Blockchains have sparked great enthusiasm from the data science community who believes this technology will be THE solution to data authenticity, data privacy protection, data quality guarantee, smooth data access and real time analysis , . Data being considered as the new digital oil, data science and blockchain seem to be the perfect match . Indeed, data science allows people/organizations to extract valuable knowledge from humongous volume of structured or unstructured data. So, blockchain provides security and reliability of the manipulated data. But does it sound too good to be true?
Blockchain is a way to implement a decentralized repository (a.k.a Distributed Ledger Technology) managed by a group of participants, without necessity of assuming trust among each other. Blockchain groups data records into blocks that are cryptographically signed and chained by back-linking each block to its predecessor. Blockchain was initially proposed for cryptocurrency (e.g., Bitcoin). This first generation of blockchain applications is called Blockchain 1.0. Later, smart contracts were introduced, paving the way to decentralized applications referred as Blockchain 2.0. Today, Blockchain 3.0 explores a wider spectrum of target applications like e-health, smart cities, identity management, etc .
Big data is one of the possible Blockchain 3.0 applications. Deepa et al  recently published a survey on the use of blockchain technology for big data which shows that projects try to apply blockchain-based solutions at different steps of big data processing. This includes big data acquisition (data collection, data transmission and data sharing ), big data storage (by securing decentralized file systems or by detecting malicious updates in databases ) or big data analytics (for machine learning model sharing, decentralized intelligence and trusted decision-making of machine learning ).
Although blockchain technology appears to be a good candidate to secure big data, this technology is not flawless    and security threats/vulnerabilities have been identified at each layer of the blockchain stack model . First of all, blockchains depend on the underlying network services and attacks on routing protocols or on DNS can harm a blockchain network. At the consensus layer, which is the core component that directly dictates the behavior and the performance of the blockchain, the situation is also complex . The classic Proof of Work protocol is far from being a panacea and is a non-sense from the environment point of view . In addition, most miners are gathering around mining pools to increase their processing capability, and thus, their chance of adding a new block to the blockchain. At the time of writing, the blockchain.com website estimates that six bitcoin mining pools (F2Pool, AntPool, Poolin, ViaBTC, Huobi.pool and SlushPool) represent 63% of the hash rate . If they collude with each other, they can launch the 51% attack and destabilize the whole bitcoin network . Consequently, more and more consensus algorithms are studied, proposed, and extended such as proof of stake, of authority, of activity, RBFT, YAC, etc. However, an ideal consensus algorithm is still missing as almost all algorithms have significant disadvantages in one way or another with respect to their security and performance, as concluded in . The Replicated State Machine layer, which is responsible for the interpretation and execution of transactions, can be vulnerable too. Blockchain technology doesn’t guarantee the reliability of the data, only the integrity of the blocks. For instance, Karapapa et al.  showed how to make ransomwares available using Ethereum smart contracts. Confidentiality of data is also not always embedded in the blockchain. Finally, blockchain is implemented as software running on computers and thus attackers can exploit security holes and misconfigurations. E.g., white hat hackers found more than 40 bugs in blockchain and cryptocurrency platforms during a one month bug bounty session in 2019 – 4 of them were buffer overflows which made possible to inject arbitrary code .
To conclude, blockchain technology offers promising features to big data. However, one should acknowledge the current technical limitations of the technology. Another consideration is legal aspects. Indeed, the European Parliamentary Research Service observed many points of tension between blockchains and the GDPR . When all these issues will be answered then yes … blockchain will be a serious candidate for being the reliability solution for big data.
By Romain Laborde
 “Why Data Scientists Are Falling in Love with Blockchain Tech,” Techopedia.com. https://www.techopedia.com/why-data-scientists-are-falling-in-love-with-blockchain-technology/2/33356 (accessed Apr. 21, 2021).
 2021 at 1:00pm Posted by Isaac Rallo on March 15 and V. Blog, “Six use cases in Blockchain Analysis.” https://www.datasciencecentral.com/profiles/blogs/six-use-cases-in-blockchain-analysis (accessed Apr. 21, 2021).
 “What Makes Blockchain and Data Science a Perfect Combination.” https://www.rubiscape.io/blog/focus-on-data-diversity-to-make-your-ai-initiatives-successful-0 (accessed Apr. 21, 2021).
 D. Di Francesco Maesa and P. Mori, “Blockchain 3.0: applications survey,” Journal of Parallel and Distributed Computing, vol. 138, pp. 99–114, Apr. 2020, doi: 10.1016/j.jpdc.2019.12.019.
 N. Deepa et al., “A survey on blockchain for big data: Approaches, opportunities, and future directions,” arXiv preprint arXiv:2009.00858, 2020.
 N. Tariq et al., “The Security of Big Data in Fog-Enabled IoT Applications Including Blockchain: A Survey,” Sensors, vol. 19, no. 8, Art. no. 8, Jan. 2019, doi: 10.3390/s19081788.
 N. Zahed Benisi, M. Aminian, and B. Javadi, “Blockchain-based decentralized storage networks: A survey,” Journal of Network and Computer Applications, vol. 162, p. 102656, Jul. 2020, doi: 10.1016/j.jnca.2020.102656.
 Y. Liu, F. R. Yu, X. Li, H. Ji, and V. C. M. Leung, “Blockchain and Machine Learning for Communications and Networking Systems,” IEEE Communications Surveys Tutorials, vol. 22, no. 2, pp. 1392–1431, Secondquarter 2020, doi: 10.1109/COMST.2020.2975911.
 X. Li, P. Jiang, T. Chen, X. Luo, and Q. Wen, “A survey on the security of blockchain systems,” Future Generation Computer Systems, vol. 107, pp. 841–853, 2020.
 M. Saad et al., “Exploring the attack surface of blockchain: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 1977–2008, 2020.
 Y. Wen, F. Lu, Y. Liu, and X. Huang, “Attacks and countermeasures on blockchains: A survey from layering perspective,” Computer Networks, vol. 191, p. 107978, 2021.
 I. Homoliak, S. Venugopalan, D. Reijsbergen, Q. Hum, R. Schumi, and P. Szalachowski, “The Security Reference Architecture for Blockchains: Toward a Standardized Model for Studying Vulnerabilities, Threats, and Defenses,” IEEE Communications Surveys & Tutorials, vol. 23, no. 1, pp. 341–390, 2020.
 M. Sadek Ferdous, M. Jabed Morshed Chowdhury, M. A. Hoque, and A. Colman, “Blockchain Consensus Algorithms: A Survey,” arXiv e-prints, p. arXiv-2001, 2020.
 A. B. Business CNN, “Bitcoin mining in China could soon generate as much carbon emissions as some European countries, study finds,” CNN. https://www.cnn.com/2021/04/09/business/bitcoin-mining-emissions/index.html (accessed Apr. 21, 2021).
 “pools,” Blockchain.com. https://www.blockchain.com/charts/pools (accessed May 03, 2021).
 C. Karapapas, I. Pittaras, N. Fotiou, and G. C. Polyzos, “Ransomware as a Service using Smart Contracts and IPFS,” in 2020 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), 2020, pp. 1–5.
 Mix, “Security researchers found over 40 bugs in blockchain platforms in 30 days,” TNW | Hardfork, Mar. 14, 2019. https://thenextweb.com/news/blockchain-cryptocurrency-vulnerability-bug (accessed Apr. 28, 2021).
 M. Finck, “Blockchain and the General Data Protection Regulation: Can distributed ledgers be squared with European data protection law?,” PE 634.44, Jul. 2019. [Online]. Available: https://www.europarl.europa.eu/RegData/etudes/STUD/2019/634445/EPRS_STU(2019)634445_EN.pdf.