Two septillion bytes by 2025: the advent of the Internet and of wireless networks has led to a massive accumulation of data. “If we were to store all of today’s information on Blu-ray, we would need twenty-three piles of disks stretching to the moon,” explains Marc Antonini, a research professor at the Computer Science, Signal Processing, and Systems laboratory (I3S) at Sophia Antipolis (southeastern France).1 A crisis is unfolding, forcing Internet giants to expand the number of data centres, which they build in cold areas due to the enormous cooling problems they generate.
The world’s data in a shoe box
The chemistry and molecules of living matter have drawn the interest of various researchers in the quest for better-adapted storage systems. Marc Antonini has focused on DNA, a single gramme of which can theoretically contain up to 455 exabytes of information, or 455 quintillion bytes. All of the world’s data would thus fit in a shoe box.
Given the pressing need and the improvement of sequencing techniques, the idea is increasingly appealing. “DNA has the advantage of being extremely compact and resistant to the passage of time,” Antonini points out. “We can sequence that of mammoths, which is tens of thousands of years old, whereas systems on hard drives have to be duplicated every five years as a precaution, and those on magnetic tapes every twenty years.” DNA could replace these tedious and energy-consuming processes.
The scientist and his team are working on OligoArchive, a three-year project financed to the tune of €3 million by the European Commission, and which brings together the Institute of Molecular and Cellular Pharmacology (IPMC),2 I3S, the Eurecom Graduate School and Research Centre in Digital Sciences, Imperial College London (UK), and the Irish start-up HelixWorks Technologies Limited. Together they are seeking to develop proof of concept for each stage of DNA storage: synthesising and storing data, and retrieving it as efficiently as possible. The project’s goal is to build a DNA disk: a fully functional end-to-end prototype demonstrating that DNA could one day replace current archival storage technologies on magnetic tape.
One of the main stumbling blocks however is price. Whether it is natural or synthetic, DNA consists of sequences of four nucleotides, also known as bases. Storage systems use those as part of a quaternary system, as opposed to the binary system of computers. Yet it costs one dollar today to synthesise two hundred nucleotides, and encoding a single image requires a few thousand of them, which makes it impossible to convert the gigantic mass of data that needs to be dealt with.
Hot and cold data
Solutions exist to overcome this problem, such as not conserving everything on DNA, and making a distinction between cold and hot data. “Cold data is that which is accessed only rarely, not to say never, such as old digitised photos that have accumulated on the cloud, or administrative archives. This stock grows by 60% each year, while the storage capacity of current systems only improves by 20%, which leads to the construction of even more data centres.”
This cold data does not have to be accessible with the same immediacy as items used every day, and is therefore an excellent candidate for alternative forms of storage such as synthetic DNA, because it requires less successive encoding and decoding. “It would be invaluable for the cultural heritage sector, which could easily keep multiple copies of film or museum archives. As the fire at Universal Studios in Hollywood in 2008 unfortunately showed, a number of master tapes were lost because they had not been duplicated.”
The OligoArchive team is looking at solutions to reduce costs, such as limiting the number of nucleotides needed to store the same amount of information. As previously mentioned, DNA consists of four different nucleotides called A, C, G, and T. A first simple DNA encoding technique involves attributing two binary numbers to each. A for 0 0, C for 0 1, G for 1 0 and T for 1 1. This is referred to as transcoding.
Circumventing the rules of living matter
However, while the synthetic DNA code generated to represent a piece of digital data contains no genetic information that can be understood by the world of living matter, it remains subject to some of its rules. For example, if a nucleotide is repeated too many times without interruption, its sequencing will experience a number of errors. Transcoding cannot easily manage this or control the length – and hence the cost – of the DNA sequences generated. To mitigate these problems, researchers propose integrating an encoding system directly at the level of digital data compression. The challenge is to create sequences of DNA code that can contain, on average, even more digital data on the same number of nucleotides, thereby reducing the cost of synthesis. The team is also developing algorithms that automatically correct the errors connected to the process of DNA code sequencing during decoding.
“When we speak on the telephone, problems can occur with the encoding channels and the sound becomes choppy, or communication is even cut. The noise introduced by DNA sequencing produces a similar phenomenon. We are therefore striving to make encoding more robust. We would also like to standardise compression systems beyond our study group, and are contributing to the JPEG International Standardization Committee with this goal in mind.” The team has given itself three years to provide its first proofs of concept, and to pave the way for the practical use of artificial DNA storage.