DNA Data Storage
"We are at the dawn of an era where biology and digital technology converge in the most unexpected ways."
This fusion of disciplines presents groundbreaking opportunities, particularly in data storage. As data storage requires physical space and form, there is growing concern regarding the availability of enough volume to fill up tomorrow’s data. Even if we use up all of the Solid-State drives, hard disks and what not, it would still not be a stretch to say that we may end up exhausting all of our silicon and spatial reserves. But, just because we have all the options we need now, available at our hands, it is still imperative to think of revolutionary solutions to the problem. A solid idea, even a draft of what’s to come, may inadvertently solve headaches for systems engineers of government, and top firms.
Frankly, we do already have a solution ready. If we tell you storing the entire Internet estimated at over 100 zettabytes of data, in something smaller than a sugar cube is possible, would you believe it? We need to only look in the direction of the most basic unit of heredity in our body, the DNA.
With these awesome thoughts in mind, myself and my colleague Anirban Pal interviewed Professor Soumya De, Associate Professor at IIT Kharagpur to learn more and delve a little deeper into the topic.
Why DNA?
“The limitless option that DNA allows is the freedom to store as much data as possible, incomprehensible to human terms”, in Prof. Soumya's words. Basically, molecular level data storage by sequencing genomes and storing the extracted data in clusters, has incredible potential. For those who will like to learn more about the technicalities, it translates to coding strands with information that we need to store.
Mechanism of DNA Data Storage
DNA data storage starts with the translation of binary data (0s and 1s) into the four nucleotide bases of DNA: adenine (A), cytosine (C), guanine (G), and thymine (T). Encoding schemes have been devised to accomplish this translation. One such method is to assign binary pairs to particular nucleotides; e.g., '00' to 'A', '01' to 'C', '10' to 'G', and '11' to 'T'. This ensures that the binary information is correctly represented in a DNA synthesis-compatible format.
To enhance the reliability of data storage, error correction codes such as Reed-Solomon or fountain codes are incorporated during the encoding process. These codes add redundant information to the sequence, making it possible to detect and correct errors caused by synthesis, degradation, or sequencing mistakes.
For instance, if a DNA strand is damaged or partially lost, error correction can help reconstruct missing parts and recover the original information.
Since DNA molecules cannot be infinitely long, the encoded data is split into multiple short DNA strands, typically 100–200 base pairs each. To ensure proper reassembly, each fragment is labeled with an index sequence, similar to page numbers in a book. This allows sequencing technologies to read and reconstruct the data correctly.
After encoding binary data as nucleotide sequences, physical production of these sequences follows through DNA synthesis. In this process, the nucleotides are arranged in the prescribed sequence according to encoded information. Chemical methods of synthesis, for example, Phosphoramidite synthesis, are most commonly utilized. Nevertheless, DNA synthesis today remains slow and expensive with production rates quoted in megabytes per hour. Also, synthesis errors may be present, requiring strong error correction processes.
Once synthesized, the DNA strands are kept in stable conditions so they can last longer. The inherent stability of DNA and high density of information enable it to be a good candidate for long-duration data storage. Encapsulated DNA will survive decades at ambient temperatures and can probably survive longer under controlled environments, including in data centers.
To retrieve the stored data, scientists first sequence the DNA, which means reading the order of its nucleotide bases (A, C, G, and T). High-throughput sequencing technologies quickly scan the DNA strands and generate a digital version of the sequence. This sequence is then decoded using special algorithms that reverse the original encoding process, converting the DNA back into binary (0s and 1s). Since errors can occur during storage or sequencing, built-in error correction codes help detect and fix any mistakes, ensuring that the retrieved data is accurate and matches the original information.
Each one of these steps—from encoding and synthesis to storage and sequencing—relies on highly intricate bioinformatics, careful chemical engineering, and rigorous error management. There is on going research for improving all these steps to gain speed and efficiency, as well as becoming economically viable, in order to replace conventional electronic storage methods.
Here's a cool fact, the Netflix series ‘Biohacker’, an awesome watch, has its first episode stored entirely in synthetic DNA designed by professors of ETH Zurich. These guys know their jobs well, setting the standard for more cutting-edge technologies for the future. This is just the beginning as we have just tapped into the vast world of Biotechnology and the wonders that await us. Using DNA gives us access to the high density of DNA (1 exabyte/mm3) as well the durability for advantage of storage. Unlike hard drives that last 3-5 years or tapes that last 10-30 years, DNA can last Thousands of years if stored properly. Also, DNA storage does not require electricity to maintain data, making it an Eco-friendly alternative. These suggestions might sound too good to be true, because they are, as a lot of discussions still need to take place on their efficiency and cost effectiveness.
Here's another mind-boggling fact. A Facebook Data Center is almost the size of ten football fields, while the same information can be held in a tablespoon full of DNA. This fact alone surely raises some eyes. But this can be made into a reality. To do that, we need full support from governments towards Biotechnology and other research. Awareness and implementing right policies can go a long way in working hand in hand with talented and hard-working scientists to make this possible.
The Challenges of DNA Storage
There is a need to educate people about the awesomeness and seriousness of this matter. There is a hardwired, preconceived notion amongst the public, add to it the media glorification, that biotechnologists are some mad scientists doing crazy mutation-based animal/ human experiments and creating supervillains. The common folk will need a lot of convincing to warm up to the idea of accepting DNA as a Data storage option. It's still relatively pricey to synthesize and sequence DNA, writing at an estimated $3,500 per megabyte, although this is expected to fall with improving technology. Writing and reading DNA data aren't yet anywhere near the speeds of hard drives today—400 bytes per second is far, far slower than SSDs. DNA is prone to mutations (small errors), so the encoding techniques used must be advanced in order to ensure correctness of data.
Conclusion
DNA storage is not about saving all your favourite movies or photos. It is, rather, the safeguarding of human knowledge and history to be passed down to future generations. It might be in the form of ancient manuscripts, medical records, or an entire library of books, but DNA might just hold the key to eternal data preservation. The next time you want to upgrade your hard drive, just imagine having all your digital life in one droplet of liquid. Believe it or not, that future is closer than you think! This idea’s fruition in daily lives would create a path towards development of us as a species and also allow us to explore the intricacies of the untapped world of our genomes.