Convert Text to a DNA Sequence with Python
Deoxyribonucleic acid (DNA) is a promising storage medium, capable of storing and archiving our abundance of data.
When bits are converted to bases, data can be encoded in DNA.
To encode data is to convert it from one form to another. Encoded images, audio recordings, videos and characters can be used to compile and execute programs, transmit, store and compress/decompress data, and convert files. I decided to encode input text into a DNA sequence in hopes of storing it.
Here’s how I did it.
ABOUT THE PROJECT
Before the text is mapped to a nucleotide sequence, it must be converted to a binary sequence. In digital electronics, binary sequences represent instructions to the computer and types of data using bits and bytes. Binary digits, or bits, store either 0 or 1, rendering them the smallest units of storage. 8 bits grouped together make a byte, such as 01011010 and 01000001.
A nucleotide is a compound comprising pentose (5 carbon atoms) sugar, a phosphate group and a nitrogenous base. In DNA, the 4 nitrogenous bases are adenine, guanine, cytosine, and thymine. Their sequence will code for proteins and carry all genetic information.
Nucleotides are joined together by covalent bonds between the phosphate group of a nucleotide and the 3rd carbon atom of the pentose sugar in the next. Thus, the pairing of nucleotides produces the DNA backbone of sugar — phosphate — sugar — phosphate — and so on. This sequence forms a polynucleotide chain and 2 coiled polynucleotide chains produce DNA’s iconic double helix.
The entire program is based on a simple mapping, in which 2 bits of a sequence are converted into nucleotides, each represented by a single letter (A, G, C or T).
When DNA is sequenced, it is possible to determine the order of bases and in plain sequence format, they can be represented by a single letter. To bring about a new or modified sequence, it’s possible to move A, G, C or T around.
For more about digital data storage in DNA, read my previous article, The Future of Digital Data Storage Lies in DNA.
DISSECTING THE PROJECT
At first, I set a variable (original_str) with the user’s input. The text can vary in length and content; it’s all up to the user.
Then, I used the bytearray(source, encoding) method, where source is the input text (original_str) and encoding is UTF-8 (“utf-8). UTF-8 represents every character in the input text with 1, 2, 3 or 4 bytes.
It’s thus possible to use the built-in format(value, format_spec) method to convert each integer value (x) to binary and keep the leading zeros by specifying the formatting (“08b”). The resulting binary number includes all leading 0s and has 8 bits in total. The list of bytes is consequently attached with the string.join(iterable) method and becomes a string (binary_str).
Next, the binary string becomes a list (binary_list), divided into items that are 2 bits long. Because 2 bits turn into 1 nucleotide, each item in binary_list should have a length of 2.
I then define a dictionary (DNA_encoding), in which every key represents a possible 2-bit item in binary_list and every value is its corresponding nucleotide.
An empty list (DNA_list) is also set, to which values in DNA_encoding are appended. A for loop iterates on every 2-bit item in binary_list and another for loop iterates over every key in DNA_encoding. When the 2-bit item matches a key, the value of that key is added to DNA_list.
At the end, a new string (DNA_str) is produced when all nucleotides in DNA_list are joined together.
Finally, the input text (original_str), the binary string (binary_str) and the ensuing nucleotide sequence (DNA_str) are all printed (\n is just to format the output).
Here’s an example of the output:
WHAT’S NEXT FOR THE PROJECT?
Next, this program could convert documents, pictures and videos to a binary sequence and thus, a nucleotide sequence. By including more than just text, different types of information can be encoded and stored in DNA. Also, a file including the resulting sequence could be downloaded to be used for in silico analyses with computational and statistical techniques.
Additionally, various rules and algorithms could better protect the data, thanks to more guarded encoding. From a range of algorithms with varying degrees of security, the user could select the one that would best meet their needs and help them accomplish their goal.
Finally, a website could be built to make the program more accessible and allow anyone to encode personal information into a base sequence. There are many text-to-binary and binary-to-text converters online, why not a text-to-nucleotide one?