Information

What can I research for thesis on DNA data storage from math?

What can I research for thesis on DNA data storage from math?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm a math researcher, I just finished a Master on Error Correcting Codes. I recently saw a presentation on DNA-based data storage that I loved.

I wish to continue my studies on (applied mathematics) in this subject.

I received a PhD offer and when I mentioned the subject, the possible director was receptive, but he is not well informed on the subject either, so he proposed that I come up with a research path…

What do you think is an interesting research topic (and worthy of the community) about storing DNA data from the perspective of error-correcting codes?

Thanks for the help!


50 Best Genetics Research Topics For Academic Papers

The study of genetics takes place across different levels of the education system in academic facilities all around the world. It is an academic discipline that seeks to explain the mechanism of heredity and genes in living organisms. First discovered back in the 1850s, the study of genetics has come a pretty long way, and it plays such an immense role in our everyday lives. Therefore, when you are assigned a genetics research paper, you should pick a topic that is not only interesting to you but one that you understand well.


Preserving Research

Amy Maxmen
Aug 1, 2013

© DRAFTER123/ISTOCKPHOTO.COM A s a graduate student in Harvard&rsquos organismic and evolutionary biology department in the early 2000s, I wanted to publicly share all of the research that went into my doctoral thesis in order to contribute to the small body of scientific literature on the little-known group of marine arthropods I studied, sea spiders. However, after I published a few reports and successfully defended my PhD, my drive to submit the final chapter of my thesis to a journal dissolved because of the expense and time involved. Yet, on the rare occasions when researchers have asked to see it, I regretted that it languished on my bookshelf. Although the chapter is far from earth-shattering, it might provide a stepping stone for another biologist.

&ldquoThere is a need for science to be communicated faster to other researchers and the public, so by putting manuscripts online in places like the [preprint.

Luckily, sharing is cheaper and faster now that online, open-access collections for biology are flourishing as researchers realize the benefits of uploading unpublished reports of negative results, observations, grant applications, protocol notes, and yes, their unpublished theses onto the Web for others to peruse. In January, I finally uploaded my thesis chapter on sea spider metamorphosis onto several sites. Within 3 weeks, a zoologist from Germany e-mailed me to ask how to cite it and whether I was still following that line of study.

In addition, unpublished uploads may directly contribute to one’s career. This year, the National Science Foundation announced that grant reviewers would take note of citable and accessible “products” in addition to publications. Because online repositories grant unpublished reports a digital object identifier, or DOI, that can be referenced in a citation, these uploads may now improve a scientist’s reputation.

Submission is typically free and relatively simple. However, how readable, usable, and findable the report is to others remains up to you. In order to explore how several online repositories function, I uploaded my thesis chapter as a test, and spoke with experts who have turned to the Web for similar reasons.

THINKING ABOUT UPLOADING A MANUSCRIPT?

Researchers list various reasons for uploading unpublished material: to get feedback on a paper before submission to help others learn why a grant was accepted or rejected so that they need not repeat the same mistakes to place a time stamp on their data or ideas to share observations and protocols that could be useful to other scientists and to post movies and other data in formats that most journals cannot handle. Here are a few tips to get the most out of your post.

Choose your words wisely
Search engines pick over the title and abstract of uploaded reports. Therefore, it’s important to think about your wording. “It’s cute to have a title like ‘To Be or Not to Be,’” says physicist Paul Ginsparg, founder of the first major preprint server, arXiv. “But since that does not convey the essential content, it will be missed by your target audience.” Ginsparg complimented me on the title I had chosen for the thesis chapter I uploaded to the arXiv, “Sea Spider Development: How the encysting Anoplodactylus eroticus matures from a buoyant nymph to a grounded adult.” He says that it includes words that a nonspecialist may Google in addition to technical terms like “nymph” and “encysting” that researchers in the field might use to search for the paper. In addition, Ginsparg advises researchers to attach plenty of metadata, such as keywords ranging from the general to the specific, to every upload.

Check the license
Before hitting the submit button on a particular repository, read it’s licensing information carefully. Many repositories now offer Creative Commons (CC) licenses. The most common type, “CC BY,” allows anyone to read and distribute a paper as long as they give proper acknowledgment to the author. This way, anyone who wishes to post the content on Wikipedia or another website need not worry about infringement, as long as they reference the author. A subcategory of the Creative Commons license, “CC BY-NC,” adds the clause that others cannot distribute the report for commercial purposes. If an author intends to also submit the report to a peer-reviewed journal, this option is better, as journals tend to want the exclusive right to distribute the article for commercial purposes.

Compress huge files and append raw data
Some repositories boast that they offer unlimited upload size, but that might not be a blessing. If you manage to upload a huge file before the server times out, the report may cause the browser to perform poorly and readers may not be able to download the file without a high-speed connection. For this reason, Ginsparg recommends that researchers compress figures into a single PDF, but also upload a separate file in a format that preserves the raw data.

CHOOSING A HOST

arXiv LAUNCHED (1991)
PAUL GINSPARG, ARXIV.ORG Theoretical physicists have posted unpublished reports on arXiv.org for more than a decade, and recently, a growing number of biologists are doing so, too. (See graph on this page.)

The subheading for biology, “Quantitative Biology” is a loose one, with subject matter ranging from cancer to epigenetics.

Number of uploaded reports: About 860,000 reports from a variety of scientific disciplines

Number of biologicay-related reports: 7,200 registered under the quantitative biology category

Cost: Uploads are free. As of 2001, the website is hosted and handled by Cornell University Library in Ithaca, New York.

Submitting: Anyone can upload a report, provided you have an organization or institutional affiliation.

Searchability: The local arXiv search engine indexes the author’s name, keywords, and words in the title, and abstract. It also combs through the text of a PDF (a suggested and common format for uploads), but slightly less thoroughly.

Pro: Reputation. With 2 million downloads weekly, Google and other search engines discover papers on arXiv quickly, and most researchers immediately recognize the website as a mainstay in online publishing.

Con: Usability. There is no comment feature, so if another researcher wants to critique the work, she must send an e-mail. Also, most quantitative biology uploads are in PDF format, as arXiv suggests. As such, researchers cannot update data within a report that has been compressed.


FIGSHARE (LAUNCHED 2011)
Use of figshare boomed after Nature recommended the site as an alternative when they stopped accepting submissions to Nature Precedings, an online preprint journal (figshare is a sister company of Nature Publishing Group). Figshare’s content includes supplemental data associated with published papers, as well as unpublished data sets and reports, conference presentations, and more.

Number of uploads: Hundreds of thousands, but many are supplemental data associated with peer-reviewed manuscripts

Number of registered users: Thousands of active users, primarily in the life sciences

Cost: Generally free. The site plans to sustain itself by working with publishers, such as F1000Research and PLOS, who pay for figshare services to help with visual content that those journals cannot easily handle.

Submitting: Each upload is free and limited to 250 MB, and users can upload as many projects as they like, as long as the uploads are public. Privacy, or partial privacy with a handful of selected collaborators, is also an option however, it limits researchers to 1 GB total. If there is a demand for unlimited space, founder Mark Hahnel says he can set up premium accounts for a small fee.

Pro: Usability. Figshare features an intuitive user interface. In addition, Hahnel put special effort into how video data and other nontraditional formats are displayed because of his frustration that he could not easily share his own videos of cell dynamics. Finally, figshare encourages feedback by making it as simple to leave comments below the manuscript as it is on YouTube or a discussion board.

Con: Youth. As a relatively recent site for scientific data, preprints, and published papers, figshare has yet to prove its staying power.


ResearchGate (LAUNCHED 2008)
COLLECTING AT ALL LEVELS: My graduate work focused on the evolution of arthropods, using sea spiders as a model. Some of the sea spiders were collected from rocks along the Pacific coast of Japan. The confocal microscope image (insert) shows a juvenile sea spider’s nervous system tagged with a fluorescent marker and color-coded to indicate depth. My goal in uploading the last chapter of my doctoral thesis was to share more of my data with other scientists. COURTESY OF AMY MAXMEN KATSUMI MIYAZAK ResearchGate focuses on a researcher’s academic network more than the other sites. It initially creates this network by asking a user to invite coauthors, and it automatically locates them by scanning the user’s published research. When people in your network upload unpublished reports, a notification appears on your home page (unless the authors have requested privacy). Most of the content currently on ResearchGate consists of published peer-reviewed material and science-related forum posts however, cofounder Ijad Madisch expanded the database in December 2012 to include non-peer-reviewed posts. In part, Madisch made the change because “80 percent of the experiments I tried did not work, and I never shared those negative results,” he says. “I was sure someone else had made the same mistakes, and I wanted to be able to find them.”

Number of biology-related posts: More than 100,000 non-peer-reviewed uploads, including many data sets

Number of registered users: As of mid-July, almost 630,000 biologists have signed up for ResearchGate.

Submitting: Users sign in with an e-mail attached to an academic institution.

Searchability: Because ResearchGate smoothly accrues a large collection of published research, a search for the topic “sea spider,” for example, returns a library of information, published and unpublished information alike.

Cost: Uploads are free. Companies and institutions can post job ads on the site for a fee.

Pro: Usability. Users receive a score based on the number of publications in peer-reviewed journals and the impact factor of the journals, as well as an “RG” score based on their participation with the site. This score could be submitted as part of a grant application, although the value of its impact remains to be seen. Also, feedback is social. Readers can post questions about a report to a forum that all users see.

Con: Networking. Some researchers may dislike publicly sharing their query about a report with a forum, and may be turned off by requests from ResearchGate to invite colleagues, or by the Facebook-like home page with a running stream of updates from other scientists.

INSTITUTIONAL REPOSITORIES (ONLINE BEGINNING IN THE EARLY 1990s)

SAVING DATA: During the course of my research, I gathered a vast number of microscope images, DNA sequences, and other data. COURTESY OF AMY MAXMEN Most universities encourage their researchers to submit dissertations and published manuscripts to their repositories. The digital repository called DASH (Digital Access to Scholarship at Harvard) at my alma mater, Harvard University, also permits the submission of unpublished reports, but Stuart Shieber, the founder and former director of Harvard’s Office for Scholarly Communications, says that researchers rarely use it for this function. My review of these repositories is based on DASH, but the capabilities of different institutions vary.

Number of reports on DASH: 12,309. Most are published reports from a wide variety of fields. An additional 625 dissertations are uploaded from the College of Arts and Sciences.

Searchability: People who wish to find reports on digital institutional repositories around the world can search for them at base-search.net/.

Pro: Reputation. Because membership requires a university affiliation, readers may feel assured that the research derives from a qualified source. Whereas newer platforms may lose ground over time, those hosted by a university will likely stand the test of time, even if they remain underutilized.

Con: Usability. Because submissions are manually vetted, my chapter did not appear online for 5 weeks after I uploaded it in mid-January. Also, readers cannot leave comments or click a button to send a message to the author. Finally, the system felt less flexible and less intuitive than other online repositories mentioned here. 


Advanced Biology Requirement

At least eighteen units in approved advanced Biology courses (numbered 300 or above) are required. Courses that may be counted toward these 18 units are listed following Biol 2960 and Biol 2970 in the section 'Courses for Biology-Major Credit'. At least one course in each of three distribution areas (A-C) and an advanced laboratory course must be taken each of these courses counts toward the required 18 advanced biology units. Up to 6 units of Bio 500 may be counted toward the 18 advanced biology units.

Three Areas of Biology Required (Fall 2020 offerings in bold):

  • Area A: Plant Biology and Genetic Engineering (Biol 3041) Human Genetics (Biol 324) Cell Biology (Biol 334) Eukaryotic Genomes (Biol 3371) Microbiology (Biol 349) Immunology (Biol 424) Infectious Diseases: History, Pathology, and Prevention (Biol 4492) General Biochemistry (Biol 451) General Biochemistry I (Biol 4810) General Biochemistry II (Biol 4820)
  • Area B: Endocrinology (Biol 3151) Principles in Human Physiology (Biol 328) Principles of the Nervous System (Biol 3411) Introduction to Neuroethology (Biol 3421) Genes, Brains and Behavior (Biol 3422) How Plants Work: Physiology, Growth and Metabolism (Biol 4023) Biological Clocks (Biol 4030) Developmental Biology (Biol 4071) Principles of Human Anatomy and Development (Biol 4580)
  • Area C: Woody Plants of Missouri (Biol 3220) Darwin and Evolutionary Controversies (Biol 347)Evolution (Biol 3501) Animal Behavior (Biol 370) Introduction to Ecology (Biol 381) Population Genetics and Microevolution (Biol 4181) Macroevolution (Biol 4182) Molecular Evolution (Biol 4183) Community Ecology (Biol 419) Disease Ecology (Biol 4195) Behavioral Ecology (Biol 472)

Getting Past the Errors

But like all data storage methods, DNA has a few shortcomings as well. The most significant upfront hurdle is cost. Hawkins says that current methods are similar to the cost for an Apple Hard Disk 20 back in 1980. Back then, about 20 megabytes of storage&mdashor the amount of data you'd need to use to download a 15-minute video&mdashwent for about $1,500.

Beyond that, DNA is also error-prone. Recall the four nucleotide bases that make up the DNA ladder. On average, DNA introduces about one mistake per 100 to 1,000 nucleotides. These can take three forms: substitutions, insertions, and deletions.

In a substitution mutation, a single letter in a string of nucleotides may be switched out for another. In the graphic below, cytosine is replaced with thymine. The strands of DNA remain the same length. In an insertion or deletion, though, the DNA gets an extra nucleotide base, or removes one. But unlike errors in computer code, there is no space left behind where a removed base once lived, which can quickly become problematic when you go to decode the data stored in the DNA.

Hawkins likes to compare this to English words: "A deletion of the letter 'L' turns 'world' into 'word.' Additionally, inserting an 'S' then turns it into 'sword.' Correctly reading 'world' from 'sword' is hard not only because sword is still a valid English word, but because all the letters shifted around."

Other forms of DNA storage got past these replication errors by repeating the code for the data 10 to 15 times over&mdashbut that's a massive waste of space. In the new method described in the team's research paper, however, they build the data into the DNA in a lattice shape, wherein each bit of data reinforces the next, so that it only needs to be read once.

They also developed an algorithm that overcomes insertion, deletion, and substitution errors all at once, making DNA-based digital data storage far more efficient. It's why the team could so readily fit "The Wizard of Oz" onto strands of DNA without replicating the combination of A, C, T, and G bases many times over.


BIOL191 HM - Biology Colloquium (taken twice)

Instructor: Staff

Offered: Fall and Spring

Description: Oral presentations and discussions of selected topics including recent developments. Participants include biology majors, faculty members, and visiting speakers. Required for junior and senior biology majors. No more than 2.0 credits can be earned for departmental seminars/col­loquia.

Prerequisites: HMC Biology (including joint majors) only.

MATH198 HM - Undergraduate Mathematics Forum (preferably taken in the junior year)

Instructors: Castro, Jacobsen, Orrison, Weinburd, Zinn-Brooks H, Zinn-Brooks L

Offered: Fall and Spring

Description: The goal of this course is to improve students' ability to communicate mathematics, both to a general and technical audience. Students will present material on assigned topics and have their presentations evaluated by students and faculty. This format simultaneously exposes students to a broad range of topics from modern and classical mathematics. Required for all majors recommended for all joint CS-math majors and mathematical biology majors, typically in the junior year.

MCBI199 HM - Joint Colloquium for the Mathematical and Computational Biology Major

Instructor: Staff

Offered: Fall and Spring

Description: Students registered for joint colloquium must attend a fixed number of colloquium talks during the semester in any field(s) related to their interests. The talks may be at any members of The Claremont Colleges or a nearby university and may be in any of a wide array of fields including biology, mathematics, computer science and other science and engineering disciplines including bioengineering, cognitive science, neuroscience, biophysics, and linguistics. Students enrolled in the joint colloquium are required to submit a short synopsis of each talk that they attend. No more than 2.0 credits can be earned for departmental seminars/col­loquia.


DNA: The Ultimate Hard Drive

When it comes to storing information, hard drives don't hold a candle to DNA. Our genetic code packs billions of gigabytes into a single gram. A mere milligram of the molecule could encode the complete text of every book in the Library of Congress and have plenty of room to spare. All of this has been mostly theoretical—until now. In a new study, researchers stored an entire genetics textbook in less than a picogram of DNA—one trillionth of a gram—an advance that could revolutionize our ability to save data.

A few teams have tried to write data into the genomes of living cells. But the approach has a couple of disadvantages. First, cells die—not a good way to lose your term paper. They also replicate, introducing new mutations over time that can change the data.

To get around these problems, a team led by George Church, a synthetic biologist at Harvard Medical School in Boston, created a DNA information-archiving system that uses no cells at all. Instead, an inkjet printer embeds short fragments of chemically synthesized DNA onto the surface of a tiny glass chip. To encode a digital file, researchers divide it into tiny blocks of data and convert these data not into the 1s and 0s of typical digital storage media, but rather into DNA’s four-letter alphabet of As, Cs, Gs, and Ts. Each DNA fragment also contains a digital "barcode" that records its location in the original file. Reading the data requires a DNA sequencer and a computer to reassemble all of the fragments in order and convert them back into digital format. The computer also corrects for errors each block of data is replicated thousands of times so that any chance glitch can be identified and fixed by comparing it to the other copies.

To demonstrate its system in action, the team used the DNA chips to encode a genetics book co-authored by Church. It worked. After converting the book into DNA and translating it back into digital form, the team’s system had a raw error rate of only two errors per million bits, amounting to a few single-letter typos. That is on par with DVDs and far better than magnetic hard drives. And because of their tiny size, DNA chips are now the storage medium with the highest known information density, the researchers report online today in Science.

Don’t replace your flash drive with genetic material just yet, however. The cost of the DNA sequencer and other instruments "currently makes this impractical for general use," says Daniel Gibson, a synthetic biologist at the J. Craig Venter Institute in Rockville, Maryland, "but the field is moving fast and the technology will soon be cheaper, faster, and smaller." Gibson led the team that created the first completely synthetic genome, which included a "watermark" of extra data encoded into the DNA. The researchers used a three-letter coding system that is less efficient than the Church team's but has built-in safeguards to prevent living cells from translating the DNA into proteins. "If DNA is going to be used for this purpose, and outside a laboratory setting, then you would want to use DNA sequence that is least likely to be expressed in the environment," he says. Church disagrees. Unless someone deliberately "subverts" his DNA data-archiving system, he sees little danger.


Abstract

Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed–Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine–cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.

DNA is an ideal molecular-scale storage medium for digital information (1 ⇓ ⇓ ⇓ ⇓ ⇓ –7). An arbitrary digital message can be encoded as a DNA sequence and chemically synthesized as a pool of oligonucleotide strands. These strands can be stored, duplicated, or transported through space and time. DNA sequencing can then be used to recover the digital message, hopefully exactly. Advances in the cost and scale of DNA synthesis and sequencing are increasingly making DNA-based information storage economically feasible. While synthesis today costs

Discussion

HEDGES is designed to be flexible with respect to DNA strand lengths, DNA sequencing and synthesis technologies, choices of outer code, and interleaving details. The most important feature of HEDGES is that it always either 1) recovers “perfect” synchronization of the individual DNA strand to which it is applied (that is, completely eliminates insertion and deletion errors) or else 2) signals that it is unable to do so by a decode failure. Here “perfect” means that our reported bit and byte error rates, which are small enough to be completely corrected by a standard outer code such as RS, are already inclusive of any residual instances of missynchronization.

In the feasible (green) regions of Fig. 2, HEDGES decode failures occur about every 1 0 4 to 1 0 5 nucleotides (bottom cells). Two strategies are possible: 1) We can keep these strands and mark as erasures the bits after the failure point, or 2) we can, instead, use another strand from the pool showing the same strand ID—thus increasing the sequencing depth requirement by a tiny amount. The performance values shown in Fig. 2 use strategy 1 those in Table 2 use strategy 2. Importantly, HEDGES allows constraints on the encoded DNA strands such as reducing homopolymer runs and maintaining a balanced GC content. SI Appendix, Fig. S3, when compared to Fig. 2, shows that such constraints impose little penalty on both the code rate and error correction level. Thus, we demonstrate that both are viable strategies for error correction.

We performed both in silico and in vitro experiments to validate HEDGES across a variety of error rates. Such statistical analyses of rare events, based on both experimental data and simulations, should be a required part of all future proposals for DNA data storage. HEDGES performance on real DNA with observed total errors of ∼ 1 % and ∼ 3 % (Tables 1 and 2) was comparable to computer simulation at the same total DNA error rates and to the statistical model we built using simple Poisson random errors (Fig. 2). In both cases, HEDGES demonstrates the feasibility of large-scale error-free recovery at code rates up to 0.6 (1.2 bits per nucleotide) for ∼ 1 % DNA errors and 0.5 (1 bit per nucleotide) for ∼ 3 % DNA errors. Error-free exabyte-scale storage is feasible at DNA error rates as large as 7 to 10% with a code rate of 0.25 (0.5 bits per nucleotide). Thus, HEDGES paves the way for robust error correction in large-scale but error-prone pooled synthesis of large DNA libraries.

.001 per nucleotide, some observers project a decrease of orders of magnitude (8). A strand of DNA containing the four natural nucleotides can encode a maximum of 2 bits per DNA character. With this maximum code rate (defined as rate r = 1.0 ), no error correction is possible, because there is no redundancy in the message. However, both DNA synthesis and sequencing introduce errors in the underlying DNA pools, requiring efficient error-correcting codes (ECCs) to extract the underlying information. An ECC reduces the code rate but is necessary to protect against errors when a message is encoded as DNA characters, and, later, when decoding DNA characters back to message bits.

An ECC must correct the three kinds of errors associated with DNA—substitutions of one base by another, as well as spurious insertions or deletions of nucleotides in the DNA strand (indels). Indels represent more than 50% of observed DNA errors (Fig. 1A). However, most DNA encoding schemes use ECCs that can only correct substitutions, a standard task in coding theory (9 ⇓ ⇓ –12). The coding theory literature reports only a few ECCs that correct for deletions, and there are no well-established methods for all three of deletions, insertions, and substitutions (13, 14). Prior DNA storage implementations correct for indels by sequencing to high depth, followed by multiple alignment and consensus base calling (Fig. 1B) (1, 3, 6). This approach represents an inefficient “repetition” ECC. Moreover, repetition ECCs only correct errors associated with DNA sequencing. Correcting synthesis errors using this approach also requires pooling multiple synthesis reactions, which is the most costly and time-consuming step in DNA-based information storage (2). Finally, alignment and consensus decoding does not scale well beyond small proof-of-principle experiments. In sum, ECCs that require high-depth repetition in the stored DNA have very small code rates because a large number of stored nucleotides are required per recovered message bit.

(A) Distribution of insertion and deletion errors (indels) in a typical DNA storage pipeline (Table 1) ins, insertion del, deletion sub, substitution. (B) (Left) Existing DNA-based encoding methods require sequence-level redundancy, strand alignment, and consensus calling to reduce indel errors. (Right) HEDGES corrects indel and substitution errors from a single read. (C) Overview of the interleaved encoding pipeline used throughout this paper. (D) HEDGES encoding algorithm in the simplest case: half-rate code, no sequence constraints. The HEDGES encoding algorithm is a variant of plaintext auto-key, but with redundancy introduced because (in the case of a half-rate code, for example) 1 bit of input generates 2 bits of output. Hashing each bit value with its strand ID, bit index, and a few previous bits “poisons” bad decoding hypotheses, allowing for correction of indels. (E) An example HEDGES encode, encoding bit 9 of the shown data strand (red box). As in D, half-rate code, no sequence constraints. (F) The HEDGES decoding algorithm is a greedy search on an expanding tree of hypotheses. Each hypothesis simultaneously guesses one or more message bits v i , its bit position index i, and its corresponding DNA character position index k. A “greediness parameter” P ok (see SI Appendix, Supplementary Text) limits exponential tree growth: Most spawned nodes are never revisited. (G) Illustration of a simplified HEDGES decode. The example bit strand message is encoded and then sequenced with an insertion error. Blue squares give decoding action order: 1, Initialize Start node 2 to 5, explore best hypothesis at each step and 6, traceback and output the best hypothesis message. DNA image credit: freepik.com.

Here, we describe an algorithm to achieve high code rates with a minimum requirement for redundancy in the stored DNA. We adapt the coding theory approach of constructing an “inner” code (so termed because it is closest to the physical channel, the DNA) to correct most indel and substitution errors. The inner code translates between a string of < A , C , G , T >and an intermediate binary string of < 0,1 >, with no added or dropped bits even in the presence of indels in the DNA string. An efficient “outer” code corrects residual errors with extremely high probability. Our inner code, termed HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search), is optimized for real-world DNA-based information storage: 1) It finds and corrects indels, or converts them to substitutions (which it also usually corrects). 2) It admits varying code rates, with correspondingly greater tolerance of DNA errors at lower code rates. 3) It is adaptable to the experimental constraints on DNA synthesis, for example, balanced GC content and the avoidance of homopolymer runs. 4) It has, effectively, zero strand ordering errors, removing a source of large bursts of errors. Although this paper’s main contribution is an efficient indel-correcting code, we also develop a specific implementation of the outer Reed–Solomon (RS) code for DNA-based storage. The RS code is applied “diagonally” across multiple DNA strands (Fig. 1C) to more evenly distribute synthesis and sequencing errors, which improves error correction performance (15). We test our strategy (both in silico and in vitro) with degraded DNA oligonucleotide pools. Based on these experiments, we use computer simulations to demonstrate that this coding strategy enables error-free exabyte ( 1 0 18 )-scale DNA storage.


What can I research for thesis on DNA data storage from math? - Biology

a Laboratory of Chemical Biology and State Key Laboratory of Rare Earth Resources Utilization, Changchun Institute of Applied Chemistry, Chinese Academy of Science, Changchun, Jilin 130022, P. R. China
E-mail: [email protected], [email protected]

b University of Chinese Academy of Sciences, Beijing 100039, P. R. China

c University of Science and Technology of China, Hefei, Anhui 230029, P. R. China

Abstract

DNA metallization has witnessed tremendous growth and development, from the initial simple synthesis aimed at manufacturing conductive metal nanowires to the current fabrication of various nanostructures for applications in areas as diverse as nanolithography, energy conversion and storage, catalysis, sensing, and biomedical engineering. To this, our aim here was to present a comprehensive review to summarize the research activities on DNA metallization that have appeared since the concept was first proposed in 1998. We start with a brief presentation of the basic knowledge of DNA and its unique advantages in the template-directed growth of metal nanomaterials, followed by providing a systematic summary of the various synthetic methods developed to date to deposit metals on DNA scaffolds. Then, the leverage of DNAs with different sequences, conformations, and structures for tuning the synthesis of feature-rich metal nanostructures is discussed. Afterwards, the discussion is divided around the applications of these metal nanomaterials in the fields mentioned above, wherein the key role DNA metallization plays in enabling high performance is emphasized. Finally, the current status and some future prospects and challenges in this field are summarized. As such, this review would be of great interest to promote the further development of DNA metallization by attracting researchers from various communities, including chemistry, biology, physiology, material science, and nanotechnology as well as other disciplines.


Supplementary Information 1

This file contains Supplementary Tables 1-4, Supplementary Figures 1-9, Supplementary Methods and Data, a Supplementary Discussion and Supplementary references. This file was replaced on 14 February 2013 to correct the DNA sequence in Supplementary Figure 8, which was misaligned. (PDF 2027 kb)

Supplementary Information 2

This file contains the full formal specification of the digital information encoding scheme. (PDF 244 kb)

Supplementary Information 3

This file contains FastQC QC report on Illumina HiSeq 2000 sequencing run. (PDF 411 kb)

Supplementary Data 1

This zipped file contains the five original files encoded and decoded in this study, namely wssnt10.txt (ASCII text file containing text of all 154 Shakespeare sonnets), watsoncrick.pdf (PDF of Watson & Crick’s (1953) paper describing the structure of DNA), MLK_excerpt_VBR_45-85.mp3 (MP3 file containing a 26 s excerpt from Martin Luther King's 1963 "I Have A Dream" speech), EBI.jp2 (JPEG 2000 format medium resolution colour photograph of the European Bioinformatics Institute) and View_huff3.cd.new (ASCII text file defining the Huffman code used to convert bytes of encoded files to base 3). (ZIP 646 kb)

Supplementary Data 2

This file contains the GATK ErrorRatePerCycle report on Illumina HiSeq 2000 sequencing run. (TXT 6 kb)