Double-Masked Error Correction Code Transformer and Coding Schemes for High-Density DNA Storage with Low Error Rates | Twist Bioscience (2025)

Publications

Thesis,&nbspJan 2024

Double-Masked Error Correction Code Transformer and Coding Schemes for High-Density DNA Storage with Low Error Rates

Seong-Joon, P

Product Used

Oligo Pools

Abstract

In this dissertation, two main topics are discussed: improving the decoding per- formance of short length codes using the error correction code transformer and a de- sign of high information density constrained codes and error correcting codes that reduces error rates and costs in deoxyribo nucleic acid (DNA) storage. For these two topics, four main contributions are addressed as follows: i) a design of the double- masked (DM) error correction code transformer (ECCT) architecture, ii) encoding and decoding schemes utilizing the soft information of variable-length (VL) reads that im- prove the trade-off between writing and reading costs in DNA storage, iii) the integra- tion method of the balanced bit insertion-based constrained (BIC) code and the error correcting code for DNA storage, and iv) the design method of the iterative constrained coding scheme which is robust to error propagation in DNA storage. First, the DM ECCT architecture utilizing two different mask matrices is proposed. In general, the masked self-attention block of the ECCT utilizes the mask matrix de- rived from the parity check matrix to facilitate learning the relevance between code- word bits. Since a large number of parity check matrices exists for the same codebook, it is important to identify the optimal parity check matrix for ECCT. The systematic mask matrix derived from the systematic form of the parity check matrix is shown to be suitable for ECCT architecture. It shows an improvement in both decoding perfor- mance and computational complexity. One surprising observation is that the system- atic form of the matrices is typically helpful for the error correcting code encoding process rather than decoding, but it is shown that they play a pivotal role also in the decoding performance of ECCT. In addition, I propose the DM ECCT, which consists of two parallel masked self-attention blocks that utilize two different mask matrices. The DM ECCT significantly improves the decoding performance by capturing diverse features of parity check matrix from two different mask matrices and outperforms the past neural network-based error correcting code decoders. Second, encoding and decoding schemes of DNA storage utilizing the soft infor- mation of VL reads are introduced. Conventionally, the encoding scheme of DNA stor- age employs multiple error correcting codes to handle all error types in DNA storage. However, the proposed scheme efficiently reduces the writing cost by employing only a single low-density parity-check (LDPC) code in DNA storage without additional redundancy. The decoding scheme is designed to reduce the reading cost by fully uti- lizing the soft information of VL reads. Rather than discarding reads with deletion errors, the proposed clustering and multiple sequence alignment steps enable the de- coder to fully utilize shortened reads. In these steps, abnormal reads are discarded and only high-confidence reads are used during the decoding process. Finally, a practical DNA storage experiment demonstrates that the proposed schemes highly reduce both writing and reading costs. Third, an integration coding method that combines constrained codes and error correcting codes is proposed. First, BIC codes are proposed, which convert the binary data to oligo sequences satisfying the RL constraint. When the specific conditions are satisfied, BIC codes insert dummy bits in the predetermined positions to meet the RL constraint. The tight lower and upper bounds on the average information density of the BIC codes are derived, demonstrating that the BIC code can achieve the capacity by 99.62% for the maximum RL of three. In addition, balanced BIC (B-BIC) codes are proposed to satisfy the GC-content constraint by combining BIC codes with the modi- fied Knuths balancing technique for DNA sequences. Finally, I propose an integration coding method that combines B-BIC codes with LDPC codes. The simulation results demonstrate that the proposed coding method ensures high error correction capability while satisfying RL and GC-content constraints in DNA storage. Lastly, the encoding method of constrained code robust to error propagation in DNA storage is introduced. By applying randomization step, M -ary mapping (M = 3 · 4m−1) step, and verification step, the maximum RL of m and the GC-content ratio between 0.5±α can be satisfied, for any m and α. The proposed M -ary mapping table is constructed based on the substitution error probability between bases. Considering substitution error probabilities, it is designed in a greedy way to have only a one-bit difference for adjacent symbols. This mapping table is shown to reduce the average bit error caused by one base substitution error compared to the randomly generated mapping table. Since the encoding method converts binary data into M -ary symbol and then into DNA sequences of length m, when errors occur in the DNA sequence, they remain in that DNA sequence of length m and do not propagate to other DNA sequences. This characteristic demonstrates that the proposed method is robust to error propagation. Keywords: Channel coding, coding theory, constrained codes, deoxyribo nucleic acid (DNA), DNA storage, error correction code transformer (ECCT), low-density parity-check (LDPC) codes, machine learning, neural network, neural network-based decoder Student number: 2019-25142

Product Used

Oligo Pools

Related Publications

All Publications

Double-Masked Error Correction Code Transformer and Coding Schemes for High-Density DNA Storage with Low Error Rates | Twist Bioscience (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: The Hon. Margery Christiansen

Last Updated:

Views: 6125

Rating: 5 / 5 (50 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: The Hon. Margery Christiansen

Birthday: 2000-07-07

Address: 5050 Breitenberg Knoll, New Robert, MI 45409

Phone: +2556892639372

Job: Investor Mining Engineer

Hobby: Sketching, Cosplaying, Glassblowing, Genealogy, Crocheting, Archery, Skateboarding

Introduction: My name is The Hon. Margery Christiansen, I am a bright, adorable, precious, inexpensive, gorgeous, comfortable, happy person who loves writing and wants to share my knowledge and understanding with you.