Publications
Thesis, Jan 2024
Double-Masked Error Correction Code Transformer and Coding Schemes for High-Density DNA Storage with Low Error Rates
Seong-Joon, P
Product Used
Oligo Pools
Abstract
In this dissertation, two main topics are discussed: improving the decoding per- formance of short length codes using the error correction code transformer and a de- sign of high information density constrained codes and error correcting codes that reduces error rates and costs in deoxyribo nucleic acid (DNA) storage. For these two topics, four main contributions are addressed as follows: i) a design of the double- masked (DM) error correction code transformer (ECCT) architecture, ii) encoding and decoding schemes utilizing the soft information of variable-length (VL) reads that im- prove the trade-off between writing and reading costs in DNA storage, iii) the integra- tion method of the balanced bit insertion-based constrained (BIC) code and the error correcting code for DNA storage, and iv) the design method of the iterative constrained coding scheme which is robust to error propagation in DNA storage. First, the DM ECCT architecture utilizing two different mask matrices is proposed. In general, the masked self-attention block of the ECCT utilizes the mask matrix de- rived from the parity check matrix to facilitate learning the relevance between code- word bits. Since a large number of parity check matrices exists for the same codebook, it is important to identify the optimal parity check matrix for ECCT. The systematic mask matrix derived from the systematic form of the parity check matrix is shown to be suitable for ECCT architecture. It shows an improvement in both decoding perfor- mance and computational complexity. One surprising observation is that the system- atic form of the matrices is typically helpful for the error correcting code encoding process rather than decoding, but it is shown that they play a pivotal role also in the decoding performance of ECCT. In addition, I propose the DM ECCT, which consists of two parallel masked self-attention blocks that utilize two different mask matrices. The DM ECCT significantly improves the decoding performance by capturing diverse features of parity check matrix from two different mask matrices and outperforms the past neural network-based error correcting code decoders. Second, encoding and decoding schemes of DNA storage utilizing the soft infor- mation of VL reads are introduced. Conventionally, the encoding scheme of DNA stor- age employs multiple error correcting codes to handle all error types in DNA storage. However, the proposed scheme efficiently reduces the writing cost by employing only a single low-density parity-check (LDPC) code in DNA storage without additional redundancy. The decoding scheme is designed to reduce the reading cost by fully uti- lizing the soft information of VL reads. Rather than discarding reads with deletion errors, the proposed clustering and multiple sequence alignment steps enable the de- coder to fully utilize shortened reads. In these steps, abnormal reads are discarded and only high-confidence reads are used during the decoding process. Finally, a practical DNA storage experiment demonstrates that the proposed schemes highly reduce both writing and reading costs. Third, an integration coding method that combines constrained codes and error correcting codes is proposed. First, BIC codes are proposed, which convert the binary data to oligo sequences satisfying the RL constraint. When the specific conditions are satisfied, BIC codes insert dummy bits in the predetermined positions to meet the RL constraint. The tight lower and upper bounds on the average information density of the BIC codes are derived, demonstrating that the BIC code can achieve the capacity by 99.62% for the maximum RL of three. In addition, balanced BIC (B-BIC) codes are proposed to satisfy the GC-content constraint by combining BIC codes with the modi- fied Knuths balancing technique for DNA sequences. Finally, I propose an integration coding method that combines B-BIC codes with LDPC codes. The simulation results demonstrate that the proposed coding method ensures high error correction capability while satisfying RL and GC-content constraints in DNA storage. Lastly, the encoding method of constrained code robust to error propagation in DNA storage is introduced. By applying randomization step, M -ary mapping (M = 3 · 4m−1) step, and verification step, the maximum RL of m and the GC-content ratio between 0.5±α can be satisfied, for any m and α. The proposed M -ary mapping table is constructed based on the substitution error probability between bases. Considering substitution error probabilities, it is designed in a greedy way to have only a one-bit difference for adjacent symbols. This mapping table is shown to reduce the average bit error caused by one base substitution error compared to the randomly generated mapping table. Since the encoding method converts binary data into M -ary symbol and then into DNA sequences of length m, when errors occur in the DNA sequence, they remain in that DNA sequence of length m and do not propagate to other DNA sequences. This characteristic demonstrates that the proposed method is robust to error propagation. Keywords: Channel coding, coding theory, constrained codes, deoxyribo nucleic acid (DNA), DNA storage, error correction code transformer (ECCT), low-density parity-check (LDPC) codes, machine learning, neural network, neural network-based decoder Student number: 2019-25142
Product Used
Oligo Pools
Related Publications
All Publications