Sequencing of RNA provides the possibility to study an individual’s transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms.
Researchers from UCLA have developed HapIso (Haplotype-specific Isoform Reconstruction), a method able to tolerate the relatively high error-rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows this method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k-means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. The researchers used family pedigree information to evaluate their approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error-rate and accurately partition the reads into the parental alleles of the isoform transcripts. They also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate ASE of genes of interest. The method was able to correct reads and determine Glu1883Lys point mutation of clinical signifcance validated by GeneDx HCM Panel. Furthermore, this method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads.
Overview of HapIso
(A) The algorithm takes long single-molecule reads that have been mapped to the reference genome as an input. (B) The transcribed segments are identified as contiguous regions of equivalently covered positions. (C) Aligned nucleotides of the transcribed segment are condensed into the binary matrix whose width equals the number of variable positions. The entry ”1” corresponds to the position with the observed mismatch, the entry is encoded as ”0” if it matches the reference allele. (D) Reads restricted to the transcribed segment (rows of the binary matrix) are partitioned into two clusters, using the 2-means clustering algorithm. Each cluster corresponds to a local haplotype. (E) The segment graph is constructed to incorporate the linkage between the alleles. The edges of the graph connect the local haplotypes. The minimum number of corrections to the graph is applied to partition the
graph into two independent components corresponding to full-length parental gene haplotypes. (F) An error-correction protocol is applied for the reads from the same cluster. The protocol corrects the sequencing errors and produce corrected haplotype-specific isoforms.
Availablity – The open source Python implementation of HapIso is freely available for download at https://github.com/smangul1/HapIso/.