Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Now, an international team led by researchers at USC have sequenced a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, constructed a physical map by NanoChannel arrays and generated a de novo assembly of 2.93 Gb (contig N50: 8.3 Mb, scaffold N50: 22.0 Mb, including 39.3 Mb N-bases), together with 206 Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8 Mb of HX1-specific sequences, including 4.1 Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. These results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.
Novel gene inferred from Iso-Seq long-read RNA sequencing
(a) Integrative Genomics Viewer on alignment files generated from Iso-Seq. Over 100 long reads can be mapped to this locus on chr20q13.12 in the GRCh38 assembly. (b) UCSC Genome Browser screenshot on the predicted transcript models. The transcripts are not detected in RNA-Seq data on nine cell lines in ENCODE. This gene is conserved in primates but not in other vertebrate species, and is not in segmental duplication regions or simple repeat regions. (c) PCR validation of the transcript TCONS_0035154 by a primer pair that targeted exons 1 and 5. Several PCR products with different sizes can be detected, representing different isoforms. MC239 is a Caucasian sample and MA296 is an East Asian sample. (d) Sanger sequencing confirmed the splicing events predicted by the Iso-Seq data.