The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth.
Here, researchers from the University of Southern California investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples.
Locations of the 88 metatranscriptomic samples from global ocean, the reference tree, and the clustering trees based on different dissimilarity measures and background sequence models
(a) The distribution of the collecting locations. The map is based on OpenStreetMap and the cartography in the OpenStreetMap map tiles is licensed under CCBY-SA. (b) The clustering tree with VLMC using and k = 6. (c) The clustering tree with FOMC using and k = 6. *‘SWGE’ samples were collected from different locations with two research cruises in the Equatorial North Atlantic Ocean and South Pacific Subtropical gyre.
Availability – A software pipeline is available at https://d2vlmc.codeplex.com