Many applications of multiple sequence alignments (MSA) involve working with sequences which are not ideal – they can be incomplete, contain errors or just be highly divergent with many mismatches. This is very common when working with high throughput sequencing data, for example in alignments based on de novo assembled transcripts, long read sequencing reads or mixed metagenomic datasets. MSAs based on these sequences often contain many gaps and areas of low quality alignment. It’s still common to manually edit MSAs before performing further analysis but this method is time-consuming and not easily reproducible.
We have developed a new command line tool, CIAlign, which can be used to automatically solve some of the most common problems with MSAs. CIAlign was developed by Charlotte Tumescheit, Andrew Firth and Katherine Brown in the Firth Lab, part of the Virology Division of the University of Cambridge Department of Pathology.
CIAlign is freely available at github.com/KatyBrown/CIAlign and is described in full in our BioRxiv preprint at doi.org/10.1101/2020.09.14.291484.
CIAlign targets four common features of complex MSAs:
- Low quality or incomplete ends of sequences leading to gaps and mismatches
- Insertions in a minority of sequences dominating the alignment
- Unexpectedly divergent sequences
- Very short sequences
CIAlign also provides additional functions for working with MSAs. We have developed a new type of alignment visualisation, which shows the whole alignment in a single, publication ready image. CIAlign can also generate various types of consensus sequence, sequence similarity matrices, sequence logos and coverage plots.
CIAlign is designed to be highly customisable. Users are encouraged to think carefully about their sequences and their downstream application and are able to select the most appropriate functions and settings to use. It is also highly transparent, with unambiguous log files allowing the user to clearly and reproducibly report their methodology.