A recurring challenge in interpreting genomic data is the assessment of results in the context of existing reference databases. Currently, there is no tool implementing automated, easy programmatic access to curated reference information stored in a diverse collection of large, public genomic databases.
gget is a free and open-source command-line tool and Python package that enables efficient querying of genomic reference databases, such as Ensembl. gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying required for genomic data analysis in a single line of code.
The manual and source code are available at https://github.com/pachterlab/gget.
Preprint: https://www.biorxiv.org/content/10.1101/2022.05.17.492392v3.full
gget consists of nine tools:
-
gget ref: Fetch File Transfer Protocols (FTPs) and metadata for reference genomes or annotations from Ensembl by species.
-
gget search: Fetch genes or transcripts from Ensembl using free-form search terms.
-
gget info: Fetch extensive gene or transcript metadata from Ensembl, UniProt, and NCBI by Ensembl ID.
-
gget seq: Fetch nucleotide or amino acid sequences of genes or transcripts from Ensembl or UniProt by Ensembl ID.
-
gget blast: BLAST (Altschul et al., 1990, 1997) a nucleotide or amino acid sequence to any BLAST database.
-
gget blat: Find the genomic location of a nucleotide or amino acid sequence using BLAT (James Kent, 2002).
-
gget muscle: Align multiple nucleotide or amino acid sequences to each other using the Muscle5 algorithm (Edgar, 2021).
-
gget enrichr: Perform an enrichment analysis on a list of genes using Enrichr (Chen et al., 2013; Xie et al., 2021; Kuleshov et al., 2016) and an extensive collection of gene set libraries, including KEGG (Kanehisa and Goto, 2000; Kanehisa, 2019; Kanehisa et al., 2021) and Gene Ontology (Ashburner et al., 2000; Gene Ontology Consortium, 2021).
-
gget archs4: Find the most correlated genes to a gene of interest or find the gene’s tissue expression atlas using ARCHS4 (Lachmann et al., 2018).
Each gget tool accesses data stored in one or several public databases, as depicted in Figure 1. gget fetches the requested data in real-time, guaranteeing that each query will return the latest information. One exception is gget muscle, which locally compiles the Muscle5 algorithm (Edgar, 2021) and therefore does not require an internet connection.
gget info combines information from Ensembl, NCBI, and UniProt (Cunningham et al., 2022; NCBI Resource Coordinators, 2013; UniProt Consortium, 2021) to provide the user with a comprehensive executive summary of the available information about a gene or transcript. This also enables users to assert whether data from different sources are consistent.
By accessing the NCBI server (NCBI Resource Coordinators, 2013) through HTTP requests, gget blast does not require the download of a reference BLAST database, as is the case with existing BLAST tools (Buchfink et al., 2021; Camacho et al., 2009). The whole self-contained gget package is approximately 3 MB after installation.
Usage and documentation
gget can be installed from the command line by running ‘pip install gget’. Figure 1 depicts one use case for each gget tool with the corresponding output.
Each gget tool features an extensive manual available as function documentation in a Python environment or as standard output using the help flag [-h] in the command line. The complete manual with examples can be viewed in the gget repository, available at https://github.com/pachterlab/gget. A separate gget examples repository is accessible at https://github.com/pachterlab/gget_examples and includes exemplary workflows immediately executable in Google Colaboratory (Bisong, 2019).