Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types.
Researchers from Uninettuno International University have developed TCGA2BED a software tool to search and retrieve TCGA data, and convert them in the structured BED format for their seamless use and integration. Additionally, it supports the conversion in CSV, GTF, JSON, and XML standard formats. Furthermore, TCGA2BED extends TCGA data with information extracted from other genomic databases (i.e., NCBI Entrez Gene, HGNC, UCSC, and miRBase). The researchers also provide and maintain an automatically updated data repository with publicly available Copy Number Variation, DNA-methylation, DNA-seq, miRNA-seq, and RNA-seq (V1,V2) experimental data of TCGA converted into the BED format, and their associated clinical and biospecimen meta data in attribute-value text format.
Interaction diagram of the TCGA2BED software architecture
It is composed of: a the controller, which executes the operations (e.g., download, conversion) specified either with a XML input configuration file or through the user interface; b TCGA retrieval system, which searches and retrieves TCGA genomic data of multiple types (i.e., CNV, DNA-seq, DNA-methylation, miRNA-seq, and RNA-seq V1, V2) and their associated clinical and biospecimen meta data; c the BioParser, which converts them in the tab-delimited BED format, and all their corresponding clinical and biospecimen meta data in tab-delimited attribute-value text format. Dashed blue and full green arrowed lines correspond to the paths of data download and conversion, respectively; from left to right, blue thick line rectangles refer to software components, green thin line ones represent the BioParser extensions with the links to the four external databases for additional genomic data retrieval (i.e., UCSC, HGNC, NCBI Entrez Gene, and miRBase). The roman (arabic) numbers refer to the sequence of download (conversion) operations that a user can perform
The availability of the valuable TCGA data in BED format reduces the time spent in taking advantage of them: it is possible to efficiently and effectively deal with huge amounts of cancer genomic data integratively, and to search, retrieve and extend them with additional information. The BED format facilitates the investigators allowing several knowledge discovery analyses on all tumor types in TCGA with the final aim of understanding pathological mechanisms and aiding cancer treatments.
Availability – The TCGA2BED software is available for multiple operating systems, as a Java jar executable with graphic user interface, at http://bioinf.iasi.cnr.it/tcga2bed/.