Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. A team led by researchers at Johns Hopkins University discusses the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; the researchers look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, they examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.
Predicted and observed human gene counts over time
Counts of protein-coding, pseudogene and non-coding genes are shown. The timepoints before 2003 and after 2023 (dashed lines) represent an average of predictions from the literature and extrapolations from this Perspective, respectively. The timepoints from 2003 to 2023 are based on 20 iterations of the NCBI RefSeq annotation of the human reference genome, including both curated and predicted genes.