In a preprint paper, scientists have announced the completion of the first full, telomere-to-telomere, sequencing of the human genome, more than two decades after the first draft of its sequencing. This was made possible by new sequencing technologies [1].
Mind the gaps
The human genome was first reported as being sequenced two decades ago by the Human Genome Project and Craig Ventor’s company Celera Genomics. This original draft was announced with great fanfare from the White House lawn on June 26, 2000. It included roughly 90% of the genome, which took scientists ten long years to achieve. The task was declared completed three years later, in 2003, but almost 8% of the genome remained undeciphered.
Certain hard-to-read DNA regions were left out due to technological constraints. The problem with the previous generation’s sequencing technologies was that scientists had to shred multiple copies of the DNA into small chunks several hundred base pairs long and sequence them separately. After the sequencing, the scientists would try to recreate the order of the fragments by looking for unique overlaps: that is, if two chunks overlap, they probably contain adjacent sequences. It worked for most of the regions but not all of them. The regions that defy this old type of sequencing are rife with multiple tandem repeats – such as AGAGAGAG – sometimes spanning over thousands of bases [2]. Until recently, there was no way to recognize where these sequences overlap.
These regions are located mostly near centrosomes – the “knots” that tie homologous chromosomes together into the familiar X-form, and near telomeres, i.e., at the ends of the chromosomes. This is the reason why the initiative that gave us this final reading of the human genome is called the Telomere 2 Telomere (T2T) Consortium.
New kids on the block
In recent years, several new technologies emerged that were able to sequence longer chunks of DNA. With the help of these technologies, the sequencing of the human genome has been inching slowly towards completion. For this last push, the scientists used methods developed by two companies: PacBio from California and Oxford Nanopore from Oxford, UK.
Both methods allow sequencing long chunks of DNA, with Oxford Nanopore claiming a seven-digit limit. In this technology, a single DNA molecule is sequenced while being slowly driven through a nanopore, like a thread through a needle. Oxford Nanopore’s technology sacrifices accuracy for length with a 15% error rate, while PacBio’s method, though capped at 20 thousand base pairs per string, is 99.9% accurate, which makes these two technologies complementary.
200 million new base pairs
The researchers have fully sequenced all 22 autosomal (non-sex) chromosomes and chromosome X (the male-exclusive chromosome Y was not sequenced). This added almost 200 million base pairs of novel sequence that contain more than 2000 paralogous gene copies. Paralogs are copies of an original gene that were inserted into other places in the DNA. Most of these copies become defunct, while others can retain their original functions or develop new ones. A small but considerable portion of the newly found genes – 115 – are predicted to code for proteins, and many more probably code for RNA that is not translated into proteins but is still used in gene regulation in what is known as RNA interference.
Short arms matter too
Five of our 23 chromosomes are highly asymmetric, with one arm on one side of the centromere being much shorter than the other arm. These are called acrocentric chromosomes. These short arms are particularly rich in repeats and duplications, so they have not been adequately sequenced until now. This new technique provides highly accurate sequencing of the short arms of all acrocentric chromosomes.
The researchers note that genes on these short arms are known to affect such important cellular processes as ribosome biogenesis and nucleolus formation. They have also been implicated in genetic conditions such as Down syndrome. A full sequencing of these regions can bring us closer to understanding these aspects of cellular life and human health.
We need more references
In addition to gaps that have been all but eliminated now, the human reference genome has yet another problem: it is a mosaic assembled from the genes of 13 anonymous volunteers. It does a decent job as a genomic map that helps sequence new genomes and can tell us where particular genes are located, but it fails to take into account some differences that exist between real-world genomes. Scientists have been arguing for a while that we need more reference genomes that are fully derived from people of various ethnic backgrounds. New sequencing technologies can make this task easier.
Conclusion
Full sequencing of the human genome, if confirmed, is an important milestone that opens the door to many new discoveries. Soon, we will be able to better understand how our centromeres work and discover the role of long-tandem repeats. Scientists will also be able to look deeper into genetic differences between people of various backgrounds, which is important for studying ethnicity-based genetic diseases.
Literature
[1] Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A. V., Mikheenko, A., … & Phillippy, A. M. (2021). The complete sequence of a human genome. bioRxiv.
[2] Warburton, P. E., Hasson, D., Guillem, F., Lescale, C., Jin, X., & Abrusan, G. (2008). Analysis of the largest tandemly repeated DNA families in the human genome. BMC genomics, 9(1), 1-18.