singularityhub.com

Thousands of Undiscovered Genes May Be Hidden in DNA ‘Dark Matter’

Thousands of new genes are hidden inside the “dark matter” of our genome.

Previously thought to be noise left over from evolution, a new study found that some of these tiny DNA snippets can make miniproteins—potentially opening a new universe of treatments, from vaccines to immunotherapies for deadly brain cancers.

The preprint, not yet peer-reviewed, is the latest from a global consortium that hunts down potential new genes. Ever since the Human Genome Project completed its first draft at the turn of the century, scientists have tried to decipher the genetic book of life. Buried within the four genetic letters—A, T, C, and G—and the proteins they encode is a wealth of information that could help tackle our most frustrating medical foes, such as cancer.

The Human Genome Project’s initial findings came as a surprise. Scientists found less than 30,000 genes that build our bodies and keep them running—roughly a third of that previously predicted. Now, roughly 20 years later, as the technologies that sequence our DNA or map proteins have become increasingly sophisticated, scientists are asking: “What have we missed?”

The new study filled the gap by digging into relatively unexplored portions of the genome. Called “non-coding,” these parts haven’t yet been linked to any proteins. Combining several existing datasets, the team zeroed in on thousands of potential new genes that make roughly 3,000 miniproteins.

Whether these proteins are functional remains to be tested, but initial studies suggest some are involved in a deadly childhood brain cancer. The team is releasing their tools and results to the wider scientific community for further exploration. The platform isn’t just limited to deciphering the human genome; it can delve into the genetic blueprint of other animals and plants as well.

Even though mysteries remain, the results “help provide a more complete picture of the coding portion of the genome,” Ami Bhatt at Stanford University told Science.

What’s in a Gene?

A genome is like a book without punctuation. Sequencing one is relatively easy today, thanks to cheaper costs and higher efficiency. Making sense of it is another matter.

Ever since the Human Genome Project, scientists have searched our genetic blueprint to find the “words,” or genes, that make proteins. These DNA words are further broken down into three-letter codons, each one encoding a specific amino acid—the building block of a protein.

A gene, when turned on, is transcribed into messenger RNA. These molecules shuttle genetic information from DNA to the cell’s protein-making factory, called the ribosome. Picture it as a sliced bun, with an RNA molecule running through it like a piece of bacon.

When first defining a gene, scientists focus on open reading frames. These are made of specific DNA sequences that dictate where a gene starts and stops. Like a search function, the framework scans the genome for potential genes, which are then validated with lab experiments based on myriad criteria. These include whether they can make proteins of a certain size—more than 100 amino acids. Sequences that meet the mark are compiled into GENCODE, an international database of officially recognized genes.

Genes that encode proteins have attracted the most attention because they aid our understanding of disease and inspire ways to treat it. But much of our genome is “non-coding,” in that large sections of it don’t make any known proteins.

For years, these chunks of DNA were considered junk—the defunct remains of our evolutionary past. Recent studies, however, have begun revealing hidden value. Some bits regulate when genes turn on or off. Others, such as telomeres, protect against the degradation of DNA as it replicates during cell division and ward off aging.

Still, the dogma was that these sequences don’t make proteins.

A New Lens

Recent evidence is piling up that non-coding areas do have protein-making segments that affect health.

One study found that a small missing section in supposedly non-coding areas caused inherited bowel troubles in infants. In mice genetically engineered to mimic the same problem, restoring the DNA snippet—not yet defined as a gene—reduced their symptoms. The results highlight the need to go beyond known protein-coding genes to explain clinical findings, the authors wrote.

Dubbed non-canonical open reading frames (ncORFs), or “maybe-genes,” these snippets have popped up across human cell types and diseases, suggesting they have physiological roles.

In 2022, the consortium behind the new study began peeking into potential functions, hoping to broaden our genetic vocabulary. Rather than sequencing the genome, they looked at datasets that sequenced RNA as it was being turned into proteins in the ribosome.

The method captures the actual output of the genome—even extremely short amino acid chains normally thought too small to make proteins. Their search produced a catalog of over 7,000 human “maybe-genes,” some of which made microproteins that were eventually detected inside cancer and heart cells.

But overall, at that time “we did not focus on the questions of protein expression or functionality,” wrote the team. So, they broadened their collaboration in the new study, welcoming specialists in protein science from over 20 institutions across the globe to make sense of the “maybe-genes.”

They also included several resources that provide protein databases from various experiments—such as the Human Proteome Organization and the PeptideAtlas—and added data from published experiments that use the human immune system to detect protein fragments.

In all, the team analyzed over 7,000 “maybe-genes” from a variety of cells: Healthy, cancerous, and also immortal cell lines grown in the lab. At least a quarter of these “maybe-genes” translated into over 3,000 miniproteins. These are far smaller than normal proteins and have a unique amino acid makeup. They also seem to be more attuned to parts of the immune system—meaning they could potentially help scientists develop vaccines, autoimmune treatments, or immunotherapies.

Some of these newly found miniproteins may not have a biological role at all. But the study gives scientists a new way to interpret potential functions. For quality control, the team organized each miniprotein into a different tier, based on the amount of evidence from experiments, and integrated them into an existing database for others to explore.

We’re just beginning to probe our genome’s dark matter. Many questions remain.

“A unique capacity of our multi-consortium collaboration is the ability to develop consensus on the key challenges” that we feel need answers, wrote the team.

For example, some experiments used cancer cells, meaning that certain “maybe-genes” might only be active in those cells—but not in normal ones. Should they be called genes?

From here, deep learning and other AI methods may help speed up analysis. Although annotating genes is “historically rooted in manual inspection” of the data, wrote the authors, AI can churn through multiple datasets far faster, if only as a first pass to find new genes.

How many might scientists discover? “50,000 is in the realm of possibility,” study author Thomas Martinez told Science.

Image Credit: Miroslaw Miras from Pixabay

Read full news in source page