Paving a Data-Savvy Path to Ultra-High-Throughput Genomics

colorectal cancer — Credit: National Pathology Imaging Cooperative (NPIC)

The All of Us project, like the Mount Sinai Million Health Discovery Program and the Taiwan Precision Medicine Initiative, aims to enroll 1 million volunteers. The UK Biobank already has 500,000 participants and has made 200,000 whole genomes available to scientists. This will be followed by the release of his WGS data for another 300,000 participants in early 2023.

With such large-scale genomic studies occurring worldwide, with the introduction of faster sequencing instruments, and the need to integrate new types of data, a major problem is the need for analysis and storage capabilities to make sense of the data. It’s whether you can handle the task. all this data. Most importantly, do any of these benefit patients?

Rami Mejio
Head of Global Software and Informatics, Illumina

No one is more aware of this problem than the people who write the sequencers that are churning out much of this data.

Rami Mehio, Head of Global Software and Informatics at Illumina, said: Of course, Illumina is working overtime to meet this growing need. Rami now senses gaps such as the incorporation of proteomics and spatial genomics, but hopes that solutions will soon emerge that will help the field continue to thrive.

Kim Jun-hyung
Second Professor, Department of Computer and Information Science
University of Pennsylvania University of Pennsylvania Single Cell Biology Penn Program Co-Director

We have also greatly expanded the range of data. “Multimodality data is now available for millions of cells. [and] The key is how we integrate them,” said Junhyong Kim, co-director of the University of Pennsylvania Single Cell Biology Program.

The very future of drug discovery and development is at stake. “The mining of data on human diversity, using not only genetics but also proteomics and transcriptomics, will very likely dominate drug discovery and development.” Our sister publication, GEN In a recent interview with Biotechnology.

As shown by Genomics England, these data also have the potential to transform patient care (see ‘Clinical applications’ below). This project is slowly but steadily introducing the gold standard of cancer diagnosis and treatment across the UK’s National Health Service (NHS). It requires next-generation sequencing, the ability to analyze it, and vast amounts of new data.

Advances in data management

This field has already come a long way. For one thing, sequencers are doing more of the data management work automatically. A decade ago, the data from a sequencer was still an image that required a lot of processing, but today’s advanced instruments skip many of these steps providing only the data researchers need. increase.

And for big data projects, you now have data compression, tiered storage options, and software that automatically migrates older data to cheaper storage and consolidates potentially duplicate files. . Companies like AWS, Dell, Google, IBM, and Microsoft Health (Azure) are meeting the demand for flexible storage.

“You can imagine precision medicine operations and diagnostic labs generating massive amounts of data,” explains Mehio. “They run the data, get the results, leave expensive storage accessible for six months, and the software automatically moves it to a cheaper, less accessible storage system.”

In addition to sequencer and software updates, Illumina, a leader in the sequencing instrument space, has responded to demand by acquiring Enancio, a company that developed data compression software for the space. “This type of compaction is intrinsic to the genome,” Mehio says. “It accounts for overlapping parts of the genome.” There are other compaction solutions.[but] This reduces data by a factor of 5 without losing important information,” he adds.

As more high-throughput instruments come online and data from fields such as proteomics and spatial genomics become more widely used, analysis and storage will be further squeezed.

What advice does Mehio have for people starting large-scale genomics projects today?

“From the beginning, set compression to minimize footprint. Find a way to store variants in the database as cheaply as possible. You might need to access that data later, so keep it, but make sure it’s on a cheaper storage option,” he says.

But this is a big challenge for scientists who aren’t in the big companies where everything works.

“There are a lot of questions that come to mind in this area,” says Mark Kalinich, co-founder and CSO of Watershed Informatics. “He has two main obstacles. [that] It prevents you from turning data into insight. [1] inaccessible computational infrastructure and [2] Today’s tools are fragmented and fragile. “

This means that wet lab scientists who generate large amounts of data from sequences need to understand how to translate the data into something interpretable. Companies must decide not only how to store all this data, but also how to interpret it.

“Many of these bioinformatics tools are outdated and potentially incompatible,” says Kalinich. “The size and variety of data in this field is growing exponentially,” he adds. “There is a need that is not justified by the explosion of capabilities.”

Today’s infrastructure, even including the cloud, is flexible but not as accessible, Kalinich says. “Hoover he can do it all in the cloud, like he builds an entire dam out of cement,” he says. “The cloud can provide storage, but the remaining question is the computing needed to charge it and the proper bioinformatics needed to make it productive.”

Data sharing challenges

Data sharing has finally come to the fore, a previously thorny subject due to privacy concerns.

Britain is leading the way. The UK Biobank is a prospective cohort study of 500,000 participants aged 40-69 years from 2006-2010. The study was established to “enable research into the lifestyle, environmental, and genomic determinants of life-threatening and disabling illnesses in middle-aged and older adults.”

Data collected at recruitment included self-reported lifestyle and medical information (subsequently supplemented by antecedent information from health records), various physical measurements (blood pressure, anthropometry, spirometry, etc.), and Includes biological samples (blood, urine, and saliva). All data can be viewed in UK Biobank’s online data showcase, including summary statistics for each data field available for research.

Kari Stephenson — Kari Stefánsson, MD, Dr.Med
Founder and CEO, deCODE

“The UK Biobank is a very unusual company. It turned out to be difficult,” says Stefansson.

Meanwhile, All of Us released nearly 100,000 WGS sequences this March. About 50% of the data are from individuals who identify with racial or ethnic groups that have historically been underrepresented in research. The project also published data on her 20,000 people infected with SARS-CoV-2.

This project contains a lot of external data from questionnaires. In late 2021, All of Us launched the Social Determinants of Health Survey (SDOH) to collect information on various social and environmental factors in people’s daily lives. These factors include neighborhood safety, food and housing security, and experience of discrimination and stress.

The COVID-19 Participant Experience (COPE) survey asked participants about the impact of COVID-19 on their mental health, well-being, and daily life. The study, which he rolled out six times between May 2020 and February 2021, allowed researchers to see how his COVID-19 impacted participants over time. I made it understandable.

All of Us Program Chief Data Officer Andrea Ramirez said: “One of our goals is to make the data widely available so that the methodology is transparent, but the identities of the participants are indistinguishable.”

Of course, sharing means many data integration issues. “Multimodal data integration requires knowing whether the data are consistent [i.e., measured in the same way] Or incomparable,” says Kim.

Ramirez repeats it. “She’s bringing in external data,” she says. “But the standards are not always the same. We have our own internal quality control, but we serve a very diverse group of researchers and the quality standards are not always the same.”

Final goal: clinical application

Then there is the issue of translating genomics into the clinical arena. That’s the whole point. The UK has also played a leading role in this area. Since 2020, Genomics England has performed whole-genome sequencing of all childhood cancer, sarcoma and acute leukemia patients being treated in the UK National Health Service (NHS). They have now begun sequencing triple-negative breast cancer patients, gliomas, and ovarian cancers.

This project is for National Health Service Genomic Medicine Service (NHS GMS) patients. They may be offered whole genome sequencing as part of their clinical care and will be asked if they would like to donate that data and/or biological sample for research.

parker moss
Chief Ecosystem and Partnerships Officer, Genomics England

Genomics England claims to have the world’s largest clinical genomics dataset on cancer. “We do both germline and tumor sequencing, and we do it in depth, so we don’t stop sequencing until we’ve covered all the genes,” said Genomics England chief of his ecosystem and Partnerships his officer Parker Moss said.

Half of each tumor sample is placed in paraffin and cut into digitized slices. Digital images of tumor biopsies, genomic sequence data, and other imaging data such as radiation are used in combination to assess patient prospects and determine optimal treatment.

Genomic data are analyzed using specialized natural language processing (NLP). Moss said: Then the image he can represent as a matrix of 1000 x 1000 pixels. “

Patients whose data are ingested into this research platform are from 80 different hospitals. To digitize these images, therefore, we must first obtain physical slides from the hospital and send them to the National Pathology Imaging Cooperative (NPIC) in Leeds. This is Genomics England’s partner in this work. According to Moss, the project includes more than 60 petabytes of his data, mostly genomes, but a growing proportion of the included image data.

There are clinical centers around the world that offer such services, but Genomics England stands out for its systemization of processes. Hopefully, more data sharing, new tools, and new projects will make patient services like these truly global.

Malorye Branca is a freelance science writer based in Acton, Massachusetts.