Indexing All Life’s Known Biological Sequences | bioRxiv

peter.suber's bookmarks 2024-05-23

Summary:

Abstract:  The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making it full-text searchable and easily accessible to researchers in life and data science is an unsolved problem. In this work, we take advantage of recently developed, very efficient data structures and algorithms for representing sequence sets. We make Petabases of DNA sequences across all clades of life, including viruses, bacteria, fungi, plants, animals, and humans, fully searchable and make the indexes available to the research community. The indexes are a highly compressed representation of the input sequences (up to 5800×) and fit on a single consumer hard drive (≈100 USD). We present the underlying methodological framework, called MetaGraph, that allows us to scalably index very large sets of DNA or protein sequences using annotated De Bruijn graphs. We demonstrate the feasibility of indexing the full extent of existing sequencing data and present new approaches for efficient and cost-effective full-text search at an on-demand cost of $0.10 per queried Mpb. We explore several practical use cases to mine existing archives for interesting associations and demonstrate the utility of our indexes for integrative analyses.

 

Link:

https://www.biorxiv.org/content/10.1101/2020.10.01.322164v3

From feeds:

Open Access Tracking Project (OATP) » peter.suber's bookmarks

Tags:

oa.new oa.biology oa.data oa.discoverability oa.search

Date tagged:

05/23/2024, 09:02

Date published:

05/23/2024, 05:02