Treat Biological Data as a Strategic Resource

Article Voiceover
1x
0:00
-0:00

Congress must authorize the Department of Energy (DOE) to create a Web of Biological Data (WOBD), a single point of entry for researchers to access high-quality data.

Congress should authorize the National Institute of Standards and Technology (NIST) to create standards that researchers must meet to ensure that U.S. biological data is ready for use in AI models.

Congress should authorize and fund the Department of Interior (DOI) to create a Sequencing Public Lands Initiative to collect new data from U.S. public lands that researchers can use to drive innovation.

Congress should authorize the National Science Foundation (NSF) to establish a network of “cloud labs,” giving researchers state-of-the-art tools to make data generation easier.

Biological data lie at the heart of emerging biotechnologies and are defined by the National Institute of Standards and Technology (NIST) as “the information, including associated descriptors, derived from the structure, function, or process of a biological system(s) that is either measured, collected, or aggregated for analysis.”228

Biological data include a wide variety of human data as well as data from animals, plants, fungi, bacteria, and viruses that comprise the rich biological landscape of the United States. These biological data enable scientists to discover, design, and optimize everything from individual components of cells to the behavior of whole groups of organisms to the inputs and outputs of biomanufacturing processes.

Biological data are especially important for unlocking AI’s potential. Just as large language model (LLM) chatbots such as ChatGPT are trained on vast amounts of text from the internet, biological design tools and scientific language models are trained on troves of biological data from research efforts.

Biological Data Definition

Biological data are “the information, including associated descriptors, derived from the structure, function, or process of a biological system(s) that is either measured, collected, or aggregated for analysis.”ˡⁱⁱⁱ Biological data and associated metadata illuminate how biology behaves, from individual components of cells to the behavior of whole groups of organisms and their ecosystems. Biological data also describe the necessary conditions for production of medications such as vaccines and antibodies, materials such as those derived from mushroom leather or spider silks, and chemicals that are produced from microbes.

If the United States is to cement its global lead in biotechnology, it must do more to develop high-quality data. The country has failed to provide high-quality data in a usable way, address gaps in data holdings, invest in automated biological data collection, or build the infrastructure needed to ensure that the United States fully leverages its wealth of biological data. The federal government has even failed to maximize the scientific discoveries and innovations already held in its existing collections of biological specimens.

China’s approach to biological data involves accessing and exploiting publicly available data from around the world, including from the United States, while harvesting its own domestic datasets and closing them off to the rest of the world.230 This approach gives China an asymmetric advantage in exploiting biological data and highlights its lack of data-sharing reciprocity. Many Chinese Communist Party (CCP) policies explicitly state that the government intends to prioritize the collection and use of biological data, as do statements from China’s medical AI industry.231 Accordingly, the U.S. government must ensure that China cannot obtain bulk and sensitive biological data from the United States.

Recommendation 4.1A

Congress must authorize the Department of Energy (DOE) to create a Web of Biological Data (WOBD), a single point of entry for researchers to access high-quality data.

Currently, U.S. biological data is generated from a wide variety of sources and organized with different purposes in mind. These data are organized differently across organizations in academia, government, and industry, and even across individual labs within the same organization.232

This uncoordinated approach makes collating large datasets a burdensome process for researchers, slowing potential discoveries. It might take months to answer a single question, assuming the information exists in the first place.

There are several noteworthy examples of biological databases created by federal departments and agencies, but each is incomplete for a future that requires data for new AI models. For example, the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) is one of the most comprehensive genomic databases in the world.233 But its datasets are in reality spread over different databases and data types and are not designed to be used comprehensively, a key requirement for training AI models. Targeted programs to make biological data more compatible would help to ensure that efforts such as the NCBI drive the future of biotechnology. The Joint Genome Institute at Lawrence Berkeley National Laboratory leads an exemplary data program on microbial sequences and ecosystems, but the program is focused on a small subset of microbiome data.234 Expanding efforts like this to include a larger class of organisms and other types of biological information, such as protein data, would add valuable tools needed for the future of biotechnology.

Having the ability to standardize, combine, and analyze biological data generated from different places, organisms, or experiments is critical to advancing research and training AI models. In many cases, the combination of different datasets is more valuable than the individual parts.

The creation of a resource that combines biological datasets in a usable way would allow researchers to spend less time curating biological data and more time testing hypotheses, training models, and designing novel biological functions. Such a resource would:

  • serve as a single point of entry for researchers to access different sources of biological data, all of which would be standardized, usable, and interoperable;
  • enable discovery with advanced computational methods; and
  • protect and control access to U.S. biological data.

To create these resources, Congress must authorize the Department of Energy (DOE) to create the Web of Biological Data (WOBD), a comprehensive central biological data infrastructure that would serve as single point of entry for accessing biological data, have built-in security and access controls, and provide opportunities for advanced computation and analysis. The WOBD would start with data collected from federally funded efforts and have the potential to expand to collect other sources of data.

Web of Biodata Will Make Using Biological Data Easier

The WOBD would:

  • serve as an access point for high-quality biological data from different locations;
  • host new biological data;
  • develop and maintain tools for using these biological data such as bioinformatics pipelines, models, and ontologies (i.e., the categories, properties, and relationships between concepts and conventions that define a field); and
  • have a requirement that any datasets included on the platform must be standardized.

This centralized resource would have the added benefit of incorporating cybersecurity and access controls into the earliest stages of its design and development. There are many considerations when designing security and access controls for biological data. For example, plant genome sequences from basic research projects would need different access controls and cybersecurity protocols than sensitive medical records or human genomic data. The WOBD would be meant to encompass many different types of biological data, and as it expands, it would need to carefully build in security and take into account all appropriate privacy laws.

Security Considerations for Biological Data

Security considerations are not the same for different types of biological data. Safeguards implemented on the Web of Biological Data (WOBD) should be proportionate to the sensitivity of the data, ensuring access is appropriately managed, while encouraging scientific collaboration.

While much of the security and access control implemented through the WOBD would be decided on a case-by-case basis, there are some basic distinctions in the types of biological data that exist. While not an exhaustive list, these include: