Tremendous amounts of biomedical research data have been generated and collected at an ever-increasing speed and scale. For example, PubMed, a primary source of biomedical research literature covering nearly 100 years of publications, is adding articles at an exponential rate: 90% of all articles have been added in the last X years. In addition to the research literature, there is a broad spectrum of data available, from genomic, demographic, administrative, clinical, and policy repositories.
As a number of researchers have pointed out, although individual data sets and repositories are increasingly available, they are limited with respect to interoperability and reuse by the community. Rather than becoming easier to access and utilize, the data ecosystem is becoming less integrated, making the problems associated with information processing and interpretation more difficult .
As the amount of data increases, these problems only get worse. Some efforts are perennially underway to impose formal ontologies and universal descriptors upon datasets to make them more computationally manageable, but these efforts themselves are subject to the same problems they set out to solve. In trying to fit all data sets, they become overly complex and hard for people to understand and use. People end up customizing them for particular data types and the ontologies themselves diversify and require unwieldy mappings from one to the other.
The only realistic solution to making effective use of the scale and diversity of data that are available is computational. Specifically, we require computational systems that are designed to understand the ambiguities inherent to all forms of human communication and knowledge representation. The real power of these data sets comes from interlinking them and deriving latent knowledge that is spread across a scale that is beyond the abilities of individual people to extract. The more dissimilar the data types, the harder it is to link, but also provides more potential for novel discovery.
We are focused on providing state-of-the-art disambiguation services for large-scale biomedical data linking. We have developed highly scalable and accurate natural language processing techniques to extract and resolve references to research objects such as genes, proteins, diseases, and patient groups across a variety of sources including publications, grants, patents, social media, and clinical trials. In addition, we have constructed machine learning models capable of inferring highly accurate links between research objects.
Contrary to misleading marketing claims, developing the technical capabilities to successfully match and combine information from disparate biomedical data sources requires substantial domain-specific knowledge and expertise.
It is crucial to distinguish between generic data linking software and the domain-specific knowledge and technical expertise required to implement the features needed to unambiguously link biomedical data at a large scale.
The easier-to-use that a product claims to be, the less likely it is to work as advertised for your specific needs.
The real power of interlinked data goes beyond mapping identifiers from one database onto identifiers in another. Although this task may be challenging in its own right (e.g., formatting, ambiguity, data quality, conflicting values) it pales in significance to the analytical power that can be derived from extracting and interlinking information from structured text sources such as research articles, legal documents, and social media. The ability to identify and extract biomedical entities such as genes, proteins, diseases, individual researchers, and research organizations, as well as track their development from early stage research through to patented drugs and medical devices, and subsequently follow continued progress or setbacks is an extremely valuable capability to academic institutions, funding agencies, private sector enterprise, and financial and legal institutions.
From an academic perspective, high-quality deeply linked data provides researchers with crucial knowledge of the research landscape and developing scientific trends. It also provides administrators with important tools for calibrating institutional performance. Institutions with access to deeply interlinked datasets lead the way in research productivity and impact.
From a funding perspective, deeply interlinked data holds the promise of speeding scientific discovery and enabling effective stewardship of publicly funded research.
From a commercial perspective, deeply linked biomedical research data that ties together all stages of the drug development pipeline affords a distinct competitive advantage. Machine learning models trained on high-quality deeply linked data can quickly identify promising new areas of success based on previous exemplars . Deeply linked data plus large scale machine learning models provide nearly unimaginable power to hone in on undiscovered opportunities that minimize financial risk. Companies that adopt this data-driven approach to investing, whether venture capital or internal strategic initiatives, are more likely to succeed than those that rely solely on typical data sets and dashboard-based hunt and peck approaches to information gathering.
About The Analytic Research Institute
The ARI brings together technical capability, analytical experience, and deep subject matter expertise to deliver AI training and solutions to policy and decision-makers in the Federal Government as well as the private sector. The ARI can affect change throughout the process of your project, or help optimize individual elements of your plan. Take advantage of our services or embed ARI experts in your office to help manage your projects.
. Wilkinson, Mark D et al. “The FAIR Guiding Principles for scientific data management and stewardship.” Scientific data vol. 3 160018. 15 Mar. 2016, doi:10.1038/sdata.2016.18
. Hao Yu et al. 2021. The effect of mentee and mentor gender on scientific productivity of applicants for NIH training fellowships. https://www.biorxiv.org/content/10.1101/2021.02.02.429450v1