A diverse amalgam of variably-structured data embodies our expanding knowledge of biological phenomena, including human health and disease. The PubMed biomedical citation database adds more than one million entries to its collection of 30 million literature items each year. These documents cover both experimental observations and clinical reports; more than two million of the latter type (i.e., clinical case reports, or CCRs) are in publication. As biomedical text varies extensively in content, vocabulary and style across subdomains, it is challenging for even the most advanced natural language processing strategies to impose a consistent structure upon these documents. Most results therefore remain fragmented, unstructured, and difficult to compare without careful manual curation. Concurrently, we realize that biomedical knowledgebases (KBs) are rich sources of structured observations, yet are not inherently interoperable. How may we organize, unify, and discover novel relationships among massive sets of heterogeneous biomedical observations? What is necessary to consistently link structured knowledge with that extracted from unstructured sources? Perhaps most importantly, how may we best translate efforts to impose structure on experimental reports and CCRs to the reports produced by clinicians daily?
We have developed resources to address the above challenges. To provide a starting point for the development of methods for identification of high-level concepts within clinical narratives, we assembled a set of metadata acquired from clinical case reports (MACCRs). This expert-curated, publicly-available data set contains publication metadata, patient demographics, and descriptions of clinical events and observations within 3,100 CCRs. We see our MACCR set as a resource for aiding clinicians, researchers, and machine learning systems in understanding how disease presentations are described and how they may be written about more clearly. We have also constructed a semantic typing system entitled Annotation for Case Reports using Open Biomedical Annotation Terms (ACROBAT). This system defines a set of concepts, categories, and relations for representation of medical language. The resource includes a set of 200 CCR texts manually annotated with ACROBAT. Our rules we use are sufficiently broad to afford generalization to other document types written in biomedical language, and to our knowledge, this collection is the only collection of CCRs deeply annotated for both entities and relations. We are currently employing our resources and those just entering usage in biomedical text processing, including emerging transfer learning and language model-driven approaches, to develop a platform for biomedical information extraction. Our goal is to enforce sufficient structure on experimental and clinical observations to transform them into consistent, searchable knowledge graphs. These living documents can unify heterogeneous information to support strategies for data comparison, biomarker identification, and disease diagnosis.
Authors: J. Harry Caufield, Yichao Zhou, Anders O. Garlid, Yunsheng Bai, Yijiang Zhou, Quan Cao, Jessica Lee, Sanjana Murali, Sarah Spendlove, David A. Liem, Kai-Wei Chang, Yizhou Sun, Wei Wang, and Peipei Ping