
Chris Mungall (Lawrence Berkeley National Laboratory)
Open Knowledge Bases in the Age of Generative AI
(Keynote talk for joint BOSC/BOKR session)
ABSTRACT: The scientific and clinical community relies on the active development of a wide range of interlinked knowledge bases in order to plan experiments, interpret omics data, and to help with the diagnosis and treatment of disease. These knowledge bases make use of expert curation and the use of community ontologies in order to provide accurate and structured information that can be used algorithmically. The advent of generative AI and agentic methods presents fantastic opportunities for accelerating curation, increasing the breadth and depth of coverage. Open knowledge bases also present opportunities to generative AI, in the form of a trusted backbone of knowledge that can mitigate the hallucinations that plague large language models. However, the pace of development of AI, combined with misunderstandings about both strengths and weaknesses, poses significant dangers. In this talk, I will present our recent work on the use of agentic AI to assist with manual knowledge base tasks, particularly those involving complex ontology development and maintenance tasks. I will present a realistic picture of challenges we face, but also strategies to mitigate them, and a path towards a future where agents, curators, and others can work together to leverage and integrate open source tools and data along with the combined knowledge of the scientific community.
Dr. Chris Mungall is a Senior Scientist at Berkeley Lab, where he heads the Biosystems Data Science department in the Environmental Genomics and Systems Biology Division. Chris’s research interests center around the capture, computational integration, and dissemination of biological research data, and the development of methods for using this data to elucidate biological mechanisms underpinning the health of humans and of the planet. He and his team have led the creation of key biological ontologies for the integration of resources covering gene function, anatomy, phenotypes and the environment, including the the Uberon anatomy ontology, the Cell Ontology (CL), and the Mondo disease ontology. He is also one of the cofounders of the OBO Foundry. For decades, he has been a strong advocate for open-source bioinformatics software, open standards, and open science.
Chris, who has a PhD in bioinformatics from the University of Edinburgh, is a PI on the Gene Ontology (GO), the Monarch Initiative, the Alliance of Genome Resources, Phenomics First, and the NCATS Biomedical Data Translator, as well as metadata lead for the National Microbiome Data Collaborative (NMDC). In 2017, Chris was the first person to be awarded the Exceptional Contributions to Biocuration Award by the International Society for Biocuration. In 2020, he received a Berkeley Lab Early Scientific Career Director’s Award.

Christine Orengo (University College London)
Working together to develop, promote and protect our data resources: Lessons learnt developing CATH and TED
ABSTRACT: The CATH protein domain structure classification was the vision of the pioneering computational scientist Janet Thornton. Algorithms developed by Orengo and Taylor in the lab of Willie Taylor enabled the analyses that laid the foundations for CATH. Since then, the Orengo team have taken CATH forward in many ways. Working closely with the protein sequence, structural and evolutionary biology communities provided the focus and feedback to shape the resource.
Maintaining the value and integrity of CATH has necessitated continuously embracing new types of data as it became relevant and developing the appropriate tools for this. For example, CATH was recently expanded >400-fold with predicted structures from AlphaFold Database (AFDB) using novel AI-based tools.
CATH is also a partner resource in InterPro and was used by the Structural Genomics Consortia in the US for more than 15 years to probe novel fold and function space. All CATH data and tools are publicly available. The talk will present landmark developments and describe how the resource has benefitted from extensive collaborations with the wider community to handle the data expansions and to provide accurate data needed by the community. It will also draw on CATH experience to reflect on strategies for supporting open data and open source.
Dr. Christine Orengo is a Professor of Bioinformatics at University College London (UCL). Her research focuses on the development of algorithms to capture relationships between protein structures, sequences and functions. She has built one of the most comprehensive protein classifications, CATH. CATH structural and functional data for hundreds of millions of proteins has enabled studies that revealed essential universal proteins and their biological roles, and extended characterisation of biological systems implicated in disease e.g. in cell division, cancer and aging. The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 365 million domain assignments.
Dr. Orengo received her PhD from University College London. She is currently a Vice President of the International Society of Computational Biology (ISCB) and was previously the ISCB’s first female President. She is a Fellow of the Royal Society (FRS), an Elected Member of EMBO since 2014, and a Fellow of ISCB since 2016. Dr. Orengo is a strong supporter of FAIR and open data and data sharing practices.
BOSC keynote speaker selection process
BOSC usually includes two or three keynote talks given by prominent individuals or emerging leaders who are accomplished in areas relevant to the bioinformatics open source community and who represent a range of backgrounds and ideas. Please see our invited speaker rubric for more information about our keynote speaker selection process and criteria.