Use Cases
CZI’s Mission - Build, Fund, Do
Broadly, our scientific mission at the Chan Zuckerberg Initiative (CZI) is expressed as:
We build open source software tools to accelerate science and generate more accurate and biologically important sources of data. We fund scientific research worldwide to advance the frontiers of knowledge. And we launched a family of institutes to do research that can’t be done in conventional environments. Each aspect is essential to our approach to building for the long term.
We describe this coordinated activity as ‘Build / Fund / Do’, so that our SciTech development work (the ‘Build’ compontent), is based on scientific challenges and use cases provided by the collaborative research networks that funding supports (‘Fund’) and direct research activities undertaken within the CZI research organizations such as the CZI BioHubs and the Chan Zuckerberg Imaging Institute (‘Do’). In particular, our scientific work generates complex research landscaping requirements that we seek to address using the latest available AI technology.
Alhazen - Basic Functionality and Design Choices
The Alhazen system is intended to provide open-source infrastructure that supports landscape analysis of scientific knowledge (from both the literature and other online sources) using large language model technology including LLM-enabled workflows and automated agent-based tool use (provided by the LangChain library).
- We prioritize the use of open source Large Language Models (but support access to state-of-the-art commercial systems like )
- Explore the competitive utility of being able to execute long-running processing chains on local GPUs that would otherwise be prohibitively expensive for commercial systems.
- Emulate the working process of a diligent, hard-working scientist who rigorously reads documents in depth, makes notes, and then bases their conclusions on summaries of their notes with full explanations of how they got there. This is juxtaposed to QA systems designed to answer questions quickly based on lightweight indexes of source material.
- Use LangChain as the core library for this work.
At present, these constitute works-in-progress that utilize elements of the Alhazen system expressed as Jupyter Notebooks in the nb/cookbook
subdirectory.
Scientific Use Cases
These use cases include:
- Zero-shot information extraction of metadata from the methods sections of full text papers describing CryoET papers in order to support curation of that data into the CZI CryoET Portal system - see below.
- Understanding distribution of types of experimental studies in rare disease research (see Human Annotated Corpus for Disease Research State Classifications)
- Landscaping analysis of researchers working on Infectious Disease in Low-Middle Income Countries (LMIC)
- Zero-shot extractions of the names of advanced imaging methods from methods papers in a small subset of journals.
- Searches for key opinion leaders from Africa working in the field of microscopy
Key Use Case: Metadata Extraction for Curation of Data into a Repository
A key use case for this effort is the development of tools that can assist curation of complex datasets to a central repository (such as for the Chan Zuckerberg Imaging Institute’s CryoET Data Portal.
We currently seek to improve this use case by including ontology search / matching capabilities to the extracted text and to generalize the extraction process to other protocol types.