Chan Zuckerberg Landscaping Toolkit
In scientific work, context is provided by knowledge of prior work in the field. Traditionally, the repository of that information is the crucible of the published scientific literature, but more recently other online sources may potentially play a role.
This project is concerned with the tools needed to build representations of contextual knowledge for CZI’s SciTech and Program efforts as ‘Landscaping’ work. The design goals of this work is to make our tools modular, tailored to the needs of our colleagues, lightweight, and effective. We rely on low-tech, low-lift pieces that we can build on to make more sophisticated systems. We also drive this work as open source development.
This project is under development and not yet stable. This is a library of components designed to support and facilitate ‘scientific knowledge landscaping’ within the Chan Zuckerberg Initiative’s Science Program. It consists of several utility libraries to help build and analyze corpora of scientific knowledge expressed both as natural language and structured data. This system is built on the excellent
nbdev
package that uses notebooks as a vehicle for development.
Installation & Code of Ethics
pip install git+https://github.com/chanzuckerberg/czLandscapingTk.git
CZI adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.
Please note: If you believe you have found a security issue, please responsibly disclose by contacting us at security@chanzuckerberg.com.
High-level Design: The Surveying Knowledge Task
This project is focussed on provide a suite of generalizable tools that can be used by knowledge analysts to implement solutions for surveying tasks. The basic structure of this class of data analysis can be described in the following way:
Goal
An analytic task, where we attempt to answer a question by (A) surveying existing data sources, (B) compiling an intermedical knowledge corpus drawn from those sources, (C) analysing that corpus to yield an answer to the question.
Typical Example
- Identifying a set of Key Opinion Leaders (KOLs) with specialized expertise in an understudied area.
- Performing a systematic review of available treatments for a specific rare disease
- Developing (and using) reproducible impact metrics for a funded scientific program to study what is working and what is not.
Terminology + Implementation Design
Question
- A natural language expression of the research question that is the objective of the taskStudy Data Sources
- List of avaiable information sources that can be interrogated by executors of the taskInformation Retrieval Query
(IR Query
) - A list of logically-defined queries that can be run over the data sourcesInclusion / Exclusion Criteria
- Logical operators to determine if retrieved data should be included in the studyIntermediate Corpus
- Schema and Data of the collection of documents gathered from external information sourcesAnalysis
- Workflow specification of analyses to be performed over the intermediate corpus to generate anAnswer
Answer
- The answer to thequestion
expressed in natural language with a full explanation of the provenance of how the answer was computed.
Organizational Model
Image source on LucidDraw: Link
Adopting the CommonKADS knowledge engineering design process, we consider the interplay between agents (swimlanes), processes, and items in the figure. In particular, we seek to characterize how knowledge is needed, used, or derived in the workflow.
The goal of this project is to provide code to execute the processes described above to provide an extensible set of executable computational tools to automate the process shown.
Basic System Workflow
Image source on LucidDraw: Link