Chan Zuckerberg Landscaping Toolkit

Accelerating our understanding of the context of SciTech development

In scientific work, context is provided by knowledge of prior work in the field. Traditionally, the repository of that information is the crucible of the published scientific literature, but more recently other online sources may potentially play a role.

This project is concerned with the tools needed to build representations of contextual knowledge for CZI’s SciTech and Program efforts as ‘Landscaping’ work. The design goals of this work is to make our tools modular, tailored to the needs of our colleagues, lightweight, and effective. We rely on low-tech, low-lift pieces that we can build on to make more sophisticated systems. We also drive this work as open source development.

This project is under development and not yet stable. This is a library of components designed to support and facilitate ‘scientific knowledge landscaping’ within the Chan Zuckerberg Initiative’s Science Program. It consists of several utility libraries to help build and analyze corpora of scientific knowledge expressed both as natural language and structured data. This system is built on the excellent nbdev package that uses notebooks as a vehicle for development.

Installation & Code of Ethics

pip install git+https://github.com/chanzuckerberg/czLandscapingTk.git

CZI adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.

Please note: If you believe you have found a security issue, please responsibly disclose by contacting us at security@chanzuckerberg.com.

High-level Design: The Surveying Knowledge Task

This project is focussed on provide a suite of generalizable tools that can be used by knowledge analysts to implement solutions for surveying tasks. The basic structure of this class of data analysis can be described in the following way:

Goal

An analytic task, where we attempt to answer a question by (A) surveying existing data sources, (B) compiling an intermedical knowledge corpus drawn from those sources, (C) analysing that corpus to yield an answer to the question.

Typical Example

Identifying a set of Key Opinion Leaders (KOLs) with specialized expertise in an understudied area.
Performing a systematic review of available treatments for a specific rare disease
Developing (and using) reproducible impact metrics for a funded scientific program to study what is working and what is not.

Terminology + Implementation Design

Question - A natural language expression of the research question that is the objective of the task
Study Data Sources - List of avaiable information sources that can be interrogated by executors of the task
Information Retrieval Query (IR Query) - A list of logically-defined queries that can be run over the data sources
Inclusion / Exclusion Criteria - Logical operators to determine if retrieved data should be included in the study
Intermediate Corpus - Schema and Data of the collection of documents gathered from external information sources
Analysis - Workflow specification of analyses to be performed over the intermediate corpus to generate an Answer
Answer - The answer to the question expressed in natural language with a full explanation of the provenance of how the answer was computed.

Organizational Model

General proposed workflow for Landscaping systems

Image source on LucidDraw: Link

Adopting the CommonKADS knowledge engineering design process, we consider the interplay between agents (swimlanes), processes, and items in the figure. In particular, we seek to characterize how knowledge is needed, used, or derived in the workflow.

The goal of this project is to provide code to execute the processes described above to provide an extensible set of executable computational tools to automate the process shown.

Basic System Workflow

Image source on LucidDraw: Link