Data
What is the Data CLI Tool?
A command-line interface for searching, exploring metadata, and downloading data registered in the Virtual Cells Platform (“VCP”). This tool allows you to search for data across multiple scientific domains, without needing to write code or scripts.
Metadata Schemas
Registered data comes with rich metadata to streamline search. Learn about our data schemas including the cross modality schema that specifies the key metadata available for all registered datasets.
Getting Started
Prerequisites
Your Virtual Cells Platform account credentials (register here)
Python version 3.10 or greater
The VCP CLI tool. See Installation for instructions.
Authentication
Some CLI commands will require that you have a user account on the Virtual Cells Platform website and that you login to your account using the CLI. If needed, you can create a new account in the Virtual Cells Platform website.
Login via Web Browser
To log in to your Virtual Cells Platform account using your browser:
vcp login
Once you log in, you can go back to the command line and continue.
Login via the Command Line
To log in to your Virtual Cells Platform account from your terminal, specify the --username
option:
vcp login --username your.name@example.org
You will be prompted for a password. Use the same one you use on the Virtual Cell Models web page.
Get Help Using the CLI
The --help
option provides additional documentation and tips. You can add it to the end of any of the available commands for more information.
For example, to learn what data commands are available for this tool, run:
vcp data --help
You can also get help with learning how to use individual commands by adding it to a command, for example:
vcp data describe --help
Overview of Data Commands
The CLI has 6 core data commands:
Command |
Description |
---|---|
|
List available metadata fields for searching datasets. |
|
Summarize counts of matched datasets against a specified FIELD. |
|
Search for datasets by TERM. Lucene-style queries are supported. |
|
Describe a dataset with comprehensive metadata in tabular format. |
|
Generate a preview URL to visualize zarr files in a dataset. |
|
Download dataset(s) by ID or search query. At least one of –id or –query is required. |
The CLI also has the following options that can be used to adjust commands:
Option |
Purpose |
Commands |
---|---|---|
|
Specify the ID of a single dataset to download |
|
|
Provide a search query to filter datasets for download or summary |
|
|
Specify the output directory for downloaded files |
|
|
Use exact match for the search term |
|
|
Download all datasets returned by a search |
|
|
Show detailed metadata in pretty-printed JSON format |
|
|
Show the raw returned record from the API |
|
|
Automatically open the Neuroglancer preview URL in your default browser |
|
|
Show usage information and exit |
All subcommands ( |
Summary of Fields
To return the full list of searchable fields, run
vcp data metadata-list
Below is a table of searchable fields which includes terms from the cross modality schema.
Field |
Definition |
---|---|
assay |
Defines the assay that was used to create the dataset. Human-readable label. |
assay_ontology_term_id |
Defines the assay used to create the dataset. MUST be an Experimental Factor Ontology (EFO) term, e.g., “EFO:0022605”. |
tissue |
Defines the tissues from which assayed biosamples were derived. Human-readable label. |
tissue_ontology_term_id |
Defines the tissues from which assayed biosamples were derived. Allowed ontologies include CL, GO, UBERON, WBbt, ZFA, and FBbt. |
tissue_type |
One of: ‘cell culture’, ‘cell line’, ‘organelle’, ‘organoid’, or ‘tissue’. |
cell_type |
Cell Type as defined in Cellosaurus. |
organism |
Defines the organism from which assayed biosamples were derived. Human-readable label. |
organism_ontology_term_id |
Defines the organism from which assayed biosamples were derived. MUST be an NCBI organismal classification term, e.g., ‘NCBITaxon:9606’. |
disease |
Defines the disease of the patients or organisms from which biosamples were derived. Human-readable label. |
disease_ontology_term_id |
Defines the disease. MUST be a descendant of ‘MONDO:0000001’ if disease, or ‘PATO:0000461’ for normal/healthy. |
development_stage |
Defines the development stage of the patients or organisms. Human-readable label. |
development_stage_ontology_term_id |
Defines the development stage. Use ‘na’ for cell lines, ‘unknown’ if unknown, or otherwise an ontology term from an organism-specific ontology. |
name |
Curator-provided name for the dataset. |
tags |
List of tags associated with the dataset, including ‘namespace: |
creator |
Stringified list of creators, usually a person (e.g., ‘John Doe’) or organization (e.g., ‘CZI’). |
For any field, you can return a comprehensive table of all terms used in that field along with a count of the number of datasets for each term using vcp data summary
vcp data summary tissue_type
Use the --query
option with vcp data summary
to apply a filter when summarizing a specific metadata field. For example, to filter for datasets that mention brain tissue and summarize the distribution of assay types, run:
vcp data summary assay --query brain
Example Queries
Search for Datasets
vcp data search cryoet
This will return an overall count of the datasets with cryoet
in the dataset name or metadata and a paginated table of those datasets with their associated metadata. To automatically download the datasets returned by search add the option --download
to the end of your query. To search for an exact match of a term, use the option --exact
at the end of your query.
The CLI supports Lucene-style search queries, so you can use:
Field-specific search, like
"tissue:brain"
Quotation marks
" "
to group multiword terms and boolean expressionsAND, OR, NOT boolean operators to combine terms (To use boolean operators with multiwords terms, use double quotation marks (
" "
) around the query and single quotation marks (' '
) for the multiword terms within the query.)Wildcard terms with
*
and?
Fuzzy search with
~
Examples using each type of Lucene query are below.
Field-specific Terms
Use field:value
pairs to search for specific metadata.
vcp data search tissue:skeletal
To search for ontology terms, escape colons with a backslash (\:
) and place the entire query within single quotes (''
). For example,
vcp data search 'assay_ontology_term_id:EFO\:0030062'
Multiword Terms
Use single quotes (''
) around multiword search terms and around queries that contain spaces.
vcp data search 'cryoet data portal'
Combine Terms with Boolean Operators
The following returns CellxGene datasets of kidney samples.
vcp data search 'tissue:kidney AND cellxgene'
To combine boolean operators with multiword search terms, use double quotes (" "
) around the query, and single quotes (' '
) around the multiword search terms. For example, to search for datasets on the CryoET Data Portal from the Chan Zuckerberg Imaging Institute (CZII), use:
vcp data search "'cryoet data portal' AND CZII"
Wildcard Terms
The *
symbol can be used as a multicharacter wildcard and the ?
as a single character wildcard on terms within quotation marks. For example, to search for data from any 10x assay, use:
vcp data search "assay:10x*"
To search for cryoET or cryoEM data, you could use:
vcp data search "cryoe?"
Fuzzy Search
To do a fuzzy search, use a ~
symbol at the end of a single word term. This type of search accounts for simple typos and formatting differences.
vcp data search "Hpylori~"
View Dataset Metadata
vcp data describe 688ab21b2f6b6186d8332644
This returns a table with additional metadata beyond what is displayed with the search
command.
To show comprehensive metadata for a dataset add the option --full
to the end of your query, for example:
vcp data describe 688ab21b2f6b6186d8332644 --full
All of the metadata displayed can be used for field specific search, for example:
vcp data search namespace:cellxgene
Preview an Imaging Zarr Dataset
For Imaging datasets with Zarr files, we support previewing the data in Neuroglancer. Check out this Neuroglancer quickstart and Neuroglancer documentation to familiarize yourself with the tool.
vcp data preview 681a6a61200cf05759b5bf91
This will return a clickable URL for opening the dataset in Neuroglancer. Use the --open
option to automatically open the link in your browser.
vcp data preview 681a6a61200cf05759b5bf91 --open
Download a Dataset
vcp data download --id 688ab21b2f6b6186d8332644
This will return the size of the dataset in bytes and request confirmation for download. Typing Y
will initiate the download in the current working directory. During the download, a progress bar is displayed.
You can use the following options -o
or --outdir
followed by a path to a folder to specify the output directory for download. For example, to download a file to your Documents folder, run:
vcp data download -o ~/Documents --id 688ab21b2f6b6186d8332644
Or
vcp data download --outdir ~/Documents --id 688ab21b2f6b6186d8332644
Download All Datasets Based on Query
To download all datasets that match a query, use vcp data download --query $QUERY
. For example, to download all CellxGene datasets with kidney samples, use:
vcp data download --query "tissue:kidney AND cellxgene"
Tips
Put multiword terms in quotes:
"stem cell"
notstem cell
.Start simple: Try
vcp data search cellxgene
to get a feel for results.Use
--help
often: Every command supports it!
For more information on the available commands and options, see Command Line Interface.