cellxgene_census.experimental.ml.huggingface.GeneformerTokenizer
- class cellxgene_census.experimental.ml.huggingface.GeneformerTokenizer(experiment: Experiment, *, obs_column_names: Sequence[str] | None = None, obs_attributes: Sequence[str] | None = None, max_input_tokens: int = 4096, special_token: bool = True, token_dictionary_file: str = '', gene_median_file: str = '', gene_mapping_file: str = '', **kwargs: Any)
Generate a Hugging Face Dataset containing Geneformer token sequences for each cell in CELLxGENE Census ExperimentAxisQuery results (human).
This class requires the Geneformer package to be installed separately with: pip install transformers[torch]<4.50 git+https://huggingface.co/ctheodoris/Geneformer@ebc1e096
DEPRECATION NOTICE: this is planned for removal from the cellxgene_census API and migrated into git:cellxgene-census/tools/models/geneformer.
Example usage:
``` import cellxgene_census import tiledbsoma from cellxgene_census.experimental.ml.huggingface import GeneformerTokenizer
- with cellxgene_census.open_soma(census_version=”latest”) as census:
- with GeneformerTokenizer(
census[“census_data”][“homo_sapiens”], # set obs_query to define some subset of Census cells: obs_query=tiledbsoma.AxisQuery(value_filter=”is_primary_data == True and tissue_general == ‘tongue’”), obs_column_names=(
“soma_joinid”, “cell_type_ontology_term_id”,
),
- ) as tokenizer:
dataset = tokenizer.build()
Dataset item contents: - input_ids: Geneformer token sequence for the cell - length: Length of the token sequence - and the specified obs_column_names (cell metadata from the experiment obs dataframe)
- __init__(experiment: Experiment, *, obs_column_names: Sequence[str] | None = None, obs_attributes: Sequence[str] | None = None, max_input_tokens: int = 4096, special_token: bool = True, token_dictionary_file: str = '', gene_median_file: str = '', gene_mapping_file: str = '', **kwargs: Any) None
Initialize GeneformerTokenizer.
Args: - experiment: Census Experiment to query - obs_query: obs AxisQuery defining the set of Census cells to process (default all) - obs_column_names: obs dataframe columns (cell metadata) to propagate into attributes
of each Dataset item
max_input_tokens: maximum length of Geneformer input token sequence (default 4096)
special_token: whether to affix separator tokens to the sequence (default True)
token_dictionary_file, gene_median_file: pickle files supplying the mapping of Ensembl human gene IDs onto Geneformer token numbers and median expression values. By default, these will be loaded from the Geneformer package.
gene_mapping_file: optional pickle file with mapping for Census gene IDs to model’s
Methods
X(layer_name, *[, batch_size, partitions, ...])Returns an
Xlayer as a sparse read.__init__(experiment, *[, obs_column_names, ...])Initialize GeneformerTokenizer.
build([from_generator_kwargs])Build the dataset from query results.
cell_item(cell_joinid, cell_Xrow)Given the expression vector for one cell, compute the Dataset item providing the Geneformer inputs (token sequence and metadata).
close()Releases resources associated with this query.
obs(*[, column_names, batch_size, ...])Returns
obsas an Arrow table iterator.obs_joinids()Returns
obssoma_joinidsas an Arrow array.obs_scene_ids()Returns a pyarrow array with scene ids that contain obs from this query.
obsm(layer)Returns an
obsmlayer as a sparse read.obsp(layer)Returns an
obsplayer as a sparse read.to_anndata(X_name, *[, column_names, ...])Exports the query to an in-memory
AnnDataobject.to_spatialdata(X_name, *[, column_names, ...])Returns a SpatialData object containing the query results
var(*[, column_names, batch_size, ...])Returns
varas an Arrow table iterator.var_joinids()Returns
varsoma_joinidsas an Arrow array.var_scene_ids()Return a pyarrow array with scene ids that contain var from this query.
varm(layer)Returns a
varmlayer as a sparse read.varp(layer)Returns a
varplayer as a sparse read.Attributes
indexerA
soma_joinidindexer for bothobsandvaraxes.model_cls_tokenmodel_eos_tokenn_obsThe number of
obsaxis query results.n_varsThe number of
varaxis query results.obs_column_namesmax_input_tokensspecial_tokenmodel_gene_mapmodel_gene_tokensmodel_gene_medians