cellxgene_census.experimental.ml.huggingface.GeneformerTokenizer
- class cellxgene_census.experimental.ml.huggingface.GeneformerTokenizer(experiment: Experiment, *, obs_column_names: Sequence[str] | None = None, obs_attributes: Sequence[str] | None = None, max_input_tokens: int = 2048, special_token: bool = False, token_dictionary_file: str = '', gene_median_file: str = '', gene_mapping_file: str = '', **kwargs: Any)
Generate a Hugging Face Dataset containing Geneformer token sequences for each cell in CELLxGENE Census ExperimentAxisQuery results (human).
This class requires the Geneformer package to be installed separately with: pip install git+https://huggingface.co/ctheodoris/Geneformer@eb038a6
Example usage:
``` import cellxgene_census import tiledbsoma from cellxgene_census.experimental.ml.huggingface import GeneformerTokenizer
- with cellxgene_census.open_soma(census_version=”latest”) as census:
- with GeneformerTokenizer(
census[“census_data”][“homo_sapiens”], # set obs_query to define some subset of Census cells: obs_query=tiledbsoma.AxisQuery(value_filter=”is_primary_data == True and tissue_general == ‘tongue’”), obs_column_names=(
“soma_joinid”, “cell_type_ontology_term_id”,
),
- ) as tokenizer:
dataset = tokenizer.build()
Dataset item contents: - input_ids: Geneformer token sequence for the cell - length: Length of the token sequence - and the specified obs_column_names (cell metadata from the experiment obs dataframe)
- __init__(experiment: Experiment, *, obs_column_names: Sequence[str] | None = None, obs_attributes: Sequence[str] | None = None, max_input_tokens: int = 2048, special_token: bool = False, token_dictionary_file: str = '', gene_median_file: str = '', gene_mapping_file: str = '', **kwargs: Any) None
Initialize GeneformerTokenizer.
Args: - experiment: Census Experiment to query - obs_query: obs AxisQuery defining the set of Census cells to process (default all) - obs_column_names: obs dataframe columns (cell metadata) to propagate into attributes
of each Dataset item
max_input_tokens: maximum length of Geneformer input token sequence (default 2048)
special_token: whether to affix separator tokens to the sequence (default False)
token_dictionary_file, gene_median_file: pickle files supplying the mapping of Ensembl human gene IDs onto Geneformer token numbers and median expression values. By default, these will be loaded from the Geneformer package.
gene_mapping_file: optional pickle file with mapping for Census gene IDs to model’s
Methods
X
(layer_name, *[, batch_size, partitions, ...])Returns an
X
layer as a sparse read.__init__
(experiment, *[, obs_column_names, ...])Initialize GeneformerTokenizer.
build
([from_generator_kwargs])Build the dataset from query results.
cell_item
(cell_joinid, cell_Xrow)Given the expression vector for one cell, compute the Dataset item providing the Geneformer inputs (token sequence and metadata).
close
()Releases resources associated with this query.
obs
(*[, column_names, batch_size, ...])Returns
obs
as an Arrow table iterator.obs_joinids
()Returns
obs
soma_joinids
as an Arrow array.obsm
(layer)Returns an
obsm
layer as a sparse read.obsp
(layer)Returns an
obsp
layer as a sparse read.to_anndata
(X_name, *[, column_names, ...])Executes the query and return result as an
AnnData
in-memory object.var
(*[, column_names, batch_size, ...])Returns
var
as an Arrow table iterator.var_joinids
()Returns
var
soma_joinids
as an Arrow array.varm
(layer)Returns a
varm
layer as a sparse read.varp
(layer)Returns a
varp
layer as a sparse read.Attributes
indexer
A
soma_joinid
indexer for bothobs
andvar
axes.model_cls_token
model_eos_token
n_obs
The number of
obs
axis query results.n_vars
The number of
var
axis query results.obs_column_names
max_input_tokens
special_token
model_gene_map
model_gene_tokens
model_gene_medians