cellxgene_census.experimental.ml.huggingface.CellDatasetBuilder

class cellxgene_census.experimental.ml.huggingface.CellDatasetBuilder(experiment: Experiment, measurement_name: str = 'RNA', layer_name: str = 'raw', *, block_size: int | None = None, **kwargs: Any)

Abstract base class for methods to process CELLxGENE Census ExperimentAxisQuery results into a Hugging Face Dataset in which each item represents one cell. Subclasses implement the cell_item() method to process each row of an X layer into a Dataset item, and may also override __init__() and context __enter__() to perform any necessary preprocessing.

The base class inherits ExperimentAxisQuery, so typical usage would be:

``` import cellxgene_census import tiledbsoma from cellxgene_census.experimental.ml import GeneformerTokenizer

with cellxgene_census.open_soma() as census:
with SubclassOfCellDatasetBuilder(

census[“census_data”][“homo_sapiens”], obs_query=tilebsoma.AxisQuery(…), # define some subset of Census cells … # other ExperimentAxisQuery parameters e.g. var_query

) as builder:

dataset = builder.build()

```

__init__(experiment: Experiment, measurement_name: str = 'RNA', layer_name: str = 'raw', *, block_size: int | None = None, **kwargs: Any)

Initialize the CellDatasetBuilder to process the results of a Census ExperimentAxisQuery.

  • experiment: Census Experiment to be queried.

  • measurement_name: Measurement in the experiment, default “RNA”.

  • layer_name: Name of the X layer to process, default “raw”.

  • block_size: Number of cells to process in-memory at once. If unspecified,

    tiledbsoma.SparseNDArrayRead.blockwise() will select a default.

  • kwargs: passed through to ExperimentAxisQuery(), especially obs_query

    and var_query.

Methods

X(layer_name, *[, batch_size, partitions, ...])

Returns an X layer as a sparse read.

__init__(experiment[, measurement_name, ...])

Initialize the CellDatasetBuilder to process the results of a Census ExperimentAxisQuery.

build([from_generator_kwargs])

Build the dataset from query results.

cell_item(cell_joinid, Xrow)

Abstract method to process the X row for one cell into a Dataset item.

close()

Releases resources associated with this query.

obs(*[, column_names, batch_size, ...])

Returns obs as an Arrow table iterator.

obs_joinids()

Returns obs soma_joinids as an Arrow array.

obsm(layer)

Returns an obsm layer as a sparse read.

obsp(layer)

Returns an obsp layer as a sparse read.

to_anndata(X_name, *[, column_names, ...])

Executes the query and return result as an AnnData in-memory object.

var(*[, column_names, batch_size, ...])

Returns var as an Arrow table iterator.

var_joinids()

Returns var soma_joinids as an Arrow array.

varm(layer)

Returns a varm layer as a sparse read.

varp(layer)

Returns a varp layer as a sparse read.

Attributes

indexer

A soma_joinid indexer for both obs and var axes.

n_obs

The number of obs axis query results.

n_vars

The number of var axis query results.