cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe

class cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe(experiment: Experiment, measurement_name: str = 'RNA', X_name: str = 'raw', obs_query: AxisQuery | None = None, var_query: AxisQuery | None = None, obs_column_names: Sequence[str] = (), batch_size: int = 1, shuffle: bool = True, seed: int | None = None, return_sparse_X: bool = False, soma_chunk_size: int | None = 64, use_eager_fetch: bool = True, shuffle_chunk_count: int | None = 2000, encoders: list[cellxgene_census.experimental.ml.encoders.Encoder] | None = None)

An torchdata.datapipes.iter.IterDataPipe that reads obs and X data from a tiledbsoma.Experiment, based upon the specified queries along the obs and var axes. Provides an iterator over these data when the object is passed to Python’s built-in iter function.

>>> for batch in iter(ExperimentDataPipe(...)):
        X_batch, y_batch = batch

The batch_size parameter controls the number of rows of obs and X data that are returned in each iteration. If the batch_size is 1, then each Tensor will have rank 1:

>>> (tensor([0., 0., 0., 0., 0., 1., 0., 0., 0.]),  # X data
     tensor([2415,    0,    0], dtype=torch.int64)) # obs data, encoded

For larger batch_size values, the returned Tensors will have rank 2:

>>> DataLoader(..., batch_size=3, ...):
    (tensor([[0., 0., 0., 0., 0., 1., 0., 0., 0.],     # X batch
             [0., 0., 0., 0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
     tensor([[2415,    0,    0],                       # obs batch
             [2416,    0,    4],
             [2417,    0,    3]], dtype=torch.int64))

The return_sparse_X parameter controls whether the X data is returned as a dense or sparse torch.Tensor. If the model supports use of sparse torch.Tensors, this will reduce memory usage.

The obs_column_names parameter determines the data columns that are returned in the obs Tensor. String-typed columns are encoded as integer values. If needed, these values can be decoded by obtaining the encoder for a given obs column name and calling its inverse_transform method:

>>> exp_data_pipe.obs_encoders["<obs_attr_name>"].inverse_transform(encoded_values)

Lifecycle

experimental

__init__(experiment: Experiment, measurement_name: str = 'RNA', X_name: str = 'raw', obs_query: AxisQuery | None = None, var_query: AxisQuery | None = None, obs_column_names: Sequence[str] = (), batch_size: int = 1, shuffle: bool = True, seed: int | None = None, return_sparse_X: bool = False, soma_chunk_size: int | None = 64, use_eager_fetch: bool = True, shuffle_chunk_count: int | None = 2000, encoders: list[cellxgene_census.experimental.ml.encoders.Encoder] | None = None) None

Construct a new ExperimentDataPipe.

Parameters:
  • experiment – The tiledbsoma.Experiment from which to read data.

  • measurement_name – The name of the tiledbsoma.Measurement to read. Defaults to "RNA".

  • X_name – The name of the X layer to read. Defaults to "raw".

  • obs_query – The query used to filter along the obs axis. If not specified, all obs and X data will be returned, which can be very large.

  • var_query – The query used to filter along the var axis. If not specified, all var columns (genes/features) will be returned.

  • obs_column_names – The names of the obs columns to return. If custom encoders are passed, this parameter must not be used, since the columns will be inferred automatically from the encoders.

  • batch_size – The number of rows of obs and X data to return in each iteration. Defaults to 1. A value of 1 will result in torch.Tensor of rank 1 being returns (a single row); larger values will result in torch.Tensors of rank 2 (multiple rows).

  • shuffle – Whether to shuffle the obs and X data being returned. Defaults to True. For performance reasons, shuffling is not performed globally across all rows, but rather in chunks. More specifically, we select shuffle_chunk_count non-contiguous chunks across all the observations in the query, concatenate the chunks and shuffle the associated observations. The randomness of the shuffling is therefore determined by the (soma_chunk_size, shuffle_chunk_count) selection. The default values have been determined to yield a good trade-off between randomness and performance. Further tuning may be required for different type of models. Note that memory usage is correlated to the product soma_chunk_size * shuffle_chunk_count.

  • seed – The random seed used for shuffling. Defaults to None (no seed). This must be specified when using torch.nn.parallel.DistributedDataParallel to ensure data partitions are disjoint across worker processes.

  • return_sparse_X – Controls whether the X data is returned as a dense or sparse torch.Tensor. As X data is very sparse, setting this to True will reduce memory usage, if the model supports use of sparse torch.Tensors. Defaults to False, since sparse torch.Tensors are still experimental in PyTorch.

  • soma_chunk_size – The number of obs/X rows to retrieve when reading data from SOMA. This impacts two aspects of this class’s behavior: 1) The maximum memory utilization, with larger values providing better read performance, but also requiring more memory; 2) The granularity of the global shuffling step (see shuffle parameter for details). The default value of 64 works well in conjunction with the default shuffle_chunk_count value.

  • use_eager_fetch – Fetch the next SOMA chunk of obs and X data immediately after a previously fetched SOMA chunk is made available for processing via the iterator. This allows network (or filesystem) requests to be made in parallel with client-side processing of the SOMA data, potentially improving overall performance at the cost of doubling memory utilization. Defaults to True.

  • shuffle_chunk_count – The number of contiguous blocks (chunks) of rows sampled to then concatenate and shuffle. Larger numbers correspond to more randomness per training batch. If shuffle == False, this parameter is ignored. Defaults to 2000.

  • encoders – Specify custom encoders to be used. If not specified, a LabelEncoder will be created and used for each column in obs_column_names. If specified, only columns for which an encoder has been registered will be returned in the obs tensor. Each encoder needs to have a unique name. If this parameter is specified, the obs_column_names parameter must not be used, since the columns will be inferred automatically from the encoders.

Lifecycle

experimental

Methods

__init__(experiment[, measurement_name, ...])

Construct a new ExperimentDataPipe.

register_datapipe_as_function(function_name, ...)

register_function(function_name, function)

reset()

Reset the IterDataPipe to the initial state.

set_getstate_hook(hook_fn)

set_reduce_ex_hook(hook_fn)

stats()

Get data loading stats for this cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe.

Attributes

functions

getstate_hook

obs_encoders

Returns a dictionary of sklearn.preprocessing.LabelEncoder objects, keyed on obs column names, which were used to encode the obs column values.

reduce_ex_hook

repr_hook

shape

Get the shape of the data that will be returned by this cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe.

str_hook