cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe
- class cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe(experiment: Experiment, measurement_name: str = 'RNA', X_name: str = 'raw', obs_query: AxisQuery | None = None, var_query: AxisQuery | None = None, obs_column_names: Sequence[str] = (), batch_size: int = 1, shuffle: bool = True, seed: int | None = None, return_sparse_X: bool = False, soma_chunk_size: int | None = 64, use_eager_fetch: bool = True, shuffle_chunk_count: int | None = 2000, encoders: list[cellxgene_census.experimental.ml.encoders.Encoder] | None = None)
An
torchdata.datapipes.iter.IterDataPipe
that readsobs
andX
data from atiledbsoma.Experiment
, based upon the specified queries along theobs
andvar
axes. Provides an iterator over these data when the object is passed to Python’s built-initer
function.>>> for batch in iter(ExperimentDataPipe(...)): X_batch, y_batch = batch
The
batch_size
parameter controls the number of rows ofobs
andX
data that are returned in each iteration. If thebatch_size
is 1, then each Tensor will have rank 1:>>> (tensor([0., 0., 0., 0., 0., 1., 0., 0., 0.]), # X data tensor([2415, 0, 0], dtype=torch.int64)) # obs data, encoded
For larger
batch_size
values, the returned Tensors will have rank 2:>>> DataLoader(..., batch_size=3, ...): (tensor([[0., 0., 0., 0., 0., 1., 0., 0., 0.], # X batch [0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0.]]), tensor([[2415, 0, 0], # obs batch [2416, 0, 4], [2417, 0, 3]], dtype=torch.int64))
The
return_sparse_X
parameter controls whether theX
data is returned as a dense or sparsetorch.Tensor
. If the model supports use of sparsetorch.Tensor
s, this will reduce memory usage.The
obs_column_names
parameter determines the data columns that are returned in theobs
Tensor. String-typed columns are encoded as integer values. If needed, these values can be decoded by obtaining the encoder for a givenobs
column name and calling itsinverse_transform
method:>>> exp_data_pipe.obs_encoders["<obs_attr_name>"].inverse_transform(encoded_values)
Lifecycle
experimental
- __init__(experiment: Experiment, measurement_name: str = 'RNA', X_name: str = 'raw', obs_query: AxisQuery | None = None, var_query: AxisQuery | None = None, obs_column_names: Sequence[str] = (), batch_size: int = 1, shuffle: bool = True, seed: int | None = None, return_sparse_X: bool = False, soma_chunk_size: int | None = 64, use_eager_fetch: bool = True, shuffle_chunk_count: int | None = 2000, encoders: list[cellxgene_census.experimental.ml.encoders.Encoder] | None = None) None
Construct a new
ExperimentDataPipe
.- Parameters:
experiment – The
tiledbsoma.Experiment
from which to read data.measurement_name – The name of the
tiledbsoma.Measurement
to read. Defaults to"RNA"
.X_name – The name of the X layer to read. Defaults to
"raw"
.obs_query – The query used to filter along the
obs
axis. If not specified, allobs
andX
data will be returned, which can be very large.var_query – The query used to filter along the
var
axis. If not specified, allvar
columns (genes/features) will be returned.obs_column_names – The names of the
obs
columns to return. If custom encoders are passed, this parameter must not be used, since the columns will be inferred automatically from the encoders.batch_size – The number of rows of
obs
andX
data to return in each iteration. Defaults to1
. A value of1
will result intorch.Tensor
of rank 1 being returns (a single row); larger values will result intorch.Tensor
s of rank 2 (multiple rows).shuffle – Whether to shuffle the
obs
andX
data being returned. Defaults toTrue
. For performance reasons, shuffling is not performed globally across all rows, but rather in chunks. More specifically, we selectshuffle_chunk_count
non-contiguous chunks across all the observations in the query, concatenate the chunks and shuffle the associated observations. The randomness of the shuffling is therefore determined by the (soma_chunk_size
,shuffle_chunk_count
) selection. The default values have been determined to yield a good trade-off between randomness and performance. Further tuning may be required for different type of models. Note that memory usage is correlated to the productsoma_chunk_size * shuffle_chunk_count
.seed – The random seed used for shuffling. Defaults to
None
(no seed). This must be specified when usingtorch.nn.parallel.DistributedDataParallel
to ensure data partitions are disjoint across worker processes.return_sparse_X – Controls whether the
X
data is returned as a dense or sparsetorch.Tensor
. AsX
data is very sparse, setting this toTrue
will reduce memory usage, if the model supports use of sparsetorch.Tensor
s. Defaults toFalse
, since sparsetorch.Tensor
s are still experimental in PyTorch.soma_chunk_size – The number of
obs
/X
rows to retrieve when reading data from SOMA. This impacts two aspects of this class’s behavior: 1) The maximum memory utilization, with larger values providing better read performance, but also requiring more memory; 2) The granularity of the global shuffling step (seeshuffle
parameter for details). The default value of 64 works well in conjunction with the defaultshuffle_chunk_count
value.use_eager_fetch – Fetch the next SOMA chunk of
obs
andX
data immediately after a previously fetched SOMA chunk is made available for processing via the iterator. This allows network (or filesystem) requests to be made in parallel with client-side processing of the SOMA data, potentially improving overall performance at the cost of doubling memory utilization. Defaults toTrue
.shuffle_chunk_count – The number of contiguous blocks (chunks) of rows sampled to then concatenate and shuffle. Larger numbers correspond to more randomness per training batch. If
shuffle == False
, this parameter is ignored. Defaults to2000
.encoders – Specify custom encoders to be used. If not specified, a LabelEncoder will be created and used for each column in
obs_column_names
. If specified, only columns for which an encoder has been registered will be returned in theobs
tensor. Each encoder needs to have a unique name. If this parameter is specified, theobs_column_names
parameter must not be used, since the columns will be inferred automatically from the encoders.
Lifecycle
experimental
Methods
__init__
(experiment[, measurement_name, ...])Construct a new
ExperimentDataPipe
.register_datapipe_as_function
(function_name, ...)register_function
(function_name, function)reset
()Reset the IterDataPipe to the initial state.
set_getstate_hook
(hook_fn)set_reduce_ex_hook
(hook_fn)stats
()Get data loading stats for this
cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe
.Attributes
functions
getstate_hook
obs_encoders
Returns a dictionary of
sklearn.preprocessing.LabelEncoder
objects, keyed onobs
column names, which were used to encode theobs
column values.reduce_ex_hook
repr_hook
shape
Get the shape of the data that will be returned by this
cellxgene_census.experimental.ml.pytorch.ExperimentDataPipe
.str_hook