Starting a run and analyzing results using the Python API

Accessing ExpressionMatrix2 functionality from Python code

Users access ExpressionMatrix2 functionality via shared library ExpressionMatrix2.so, which can by imported in a Python script as a Python package. The build assumes that the library will be used with Python Version 3, but could be modified (by changing some include file paths and library names) to work with Python 2 instead. You can import ExpressionMatrix2 in Python 3 code just like any other Python package, for example adding the following near the beginning of your Python script or after starting a Python shell for interactive use:

import ExpressionMatrix2

For this to work, the ExpressionMatrix2 library must be made available to the Python interpreter in one of the standard ways - for example:

Set environment variable PYTHONPATH to point to the directory containing the library.
Make a copy of the library in the directory which is the current directory at the time the Python import command executes. A symbolic link will also work.

All the code examples in the next sections assume that the import command before was executed successfully.

Creating and accessing expression matrix objects

To begin working with a new data set, you must first create a new Python object to represent the expression matrix for the new data set. This can be done as shown in the following example:

e = ExpressionMatrix.ExpressionMatrix(directoryName = 'data')

The parameter is the name of a directory that will be used to hold various binary files containing expression matrix data structures. The directory should not already exist, and will be created.

Since most data structures are stored in files, the expression matrix object is persistent: it can be reaccessed in the future, even after the Python object used for the original creation disappears. This can be done by simply specifying the name of the existing directory containing the binary data, just as before:

e = ExpressionMatrix.ExpressionMatrix(directoryName = 'data')

The code examples in the next sections assume that an expression matrix object named "e" was accessed in one of the two ways shown in this section.

Adding cells

After the expression matrix object is first created, the first thing you will want to do is add cells to it. To provide flexibility in adding cells, four paths to add cells are provided:

Function addCells can be used to add cells and meta data from files in comma separated format (csv) or similar delimited formats.
Function addCellsFromHdf5 can be used to add cells from expression matrix data in HDF5 format created by the 10x Genomics pipeline.
Function addCell can be used to add a single cell with meta data and expression counts passed in as Python lists..
Function addCellFromJson can be used to add a single cell with meta data and expression counts specified using a simple JSON format.

Each of these methods is described in more details in one of the sections below. You can use any combination of the methods, in any sequence, to add different batches of cells to a single expression matrix object. In that case, each batch can use a different set of cell meta data names.

Note that the last two methods to add cells can be used to add cells in any input format. If you want to add cells stored in a custom format, you will need to write Python code to process your input data and prepare the inputs necessary to call addCell or addCellFromJson.

Adding cells in comma separated format or similar delimited formats

You can use this path if you have a text file containing a dense representation of the expression matrix. This file can be a standard comma separated file (csv), which follows the following rules:

Fields in each row are separated by commas or any other separator, which is specified when calling addCells.
A field cannot contain a separator, unless the entire field is enclosed quotes or the separator is escaped (preceeded by an extra "\" character).
If a field contains quotes, the quotes in the field must be escaped (preceeded by an extra "\" character).

The expression matrix file has a row for each gene and a column for each cell, plus an initial row containing cell names and an initial column containing gene names. The first field of the first row is optional, and is ignored if present. Except for the initial row and column, each entry in the file contains a number which is the expression count for the gene and cell corresponding to the row and column that the field belongs to. These expression counts do not have to be integers (floating-point values are accepted), so prenormalization or recalibration of the data can be done, if desired (for example to convert reads to reads per Megabase, or to apply a logarithm or other non-linear function to alter the weighing of low expression counts).

As an example of an expression matrix in this format, here is the one used for the small test case in tests/ToyTest1:

Gene,Cell0,Cell1,Cell2
Gene0,10,0,0
Gene1,10,10,0
Gene3,0,10,20

Some pipelines create expression matrices that contain rows, near the beginning or the end of the file, which contain extraneous information other than gene expression data. Such information must be removed before the file is used as input to the ExpressionMatrix2 code. Also, expression counts for artificial ERCC sequence should be removed from the expression file, if present.

An input file containing cell meta data must also be provided. The cell meta data file is formatted similarly to the expression matrix file, and contains a row for each cell and a column for each meta data field, plus an initial column containing cell names and an initial row containing meta data field names. Like for the expression matrix file, the first field of the first row is optional, and is ignored if present.

Here is an example of a cell meta data file in this format, also taken from the small test case in tests/ToyTest1:

Cell,Organ,Tumor
Cell0,Brain,Yes
Cell1,Liver,No
Cell2,Pancreas,No

The cell names in the first row of the expression count file and in the first column of the meta data file don't have to be in the same order. If a cell name is present in only one of the files, that cell is ignored. This allows using pipelines in which one or both of the files contain information for additional cells that are to be ignored.

If either the expression file or the meta data file were generated by a Windows system, you will need to use the dos2unix command to convert the line ends. If dos2unix is not available and you are on ubuntu 16.04, install it using command "apt install dos2unix" (this requires root privileges).

You can use Python code like this to add to an expression matrix a set of cells described in this format:

e.addCells(
    expressionCountsFileName = 'ExpressionMatrix.csv',
    expressionCountsFileSeparators = ',',
    metaDataFileName = 'MetaData.csv',
    metaDataFileSeparators = ',')

Here is the meaning of the arguments to the addCells call:

The name of the expression matrix file.
The separator or separators used in the expression matrix file.
The name of the cell meta data file.
The separator or separators used in the cell meta data file.

Note that, when using this format, the expression matrix read on input is represented in a dense format - that is, all of its entries are explicitly present, including the ones that are zero. Therefore this format becomes unpractical for large runs. Because information for each cell is stored in columns of the expression matrix file, the entire file has to be read into memory. As a result, the input process requires much more memory than if using more space efficient formats like the ones described in the next sections.

Adding cells from files created by the BioHub pipeline

As the BioHub single-cell RNA pipeline evolves, the file formats it creates change. Functionality to add cells from this pipeline in one of these formats is described below.

For Illumina data, July 2017

Function ExpressionMatrix.addCellsFromBioHub1 can be used to add to the system cells created by the BioHub pipeline as of July 2017 from Illumina data. For each plate, three files are used:

A csv file containing expression counts by cell, with one row per cell and one column per gene plus a header line containing gene names. The name of this file is specified as the first parameter to addCellsFromBioHub1. The first column contains the cell name. Additional columns are also allowed before and after the last gene. These columns are treated as per-cell meta data. The number of these initial and final columns is specified in the second and third parameter to addCellsFromBioHub1.
A csv file containing plate meta data. The first row contains plate names. The plate name is obtained from the name of the expression counts file (it is the portion in the file name that preceded the first period). The row in this file corresponding to that plate name is used to assign meta data to all the cells.
A csv file containing cell meta data, with one row for each cell and one column for each meta data field. The first column contains the cell name (which matches the cell name used in the expression count file). The cells in this file are not required to be in the same order as the cells in the expression counts file.

The names of the first two files are passed as arguments to addCellsFromBioHub1. The additional cell meta data in the third file can then be added using a call to

addCellMetaData.



For 10X Genomics data, September 2017

Function 
ExpressionMatrix.addCellsFromBioHub2
can be used to
add to the system cells created by the BioHub pipeline
as of September 2017 from 10X genomics data. 
It uses a plate file which is a comma separated file containing
a line for each plate to be processed plus a header line
containing column names.
This function is only available in a version of the
ExpressionMatrix2 software built with HDF5 support.


The following columns are required to be present in the plate file:

PlateName specifies the name of each plate. The name of each cell is created
by concatenating the plate name with the 10X barcode within the plate.
aws-s3-hdf5 specifies the AWS S3 path to the hdf5 file containing expression 
data for each plate. A valid AWS S3 path begins with "s3://" followed by the S3 bucket name. 
This uses the AWS CLI (ubuntu package name awscli) to make a temporary local copy of each file. 
The AWS CLI must be installed amd configured for access to the buckets where the data are stored.



All the fields in each row of the plate file are used as cell meta data for the
cells found in the plate.


The second argument to addCellsFromBioHub2 specifies a threshold on the total expression
count for a cell to be added. Barcodes for which the total expression count (total UMI)
is less than this value are not used to generate a cell.



Adding cells from expression matrix data in HDF5 format created by the 10x Genomics pipeline


The 10X Genomics RNA pipeline creates expression matrix data in HDF5 format. This format is very space efficient because the expression matrix is stored in sparse format (only the non-zero entries are stored), and in addition it is also compressed. It is also memory efficient because the data in the file are stored by cell, not by gene, and as a result it is not necessary to load the entire file in memory for processing.


If you want use one of these files as input to the ExpressionMatrix2 code and you are on ubuntu, it is suggested that you install package hdf5-tools from the standard ubuntu repositories. This provides a variety of commands that are useful to manipulate HDF5 files. For example, you can use the following command to list the data sets contained in an HDF5 file:

h5ls -r file.h5 
/                        Group
/GRCh38                  Group
/GRCh38/barcodes         Dataset {8083}
/GRCh38/data             Dataset {8863417/Inf}
/GRCh38/gene_names       Dataset {33694}
/GRCh38/genes            Dataset {33694}
/GRCh38/indices          Dataset {8863417/Inf}
/GRCh38/indptr           Dataset {8084/8192}
/GRCh38/shape            Dataset {2/16384}



The ExpressionMatrix2 code requires the following HDF5 data sets to be present, below the top level HDF5 group:


Data set genes, which contains Ensembl gene ids. Data set gene_names is not used, even though it contains more friendly gene names, because the 10X pipeline does not guarantee these names to be unique.
Data set barcodes, which contains barcodes which are used as cell names.
Data sets data, indices, indptr, which contain the non-zero elements of the expression matrix in column-compressed sparse format.



You can use Python code like this to add to an expression matrix a set of cells described in this format:

e.addCellsFromHdf5(
    fileName = 'MyData.h5', 
    cellNamePrefix = 'MyExperiment-', 
    cellMetaData = [], 
    totalExpressionCountThreshold = 1000.)



Cell names are constructed by concatenating the specified cellNamePrefix with the barcodes in the hdf5 file.
The specified cell meta data is added to all the cells. 
Cells for which the total expression count is less than the specified
totalExpressionCountThreshold are skipped. 


The argument specifies the name of the HDF5 file. Note then, when using this format, cells have no meta data except for their name.



Adding cells given meta data and expression counts in Python lists


If your expression data are in a custom format, you can write Python code to create simple Python lists for each cell you want to add, then call function addCell. A simple example follows:

metaData = [('cellName', 'abc'), ('key1', 'value1')}, 
expressionCounts = [('gene1', 10), ('gene2', 20)]
e.addCell(metaData=metaData, expressionCounts=expressionCounts)



Each cell is required ot have the cellName meta data field, but other than that the set of meta data for each cell is completely arbitrary. Each cell can have a different set of meta data fields.


Your code will need a loop over all cells to be added. For each cell, the code would read information for that cell in your custom format, create the required JSON, then call addCellFromJson as shown above. This provides complete flexibility on the input format, but requires some Python coding.



Adding cells in simple JSON format


If your expression data are in a custom format, you can also write Python code to create simple JSON for each cell you want to add, then call function addCellFromJson passing the JSON as the argument. This simple example shows the format of the expected JSON and the Python code required to add the cell to an expression matrix object:

import json
cell = {'metaData': {'cellName': 'abc', 'key1': 'value1'}, 'expressionCounts': {'gene1': 10,'gene2': 20}}
e.addCellFromJson(jsonString = json.dumps(jSonString))



Each cell is required ot have the cellName meta data field, but other than that the set of meta data for each cell is completely arbitrary. Each cell can have a different set of meta data fields.


Your code will need a loop over all cells to be added. For each cell, the code would read information for that cell in your custom format, create the required JSON, then call addCellFromJson as shown above. This provides complete flexibility on the input format, but requires some Python coding.



Finding pairs of similar cells


In order to be able to create a cell similarity graph, it is necessary to first find pairs of similar cells. This is the most computationally intense portion of the ExpressionMatrix2 code, and is done in Python code.


A set of pairs of similar pairs can be created in Python code using a choice of methods. On creation, it is also given a name. That name can later be used, from the http server or in other Python code, to create a cell similarity graph.


The following two parameters are common to all methods to find pairs of similar cells:


k is the maximum number cells similar to a given cell that will be stored. For each cell, only the k most similar cells will be stored.
similarityThreshold is the minimum similarity required to store a pair of similar cells.



Note that, if cells A and B are found to have similarity greater than the threshold, we will normally store cell B in the list of cells similar to cell A, and cell A in the list of cells similar to cell B. However, one or both of these stores may have to be omitted to enforce the k parameter. As a result, the cell similarity described by a given set of cell similarities is not necessarily symmetric: if cell A is stored as being similar to cell B, it is not necessarily the case that cell B is stored as being similar to cell A. 


The next sections describe the currently supported methods of finsding sets os similar pairs of cells.



Function findSimilarPairs0


Function findSimilarPairs0 loops over pairs of cells, and explicitly computes the similarity of the two cells in each pair, defined as the Pearson correlation coefficient of the expression counts of the two cells. It is called as follows:

e.findSimilarPairs0(geneSetName, cellSetName, similarPairsName, k, similarityThreshold)


 The first three parameters specify the set of genes to be used in the computation, the set of cells for which the computation is to be performed, and the name of the object containing the similar pairs that will be created. The remaining two parameters were described above.

Keep in mind that the time to perform the computation is proportional to the number of cell pairs and therefore grows with the square of the number of cells. Typically, this computation becomes prohibitive when the number of cells is in the several thousands.



Function findSimilarPairs3 (preliminary documentation)


Until more complete documentation on the Locality Sensitive Hashing methods used in ExpressionMatrix2 code is available, this will serve as preliminary documentation for function findSimilarPairs3 .


Function findSimilarPairs3 loops over pairs of cells, and approximately computes the similarity of the two cells in each pair, defined as the Pearson correlation coefficient of the expression counts of the two cells, using Locality Sensitive Hashing (LSH). It is called as follows:

seed = 231
lshCount = 1024
e.findSimilarPairs3(cellSetName, similarPairsName, k, similarityThreshold, lshCount, seed)



The first two parameters specify the set of cells for which the computation is to be performed, and the name of the object containing the similar pairs that will be created. The next two parameters were described above. Until complete documentation is available, the above parameter values for the last two parameters should be used. This is much faster than even approximate computation using findSimilarPairs0 described above, however it still grows with the square of the number of cells. It typically becomes computationally prohibitive when the number of cells is in the hundreds of thousands. With the parameter values above, it computes similarity values that typically have a standard deviation 0.03 through 0.05 from the exact value.


Note that, for this function to perform at high speed, use of the POPCOUNT instruction must be enabled. For the Ubuntu build under Eclipse, this is done using compiler option -msse4.2. Equivalent means would have to be used by a port to a different platform to ensure high performance of function findSimilarPairs3.



Function findSimilarPairs2 (preliminary documentation)


Function findSimilarPairs2, under development, uses Locality Sensitive Hashing (LSH) to find cell pairs with high similarity without looping over all pairs of cells. This provides scaling to large number of cells with compute cost increasing in a slower fashion than proportionally to the square of the number of cells.



Starting the http server

 See here for a description of functionality offered by the http server. Before you can use that functionality, you need to start the server using Python code like the one shown below:

e.explore(port = 17100, docDirectory = pathToDocumentationDirectory)



Specifying the port number is optional. If not specified, the port number defaults to 17100. It is best to use high port numbers (between 1025 and 65535). If the specified port number is already in use in the system, the next port number will be tried. If you want to have http servers running for multiple runs, each run will use a different port number. 


Occasionally, even after a server is interrupted, the port the server was using remains in use for a short period of time (a few seconds or up to a few minutes). This behavior is dictated by the TCP/IP stack and cannot be overridden.


Specifying serverParameters.docDirectory is also optional. This should be set to the local path of the ExpressionMatrix2/doc directory and is necessary for the server to be able to locate the documentation. If not specified, the server will still start, but will be unable to display the documentation.


The server can be stopped at any time using Ctrl^C. However, if this is done, this causes a hard termination of the Python script that was running the server. Python destructors and other Python clean up code will not be executed.



Interrupting Python code


You can interrupt running Python code using Ctrl^C as usual. This sends a signal, and when the Python interpreter sees the signal it calls the necessary cleanup code and terminates. However, if the Ctrl^C is sent while ExpressionMatrix2 code is running, the Python interpreter will only see the signal when the ExpressionMatrix2 code returns, which means that it may take a long time for termination to occur, depending on what the code was doing. A practical way to obtain immediate termination is to first use Ctrl^Z, followed by a "kill %" command to kill the process.


The http server is an exception to this behavior, because it sets up its own signal handler. As explained above, the server can be stopped at any time using Ctrl^C. However, if this is done, this causes a hard termination of the Python script that was running the server. Python destructors and other Python clean up code will not be executed.