Case study 1 script input.py

This script initializes the expression matrix from the data in the input files.

#!/usr/bin/python3

This line allows the script to be called directly from the shell, "./input.py". This assumes that you are on a platform that uses Python 3. If you are on a platform that uses Python 2, you instead need to invoke the script as "python input.py".

from ExpressionMatrix2 import *

This makes the ExpressionMatrix2 code accessible from Python, without the need to prefix it with a module name. This is not necessarily a good idea, particularly for a large script, but it does simplify the code a bit.

For this to work, ExpressionMatrix2.so must be located in a directory where the Python interpreter can find it. There are several ways to do that, the simplest of which consists of simply setting environment variable PYTHONPATH to the name of the directory that contains ExpressionMatrix2.so.

# Create a new, empty expression matrix.
# The data directory must not exist.
e = ExpressionMatrix(
    directoryName = 'data', 
    geneCapacity = 100000,
    cellCapacity = 10000,
    cellMetaDataNameCapacity = 10000,
    cellMetaDataValueCapacity = 1000000
    )

This creates the new ExpressionMatrix object which, at this point, is empty (that is, it does not contain any genes or cells). The specified directory name must not exists, and is used to store all subsequent data structures needed for this ExpressionMatrix object.

The four capacity arguments control the capacity of various hash tables used to store genes names, cell names, and cell meta names and values. To avoid performance degradation in the hash tables, make sure to set the capacities to at least a factor of two greater that what you think you will need. There is currently no automatic rehashing of the tables, so if one of the capacities is exceeded the run will have to be restarted from scratch with larger capacities. See here for reference information on the ExpressionMatrix constructors.

# Add the cells.
e.addCells(
    expressionCountsFileName = 'GBM_raw_gene_counts.csv',
    expressionCountsFileSeparators = ' ', 
    cellMetaDataFileName = 'GBM_metadata.csv',
    cellMetaDataFileSeparators = ' ' 
    )

This causes the expression data and the cell meta data contained in the input files to be stored in the ExpressionMatrix object (that is, in binary files in the data directory). See here for more information on the addCells call.

print('Input completed.')

Self-explanatory.