Case study 1 script compute1.py

This script finds pairs of cells that have similar gene expressions, with similarities computed taking all genes into account.

#!/usr/bin/python3

This line allows the script to be called directly from the shell, "./compute1.py". This assumes that you are on a platform that uses Python 3. If you are on a platform that uses Python 2, you instead need to invoke the script as "python compute1.py".

from ExpressionMatrix2 import *

This makes the ExpressionMatrix2 code accessible from Python, without the need to prefix it with a module name. This is not necessarily a good idea, particularly for a large script, but it does simplify the code a bit.

For this to work, ExpressionMatrix2.so must be located in a directory where the Python interpreter can find it. There are several ways to do that, the simplest of which consists of simply setting environment variable PYTHONPATH to the name of the directory that contains ExpressionMatrix2.so.

# Access our existing expression matrix.
e = ExpressionMatrix(directoryName = 'data')

This creates an ExpressionMatrix object using the existing binary files in the data directory. The binary files are memory mapped rather than explicitly read, so construction of the ExpressionMatrix object is very fast (almost instantaneous).

See here for reference information on the ExpressionMatrix constructors.

# Find pairs of similar cells (approximate computation using LSH, 
# while still looping over all pairs). 
print("Finding pairs of similar cells.")
e.findSimilarPairs4(similarPairsName = 'Lsh')

This finds pairs of similar cells using default parameters. The pairs of similar cells are stored in a named SimilarPairs object stored in the data directory. We chose the name Lsh as a reminder that these similar pairs were found using findSimilarPairs4, which uses Locality Sensitive Hashing (LSH). See here for reference information on findSimilarPairs4.

We used default parameters for findSimilarPairs4, which means that the computation is done for all cells, using similarities computed taking all genes into account, and storing only pairs for which the similarity is at least 0.2. The maximum number of similar cells that are stored for each cell is limited to the default value of 100. This means that, if there are more than 100 cells with similarity greater than 0.2 for a given cell, only the 100 with the greatest similarity are stored.

Using the default parameters also means that 1024 LSH hyperplanes are used. With 1024 hyperplanes, the computation of similarity using LSH has a standard deviation of 0.05 or better from the exact similarity value (regression coefficient of expression vectors).