In this case study I illustrate the functionality of
ExpressionMatrix2
software
on expression data from
Darmanis et al., 2017, "Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at the Migrating Front of Human Glioblastoma", Cell Reports 21, 1399-1410, October 31, 2017.
Special thanks to the paper's authors for providing early access to their data.
This is a small data set with under 4,000 cells.
The ExpressionMatrix2
software is designed to comfortably run
on much larger numbers of cells (it was actually run on a data set with over 1 million cells without any downsampling),
so the compute time required to run the case study is minimal,
and all phases of the run can be performed interactively.
This page contains detailed information sufficient to run the ExpressionMatrix2
software on
these data. It contains the following sections:
Getting started
Setting up the run
Downloading input files
The input phase of the case study
Finding pairs of similar cells
A first look at the data
Limiting the gene set used to compute cell similarity
Clustering
Conclusion
Running this case study can serve as a useful introduction
to performing a run on your own data.
To begin, use these directions to make sure you have access
to the software and an environment in which it can run.
When this step is completed, your environment variable PYTHONPATH
will be set to a directory containing a working version of ExpressionMatrix2.so
appropriate to your platform.
To run this case study you will need to use ExpressionMatrix2
release 0.3.0 or later.
Make a copy of directory tests/CaseStudy1
from the
ExpressionMatrix2
release you are using.
To avoid possible errors or problems due to
incompatibilities between releases,
make sure to use the same release from which
your ExpressionMatrix2.so
originates.
In the rest of this page, I will call this copy the
CaseStudy1
directory, and assume that all commands are
run from here.
At this point, your CaseStudy1
directory
only contains some Python scripts and a README
file.
The run requires the following input files, which can be downloaded from here:
GBM_raw_gene_counts.csv
GBM_metadata.csv
CaseStudy1
directory.
These files will be accessed with read-only access.
You may want to protect them by changing their access
permissions accordingly, for example using the following command:
chmod ugo-w *.csv
The expression matrix in file GBM_raw_gene_counts.csv
consists of raw read counts for each gene,
which are integer numbers.
However, the ExpressionMatrix2
software
accepts as input expression counts which are floating point numbers,
for generality. This allows applying, if desired, pre-normalization steps
and/or non-linear transformations of the raw read counts
(for exampe, logarithmic transformations to emphasize the importance
of low expression genes).
You can begin the run using the following command:
./input.py
This causes the ExpressionMatrix2
software to read the input files and create a binary
representation of the expression matrix
in GBM_raw_gene_counts.csv
and of the cell meta data in GBM_metadata.csv
.
This will run quickly (under one minute),
and the run time will be dominated by the
time to read and parse input file GBM_raw_gene_counts.csv
containing the expression matrix.
After this, the two input files will no longer be needed.
Instead, all required information is stored in binary files in the
data
directory that was created in your
CaseStudy1
directory.
Additional commands to be run later will create
additional binary files in the data
directory that will contain various types of
data structures used by the ExpressionMatrix2
software.
These binary files provide fast sequential or random access to
the required data structures and are quite compact.
At this point the size of the data
directory
is about half the size of the input csv files, and
is dominated by the space necessary to store the expression
matrix in a compact, sparse format
(file data/CellExpressionCounts.data
).
Note that, if for any reason you want to restart the run,
you need to remove the data
directory manally before running input.py
again.
This is a safety feature to prevent inadvertent
data loss.
The scripts in the CaseStudy1
directory
assume you are using a platform that uses Python 3.
If you use Python 2, you need to use the following instead of the
above command:
python input.py
This also applies to all invocations of Python scripts for the rest of the case study.
See here for details of input.py
.
With the expression data and cell meta data now stored
in the binary formats required by the ExpressionMatrix2
software,, the next step is to find pairs of cells that have similar
gene expressions, with similarities computed taking all genes
into account. This is done using command
./compute1.pyThe computation is fast (only a few seconds) as it is done using Locality Sensitive Hashing (LSH). See here for details of
compute1.py
.
At this point we are ready to take a first look at the data
using the http server functionality of the ExpressionMatrix2
software.
runServer.py
to start the ExpressionMatrix2
in
http server mode:
./runServer.pyThe code in
runServer.py
is very simple. It simply
creates the ExpressionMatrix
object just like the
compute1.py
script did, then invokes function
explore
to start the http server.
When the server starts, it writes a message containing the port number
on which it is listening for connections - usually port 17100.
The server will continue to run until you interrupt it using Ctrl-C
or by killing the process. However, most of the time
it will be doing nothing - it will just be waiting
for http requests from browsers.
With the server running, we can use a browser to explore the data. The browser communicates with the server using the http protocol, and can run on the same machine as the server, or on a different machine, as long as the two machines can communicate over a network or via the Internet.
In the simplest mode of operations, the browser runs on the same machine as the server. Start the browser and point it to this URL:
http://localhost:17100(Make sure to change the port number from 17100 to whatever port number the server is listening on, which you found out when starting the server).
If the browser is running on a different machine, enter the same URL but replace localhost with the name or ip address of the machine where the server is running. For example, the URL could look like this:
http://server.company.com:17100or
http://192.51.100.57:17100For this to work, however, it may be necessary to change the configuration of firewall software or firmware on the server or on one of the routers in between by "opening" TCP port 17100, and preferrably the range 17100-17110 to allow some flexibility, as the server will try higher numbered ports if port 17100 is not available.
Security warning: depending on your network and firewall configuration, it may be possible for anybody with network access to the server to look at your data and manipulate them. Use appropriate security precautions to prevent unwanted access to the data. Your IT department will be able to provide security advice.
The ExpressionMatrix2
software was developed
mainly using the Firefox and Chrome browsers.
However most browsers will work, including browsers available
on cell phones and other mobile devices.
When you connect your browser to the server,
the browser window will look like this:
The page summarizes the number of cells genes in the expression matrix
we are looking at, and then displays the README file in the
CaseStudy
directory. At the top of the page are
several buttons that can be clicked to explore the data set in a variety of ways.
For example, you can display information about genes, cells and cell meta data.
You can also create and manipulate gene sets and cell sets, which are
subsets of the genes and cells in yiour expression matrix.
Initially only the AllGenes
and AllCells
sets exists,
but you are free to create additional sets in a variety of ways.
As an example of using this functionality to explore the data, click on
Cell Sets
and then, in the table of cell sets, click
on the hyperlink for cell set AllCells
.
This will display a table of all the cells in this cell set
(which of course includes all 3589 cells in our expression matrix),
and also gives you the ability to select meta data fields to be displayed.
For example, in the field Select cell metadata to display:
,
select Selection
, Location
, and Sample.name
(in most browsers
you will need to press the Ctrl key while selecting the second and third fields).
Then click on Redisplay table
to obtain a table of
cells including the meta data fields you selected:
You can then click on any cell to get detailed information about it:
This also includes gene expression for the cell, which can be seen
by scrolling down:
As another example of using the http server functionality to explore the data,
click on the Cell meta data
button on the top, then
under Histogram a meta data field
select the AllCells
cell set and select meta data name Sample.name
,
then click on Create histogram
.
We can also create a contingency table showing the
combined frequency of two meta data fields.
Click the Cell meta data
button on the top, then
under Contingency table for two meta data fields
select the AllCells
cell set and select meta data names Sample.name
and Location
,
then click on Create contingency table
.
You can familiarize yourself with the rest of the functionality to explore the data. Most of it should be self-explanatory.
We computed above pairs of similar cells
using script compute1.py
.
We can now use these pairs of similar cells to create a
cell similarity graph - an undirected graph in which
each vertex represents a cell, and two vertices
are joined by an edge if the corresponding edges have
similarity greater than a chosen similarity threshold.
To avoid regions in the graph with very high connectivity,
we use a k
parameter that causes only the
best k
neighbors to be kept for each vertex (k-Nearest
Neighbor or k-NN graph).
To create such a cell similarity graph (or simply cell graph)
you can click on button Cell graphs
at the top of the browser window.
Then, you can select graph creation parameters:
Cell graph name
: for this graph we will use all of our 3589 cells,
and our similar cell pairs were computed taking all genes into account.
Therefore, we will choose All
as the graph name.
Cell set name
:
we pick the only choice available, AllCells
, as we have not yet
created any additional cell sets.
Similar pairs name
:
we pick the only choice available, Lsh
, which is the only
set of similar cell pairs we created. As a reminder,
when doing that we took into account all genes
when computing cell similarities.
Similarity threshold
:
we will keep the default value of 0.5. Only cell pairs with similarity at least
0.5 will generate a graph edge. The similar cell pairs were computed
using a threshold of 0.2 (the default value for
similarPairs3
), so we will not lose
any edges due to insufficient information being stored in our similar cell pairs.
Maximum connectivity
:
this is the k
for the k-NN graph.
We will keep the default value of 20.
The similar cell pairs were computed using k
=100
(the default value for
similarPairs3
), so we will not lose
any edges due to insufficient information being stored in our similar cell pairs.
Create a new cell graph
.
This shows a page with a message stating the the graph had 84 isolated vertices, which were removed.
This means that there were 84 cells that no other cell with similarity at least 0.5.
These cells will not be present in the cell similarity graph.
Clicking on Continue
gets you back to the Cell graphs
page, which now shows the cell graph you just created, named All
.
You can display the graph by clicking on its name.
After a short wait to compute the graph layout using the Graphviz package,
the graph is displayed:
Hide graph edges
option and clicking on Redraw graph
.
Note that the Graphviz package is used exclusively to compute the two-dimensional layout of graphs for display. This layout is not used for any other purpose, including clustering (more information on clustering follows). It is just a visual aid.
You can experiment using the pink buttons to make the graphics larger, zoom in or out to visualize specific regions of the graph, and change the display size of vertices and the thickness of edges.
You can also color the graph in a variety of ways that should be self-explanatory. This includes coloring by expression of a specific gene, coloring by similarity to a specific cell, or coloring by categorical or numerical meta data. You can also click on a vertex to display the page for the corresponding cell. This complements the coloring options to allow you to get an idea of the characteristics of cells present in different regions of the graph.
The graph shows clumps that correspond to groups of cells with similar expression vectors.
In an overly simplified view of cell populations, each of these groups corresponds
to a "cell type". The reality, of course, is much more complex.
The ExpressionMatrix2
software provides clustering functionality
which finds these clusters automatically, and I will cover this functionality below.
The Darmanis paper used a commonly used clustering approach,
t-SNE,
to classify the cells in groups, and each cell contains two meta data fields
that describe the clustering computed by t-SNE:
Cluster_2d
contains the id of the cluster each cell was assigned to.
Cluster_2d_color
contains the color assigned to
the cluster each cell was assigned to.
These colors are the same colors used in Figure 2(A) of the Darmanis paper.
The colors are stored as strings in HTML
hex triplet
format, that, is in the form #RRGGBB, where RR, GG, BB are hexadecimal bytes
representing the red, green, and blue intensities.
To see where the t-SNE clusters are located in our cell similarity graph,
we can redisplay the graph, coloring each cell in the color given by its
Cluster_2d_color
.
To do this, in the form above the graph display check the Color by meta data field
button,
and in the selections to the right of that button
select meta data field Cluster_2d_color
and ask that the field be interpreted as a color
.
The graph display will no look like this:
We see that the clumps in the graph display correspond roughly to the t-SNE clusters.
The ExpressionMatrix2
graph was constructed
taking into account expression counts for all genes,
which means that the similarity of two cells is defined
as the regression coefficient between the expression counts
of the two cells, taking all genes into account.
In contrast, the Darmanis paper
used only 500 genes selected based on the variability of their
expression.
In the ExpressionMatrix2
software,
we can use gene sets to restrict the set of genes used.
Clicking on the Gene sets
button at the top of the browser page
leads to a page that offers a variety of ways to create a new gene set,
including by pasting a set of gene names.
This option could be used to define a gene set identical to the one
of 500 genes used in the Darmanis paper.
Another option is to use the dialog near the bottom of the page,
Create a gene set using gene information content
.
This creates a gene set based on an information content criterion
(e. g. a gene that is equally expressed in all cells has zero information content,
a gene expressed only in one of N cells has information content log2N bits).
We can select AllGenes
as the starting gene set,
AllCells
as the cell set used to compute information content,
request an information content of at least two bits,
and name the resulting gene set HighInformationGenes
.
The same operation could also be performed in Python code
using function
createGeneSetUsingInformationContent.
The resulting gene set has 22442 genes out of the 23368 total.
You can use the Gene sets
page to look at the new gene set in detail.
To see what genes were removed, you could create a new gene set
consisting of the set difference betwene gene sets AllGenes
and HighInformationGenes
.
You can also show the computed information content of each gene by using the dialog under
Show gene information content
near the bottom of the Gene sets
page.
Now that we have a new gene set, we can redo the computation of the
pairs of similar cells, this time taking into account only the
gene set we just created.
This is done by compute2.py
, which is very similar to
compute1.py
, except that it requests using the gene set we just created.
To run it, make sure ot first interrupt the http server (using Ctrl+C),
then run command
./compute2.py
Like compute1.py
, this takes just a few seconds for this small data set,
after which you can restart the http server:
./runServer.py
In the browser, click on the Cell graphs
button.
You will notice that cell graph All
we previously created
disappeared. This happens because graphs (contrary to gene sets
and cell sets) are currently not persistent.
In addition to recreating it, you can also create a new graph,
which you can name
AllHighInfo
,
which still uses cell set AllCells
,
but now uses the similar cell pairs LshHighInfo
just computed by compute2.py
.
With coloring by t-SNE color as before and displayed with edges hidden, the graph looks like this:
The clumps are generally more separated. However the table shows that 1596 vertices (cells) were isolated and were removed. This means that for each of the 1596 removed cells there was no other cell with similarity greater than 0.5, the default similarity threshold used to construct the graph. This means that almost half of the cells were removed. This is not surprising, because we discarded genes that were expressed in many cells, and were therefore generally increasing the similarity between cells.
To obtain a more informative graph,
we can create a new graph
in the same way, but choosing a similarity threshold of 0.2 instead of 0.5.
Call this new graph AllHighInfo2
.
Now the number of isolated cells that were removed decreases to 22.
With coloring by t-SNE color as before, the new graph looks like this:
Clearly, it can be useful to experiment with the graph creation parameters. Different choices can reveal different aspects of the data set.
The http server offers various means to investigate the details of this graph - for example, what type of cells are present in each of the clumps. To do this, you can use one or more of the following:
HighInformationGenes
), then redisplay the graph, coloring
by expression of those genes.
Selection
,
Location
, and
Sample.name
,
is particularly informative.
For example, hovering the mouse in the red region on the right
of the display shows that one of the cells there is cell 2235.
Clicking on it shows the page for cell 2235, which reveals that the
most expressed gene is CCL3
.
Here is the display we get if we color the graph by similarity to cell 2235:
And here is graph colored by expression of gene CCL3
:
This functionality can be used to explore gene expressions in the various "clumps"
visible in the graph.
The ExpressionMatrix2
software also provides clustering functionality
which can be used to automatically find clumps or clusters in a cell similarity graph.
It uses the
Label Propagation Algorithm to assign vertices to clusters.
Additional clustering algorithms may also be added in the future.
After clusters are computed, ExpressionMatrix2
also computes
similarities between the average expression vectors of each cluster,
and creates a Cluster Graph in which each vertex represents a cluster,
and two vertices are joined by an undirected edge if the similarity
between then average expression vectors of the corresponding clusters exceeds a
given threshold.
To compute the clustering, click on the Clustering
button near the top of the browser window,
then click on "Run clustering and create a new cluster graph".
A table appears, requesting various parameters for the clustering computation.
Here, specify AllHighInfo2
as the cell graph.
Leave all parameters at their default values except for setting
the minimum number of cells for each cluster to 50
and the similarity threshold for graph edges to 0.2
(the same value we used when creating the cell similarity graph).
Use the same name AllHighInfo2
as the name of the cluster graph to be created (cell graphs and cluster graphs
use different name spaces, so it's ok to use the same name).
Finally, click on Run clustering
.
A page showing log output for the clustering appears.
Here, click on Continue
to get back to ther Cluster graphs
page. This page will now contain a link to the newly created cluster graph
AllHighInfo2
.
Click on that link to get a display of the cluster graph:
Here, each circle corresponds to a cluster. You can hover on the circle to find out how many cells are in each cluster, and also to see the most expressed genes in each cluster. You can also get more detailed information on a cluster, including the average gene expression in the cluster and the list of cells that belong to it, by clicking on it. Each edge in the display is labeled by the similarity between the average gene expressions of the two clusters it joins.
You can also compare the gene expression of two clusters or sets of clusters.
To do this, in the page that displays the cluster graph, click on
Compare gene expression between clusters
.
In the dialog that appears, specify a "first set" consisting of cluster 0
(the cluster displayed in green) and a "second set" consisting of cluster 3
(the cluster displayed in magenta/fuchsia color).
According to the display of the cluster graph, the similarity
of the average expressions of these two clusters is 0.51.
Then click on Compare
.
You get the following display - a scatter plot of the average expression counts
in the second sets versus those in the first set.
Here, each point corresponds to a gene, and you can hover on a point to get the gene name.
This allows you to easily locate genes tha are over-expressed or under-expressed
in one of the two clusters relative to the other.
For example, the rightmost point in the display corresponds to gene CCL3
,
and from its position we can see that this gene is roughly 5 times more
expressed in the first set (cluster 0) then in the second set (cluster 3).
You can do the same comparison for arbitrary sets of clusters, not just individual clusters.
To see how the cluster correspond to clumps in the cell graph,
we can store the result of the clustering operation in a new cell meta data field.
To do this, click on the button labeled Store the cluster ids in this cluster graph
in cell meta data name
after specifying Cluster
as the name
of the cell meta data to be used.
Now go back to the display of cell graph AllHighInfo2
,
and color it using the meta data field we just created, Cluster
,
interpreted as a category.
The display of the cell graph will now look like this:
The color in this display correspond to the colors in the display of the cluster
graph, which allows you to locate each of the computed clusters in the cell graph.
You can then see that the general connectivity of the cel graph
is consistent with the graph connectivity of the cluster graph.
For example, we see from the cell graph that the green and bright purple
clusters (clusters 0 and 4) are close to each other,
and in fact the cluster graph shows a similarity 0.51 between those clusters.
You can also see that the the relative placement of clusters in the two graphs
are similar, even though the two-dimensional layouts,
which are used for display purposes only, are not necessarily consistent.
The cluster graph therefore describes in a synthethic way the clumps in the cell graph and their relative similarities. Of course, several parameters were used to construct the cell similarity graph and the cluster graph, and the actual clusters will depend on the choice of these parameters. This is to be expected, as the type and amount of desired clustering depends on what questions one wants to answer. Therefore, changing the creation parameters for the cell and cluster graphs can be used in the process of exploring and understanding the data.
Despite the above considerations, it is interesting to
compare the results of this clustering with
the clustering obtained via t-SNE in the
Darmanis paper.
To do this, click on the button labeled
oCell meta data
, then, under
Contingency table for two meta data fields
,
select cell set name AllCells
and meta data names Cluster
(the clustering we just computed)
and cluster_2d
(the t-SNE clustering from the Darmanis paper).
This displays a contingency table of the two clusterings -
that is, a table containing the number of cells
with each combination of the two clusterings.
For example, this shows that cluster 0 in our clustering (900 cells) is split into clusters
7 and 8 in the t-SNE clustering.
Conversely, t-SNE cluster 8 (1329 cells) is split between clusters 0, 3, 4, 7, and 10
in our clustering.
We see from our cluster graph display that, for example, the similarity
between the average expressions of clusters 0 and 3 is only 0.51,
so it probably makes sense to keep these are two separate clusters.
For other clusters, there is more agreement between the two clustering.
For example, t-SNE cluster 9 largely corresponds to our cluster 2,
and t-SNE cluster 2 largely corresponds to our cluster 11.
As discussed above, the relative merits of each clustering
can be dependent on the type of information we are trying to extract.
The ExpressionMatrix2
software allows quick exploration
of clustering alternatives, by making it fast and easy to generate
multiple cell similarity graphs and cluster graphs with different
creation parameters.
I used this case study to illustrate most of the functionality of
the ExpressionMatrix2
software currently available, but not all of it.
Most notably, this case study did not illustrate
the functionality to define cell sets - arbitrary subsets of
cells.
Once a cell set is defined, it can be used to create a cell
similarity graph, which will include only cells in the set.
This way you can restrict your investigation to a subset of cells of interest.
This can facilitate data interpretation by reducing distractions
due to cell populations unrelated to the topic of interest.