Running an assembly

Refer the quick start guide to download an executable or build one from the source code.

Selecting assembly options

You can invoke the Shasta executable without options or with --help to get a description of the options. A list of available options can also be found here.

The only mandatory option is --input which must be used to specify the input FASTA or FASTQ files containing reads to be used for the assembly. If there is more than one file, the names should be specified separated by white space, entering --input only once, like this:

--input a.fasta b.fasta c.fasta

The Shasta assembly process is controlled by many parameters whose values can be specified by using command line options or a configuration file. Configuration files for some common situations are provided in shasta/conf or shasta-install/conf (download a tar file from a Shasta release if you don't have access to these directories). Creating options for a new type of data or genome can require some work, experimentation, and some knowledge of Shasta computational methods. If you are unable to get a satisfactory assembly for your data, feel free to file an issue in the Shasta GitHub repository and we will help. You can also use an experimental script, GenerateConfig.py, that will help you create a starting configuration file. This script is available in shasta/scripts or shasta-install/bin (download a tar file from a Shasta release if you don't have access to these directories).

There is also another experimental script, GenerateFeedback.py, that can be used after an assembly is complete to assess some (limited) aspects of assembly quality. If you decide to file an issue on the Shasta GitHub repository, including the output of this script would be helpful.

Memory requirements

For best performance, the Shasta assembler uses a single large machine rather than a cluster of smaller machines, and operates in memory, with no access to data on disk except during initial input of the reads, the final output of the assembly, and for small output files containing summary assembly information. As a result, for optimal performance, Shasta memory requirements are higher than comparable tools that keep some or most of their data on disk.

Memory requirements for optimal performance are roughly proportional to genome size and coverage and are around 4 to 6 bytes per input base. This only counts input bases that are used in the assembly - that is, excluding reads that were discarded because they were too short or for other reasons. For a human-size genome (≈3 Gb) at coverage 60x, this works out to around 1 TB of memory.

Machines with 1 to 4 TB of memory have become available at reasonable prices in the last few years, and they are widely available on cloud computing platforms. For example, for human assemblies using Shasta, we have routinely been using AWS x1.32xlarge instances with 1952 TB of memory and 128 virtual processors (64 cores with hyperthreading). These machines are available at around $13/hour as on demand instances and they complete a typical human assembly at coverage 60x in about 4 hours, at a compute cost of around $50 per assembly, much lower than current sequencing costs to produce the input reads. Much lower prices (2x to 3x lower) are also available for reserved instances and on the AWS spot market, which means that a production facility could achieve a compute cost of around $20 per genome.

Running with less than optimal memory

Shasta also supports a mode of operation with data structures physically on disk or other storage systems, mapped to virtual memory. In this mode of operation, the operating system dynamically moves pages to disk from memory and back as needed, which can result in a huge performance penalty. For this reason, this mode of operation is not suggested, except in the following conditions:

For very small genomes.
If the amount of memory available is not much smaller than the amount required for optimal performance.
If your machine has SSD or, better, NVMe storage. HDD storage (hard disk) is not suitable because of its high latency for random access.

Under these conditions, this mode of operation can be selected using the following Shasta command line options:
--memoryMode filesystem --memoryBacking disk

See the next section for more information on these options. In addition to specifying these options, make sure the assembly directory (command line option --assemblyDirectory) is on a filesystem backed by the storage type (SSD or NVMe) you want to use. If you are using a local or departmental machine, likely that a filesystem is already mounted on that storage. If you are using an AWS instance, you can use the directions here and here to make your SSD or NVMe storage accessible.

To illustrate the effect of using less than optimal memory, the table below summarizes benchmark results for a human assembly at coverage 45x. This assembly requires about 650 GB to run in memory - see the first row in the table. The remaining rows show alternative scenarios on machines with less memory. These data show that Shasta still runs in reasonable amounts of time with less than optimal memory, as long as NVMe or SSD storage is available. Note the competitive assembly time and AWS cost of the ARM Graviton2 assembly. Some notes on the table:

The benchmarks were run on AWS EC2 - details are provided in the right section of the table.
All assemblies ran with the default number of threads, equal to the number of virtual processors.
Costs are shown as on demand AWS prices in the AWS us-west-2 region (Oregon).
For Intel machines, there is one core for each two virtual processors, while for ARM processors there is one core for every virtual processor.
A Shasta executable for ARM is currently available starting with Shasta Release 0.7.0.

Architecture	Virtual processors	Memory (GiB)	Data location	Elapsed Time (h)	AWS benchmark information
Architecture	Virtual processors	Memory (GiB)	Data location	Elapsed Time (h)	Instance type	Data location	Hourly cost ($)	Assembly cost ($)
Intel Skylake-SP	96	768	Memory (2 MB pages)	2.5	r5.24xlarge	Memory (2 MB pages)	6.05	15.12
Intel Skylake-SP	96	384	NVMe	4.5	m5d.24xlarge	Local NVMe volume	5.42	24.51
Intel Skylake-SP	96	384	SSD	8.0	m5.24xlarge	EBS gp2 volume	4.61	36.82
Intel Skylake-SP	96	384	Disk	Too slow	m5.24xlarge	EBS st1 volume
Intel Skylake-SP	48	192	NVMe	6.6	m5d.12xlarge	Local NVMe volume	2.71	18.00
ARM Graviton2	48	384	NVMe	4.1	r6gd.12xlarge	Local NVMe volume	2.76	11.39

Memory modes

(This section does not apply to macOS).

For performance, the Shasta executable operates in memory, with no access to data on disk except during initial input of the reads, final output of the assembly, and for small output files containing summary assembly information. All large memory areas are allocated using mmap calls in one of several different modes of operation described below. The choice of the optimal mode of operation is dependent on many factors and decribed below. The default mode of operation works reasonably well in most cases and does not require root privilege. However, it does not deliver the best possible performance.

The memory modes are controlled by two command line options:

--memoryMode controls whether mmap allocates anonymous memory or memory mapped to a filesystem. It can take one of the following values:
- anonymous (the default value)
- filesystem
--memoryBacking specifies the physical backing of pages allocated via mmap and can take one of the following values:
- disk: mmap uses standard 4 KB pages mapped to the existing filesystem on disk in the current directory.
- 4K (the default value): mmap uses standard 4 KB pages (anonymous or on a tmpfs filesystem, depending on the setting of --memoryBacking).
- 2M: mmap uses large 2 MB pages (anonymous or on a hugetlbfs filesystem, depending on the setting of --memoryBacking). The 2MB pages are often referred to as "huge pages".

There are a total six possible combinations of these two options, summarized in the table below.

		`--memoryMode`
		`anonymous` (default)	`filesystem`
`--memoryBacking`	`disk`	Not allowed	Memory allocated by `mmap` uses 4 KB pages on a the filesystem on disk that the run output directory is in. This mode of operation can incur severe performance degradation and is therefore not generally suggested. After the run terminates, binary data are permanently available on disk and you can use the http server or the Python API to inspect assembly results.
	`4K` (default)	The default option. Memory allocated by `mmap` uses anonymous 4 KB pages. After the run terminates, binary data are destroyed, which means you cannot use the http server or the Python API to inspect assembly results. Performance is less than optimal (typically 30% degradation on a large run).	Memory allocated by `mmap` uses 4 KB pages on a `tmpfs` filesystem which is created and mounted on the `Data` directory under the run output directory. After the run terminates, binary data are available (until the next reboot) and you can use the http server or the Python API to inspect assembly results. Performance is less than optimal. When done using the binary data, you can free the memory using the following command: `shasta --command cleanupBinaryData`.
	`2M`	Memory allocated by `mmap` uses anonymous 2 MB pages. After the run terminates, binary data are destroyed, which means you cannot use the http server or the Python API to inspect assembly results. Performance is less than optimal (typically 30% degradation on a large run).	Memory allocated by `mmap` uses 2 MB pages on a `hugetlbfs` filesystem which is created and mounted on the `Data` directory under the run output directory. After the run terminates, binary data are available (until the next reboot) and you can use the http server or the Python API to inspect assembly results. This mode of operation delivers optimal performance. When done using the binary data, you can free the memory using the following command: `shasta --command cleanupBinaryData`.

The table is color coded with the following meaning:

This combination is not allowed.

This combination requires root privilege to be acquired via sudo. Depending on sudo settings, this may fail or ask for a user password.

This combination is allowed and does not require root privilege.

In summary:

For large assemblies it is best to make sure to have root privilege and use --memoryMode filesystem --memoryBacking 2M. Remember to use shasta --command cleanupBinaryData to free up the memory when done using the binary data!
For small assemblies for which performance is not important use the default mode --memoryMode anonymous --memoryBacking 4K. However, if access to binary data is required after the assembly completes to inspect assembly results using the http server or the Python API, use --memoryMode filesystem --memoryBacking disk.

See here for interactive help to select these options.

Scripted approaches to running an assembly

The Shasta assembler provides a Python API that can be used for scripting. This makes it possible to write Python scripts to run assemblies that have more flexibility or functionality than allowed by the Shasta executable. For example, these scripts allow rerunning only a portion of an assembly, which can be useful for development of new assembly functionality.

These scripts use a Python import to import the Shasta shared library, shasta.so, which provides Python bindings to Shasta functionality. The scripts and the library are in shasta-install/bin, but they will normally not be needed during basic operation of the Shasta assembler. They are more likely to be needed for debugging, testing, or development.

The Python API also allows you to write your own assembly scripts. For such a script to work, and under the assumption that the script is not located at shasta-install/bin, it will be necessary to set environment variable PYTHONPATH to (actual path)/shasta-install/bin, so the Python interpreter can locate the Shasta shared library during import.

Input files

The Shasta assembler uses as input one or more files containing reads in one of the supported formats listed below. Shasta uses the file extension to deduce the file format, as described below.

FASTA: input files in this format must be named with a file extension .fasta, .fa, .FASTA, or .FA. Older versions of Shasta had strict restrictions on Fasta files, but these restrictions were eliminated. In particular, multiline reads are now supported. Reads containing no-called or invalid bases are also allowed but are discarded on input. The file can have either Unix-style (LF) or Windows-style (CR+LF) line ends.
FASTQ: input files in this format must be named with a file extension .fastq, .fq, .FASTQ, or .FQ. Shasta requires the stricter FASTQ format with exactly four lines per read. The third line must consist of just the "+" sign. The file must have Unix-style (LF) line ends. Windows-style (CR+LF) line ends are not supported.

Input from compressed files is not supported. if your input files are compressed, you have to decompress them first using the appropriate decompression utilities.

Any reads shorter than Reads.minReadLength bases (default 10000) present in the input files are discarded on input. Reads that contain bases with repeat counts greater than 255 are also discarded. This is a consequence of the fact that repeat counts are stored using one byte, and therefore there would be no way to store such reads. Reads with such long repeat counts are extremely rare, however, and when they occur they are of suspicious quality.

Output files

The Shasta executable creates output in a directory with a name specified by the --assemblyDirectory option (default ShastaRun). The directory is created automatically at the beginning of the run. The run stops if the directory already exists. This reduces the possibility of unwanted deletion of data.

Contents of the output directory after a successful run include the following:

Assembly.fasta: The assembly results in FASTA format. The id of each assembled segment is the same as the edge id of the corresponding edge in the assembly graph.
Assembly.gfa: The assembly results in GFA 1.0 format. This contains the same sequences in the FASTA file (as GFA segments), plus their connectivity information (as GFA links). A convenient tool to inspect and study these files is Bandage. Segment ids in the GFA file correspond to FASTA ids in the FASTA file and also to assembly graph edge ids.
Assembly-BothStrands.gfa: An alternative GFA output file for the assembly. This contains both strands of the assembly. This can be useful in some cases to clarify connectivity of assembled segments. See AssemblySummary.csv to find the id of the reverse complement of each assembled segment.
AssemblySummary.html: An html file summarizing many assembly metrics. It can be viewed using an Internet browser. It has the same content shown by the summary page of the Shasta http server (--command explore).
shasta.conf: a configuration file containing the values of all assembly parameters used. This file uses a format that can also be used as input for a subsequent Shasta run using option --config.
ReadLengthHistogram.csv: A spreadsheet file containing statistics of the read length distribution. This only includes reads that were used by the assembler. The assembler discards reads shorter than Reads.minReadLength bases (default 10000) and reads that contain bases repeated more than 255 times. The fifth field of the last line of this file contains the total number of input bases used by the assembler in this run.
Binned-ReadLengthHistogram.csv: Similar to ReadLengthHistogram.csv, but using 1 Kb bins of read lengths.
Data: A directory containing binary data that can later be used by the Shasta http server or with the Shasta Python API. This is only created if option --memoryMode filesystem was used for the run. Keep in mind that, unless you used option --memoryBacking disk, these data are in memory, not on disk, and will disappear at next reboot. If you want to save them permanently, you can use script shasta-install/bin/SaveRun.py to create a copy on disk of the binary data directory named DataOnDisk. (You can also make the copy yourself using the cp command). To free up the memory used without rebooting, you can invoke the Shasta executable again using the following options:
```
--command cleanupBinaryData --assemblyDirectory outputDirectoryName
```
Here, outputDirectoryName should specify the same value used in the assembly run (that is, the name of the directory containing the Data directory). Equivalently, you can just umount the Data directory, then remove it.

Running an assembly using Docker

It is also possible to run Shasta using Docker. This provides additional portability and flexibility, as well as the ability to incorporate Shasta into workflows and computational pipelines. See here for more information.

Dealing with errors

sh: 1: cannot create /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages: Read-only file system: Shasta needs root privilege to use the `hugetlbfs` filesystem. If you're running this in a Docker container, you need to use the Docker option --privileged.

Table of contents