Assembly configurations
Shasta provides a number of command line options that can be used to set computational parameters and thresholds for assemblies. All of these options have default values, but the default values are not necessarily optimal for any particular combination of a number of factors:
- The technology used to generate the reads. Technologies currently available to generate the long reads supported by Shasta are Oxford Nanopore (ONT) and Pacific BioSciences (HiFi and others).
- The amount of coverage available (average number of reads overlapping each genome region).
- The characteristics of the genome being sequenced, including heterozygosity, ploidy, and repeats content.
shasta/conf
.
The applicability of each of the files is described in comments
embedded in each file.
Shasta command line option --config
is used to specify the configuration to be used, as described
below in details. This option is mandatory
when running an assembly.
If any option is specified both in a configuration
and explictly on the command line, the value
on the command line takes precedence.
This allows you to use a configuration as a useful
set of defaults, while still overriding some of its
options as desired.
In addition to configuration files, Shasta also provides
a set of built-in configurations that are compiled
in the Shasta executable. These built-in configurations
can be used without the need for a configuration file.
Each built-in configuration has a corresponding configuration
file with the same name in shasta/conf
, with
an extension .conf
.
For example, configuration Nanopore-Oct2021
can be specified in one of two ways:
shasta --config Nanopore-May2022or
shasta --config .../shasta/conf/Nanopore-May2022.confWhen using the second form, the file must be available, and the
...
should be replaced depending on the
location of the shasta
directory.
To obtain a list of available built-in configurations,
use Shasta command listConfigurations
as follows:
shasta --command listConfigurationsAt the time of writing (May 2022), this outputs the following list of built-in configurations:
Nanopore-Dec2019 Nanopore-UL-Dec2019 Nanopore-Jun2020 Nanopore-UL-Jun2020 Nanopore-Sep2020 Nanopore-UL-Sep2020 Nanopore-UL-iterative-Sep2020 Nanopore-OldGuppy-Sep2020 Nanopore-Plants-Apr2021 Nanopore-Oct2021 Nanopore-UL-Oct2021 HiFi-Oct2021 Nanopore-UL-Jan2022 Nanopore-Phased-Jan2022 Nanopore-UL-Phased-Jan2022 Nanopore-May2022 Nanopore-Phased-May2022 Nanopore-UL-May2022 Nanopore-UL-Phased-May2022 Nanopore-Human-SingleFlowcell-May2022 Nanopore-Human-SingleFlowcell-Phased-May2022
The following table summarizes configurations recommended at the time of writing (May 2022) under the following conditions:
- Human assemblies
- Oxford Nanopore reads.
- Guppy 5.0.7 with "super" accuracy.
Read type | Coverage | Haploid assembly | Phased assembly |
---|---|---|---|
Standard reads | 40x to 80x | Nanopore-May2022
| Nanopore-Phased-May2022
|
Ultra-Long (UL) reads
(N50 ≳ 60 Kb) | 40x to 80x | Nanopore-UL-May2022
| Nanopore-UL-Phased-May2022
|
Standard reads | Human genome with a single flowcell (low coverage, around 30x) | Nanopore-Human-SingleFlowcell-May2022
| Nanopore-Human-SingleFlowcell-Phased-May2022
|
To get details of a specific built-in configuration
use Shasta command listConfiguration
as follows,
specifiying the built-in configuration of interest after --config
:
shasta --command listConfiguration --config Nanopore-May2022
This output includes comments that describe the applicability of the selected configuration. Details of the configuration are written out in the configuration file format defined below. This allows you to create your own configuration file using a built-in configuration as a starting point.
Shasta command line option --config
must be used
to specified the desired configuration to be used for an assembly.
The option must specify either a build-in configuration
or a path to a configuration file.
Configuration file
Some options are only allowed on the command line, but most of them can also optionally be specified using a configuration file. Values specified on the command line take precedence over values specified in the configuration file. This makes it easy to override specific values in a configuration file.
Options that can be specified both on the command line
and in a configuration file are of the form
--SectionName.optionName
. The format of the configuration file
is as follows:
[SectionA] option1 = valueA1 option2 = valueA2 [SectionB] option1 = valueB1 option2 = valueB2The above is equivalent to using the following command line options:
--SectionA.option1 valueA1 --SectionA.option2 valueA2 --SectionB.option1 valueB1 --SectionB.option2 valueB2
For example, the value for option MarkerGraph.minCoverage
can be specified in the [MarkerGraph]
section of the configuration file as follows:
[MarkerGraph] minCoverage = 0
In the configuration file, blank lines and lines begining with #
are ignored and can be used to add coments and to improve readability
of the configuration file.
Boolean switches
Some command line options are boolean switches, that is, control options that can be turned on or off rather then be given a value.
To turn on one of these switches on the command line,
just add it to the command line without any value, for
example --Assembly.storeCoverageData
.
To turn it off, just omit it from the command line
(the default value is turned off).
To turn on one of these switches in a configuration file, you can either enter it without value
storeCoverageData =or assign to it one of the following values:
1, true, True, yes, Yes
.
To turn off one of these switches in a
configuration file, assign to it one of the following values:
0, false, False,no, No
.
Boolean switches are indicated as such in the Description column in he tables below.