This document provides example recipes on how to carry out particular tasks using the SeqAn functionalities in C++. Please note that these recipes are not ordered. You can use the links in the table of contents or the search function of your browser to navigate them.
It will take some time, but we hope to expand this document into containing numerous great examples. If you have suggestions for how to improve the Cookbook and/or examples you would like included, please feel free to contact us.
Read sequence files
int main ()
{
for (auto & [seq, id, qual] : file_in)
{
}
return 0;
}
Provides seqan3::debug_stream and related types.
@ seq
The "sequence", usually a range of nucleotides or amino acids.
@ qual
The qualities, usually in Phred score notation.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:42
T temp_directory_path(T... args)
Write a custom validator
This recipe implements a validator that checks if a numeric argument is an integral square (i.e. 0, 1, 4, 9...). Invalid values throw a seqan3::validation_error.
struct custom_validator
{
using option_value_type = double;
void operator() (option_value_type const & val) const
{
{
}
}
{
return "Value must be the square of an integral number.";
}
};
Argument parser exception thrown when an argument could not be casted to the according type.
Definition: exceptions.hpp:141
Construction and assignment of alphabet symbols
using seqan3::operator""_dna4;
int main ()
{
Meta-header for the alphabet module.
constexpr derived_type & assign_char(char_type const c) noexcept
Assign from a character, implicitly converts invalid characters.
Definition: alphabet_base.hpp:159
constexpr derived_type & assign_rank(rank_type const c) noexcept
Assign from a numeric value.
Definition: alphabet_base.hpp:186
The four letter DNA alphabet of A,C,G,T.
Definition: dna4.hpp:51
constexpr rank_type to_rank() const noexcept
Return the letter's numeric value (rank in the alphabet).
Definition: alphabet_base.hpp:133
decltype(seqan3::to_rank(std::declval< semi_alphabet_type >())) alphabet_rank_t
The rank_type of the semi-alphabet; defined as the return type of seqan3::to_rank.
Definition: concept.hpp:155
Reverse complement and the six-frame translation of a string using views
This recipe creates a small program that
- reads a string from the command line (first argument to the program)
- "converts" the string to a range of seqan3::dna5 (Bonus: throws an exception if loss of information occurs)
- prints the string and it's reverse complement
- prints the six-frame translation of the string
int main(int argc, char** argv)
{
myparser.add_positional_option(s, "Please specify the DNA string.");
try
{
myparser.parse();
}
{
return 0;
}
auto s_as_dna = s | seqan3::views::char_to<seqan3::dna5>;
}
Meta-Header for the argument parser module.
Argument parser exception that is thrown whenever there is an error while parsing the command line ar...
Definition: exceptions.hpp:49
The SeqAn command line parser.
Definition: argument_parser.hpp:154
constexpr auto translate
A view that translates nucleotide into aminoacid alphabet with 1, 2, 3 or 6 frames.
Definition: translate.hpp:798
auto const complement
A view that converts a range of nucleotides to their complement.
Definition: complement.hpp:69
Meta-header for the views submodule .
Adaptations of concepts from the Ranges TS.
Reading records
After construction, you can now read the sequence records. Our file object behaves like a range so you can use a range-based for loop to conveniently iterate over the file:
auto input = R"(> TEST1
ACGT
> Test2
AGGCTGA
> Test3
GGAGTATAATATATATATATATAT)";
int main()
{
for (auto & rec : fin)
{
}
}
- Attention
- An input file is a single input range, which means you can only iterate over it once!
- Note
- It is important to write
auto &
and not just auto
, otherwise you will copy the record on every iteration.
You can also use structured binding, i.e. for (auto & [seq, id, qual] : fin)
But beware: with structured bindings you do need to get the order of elements correct!
You can also read a file in chunks:
Reading records in chunks
for (auto && records : fin | ranges::views::chunk(10))
{
}
The example above will iterate over the file by reading 10 records at a time. If no 10 records are available anymore, it will just print the remaining records.
Applying a filter to a file
On some occasions you are only interested in sequence records that fulfill a certain criterion, e.g. having a minimum sequence length or a minimum average quality.
This recipe can be used to filter the sequences in your file by a custom criterion.
auto minimum_quality_filter = std::views::filter([] (auto const & rec)
{
});
for (auto & rec : fin | minimum_quality_filter)
{
}
decltype(detail::transform< trait_t >(list_t{})) transform
Apply a transformation trait to every type in the list and return a seqan3::type_list of the results.
Definition: traits.hpp:434
constexpr size_t size
The size of a type pack.
Definition: traits.hpp:150
Reading paired-end reads
In modern Next Generation Sequencing experiments you often have paired-end read data which is split into two files. The read pairs are identified by their identical name/id and position in the two files.
This recipe can be used to handle one pair of reads at a time.
for (auto && [rec1, rec2] : seqan3::views::zip(fin1, fin2))
{
if (rec1.id() != rec2.id())
}
Storing records in a std::vector
This recipe creates a small program that reads in a FASTA file and stores all the records in a std::vector.
int main()
{
using record_type = decltype(fin)::record_type;
std::ranges::copy(fin, std::cpp20::back_inserter(records));
}
This header includes C++17 filesystem support and imports it into namespace std::filesystem (independ...
Meta-include for the sequence IO submodule.
Note that you can move the record out of the file if you want to store it somewhere without copying.
auto const move
A view that turns lvalue-references into rvalue-references.
Definition: move.hpp:70
Writing records
The easiest way to write to a sequence file is to use the seqan3::sequence_file_output::push_back() or seqan3::sequence_file_output::emplace_back() member functions. These work similarly to how they work on a std::vector.
int main()
{
using seqan3::operator""_dna5;
for (int i = 0; i < 5; ++i)
{
seqan3::dna5_vector
seq{
"ACGT"_dna5};
fout.emplace_back(seq, id);
}
}
A class for writing sequence files, e.g. FASTA, FASTQ ...
Definition: output.hpp:170
Provides seqan3::dna5, container aliases and string literals.
Provides seqan3::sequence_file_output and corresponding traits classes.
File conversion
Define a custom scoring scheme
Provides seqan3::aminoacid_scoring_scheme.
Provides seqan3::nucleotide_scoring_scheme.
using seqan3::operator""_dna4;
using seqan3::operator""_aa27;
auto sc_nc = nc_scheme.score('A'_dna4, 'C'_dna4);
auto sc_aa = aa_scheme.score('M'_aa27, 'K'_aa27);
A data structure for managing and computing the score of two amino acids.
Definition: aminoacid_scoring_scheme.hpp:75
constexpr void set_similarity_matrix(aminoacid_similarity_matrix const matrix_id)
Set the similarity matrix scheme (e.g. BLOSUM62).
Definition: aminoacid_scoring_scheme.hpp:122
A data structure for managing and computing the score of two nucleotides.
Definition: nucleotide_scoring_scheme.hpp:38
@ BLOSUM30
The BLOSUM30 matrix for very distantly related proteins.
A strong type of underlying type score_type that represents the score of two matching characters.
Definition: scoring_scheme_base.hpp:41
A strong type of underlying type score_type that represents the score two different characters.
Definition: scoring_scheme_base.hpp:66
- Attention
- SeqAn's alignment algorithm computes the maximal similarity score, thus the match score must be set to a positive value and the scores for mismatch and gap must be negative in order to maximize over the matching letters.
Calculate edit distance for a set of sequences
This recipe can be used to calculate the edit distance for all six pairwise combinations. Here we only allow at most 7 errors and filter all alignments with 6 or less errors.
using seqan3::operator""_dna4;
int main()
{
"ACGAAGACCGAT"_dna4,
"ACGTGACTGACT"_dna4,
"AGGTACGAGCGACACT"_dna4};
auto filter_v = std::views::filter([](auto && res) { return res.score() >= -6;});
{
}
}
Provides pairwise alignment function.
Sets the global alignment method.
Definition: align_config_method.hpp:107
Sets the minimal score (maximal errors) allowed during an distance computation e.g....
Definition: align_config_min_score.hpp:37
Configures the alignment result to output the score.
Definition: align_config_output.hpp:43
Provides seqan3::dna4, container aliases and string literals.
constexpr configuration edit_scheme
Shortcut for edit distance configuration.
Definition: align_config_edit.hpp:51
constexpr auto align_pairwise(sequence_t &&seq, alignment_config_t const &config)
Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.
Definition: align_pairwise.hpp:138
constexpr auto persist
A view adaptor that wraps rvalue references of non-views.
Definition: persist.hpp:233
constexpr auto pairwise_combine
A view adaptor that generates all pairwise combinations of the elements of the underlying range.
Definition: pairwise_combine.hpp:709
Provides seqan3::views::pairwise_combine.
Searching for matches
This recipe can be used to search for all occurrences of a substring and print the number of hits and the positions in an ascending ordering.
using seqan3::operator""_dna4;
void run_text_single()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
<< "The following hits were found:\n";
for (
auto && result :
search(
"GCT"_dna4, index))
}
void run_text_collection()
{
"ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
"GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
<< "The following hits were found:\n";
for (
auto && result :
search(
"GCT"_dna4, index))
}
int main()
{
run_text_single();
run_text_collection();
}
The SeqAn FM Index.
Definition: fm_index.hpp:194
Provides the unidirectional seqan3::fm_index.
auto search(queries_t &&queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration)
Search a query or a range of queries in an index.
Definition: search.hpp:108
Provides the public interface for search algorithms.
If you want to allow errors in your query, you need to configure the approximate search with the following search configuration objects:
To search for either 1 insertion or 1 deletion you can use the seqan3::search_cfg::error_count:
std::string text{
"Garfield the fat cat without a hat."};
Collection of elements to configure an algorithm.
Definition: configuration.hpp:46
Configuration element that represents the number or rate of deletion errors.
Definition: max_error.hpp:169
Configuration element that represents the number or rate of insertion errors.
Definition: max_error.hpp:124
Configuration element that represents the number or rate of substitution errors.
Definition: max_error.hpp:80
Configuration element that represents the number or rate of total errors.
Definition: max_error.hpp:36
A strong type of underlying type uint8_t that represents the number of errors.
Definition: max_error_common.hpp:31
Reading the CIGAR information into an actual alignment
In SeqAn, the conversion from a CIGAR string to an alignment (two aligned_sequences) is done automatically for you. You can access it by querying seqan3::field::alignment from the record:
for (auto & [ id, alignment ] : fin)
{
}
A class template that holds a choice of seqan3::field.
Definition: record.hpp:163
Combining sequence and alignment files
This recipe can be used to:
- Read in a FASTA file with the reference and a SAM file with the alignment
- Filter the alignment records and only take those with a mapping quality >= 30.
- For the resulting alignments, print which read was mapped against with reference id and the number of seqan3::gap's involved in the alignment (either in aligned reference or in read sequence).
int main()
{
for (auto && record : reference_file)
{
ref_seqs.push_back(
std::move(record.sequence()));
}
#if !SEQAN3_WORKAROUND_GCC_93983
auto mapq_filter = std::views::filter([] (auto & rec) { return rec.mapping_quality() >= 30; });
#endif
#if SEQAN3_WORKAROUND_GCC_93983
for (auto & [id, ref_id, mapq, alignment] : mapping_file )
#else
for (auto & [id, ref_id, mapq, alignment] : mapping_file | mapq_filter)
#endif
{
size_t sum_ref{};
for (auto const & char_ref : get<0>(alignment))
++sum_ref;
size_t sum_read{};
for (auto const & char_read : get<1>(alignment))
++sum_read;
<< sum_read << " gaps in the read sequence and "
<< sum_ref << " gaps in the reference sequence.\n";
}
}
The alphabet of a gap character '-'.
Definition: gap.hpp:37
@ alignment
The (pairwise) alignment stored in an object that models seqan3::detail::pairwise_alignment.
@ mapq
The mapping quality of the seqan3::field::seq alignment, usually a Phred-scaled score.
@ ref_id
The identifier of the (reference) sequence that seqan3::field::seq was aligned to.
@ id
The identifier, usually a string.
constexpr auto const & get(configuration< configs_t... > const &config) noexcept
This is an overloaded member function, provided for convenience. It differs from the above function o...
Definition: configuration.hpp:434
Provides the seqan3::record template and the seqan3::field enum.
Map reads ans write output to SAM file
For a full recipe on creating your own readmapper, see the very end of the tutorial Implementing your own read mapper with SeqAn.
reference_storage_t & storage,
uint8_t const errors)
{
{
cereal::BinaryInputArchive iarchive{is};
iarchive(index);
}
for (auto && record : query_file_in)
{
auto & query = record.sequence();
for (
auto && result :
search(query, index, search_config))
{
size_t start = result.reference_begin_position() ? result.reference_begin_position() - 1 : 0;
std::span text_view{
std::data(storage.seqs[result.reference_id()]) + start, query.size() + 1};
{
sam_out.emplace_back(query,
record.id(),
storage.ids[result.reference_id()],
ref_offset,
aligned_seq,
record.base_qualities(),
map_qual);
}
}
}
}
Configures the alignment result to output the alignment.
Definition: align_config_output.hpp:171
Configures the alignment result to output the begin positions.
Definition: align_config_output.hpp:131
The SeqAn Bidirectional FM Index.
Definition: bi_fm_index.hpp:65
A class for writing alignment files, e.g. SAM, BAL, BLAST, ...
Definition: output.hpp:164
Configuration element to receive all hits with the lowest number of errors within the error bounds.
Definition: hit.hpp:57
@ ref_offset
Sequence (seqan3::field::ref_seq) relative start position (0-based), unsigned value.
A strong type representing free_end_gaps_sequence1_leading of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:66
A strong type representing free_end_gaps_sequence1_trailing of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:86
A strong type representing free_end_gaps_sequence2_leading of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:76
A strong type representing free_end_gaps_sequence2_trailing of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:96
Constructing a basic argument parser
{
}
struct cmd_arguments
{
};
{
parser.
add_option(args.reference_path,
'r',
"reference",
"The path to the reference.",
parser.
add_option(args.index_path,
'o',
"output",
"The output index file path.",
}
int main(int argc, char const ** argv)
{
cmd_arguments args{};
initialise_argument_parser(parser, args);
try
{
parser.parse();
}
{
return -1;
}
run_program(args.reference_path, args.index_path);
return 0;
}
argument_parser_meta_data info
Aggregates all parser related meta data (see seqan3::argument_parser_meta_data struct).
Definition: argument_parser.hpp:626
void add_option(option_type &value, char const short_id, std::string const &long_id, std::string const &desc, option_spec const spec=option_spec::standard, validator_type option_validator=validator_type{})
Adds an option to the seqan3::argument_parser.
Definition: argument_parser.hpp:267
A validator that checks if a given path is a valid output file.
Definition: validators.hpp:634
@ standard
The default were no checking or special displaying is happening.
Definition: auxiliary.hpp:234
@ required
Definition: auxiliary.hpp:235
Constructing a subcommand argument parser
struct pull_arguments
{
bool progress{false};
};
{
pull_arguments args{};
try
{
}
{
return -1;
}
seqan3::debug_stream <<
"Git pull with repository " << args.repository <<
" and branch " << args.branch <<
'\n';
return 0;
}
struct push_arguments
{
bool push_all{false};
};
{
push_arguments args{};
parser.
add_positional_option(args.branches,
"The branch names to push (if none are given, push current).");
try
{
}
{
return -1;
}
seqan3::debug_stream <<
"Git push with repository " << args.repository <<
" and branches " << args.branches <<
'\n';
return 0;
}
int main(int argc, char const ** argv)
{
argc,
argv,
{"push", "pull"}};
top_level_parser.info.description.push_back("You can push or pull from a remote repository.");
top_level_parser.add_flag(flag, 'f', "flag", "some flag");
try
{
top_level_parser.parse();
}
{
return -1;
}
return run_git_pull(sub_parser);
return run_git_push(sub_parser);
else
return 0;
}
void add_positional_option(option_type &value, std::string const &desc, validator_type option_validator=validator_type{})
Adds a positional option to the seqan3::argument_parser.
Definition: argument_parser.hpp:330
void parse()
Initiates the actual command line parsing.
Definition: argument_parser.hpp:417
argument_parser & get_sub_parser()
Returns a reference to the sub-parser instance if subcommand parsing was enabled.
Definition: argument_parser.hpp:445
@ flag
The alignment flag (bit information), uint16_t value.
@ on
Automatic update notifications should be enabled.
Serialise data structures with cereal
#include <cereal/archives/binary.hpp>
#include <cereal/types/vector.hpp>
#include <seqan3/test/tmp_filename.hpp>
{
cereal::BinaryInputArchive archive(is);
archive(data);
}
{
cereal::BinaryOutputArchive archive(os);
archive(data);
}
int main()
{
seqan3::test::tmp_filename tmp_file{"data.out"};
store(vec, tmp_file);
return 0;
}
constexpr simd_t load(void const *mem_addr)
Load simd_t size bits of integral data from memory.
Definition: algorithm.hpp:316
A custom dna4 alphabet that converts all unknown characters to <tt>A</tt>
When assigning from char
or converting from a larger nucleotide alphabet to a smaller one, loss of information can occur since obviously some bases are not available. When converting to seqan3::dna5 or seqan3::rna5, non-canonical bases (letters other than A, C, G, T, U) are converted to ‘'N’to preserve ambiguity at that position. For seqan3::dna4 and seqan3::rna4 there is no letter
'N'to represent ambiguity, so the conversion from
char` for IUPAC characters tries to choose the best fitting alternative (see seqan3::dna4 for more details).
If you would like to always convert unknown characters to A
instead, you can create your own alphabet with a respective char conversion table very easily like this:
{
public:
using nucleotide_base<my_dna4, 4>::nucleotide_base;
private:
static constexpr char_type rank_to_char[
alphabet_size] {
'A',
'C',
'G',
'T'};
{
[] () constexpr
{
conversion_table['C'] = conversion_table['c'] = 1;
conversion_table['G'] = conversion_table['g'] = 2;
conversion_table['T'] = conversion_table['t'] = 3;
conversion_table['U'] = conversion_table['T'];
conversion_table['u'] = conversion_table['t'];
return conversion_table;
}()
};
friend nucleotide_base<my_dna4, 4>;
friend nucleotide_base<my_dna4, 4>::base_t;
};
constexpr my_dna4 operator""_my_dna4(char const c) noexcept
{
return my_dna4{}.assign_char(c);
}
{
'T'_my_dna4,
'G'_my_dna4,
'C'_my_dna4,
'A'_my_dna4
};
int main()
{
my_dna4 my_letter{'C'_my_dna4};
my_letter.assign_char('S');
}
static constexpr detail::min_viable_uint_t< size > alphabet_size
The size of the alphabet, i.e. the number of different values it can take.
Definition: alphabet_base.hpp:198
A CRTP-base that refines seqan3::alphabet_base and is used by the nucleotides.
Definition: nucleotide_base.hpp:41
Provides seqan3::nucleotide_base.
If you are interested in custom alphabets, also take a look at our tutorial How to write your own alphabet.