Data structures and algorithms for the search of query sequences in a large collection of text. More...
Modules | |
Configuration | |
Data structures and utility functions for configuring search algorithm. | |
DREAM Index | |
Provides seqan3:interleaved_bloom_filter. | |
FM Index | |
Provides seqan3:fm_index and seqan3:bi_fm_index as well as respective cursors. | |
k-mer Index | |
Implementation of shapes for a k-mer Index. | |
Classes | |
struct | seqan3::detail::fm_index_validator |
Class used to validate the requirements on the input text of the fm_index. More... | |
struct | seqan3::detail::policy_max_error |
Provides the function max_error_counts if inherited by a search algorithm. More... | |
struct | seqan3::detail::policy_search_result_builder< search_configuration_t > |
Provides the function make_results if inherited by a search algorithm. More... | |
struct | seqan3::detail::search< nbr_blocks > |
Object storing information for a search (of a search scheme). More... | |
struct | seqan3::detail::search_configuration_validator |
Class used to validate the search configuration. More... | |
class | seqan3::detail::search_configurator |
Class used to update the search configuration, e.g. add defaults. More... | |
struct | seqan3::detail::search_dyn |
Object storing information for a search (of a search scheme). More... | |
class | seqan3::search_result< query_id_type, cursor_type, reference_id_type, reference_begin_position_type > |
The result class generated by the seqan3::seach algorithm. More... | |
class | seqan3::detail::search_scheme_algorithm< configuration_t, index_t, policies_t > |
The algorithm that performs a bidirectional search on a bidirectional FM index using (optimal) search schemes. More... | |
struct | seqan3::detail::search_traits< search_configuration_t > |
A collection of traits extracted from the search configuration. More... | |
class | seqan3::detail::unidirectional_search_algorithm< configuration_t, index_t, policies_t > |
The algorithm that performs a unidirectional search on an FM index using trivial backtracking. More... | |
Typedefs | |
using | seqan3::detail::search_scheme_dyn_type = std::vector< search_dyn > |
Type for storing search schemes. Number of blocks do not have to be known at compile time. | |
template<uint8_t nbr_searches, uint8_t nbr_blocks> | |
using | seqan3::detail::search_scheme_type = std::array< search< nbr_blocks >, nbr_searches > |
Type for storing search schemes. Number of blocks have to be known at compile time. | |
Functions | |
std::vector< search_dyn > | seqan3::detail::compute_ss (uint8_t const min_error, uint8_t const max_error) |
Computes a (non-optimal) search scheme. Currently the generated search scheme represents trivial backtracking. More... | |
template<typename index_t , typename configuration_t = decltype(search_cfg::default_configuration)> | |
auto | seqan3::search (char const *const queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
template<typename index_t , std::ranges::forward_range queries_t, typename configuration_t = decltype(search_cfg::default_configuration)> | |
auto | seqan3::search (queries_t &&queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration) |
Search a query or a range of queries in an index. More... | |
template<typename index_t , std::ranges::forward_range query_t, typename configuration_t = decltype(search_cfg::default_configuration)> | |
auto | seqan3::search (query_t &&query, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
template<typename index_t , typename configuration_t = decltype(search_cfg::default_configuration)> | |
auto | seqan3::search (std::initializer_list< char const *const > const &queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
template<bool abort_on_hit, typename query_t , typename delegate_t > | |
void | seqan3::detail::search_scheme_algorithm< configuration_t, index_t, policies_t >::search_algo_bi (query_t &query, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index. More... | |
template<typename search_scheme_t > | |
auto | seqan3::detail::search_scheme_block_info (search_scheme_t const &search_scheme, size_t const query_length) |
Returns for each search the cumulative length of blocks in the order of blocks in each search and the starting position of the first block in the query sequence. More... | |
template<bool abort_on_hit, typename cursor_t , typename query_t , typename search_t , typename blocks_length_t , typename delegate_t > | |
bool | seqan3::detail::search_ss (cursor_t cur, query_t &query, typename cursor_t::size_type const lb, typename cursor_t::size_type const rb, uint8_t const errors_spent, uint8_t const block_id, bool const go_right, search_t const &search, blocks_length_t const &blocks_length, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using a single search of a search schemes. More... | |
template<bool abort_on_hit, typename index_t , typename query_t , typename search_scheme_t , typename delegate_t > | |
void | seqan3::detail::search_ss (index_t const &index, query_t &query, search_param const error_left, search_scheme_t const &search_scheme, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using search schemes. More... | |
template<bool abort_on_hit, typename cursor_t , typename query_t , typename search_t , typename blocks_length_t , typename delegate_t > | |
bool | seqan3::detail::search_ss_children (cursor_t cur, query_t &query, typename cursor_t::size_type const lb, typename cursor_t::size_type const rb, uint8_t const errors_spent, uint8_t const block_id, bool const go_right, uint8_t const min_error_left_in_block, search_t const &search, blocks_length_t const &blocks_length, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using a single search of a search schemes. Sub-function for approximate search step (iterating over all children in a conceptual suffix tree). More... | |
template<bool abort_on_hit, typename cursor_t , typename query_t , typename search_t , typename blocks_length_t , typename delegate_t > | |
bool | seqan3::detail::search_ss_deletion (cursor_t cur, query_t &query, typename cursor_t::size_type const lb, typename cursor_t::size_type const rb, uint8_t const errors_spent, uint8_t const block_id, bool const go_right, search_t const &search, blocks_length_t const &blocks_length, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using a single search of a search schemes. Sub-function for deletions at the end of a block. More... | |
template<bool abort_on_hit, typename cursor_t , typename query_t , typename search_t , typename blocks_length_t , typename delegate_t > | |
bool | seqan3::detail::search_ss_exact (cursor_t cur, query_t &query, typename cursor_t::size_type const lb, typename cursor_t::size_type const rb, uint8_t const errors_spent, uint8_t const block_id, bool const go_right, search_t const &search, blocks_length_t const &blocks_length, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using a single search of a search scheme. Sub-function for searching the remaining part of the current block without any errors. More... | |
template<bool abort_on_hit, typename query_t > | |
bool | seqan3::detail::unidirectional_search_algorithm< configuration_t, index_t, policies_t >::search_trivial (typename index_t::cursor_type cur, query_t &query, typename index_t::cursor_type::size_type const query_pos, search_param const error_left, error_type const prev_error) |
Searches a query sequence in an index using trivial backtracking. More... | |
Variables | |
template<uint8_t min_error, uint8_t max_error> | |
constexpr int | seqan3::detail::optimum_search_scheme |
Search scheme that is optimal in the running time for the specified lower and upper error bound. More... | |
Data structures and algorithms for the search of query sequences in a large collection of text.
Searching is a key component in many sequence analysis tools. The search module is a powerful and easy way to search sequences in a large text or an arbitrary nested collection of texts. When it comes to searching, indices are a core component for searching large amounts of data and are used for tools such as read mappers, assemblers or protein search tools.
SeqAn currently implements only the FM index and a k-mer index is planned. The FM index works with arbitrary pattern lengths and error numbers.
Elaborate on that (space consumption for growing k, maybe a rule of thumb).
Add a description of the DREAM index here.
The Search module offers a simple unified interface for searching a query in a large indexed text. The algorithm chooses the best search method based on the provided index.
The search algorithms for FM indices implement either a trivial backtracking approach or an optimum search scheme. The latter are currently only available for searches with up to three errors using bidirectional indices. In the future we plan to improve the optimum search schemes to handle higher error counts.
A search scheme is a collection of searches, where each search s
specifies the order in which the blocks are searched (s.pi
), the lower error bounds (s.l
) and the upper error bounds (s.u
). If the number of blocks that the query sequences are split into are known at compile time, the data structure seqan3::detail::search is recommended, otherwise one has to use seqan3::detail::search_dyn. The first one implements its member variables as an an std::array
of integers, the latter as an std::vector
of integers. Search search schemes are defined similiar. They are either implemented as an std::array
of searches if the number of searches is known at compile time, or as an std::vector
if not.
Precomputed optimum search schemes are represented as an std::array
of seqan3::detail::search since both the number of searches and the number of blocks are known at compile time. Search schemes computed at run time are represented as std::vector
of seqan3::detail::search_dyn.
All optimum search schemes are disjoint, i.e. no error configuration is covered by more than one search. Thus there is no need for a filtration phase after searching to remove duplicate hits when searching with hamming distance. Allowing insertions and deletions will, however, lead to redundant reports of hits regardless of search schemes.
Reference:
Kianfar, K., Pockrandt, C., Torkamandi, B., Luo, H., & Reinert, K. (2018).
Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index. bioRxiv, 301085. https://doi.org/10.1101/301085
A k-mer index can be used to efficiently retrieve all occurrences of a certain k-mer in the text. The k-mer can be either an exact string of length k or it can contain one or more wildcards, which denote positions of arbitrary characters.
An exact k-mer is represented as seqan3::ungapped, and wildcards can be defined with seqan3::shape. Please check the respective documentation for details and examples.
|
inline |
Computes a (non-optimal) search scheme. Currently the generated search scheme represents trivial backtracking.
[in] | min_error | Minimum number of errors allowed. |
[in] | max_error | Maximum number of errors allowed. |
Constant.
Strong exception guarantee.
|
inline |
Search a query or a range of queries in an index.
index_t | Type of the index. |
queries_t | Must model std::ranges::random_access_range over the index's alphabet and std::ranges::sized_range. A range of queries must additionally model std::ranges::forward_range and std::ranges::sized_range. |
[in] | queries | A single query or a range of queries. |
[in] | index | String index to be searched. |
[in] | cfg | A configuration object specifying the search parameters (e.g. number of errors, error types, output format, etc.). |
void
if an on_hit delegate has been specified. Header File
#include <seqan3/search/search.hpp>
Each query with errors takes where is the maximum number of errors.
Strong exception guarantee if iterating the query does not change its state and if invoking a possible delegate specified in cfg
also has a strong exception guarantee; basic exception guarantee otherwise.
|
inlineprivate |
Searches a query sequence in a bidirectional index.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
delegate_t | Takes typename index_t::cursor_type as argument. |
[in] | query | Query sequence to be searched in the index. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
where is the total number of maximum errors.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inline |
Returns for each search the cumulative length of blocks in the order of blocks in each search and the starting position of the first block in the query sequence.
search_scheme_t | Is of type seqan3::detail::search_scheme_type or seqan3::detail::search_scheme_dyn_type . |
[in] | search_scheme | Search scheme that will be used for searching. |
[in] | query_length | Length of the query that will be searched in an index. |
Constant.
Strong exception guarantee.
|
inline |
Searches a query sequence in a bidirectional index using a single search of a search schemes.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
cursor_t | Must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_t | Is of type seqan3::detail::search<> or seqan3::detail::search_dyn<> . |
blocks_length_t | Is of type std::array or std::vector of unsigned integers. |
delegate_t | Takes cursor_t as argument. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched. |
[in] | lb | Left bound of the infix of query already searched (exclusive). |
[in] | rb | Right bound of the infix of query already searched (exclusive). |
[in] | errors_spent | Number of errors spent while searching the infix of query . |
[in] | block_id | Id of the block that the infix is extended to next. |
[in] | go_right | The infix will be extended to the right if the flag is set to true. |
[in] | search | Search of a search scheme to be used for searching. |
[in] | blocks_length | Cumulative block lengths of the search. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
True
if and only if abort_on_hit
is true and a hit has been found. where is the total number of errors allowed by search
.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inline |
Searches a query sequence in a bidirectional index using search schemes.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
index_t | index_t::cursor_type must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_scheme_t | Is of type seqan3::detail::search_scheme_type or seqan3::detail::search_scheme_dyn_type . |
delegate_t | Takes typename index_t::cursor_type as argument. |
[in] | index | String index built on the text that will be searched. |
[in] | query | Query sequence to be searched in the index. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | search_scheme | Search scheme to be used for searching. |
[in] | delegate | Function that is called on every hit. |
where is the total number of maximum errors.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inline |
Searches a query sequence in a bidirectional index using a single search of a search schemes. Sub-function for approximate search step (iterating over all children in a conceptual suffix tree).
abort_on_hit | If the flag is set, the search aborts on the first hit. |
cursor_t | Must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_t | Is of type seqan3::detail::search<> or seqan3::detail::search_dyn<> . |
blocks_length_t | Is of type std::array or std::vector of unsigned integers. |
delegate_t | Takes cursor_t as argument. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched. |
[in] | lb | Left bound of the infix of query already searched (exclusive). |
[in] | rb | Right bound of the infix of query already searched (exclusive). |
[in] | errors_spent | Number of errors spent while searching the infix of query . |
[in] | block_id | Id of the block that the infix is extended to next. |
[in] | go_right | The infix will be extended to the right if the flag is set to true. |
[in] | search | Search of a search scheme to be used for searching. |
[in] | blocks_length | Cumulative block lengths of the search. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
True
if and only if abort_on_hit
is true and a hit has been found. where is the total number of errors allowed by search
.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
[in] | min_error_left_in_block | Number of remaining errors that need to be spent in the current block. |
|
inline |
Searches a query sequence in a bidirectional index using a single search of a search schemes. Sub-function for deletions at the end of a block.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
cursor_t | Must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_t | Is of type seqan3::detail::search<> or seqan3::detail::search_dyn<> . |
blocks_length_t | Is of type std::array or std::vector of unsigned integers. |
delegate_t | Takes cursor_t as argument. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched. |
[in] | lb | Left bound of the infix of query already searched (exclusive). |
[in] | rb | Right bound of the infix of query already searched (exclusive). |
[in] | errors_spent | Number of errors spent while searching the infix of query . |
[in] | block_id | Id of the block that the infix is extended to next. |
[in] | go_right | The infix will be extended to the right if the flag is set to true. |
[in] | search | Search of a search scheme to be used for searching. |
[in] | blocks_length | Cumulative block lengths of the search. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
True
if and only if abort_on_hit
is true and a hit has been found. where is the total number of errors allowed by search
.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inline |
Searches a query sequence in a bidirectional index using a single search of a search scheme. Sub-function for searching the remaining part of the current block without any errors.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
cursor_t | Must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_t | Is of type seqan3::detail::search<> or seqan3::detail::search_dyn<> . |
blocks_length_t | Is of type std::array or std::vector of unsigned integers. |
delegate_t | Takes cursor_t as argument. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched. |
[in] | lb | Left bound of the infix of query already searched (exclusive). |
[in] | rb | Right bound of the infix of query already searched (exclusive). |
[in] | errors_spent | Number of errors spent while searching the infix of query . |
[in] | block_id | Id of the block that the infix is extended to next. |
[in] | go_right | The infix will be extended to the right if the flag is set to true. |
[in] | search | Search of a search scheme to be used for searching. |
[in] | blocks_length | Cumulative block lengths of the search. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
True
if and only if abort_on_hit
is true and a hit has been found. where is the total number of errors allowed by search
.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inlineprivate |
Searches a query sequence in an index using trivial backtracking.
abort_on_hit | If the flag is set, the search algorithm aborts on the first hit. |
query_t | Must model std::ranges::input_range over the index's alphabet. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched with the cursor. |
[in] | query_pos | Position in the query sequence indicating the prefix that has already been searched. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | prev_error | Previous scenario of search, i.e. error or match. |
True
if and only if abort_on_hit
is true
and a hit has been found.where is the maximum number of errors.
No-throw guarantee if invoking the delegate also guarantees no-throw.
|
inlineconstexpr |
Search scheme that is optimal in the running time for the specified lower and upper error bound.
min_error | Lower bound of errors. |
max_error | Upper bound of errors. |
Please note that the searches within each search scheme are sorted by their asymptotical run time (i.e. upper error bound string), s.t. easy to compute searches come first. This improves the run time of algorithms that abort after the first hit (e.g. search mode: best). Even though it is not guaranteed, this seems to be a good greedy approach.