Skip to main content
LexicMap: efficient sequence alignment against millions of prokaryotic genomes​
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Step 2. Searching

Table of contents

TL;DR

  1. Build a LexicMap index.

  2. Run:

    • For short queries like genes or long reads, returning top N hits.

      lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
           --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 10000
      
    • For longer queries like plasmids, returning all hits.

      lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
          --min-qcov-per-hsp 0   --min-qcov-per-genome 0  --top-n-genomes 0
      

Input

Note
Query length
LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 100 bp by default.

Input should be (gzipped) FASTA or FASTQ records from files or STDIN.

Hardware requirements

See benchmark of index building.

LexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.

  • CPU:
    • No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported.
    • More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads).
  • RAM
    • More RAM (> 16 GB) is preferred. The memory usage in searching is mainly related to:
      • The number and length of query sequences.
      • The number of matched genomes and sequences.
      • Similarities between query and target sequences.
      • The number of threads. It uses all CPUs by default (-j/--threads).
  • Disk
    • SSD disks are preferred to store the index size, while HDD disks are also fast enough.
    • No temporary files are generated during searching.

Algorithm

  1. Masking: Query sequence is masked by the masks of the index. In other words, each mask captures the most similar k-mer which shares the longest prefix with the mask, and stores its position and strand information.
  2. Seeding: For each mask, the captured k-mer is used to search seeds (captured k-mers in reference genomes) sharing prefixes or suffixes of at least p bases.
    1. Prefix matching
      1. Setting the search range: Since the seeded k-mers are stored in lexicographic order, the k-mer matching turns into a range query. For example, for a query CATGCT requiring matching at least 4-bp prefix is equal to extract k-mers ranging from CATGAA, CATGAC, CATGAG, …, to CATGTT.
      2. Retrieving search start point: The index file of each seed data file stores some k-mers’ offsets in the data file, and the index is loaded in RAM.
      3. Retrieving seed data: Seed k-mers are read from the file and checked one by one, and k-mers in the search range are returned, along with the k-mer information (genome batch, genome number, location, and strand).
    2. Suffix matching
      1. Reversing the query k-mer and performing prefix matching, returning seeds of reversed k-mers (see indexing algorithm).
  3. Chaining:
    1. Seeding results, i.e., anchors (matched k-mers from the query and subject sequence), are summarized by genome, and deduplicated.
    2. Performing chaining (see the paper).
  4. Alignment for each chain.
    1. Extending the anchor region. for extracting sequences from the query and reference genome. For example, extending 1 kb in upstream and downstream of anchor region.
    2. Performing pseudo-alignment with extended query and subject sequences, for find similar regions.
      • For these similar regions that accross more than one reference sequences, splitting them into multiple ones.
    3. Fast alignment of query and subject sequence regions with our implementation of Wavefront alignment algorithm.
    4. Filtering alignments based on user options.

Parameters

Flags in bold text are important and frequently used.

Flag Value Function Comment
-j/--threads Default: all available cpus Number of CPU cores to use. The value should be >= the number of seed chunk files (“chunks” in info.toml, set by -c/--chunks in lexicmap index).
-w/--load-whole-seeds Load the whole seed data into memory for faster search Use this if the index is not big and many queries are needed to search.
-n/--top-n-genomes Default 0, 0 for all Keep top N genome matches for a query in the chaining phase Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be >= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step.
-a/--all Output more columns, e.g., matched sequences. Use this if you want to output blast-style format with “lexicmap utils 2blast”
-J/--max-query-conc Default 12, 0 for all Maximum number of concurrent queries Bigger values do not improve the batch searching speed and consume much memory.
--max-open-files Default: 1024 Maximum number of open files It mainly affects candidate subsequence extraction. Increase this value if you have hundreds of genome batches or have multiple queries, and do not forgot to set a bigger ulimit -n in shell if the value is > 1024.
Flag Value Function Comment
-p, --seed-min-prefix Default 15 Minimum (prefix) length of matched seeds. Smaller values produce more results at the cost of slow speed.
-P, --seed-min-single-prefix Default 17 Minimum (prefix) length of matched seeds if there’s only one pair of seeds matched. Smaller values produce more results at the cost of slow speed.
--seed-max-dist Default 1000 Max distance between seeds in seed chaining. It should be <= contig interval length in database.
--seed-max-gap Default 200 Max gap in seed chaining.
Flag Value Function Comment
-Q/--min-qcov-per-genome Default 0 Minimum query coverage (percentage) per genome.
-q/--min-qcov-per-hsp Default 0 Minimum query coverage (percentage) per HSP.
-l/--align-min-match-len Default 50 Minimum aligned length in a HSP segment.
-i/--align-min-match-pident Default 70 Minimum base identity (percentage) in a HSP segment.
--align-band Default 100 Band size in backtracking the score matrix.
--align-ext-len Default 1000 Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. It should be <= contig interval length in database.
--align-max-gap Default 20 Maximum gap in a HSP segment.

Improving searching speed

LexicMap’s searching speed is related to many factors:

  • The number of similar sequences in the index/database. More genome hits cost more time, e.g., 16S rRNA gene.
  • The I/O performance and load. LexicMap is I/O bound, because seeds matching (serial access) and extracting candidate subsequences for alignment (random access) require a large number of file readings in parallel.
  • Similarity between query and subject sequences. Alignment of diverse sequences is slightly slower than that of highly similar sequences.
  • The length of query sequence. Longer queries run with more time.
  • CPU frequency and the number of threads. Faster CPUs and more threads cost less time.

Here are some tips to improve the search speed.

  • Returning less results
    • Bigger -p/--seed-min-prefix (default 15) and -P/--seed-min-single-prefix (default 17) increase the search speed at the cost of decreased sensitivity for distant matches (similarity < 90%). Don’t worry if you only search highly similar matches.
    • Setting -n/--top-n-genomes to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would significantly reduce the computation time.
    • Note that: alignment result filtering is performed in the final phase, so stricter filtering criteria, including -q/--min-qcov-per-hsp, -Q/--min-qcov-per-genome, and -i/--align-min-match-pident, do not significantly accelerate the search speed. Hence, you can search with default parameters and then filter the result with tools like awk or csvtk.
  • Increasing the concurrency number
    • Make sure that the value of -j/--threads (default: all available CPUs) is ≥ than the number of seed chunk file (default: all available CPUs in the indexing step), which can be found in info.toml file, e.g,
      # Seeds (k-mer-value data) files
      chunks = 48
      
    • Increasing the value of --max-open-files (default 1024). You might also need to change the open files limit.
    • (If you have many queries) Increase the value of -J/--max-query-conc (default 12), it will increase the memory.
  • Loading the entire seed data into memoy (If you have many queries and the index is not very big. It’s unnecessary if the index is stored in SSD)
    • Setting -w/--load-whole-seeds to load the whole seed data into memory for faster seed matching. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.

Steps

  • For short queries like genes or long reads, returning top N hits.

    lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
        --min-match-pident 70 \
        --min-qcov-per-hsp 70 \
        --min-qcov-per-genome 70 \
        --top-n-genomes 10000
    
  • For longer queries like plasmids, returning all hits.

    lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
        --min-match-pident 70 \
        --min-qcov-per-hsp 0 \
        --min-qcov-per-genome 0  \
        --top-n-genomes 0
    
    $ lexicmap search -d demo.lmi/  q.gene.fasta -o q.gene.fasta.lexicmap.tsv
    00:12:58.547 [INFO] LexicMap v0.6.0 (19844cc)
    00:12:58.547 [INFO]   https://github.com/shenwei356/LexicMap
    00:12:58.547 [INFO] 
    00:12:58.547 [INFO] checking input files ...
    00:12:58.547 [INFO]   1 input file given: q.gene.fasta
    00:12:58.547 [INFO] 
    00:12:58.547 [INFO] loading index: demo.lmi/
    00:12:58.556 [INFO]   reading masks...
    00:12:58.563 [INFO]   reading indexes of seeds (k-mer-value) data...
    00:13:00.573 [INFO]   creating reader pools for 1 genome batches, each with 16 readers...
    00:13:00.574 [INFO] index loaded in 2.027094688s
    00:13:00.574 [INFO] 
    00:13:00.574 [INFO] searching with 16 threads...
    00:13:00.575 [DEBU] NC_000913.3:4166659-4168200 (1542 bp): start to search
    00:13:00.589 [DEBU] NC_000913.3:4166659-4168200 (1542 bp): finished seed-matching (15 genome hits) in 14.622755ms
    00:13:00.593 [DEBU] NC_000913.3:4166659-4168200 (1542 bp): finished chaining (15 genome hits) in 3.286315ms
    checked genomes:  15 / 15 [======================================] ETA: 0s. done
    00:13:00.651 [DEBU] NC_000913.3:4166659-4168200 (1542 bp): finished alignment (15 genome hits) in 57.784923ms
    00:13:00.651 [DEBU] NC_000913.3:4166659-4168200 (1542 bp): finished sorting alignment results (15 genome hits) in 17.669µs

    00:13:00.651 [INFO] 
    00:13:00.651 [INFO] processed queries: 1, speed: 774.101 queries per minute
    00:13:00.651 [INFO] 100.0000% (1/1) queries matched
    00:13:00.651 [INFO] done searching
    00:13:00.651 [INFO] search results saved to: q.gene.fasta.lexicmap.tsv
    00:13:00.652 [INFO] 
    00:13:00.652 [INFO] elapsed time: 2.105031915s
    00:13:00.652 [INFO]
  • Extracting similar sequences for a query gene.

    # search matches with query coverage >= 90%
    lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta --min-qcov-per-hsp 90 --all -o results.tsv
    
    # extract matched sequences as FASTA format
    sed 1d results.tsv | awk -F'\t' '{print ">"$5":"$15"-"$16":"$17"\n"$23;}' | seqkit seq -g > results.fasta
    
    seqkit head -n 1 results.fasta | head -n 3
    >NZ_JALSCK010000007.1:39224-40522:-
    TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT
    ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT
    
  • Exporting blast-like alignment text.

    From file:

    lexicmap utils 2blast results.tsv -o results.txt
    

    Add genome annotation

    lexicmap utils 2blast results.tsv -o results.txt --kv-file-genome ass2species.map
    

    From stdin:

    # here, we only align <=200 bp queries and show one low-similarity result.
    
    $ seqkit seq -g -M 200 q.long-reads.fasta.gz \
        | lexicmap search -d demo.lmi/ -a \
        | csvtk filter2 -t -f '$pident >80 && $pident < 90' \
        | csvtk head -t -n 1 \
        | lexicmap utils 2blast --kv-file-genome ass2species.map
    
    Query = GCF_003697165.2_r40
    Length = 186
    
    [Subject genome #1/2] = GCF_002950215.1 Shigella flexneri
    Query coverage per genome = 93.548%
    
    >NZ_CP026788.1 
    Length = 4659463
    
     HSP cluster #1, HSP #1
     Score = 279 bits, Expect = 9.66e-75
     Query coverage per seq = 93.548%, Aligned length = 177, Identities = 88.701%, Gaps = 6
     Query range = 13-186, Subject range = 1124816-1124989, Strand = Plus/Plus
    
    Query  13       CGGAAACTGAAACA-CCAGATTCTACGATGATTATGATGATTTA-TGCTTTCTTTACTAA  70
                    |||||||||||||| |||||||||| | |||||||||||||||| |||||||||| ||||
    Sbjct  1124816  CGGAAACTGAAACAACCAGATTCTATGTTGATTATGATGATTTAATGCTTTCTTTGCTAA  1124875
    
    Query  71       AAAGTAAGCGGCCAAAAAAATGAT-AACACCTGTAATGAGTATCAGAAAAGACACGGTAA  129
                    ||    |||||||||||||||||| |||||||||||||||||||||||||||||||||||
    Sbjct  1124876  AA--GCAGCGGCCAAAAAAATGATTAACACCTGTAATGAGTATCAGAAAAGACACGGTAA  1124933
    
    Query  130      GAAAACACTCTTTTGGATACCTAGAGTCTGATAAGCGATTATTCTCTCTATGTTACT  186
                     || |||||||||    |||||  |||||||||||||||||||||||| |||| |||
    Sbjct  1124934  AAAGACACTCTTTGAAGTACCTGAAGTCTGATAAGCGATTATTCTCTCCATGT-ACT  1124989
    

Output

Alignment result relationship

Query
├── Subject genome
    ├── Subject sequence
        ├── HSP cluster (a cluster of neighboring HSPs)
            ├── High-Scoring segment Pair (HSP)

Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs.

A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/

Output format

Tab-delimited format with 20+ columns, with 1-based positions.

1.  query,    Query sequence ID.
2.  qlen,     Query sequence length.
3.  hits,     Number of subject genomes.
4.  sgenome,  Subject genome ID.
5.  sseqid,   Subject sequence ID.
6.  qcovGnm,  Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen.
7.  cls,      Nth HSP cluster in the genome. (just for improving readability)
8.  hsp,      Nth HSP in the genome.         (just for improving readability)
9.  qcovHSP   Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen.
10. alenHSP,  Aligned length in the current HSP.
11. pident,   Percentage of identical matches in the current HSP.
12. gaps,     Gaps in the current HSP.
13. qstart,   Start of alignment in query sequence.
14. qend,     End of alignment in query sequence.
15. sstart,   Start of alignment in subject sequence.
16. send,     End of alignment in subject sequence.
17. sstr,     Subject strand.
18. slen,     Subject sequence length.
19. evalue,   Expect value.
20. bitscore, Bit score.
21. cigar,    CIGAR string of the alignment.                      (optional with -a/--all)
22. qseq,     Aligned part of query sequence.                     (optional with -a/--all)
23. sseq,     Aligned part of subject sequence.                   (optional with -a/--all)
24. align,    Alignment text ("|" and " ") between qseq and sseq. (optional with -a/--all)

Result ordering:

For a HSP cluster, SimilarityScore = aligned_bases * weighted_pident.

  1. Within each HSP cluster, HSPs are sorted by sstart.
  2. Within each subject genome, HSP clusters are sorted in descending order by SimilarityScore.
  3. Results of multiple subject genomes are sorted by the highest SimilarityScore of HSP clusters.

Examples

query                                      qlen   hits   sgenome           sseqid                 qcovGnm   cls   hsp   qcovHSP   alenHSP   pident    gaps   qstart   qend   sstart    send      sstr   slen      evalue     bitscore
----------------------------------------   ----   ----   ---------------   --------------------   -------   ---   ---   -------   -------   -------   ----   ------   ----   -------   -------   ----   -------   --------   --------
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCF_017641985.1   NZ_SIYB01000008.1      100.000   1     1     100.000   1299      100.000   0      1        1299   38843     40141     -      140750    0.00e+00   2343    
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCF_020405635.1   NZ_JAIZEZ010000006.1   100.000   1     1     100.000   1299      100.000   0      1        1299   217200    218498    +      257915    0.00e+00   2343    
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCF_902163655.1   NZ_CABHAB010000005.1   100.000   1     1     100.000   1299      100.000   0      1        1299   39259     40557     -      173125    0.00e+00   2343    
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCF_020881975.1   NZ_CP086411.1          100.000   1     1     100.000   1299      100.000   0      1        1299   212962    214260    +      2995874   0.00e+00   2343    
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCF_900148695.1   NZ_FRXS01000009.1      100.000   1     1     100.000   1299      100.000   0      1        1299   39230     40528     -      96692     0.00e+00   2343    
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCF_009735055.1   NZ_WMGL01000019.1      100.000   1     1     100.000   1299      100.000   0      1        1299   38954     40252     -      58628     0.00e+00   2343    
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCA_021838685.1   JAKCDA010000012.1      100.000   1     1     100.000   1299      100.000   0      1        1299   44127     45425     +      84305     0.00e+00   2343    
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCF_000742975.1   NZ_CP008816.1          100.000   1     1     100.000   1299      100.000   0      1        1299   2311184   2312482   +      2939973   0.00e+00   2343    
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCF_925298485.1   NZ_CAKMCI010000007.1   100.000   1     1     100.000   1299      100.000   0      1        1299   56108     57406     +      96359     0.00e+00   2343    
lcl|NZ_CP064374.1_cds_WP_002359350.1_906   1299   6255   GCF_021120765.1   NZ_JADPWI010000009.1   100.000   1     1     100.000   1299      100.000   0      1        1299   38881     40179     -      84306     0.00e+00   2343
query                         qlen   hits     sgenome           sseqid              qcovGnm   cls   hsp   qcovHSP   alenHSP   pident    gaps   qstart   qend   sstart    send      sstr   slen      evalue     bitscore
---------------------------   ----   ------   ---------------   -----------------   -------   ---   ---   -------   -------   -------   ----   ------   ----   -------   -------   ----   -------   --------   --------
NC_000913.3:4166659-4168200   1542   306060   GCF_000468515.1   NC_022364.1         100.000   1     1     100.000   1542      100.000   0      1        1542   224243    225784    +      4835601   0.00e+00   2782    
NC_000913.3:4166659-4168200   1542   306060   GCF_000468515.1   NC_022364.1         100.000   2     2     100.000   1542      100.000   0      1        1542   2804586   2806127   -      4835601   0.00e+00   2782    
NC_000913.3:4166659-4168200   1542   306060   GCF_000468515.1   NC_022364.1         100.000   3     3     100.000   1542      100.000   0      1        1542   4350745   4352286   +      4835601   0.00e+00   2782    
NC_000913.3:4166659-4168200   1542   306060   GCF_000468515.1   NC_022364.1         100.000   4     4     100.000   1542      99.676    0      1        1542   4391290   4392831   +      4835601   0.00e+00   2758    
NC_000913.3:4166659-4168200   1542   306060   GCF_000468515.1   NC_022364.1         100.000   5     5     100.000   1542      99.611    0      1        1542   4083001   4084542   +      4835601   0.00e+00   2755    
NC_000913.3:4166659-4168200   1542   306060   GCF_000468515.1   NC_022364.1         100.000   6     6     100.000   1542      99.481    0      1        1542   4177970   4179511   +      4835601   0.00e+00   2746    
NC_000913.3:4166659-4168200   1542   306060   GCF_000468515.1   NC_022364.1         100.000   7     7     100.000   1542      99.157    0      1        1542   3548937   3550478   -      4835601   0.00e+00   2722    
NC_000913.3:4166659-4168200   1542   306060   GCA_020564915.1   JAJBZO010000001.1   100.000   1     1     100.000   1542      100.000   0      1        1542   507976    509517    +      4637373   0.00e+00   2782    
NC_000913.3:4166659-4168200   1542   306060   GCA_020564915.1   JAJBZO010000001.1   100.000   2     2     100.000   1542      100.000   0      1        1542   1180104   1181645   +      4637373   0.00e+00   2782    
NC_000913.3:4166659-4168200   1542   306060   GCA_020564915.1   JAJBZO010000001.1   100.000   3     3     100.000   1542      100.000   0      1        1542   3674442   3675983   -      4637373   0.00e+00   2782
query        qlen    hits    sgenome           sseqid                 qcovGnm   cls   hsp   qcovHSP   alenHSP   pident    gaps   qstart   qend    sstart    send      sstr   slen      evalue      bitscore
----------   -----   -----   ---------------   --------------------   -------   ---   ---   -------   -------   -------   ----   ------   -----   -------   -------   ----   -------   ---------   --------
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000005.1   100.000   1     1     77.157    40762     99.995    0      12069    52830   5194      45955     +      51479     0.00e+00    73501   
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000005.1   100.000   2     2     10.456    5524      100.000   0      1        5524    45956     51479     +      51479     0.00e+00    9963    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000005.1   100.000   3     3     9.860     5209      100.000   0      5525     10733   1         5209      +      51479     0.00e+00    9395    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000005.1   100.000   4     4     1.562     827       99.758    2      9044     9868    16840     17666     +      51479     0.00e+00    1496    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000005.1   100.000   5     5     1.565     827       99.758    2      23715    24541   3520      4344      +      51479     0.00e+00    1496    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000005.1   100.000   6     6     0.208     110       98.182    0      29827    29936   29056     29165     -      51479     6.76e-44    190     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   7     7     0.661     349       74.212    0      10092    10440   15029     15377     +      203534    1.04e-53    224     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   7     8     2.531     1337      94.091    0      10733    12069   14193     15529     +      203534    0.00e+00    2055    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   8     9     1.554     821       99.878    0      9049     9869    80484     81304     -      203534    0.00e+00    1476    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   9     10    1.552     820       99.878    0      23722    24541   80485     81304     -      203534    0.00e+00    1474    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   10    11    1.105     584       99.829    0      7767     8350    84382     84965     +      203534    1.41e-301   1049    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   10    12    0.320     169       100.000   0      9133     9301    83857     84025     +      203534    1.87e-78    306     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   11    13    0.439     232       79.310    0      23403    23634   84597     84828     +      203534    2.26e-47    203     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   11    14    0.996     526       100.000   0      23806    24331   83857     84382     +      203534    9.16e-272   949     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   12    15    0.509     269       99.628    0      8082     8350    84697     84965     +      203534    6.55e-131   480     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   12    16    0.996     526       100.000   0      9133     9658    83857     84382     +      203534    9.16e-272   949     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   13    17    0.502     265       100.000   0      19788    20052   88528     88792     +      203534    2.25e-130   479     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000002.1   100.000   14    18    0.409     224       71.429    8      8827     9042    45434     45657     +      203534    5.28e-36    165     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000001.1   100.000   15    19    1.554     821       100.000   0      9049     9869    2597473   2598293   +      5332228   0.00e+00    1481    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000001.1   100.000   16    20    1.554     821       100.000   0      23721    24541   231535    232355    +      5332228   0.00e+00    1481    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000001.1   100.000   17    21    1.554     821       100.000   0      23721    24541   2597472   2598292   +      5332228   0.00e+00    1481    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000001.1   100.000   18    22    1.552     820       100.000   0      9049     9868    231536    232355    +      5332228   0.00e+00    1480    
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000001.1   100.000   19    23    0.502     265       100.000   0      19788    20052   4258498   4258762   +      5332228   2.25e-130   479     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000003.1   100.000   20    24    0.507     268       99.627    0      19785    20052   58030     58297     -      141533    2.28e-130   479     
CP115019.1   52830   65311   GCF_021502915.1   NZ_JAJAAP010000003.1   100.000   21    25    0.424     224       100.000   0      19788    20011   109732    109955    +      141533    3.45e-108   405     
CP115019.1   52830   65311   GCF_016803855.1   NZ_CP048009.1          98.158    1     1     77.157    40762     99.995    0      12069    52830   7768      48529     -      51479     0.00e+00    73501   
CP115019.1   52830   65311   GCF_016803855.1   NZ_CP048009.1          98.158    2     2     14.702    7767      100.000   0      1        7767    1         7767      -      51479     0.00e+00    14008   
CP115019.1   52830   65311   GCF_016803855.1   NZ_CP048009.1          98.158    3     3     5.614     2966      100.000   0      7768     10733   48514     51479     -      51479     0.00e+00    5350    
CP115019.1   52830   65311   GCF_016803855.1   NZ_CP048009.1          98.158    4     4     1.565     827       99.758    2      23715    24541   49379     50203     -      51479     0.00e+00    1496

Queries are a few Nanopore Q20 reads from a mock metagenomic community.

query                qlen   hits   sgenome           sseqid              qcovGnm   cls   hsp   qcovHSP   alenHSP   pident   gaps   qstart   qend   sstart    send      sstr   slen      evalue      bitscore
------------------   ----   ----   ---------------   -----------------   -------   ---   ---   -------   -------   ------   ----   ------   ----   -------   -------   ----   -------   ---------   --------
ERR5396170.1000004   190    1      GCF_000227465.1   NC_016047.1         84.211    1     1     84.211    165       89.091   5      14       173    4189372   4189536   -      4207222   1.93e-63    253     
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    1     1     99.623    801       97.628   9      4        796    1138907   1139706   +      1887974   0.00e+00    1431    
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    2     2     99.623    801       97.628   9      4        796    32607     33406     +      1887974   0.00e+00    1431    
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    3     3     99.623    801       97.628   9      4        796    134468    135267    -      1887974   0.00e+00    1431    
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    4     4     99.623    801       97.503   9      4        796    1768896   1769695   +      1887974   0.00e+00    1427    
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    5     5     99.623    801       97.378   9      4        796    242012    242811    -      1887974   0.00e+00    1422    
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    6     6     99.623    801       96.879   12     4        796    154380    155176    -      1887974   0.00e+00    1431    
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    7     7     57.915    469       95.736   9      4        464    1280313   1280780   +      1887974   3.71e-236   829     
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    8     8     42.839    341       99.120   0      456      796    1282477   1282817   +      1887974   6.91e-168   601     
ERR5396170.1000006   796    3      GCF_009663775.1   NZ_RDBR01000008.1   99.623    1     1     99.623    801       93.383   9      4        796    21391     22190     -      52610     0.00e+00    1278    
ERR5396170.1000006   796    3      GCF_003344625.1   NZ_QPKJ02000188.1   97.362    1     1     87.437    700       98.143   5      22       717    1         699       -      826       0.00e+00    1249    
ERR5396170.1000006   796    3      GCF_003344625.1   NZ_QPKJ02000423.1   97.362    2     2     27.889    222       99.550   0      575      796    1         222       +      510       3.47e-106   396     
ERR5396170.1000000   698    2      GCF_001457615.1   NZ_LN831024.1       92.264    1     1     92.264    656       96.341   13     53       696    4452083   4452737   +      6316979   0.00e+00    1169    
ERR5396170.1000000   698    2      GCF_000949385.2   NZ_JYKO02000001.1   91.977    1     1     91.977    654       78.135   13     55       696    5638788   5639440   -      5912440   2.68e-176   630     
ERR5396170.1000001   2505   3      GCF_000307025.1   NC_018584.1         67.066    1     1     67.066    1690      97.633   16     47       1726   1905511   1907194   -      2951805   0.00e+00    2985    
ERR5396170.1000001   2505   3      GCF_900187225.1   NZ_LT906436.1       65.070    1     1     65.070    1641      93.723   20     95       1724   1869503   1871134   -      2864663   0.00e+00    2626    
ERR5396170.1000001   2505   3      GCF_013394085.1   NZ_CP040910.1       30.858    1     1     30.858    780       97.692   9      1726     2498   183873    184650    +      1887974   0.00e+00    1384    
ERR5396170.1000001   2505   3      GCF_013394085.1   NZ_CP040910.1       30.858    2     2     5.030     127       87.402   1      2233     2358   1236170   1236296   +      1887974   1.73e-37    167     
ERR5396170.1000001   2505   3      GCF_013394085.1   NZ_CP040910.1       30.858    3     3     5.150     130       80.769   12     2233     2361   930381    930499    -      1887974   6.61e-43    185     
ERR5396170.1000001   2505   3      GCF_013394085.1   NZ_CP040910.1       30.858    4     4     3.713     93        93.548   0      2257     2349   1104581   1104673   -      1887974   5.09e-30    141

Search results (TSV format) above are formatted with csvtk pretty.

Summarizing results

If you would like to summarize alignment results, e.g., the number of species, here’s the method.

  1. Prepare a two-column tab-delimited file for mapping reference (genome) or sequence IDs to any information (such as species name).

     # for GTDB/GenBank/RefSeq genomes downloaded with genome_updater
     cut -f 1,8 assembly_summary.txt > ass2species.tsv
    
     head -n 3 ass2species.tsv
     GCF_002287175.1 Methanobacterium bryantii
     GCF_000762265.1 Methanobacterium formicicum
     GCF_029601605.1 Methanobacterium formicicum
    
  2. Add information to the alignment result with csvtk or other tools.

     # add species
     cat b.gene_E_coli_16S.fasta.lexicmap.tsv \
         | csvtk mutate -t --after slen -n species -f sgenome \
         | csvtk replace -t -f species -p "(.+)" -r "{kv}" -k ass2species.tsv \
         > result.with_species.tsv
    
     # filter result with query coverage >= 80 and count the species
     cat result.with_species.tsv \
         | csvtk uniq -t -f sgenome \
         | csvtk filter2 -t -f "\$qcovHSP >= 80" \
         | csvtk freq -t -f species -nr \
         > result.with_species.tsv.stats.tsv
    
     csvtk head -t -n 5 result.with_species.tsv.stats.tsv \
         | csvtk pretty -t
    
     species                    frequency
     ------------------------   ---------
     Salmonella enterica        135065   
     Escherichia coli           128071   
     Streptococcus pneumoniae   51971    
     Staphylococcus aureus      44215    
     Pseudomonas aeruginosa     34254