Step 2. Searching
-
Build a LexicMap index.
-
Run:
-
For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000
-
For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0
-
Query length
LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 200 bp by default. However, short queries can also be aligned.
Input should be (gzipped) FASTA or FASTQ records from files or STDIN.
See benchmark of index building.
LexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.
- CPU:
- No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported.
- More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (
-j/--threads
).
- RAM
- More RAM (> 16 GB) is preferred. The memory usage in searching is mainly related to:
- The number of matched genomes and sequences.
- The length of query sequences.
- Similarities between query and target sequences.
- The number of threads. It uses all CPUs by default (
-j/--threads
).
- More RAM (> 16 GB) is preferred. The memory usage in searching is mainly related to:
- Disk
- Sufficient space is required to store the index size.
- No temporary files are generated during searching.
Flags in bold text are important and frequently used.
Flag | Value | Function | Comment |
---|---|---|---|
-j/--threads |
Default: all available cpus | Number of CPU cores to use. | The value should be >= the number of seed chunk files (“chunks” in info.toml, set by -c/--chunks in lexicmap index ). |
-w/--load-whole-seeds |
Load the whole seed data into memory for faster search | Use this if the index is not big and many queries are needed to search. | |
-n/--top-n-genomes |
Default 0, 0 for all | Keep top N genome matches for a query in the chaining phase | Value 1 is not recommended as the best chaining result does not always bring the best alignment, so it better be >= 5. The final number of genome hits might be smaller than this number as some chaining results might fail to pass the criteria in the alignment step. |
-a/--all |
Output more columns, e.g., matched sequences. | Use this if you want to output blast-style format with “lexicmap utils 2blast” | |
-J/–max-query-conc | Default 12, 0 for all | Maximum number of concurrent queries | Bigger values do not improve the batch searching speed and consume much memory. |
Flag | Value | Function | Comment |
---|---|---|---|
-p, --seed-min-prefix |
Default 15 | Minimum (prefix) length of matched seeds. | Smaller values produce more results at the cost of slow speed. |
-P, --seed-min-single-prefix |
Default 17 | Minimum (prefix) length of matched seeds if there’s only one pair of seeds matched. | Smaller values produce more results at the cost of slow speed. |
--seed-max-dist |
Default 1000 | Max distance between seeds in seed chaining. It should be <= contig interval length in database. | |
--seed-max-gap |
Default 200 | Max gap in seed chaining. | |
Flag | Value | Function | Comment |
---|---|---|---|
-Q/--min-qcov-per-genome |
Default 0 | Minimum query coverage (percentage) per genome. | |
-q/--min-qcov-per-hsp |
Default 0 | Minimum query coverage (percentage) per HSP. | |
-l/--align-min-match-len |
Default 50 | Minimum aligned length in a HSP segment. | |
-i/--align-min-match-pident |
Default 70 | Minimum base identity (percentage) in a HSP segment. | |
--align-band |
Default 50 | Band size in backtracking the score matrix. | |
--align-ext-len |
Default 1000 | Extend length of upstream and downstream of seed regions, for extracting query and target sequences for alignment. It should be <= contig interval length in database. | |
--align-max-gap |
Default 20 | Maximum gap in a HSP segment. |
LexicMap’s searching speed is related to many factors:
- The number of similar sequences in the index/database. More genome hits cost more time, e.g., 16S rRNA gene.
- Similarity between query and subject sequences. Alignment of diverse sequences is slower than that of highly similar sequences.
- The length of query sequence. Longer queries run with more time.
- The I/O performance and load. LexicMap is I/O bound, because seeds matching and extracting candidate subsequences for alignment require a large number of file readings in parallel.
- CPU frequency and the number of threads. Faster CPUs and more threads cost less time.
Here are some tips to improve the search speed.
- Increasing the concurrency number
-
Make sure that the value of
-j/--threads
(default: all available CPUs) is ≥ than the number of seed chunk file (default: all available CPUs in the indexing step), which can be found ininfo.toml
file, e.g,# Seeds (k-mer-value data) files chunks = 48
-
Increasing the value of
--max-open-files
(default 512). You might also need to change the open files limit. -
(If you have many queries) Increase the value of
-J/--max-query-conc
(default 12), it will increase the memory.
-
- (If you have many queries) Loading the entire seed data into memoy (It’s unnecessary if the index is stored in SSD)
- Setting
-w/--load-whole-seeds
to load the whole seed data into memory for faster search. For example, for ~85,000 GTDB representative genomes, the memory would be ~260 GB with default parameters.
- Setting
- Returning less results
- Setting
-n/--top-n-genomes
to keep top N genome matches for a query (0 for all) in chaining phase. For queries with a large number of genome hits, a resonable value such as 1000 would reduce the computation time.
- Setting
-
For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-match-pident 70 \ --min-qcov-per-hsp 70 \ --min-qcov-per-genome 70 \ --top-n-genomes 1000
-
For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \ --min-match-pident 70 \ --min-qcov-per-hsp 0 \ --min-qcov-per-genome 0 \ --top-n-genomes 0
-
Extracting similar sequences for a query gene.
# search matches with query coverage >= 90% lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta --min-qcov-per-hsp 90 --all -o results.tsv # extract matched sequences as FASTA format sed 1d results.tsv | awk -F'\t' '{print ">"$5":"$14"-"$15":"$16"\n"$20;}' | seqkit seq -g > results.fasta seqkit head -n 1 results.fasta | head -n 3 >NZ_JALSCK010000007.1:39224-40522:- TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT
-
Exporting blast-like alignment text.
From file:
lexicmap utils 2blast results.tsv -o results.txt
Add genome annotation
lexicmap utils 2blast results.tsv -o results.txt --kv-file-genome ass2species.map
From stdin:
# align only one long-read <= 500 bp $ seqkit seq -M 500 q.long-reads.fasta.gz \ | seqkit head -n 1 \ | lexicmap search -d demo.lmi/ -a \ | lexicmap utils 2blast --kv-file-genome ass2species.map Query = GCF_006742205.1_r100 Length = 431 [Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis Query coverage per genome = 92.575% >NZ_AP019721.1 Length = 2422602 HSP #1 Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4 Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018 Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958 Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211 |||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899 Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839 Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330 |||||||||||||||||||||||||||| ||||||||||||||||||||||||| ||||| Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779 Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390 |||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||| Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719 Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431 |||||||||| ||||||||||||||||||||||||||||||| Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677
Query
├── Subject genome # A query might have one or more genome hits,
├── Subject sequence # in different sequences.
├── High-Scoring segment Pair (HSP) # HSP is an alignment segment.
Here, the defination of HSP is similar with that in BLAST. Actually there are small gaps in HSPs.
A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the highest alignment scores in a given search. https://www.ncbi.nlm.nih.gov/books/NBK62051/
Tab-delimited format with 17+ columns, with 1-based positions.
1. query, Query sequence ID.
2. qlen, Query sequence length.
3. hits, Number of subject genomes.
4. sgenome, Subject genome ID.
5. sseqid, Subject sequence ID.
6. qcovGnm, Query coverage (percentage) per genome: $(aligned bases in the genome)/$qlen.
7. hsp, Nth HSP in the genome. (just for improving readability)
8. qcovHSP Query coverage (percentage) per HSP: $(aligned bases in a HSP)/$qlen.
9. alenHSP, Aligned length in the current HSP.
10. pident, Percentage of identical matches in the current HSP.
11. gaps, Gaps in the current HSP.
12. qstart, Start of alignment in query sequence.
13. qend, End of alignment in query sequence.
14. sstart, Start of alignment in subject sequence.
15. send, End of alignment in subject sequence.
16. sstr, Subject strand.
17. slen, Subject sequence length.
18. cigar, CIGAR string of the alignment. (optional with -a/--all)
19. qseq, Aligned part of query sequence. (optional with -a/--all)
20. sseq, Aligned part of subject sequence. (optional with -a/--all)
21. align, Alignment text ("|" and " ") between qseq and sseq. (optional with -a/--all)
Result ordering:
- Within each subject genome, alignments (HSP) are sorted by qcovHSP*pident.
- Results of multiple subject genomes are sorted by qcovHSP*pident of the best alignment.
query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen
---------------------------------------- ---- ---- --------------- -------------------- ------- --- ------- ------- ------- ---- ------ ---- ------ ------ ---- -------
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_000395405.1 NZ_KB947497.1 100.000 1 100.000 1299 100.000 0 1 1299 232279 233577 + 274511
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_019731615.1 NZ_JAASJA010000010.1 100.000 1 100.000 1299 100.000 0 1 1299 2798 4096 + 42998
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCA_004103085.1 RPCL01000012.1 100.000 1 100.000 1299 100.000 0 1 1299 44095 45393 + 84242
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_023571745.1 NZ_JAMKBS010000014.1 100.000 1 100.000 1299 100.000 0 1 1299 44077 45375 + 84206
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_013248625.1 NZ_JABTDK010000002.1 100.000 1 100.000 1299 100.000 0 1 1299 9609 10907 + 49787
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900092155.1 NZ_FLUS01000006.1 100.000 1 100.000 1299 100.000 0 1 1299 63161 64459 + 77366
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902165815.1 NZ_CABHHZ010000005.1 100.000 1 100.000 1299 100.000 0 1 1299 39386 40684 - 200163
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_014243495.1 NZ_SJAV01000002.1 100.000 1 100.000 1299 100.000 0 1 1299 39085 40383 - 256772
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_900148695.1 NZ_FRXS01000009.1 100.000 1 100.000 1299 100.000 0 1 1299 39230 40528 - 96692
lcl|NZ_CP064374.1_cds_WP_002359350.1_906 1299 3580 GCF_902164645.1 NZ_LR607334.1 100.000 1 100.000 1299 100.000 0 1 1299 236677 237975 + 3380663
query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen
--------------------------- ---- ------ --------------- ----------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- -------
NC_000913.3:4166659-4168200 1542 293398 GCF_002248685.1 NZ_NQBE01000079.1 100.000 1 100.000 1542 100.000 0 1 1542 40 1581 - 99259
NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 1 100.000 1542 100.000 0 1 1542 1270211 1271752 + 5483624
NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 2 100.000 1542 100.000 0 1 1542 5466287 5467828 - 5483624
NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 3 100.000 1543 99.546 2 1 1542 557008 558549 + 5483624
NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 4 100.000 1543 99.482 2 1 1542 4473658 4475199 - 5483624
NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 5 100.000 1543 99.482 2 1 1542 5154150 5155691 - 5483624
NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 6 100.000 1543 99.482 2 1 1542 5195176 5196717 - 5483624
NC_000913.3:4166659-4168200 1542 293398 GCF_017164795.1 NZ_CP062702.1 100.000 7 100.000 1543 99.482 2 1 1542 5369865 5371406 - 5483624
NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701684.1 100.000 1 100.000 1542 100.000 0 1 1542 1108651 1110192 - 1914390
NC_000913.3:4166659-4168200 1542 293398 GCF_000460355.1 NZ_KE701686.1 100.000 2 100.000 1542 99.741 0 1 1542 100680 102221 + 102235
query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen
---------- ----- ----- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ----- ------- ------- ---- -------
CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479
CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479
CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479
CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 4 0.916 484 91.116 0 51686 52169 27192 27675 - 34058
CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086535.1 97.473 5 0.829 438 90.868 1 52342 52779 26583 27019 - 34058
CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479
CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086534.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185
CP115019.1 52830 58744 GCF_022759845.1 NZ_CP086533.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479
CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 1 75.792 40041 99.995 0 12069 52109 11439 51479 + 51479
CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 2 20.316 10733 100.000 0 1 10733 722 11454 + 51479
CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 3 1.365 721 100.000 0 52110 52830 1 721 + 51479
CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 4 0.916 484 91.116 0 51686 52169 3843 4326 + 34058
CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086547.1 97.473 5 0.829 438 90.868 1 52342 52779 4499 4935 + 34058
CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 6 1.552 820 100.000 0 9049 9868 23092 23911 + 51479
CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086546.1 97.473 7 0.502 265 100.000 0 19788 20052 29842 30106 + 47185
CP115019.1 52830 58744 GCF_022759905.1 NZ_CP086545.1 97.473 8 0.159 84 97.619 0 8348 8431 19574 19657 + 51479
CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 1 77.157 40762 99.993 0 12069 52830 9513 50274 + 51480
CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 2 18.033 9528 99.990 1 1207 10733 1 9528 + 51480
CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058621.1 97.473 3 2.283 1206 100.000 0 1 1206 50275 51480 + 51480
CP115019.1 52830 58744 GCF_014826015.1 NZ_CP058618.1 97.473 4 2.497 1319 100.000 0 25153 26471 3019498 3020816 - 4718403
Queries are a few Nanopore Q20 reads from a mock metagenomic community.
query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen
------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- -------
ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974
ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979
ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974
ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974
ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551
ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551
ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974
Search results (TSV format) above are formatted with csvtk pretty.
If you would like to summarize alignment results, e.g., the number of species, here’s the method.
-
Prepare a two-column tab-delimited file for mapping reference (genome) or sequence IDs to any information (such as species name).
# for GTDB/GenBank/RefSeq genomes downloaded with genome_updater cut -f 1,8 assembly_summary.txt > ass2species.tsv head -n 3 ass2species.tsv GCF_002287175.1 Methanobacterium bryantii GCF_000762265.1 Methanobacterium formicicum GCF_029601605.1 Methanobacterium formicicum
-
Add information to the alignment result with csvtk or other tools.
# add species cat b.gene_E_coli_16S.fasta.lexicmap.tsv \ | csvtk mutate -t --after slen -n species -f sgenome \ | csvtk replace -t -f species -p "(.+)" -r "{kv}" -k ass2species.tsv \ > result.with_species.tsv # filter result with query coverage >= 80 and count the species cat result.with_species.tsv \ | csvtk uniq -t -f sgenome \ | csvtk filter2 -t -f "\$qcovHSP >= 80" \ | csvtk freq -t -f species -nr \ > result.with_species.tsv.stats.tsv csvtk head -t -n 5 result.with_species.tsv.stats.tsv \ | csvtk pretty -t species frequency ------------------------ --------- Salmonella enterica 135065 Escherichia coli 128071 Streptococcus pneumoniae 51971 Staphylococcus aureus 44215 Pseudomonas aeruginosa 34254