Introduction
LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences against up to millions of prokaryotic genomes.
Preprint:
Wei Shen and Zamin Iqbal. (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes. bioRxiv. https://doi.org/10.1101/2024.08.30.610459
- LexicMap is scalable to up to millions of prokaryotic genomes.
- The sensitivity of LexicMap is comparable with Blastn.
- The alignment is fast and memory-efficient.
- LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs).
- LexicMap is easy to use (tutorials and usages). Both tabular and Blast-style output formats are available.
- Besides, we provide several commands to explore the index data and extract indexed subsequences.
Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale.
- Existing full alignment tools face challenges of high memory consumption and slow speeds.
- Alignment-free large-scale sequence searching tools only return the matched genomes, without the vital positional information for downstream analysis.
- Prefilter+Align strategies have the sensitivity issue in the prefiltering step.
Methods: (algorithm overview)
- An improved version of the sequence sketching method LexicHash is adopted to compute alignment seeds accurately and efficiently.
- We solved the sketching deserts problem of LexicHash seeds to provide a window guarantee.
- We added the support of suffix matching of seeds, making seeds much more tolerant to mutations. Any 31-bp seed with a common ≥15 bp prefix or suffix can be matched, which means seeds are immune to any single SNP.
- A hierarchical index enables fast and low-memory variable-length seed matching (prefix + suffix matching).
- A pseudo alignment algorithm is used to find similar sequence regions from chaining results for alignment.
- A reimplemented Wavefront alignment algorithm is used for base-level alignment.
Results:
-
LexicMap enables efficient indexing and searching of both RefSeq+GenBank and the AllTheBacteria datasets (2.3 and 1.9 million prokaryotic assemblies respectively). Running at this scale has previously only been achieved by Phylign (previously called mof-search), which compresses genomes with phylogenetic information and provides searching (prefiltering with COBS and alignment with minimap2).
-
For searching in all 2,340,672 Genbank+Refseq prokaryotic genomes, Blastn is unable to run with this dataset on common servers as it requires >2000 GB RAM. (see performance).
With LexicMap v0.4.0 (48 CPUs),
Query Genome hits Time RAM A 1.3-kb marker gene 37,164 36 s 4.1 GB A 1.5-kb 16S rRNA 1,949,496 10 m 41 s 14.1 GB A 52.8-kb plasmid 544,619 19 m 20 s 19.3 GB 1003 AMR genes 25,702,419 187 m 40 s 55.4 GB
Building an index (see the tutorial of building an index).
# From a directory with multiple genome files
lexicmap index -I genomes/ -O db.lmi
# From a file list with one file per line
lexicmap index -X files.txt -O db.lmi
Querying (see the tutorial of searching).
# For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
--min-qcov-per-hsp 70 --min-qcov-per-genome 70 --top-n-genomes 1000
# For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
--min-qcov-per-hsp 0 --min-qcov-per-genome 0 --top-n-genomes 0
Sample output (queries are a few Nanopore Q20 reads). See output format details.
query qlen hits sgenome sseqid qcovGnm hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen
------------------ ---- ---- --------------- ------------- ------- --- ------- ------- ------- ---- ------ ---- ------- ------- ---- -------
ERR5396170.1000016 740 1 GCF_013394085.1 NZ_CP040910.1 89.595 1 89.595 663 99.246 0 71 733 13515 14177 + 1887974
ERR5396170.1000000 698 1 GCF_001457615.1 NZ_LN831024.1 85.673 1 85.673 603 98.010 5 53 650 4452083 4452685 + 6316979
ERR5396170.1000017 516 1 GCF_013394085.1 NZ_CP040910.1 94.574 1 94.574 489 99.591 2 27 514 293509 293996 + 1887974
ERR5396170.1000012 848 1 GCF_013394085.1 NZ_CP040910.1 95.165 1 95.165 811 97.411 7 22 828 190329 191136 - 1887974
ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 1 60.000 973 95.889 13 365 1333 88793 89756 - 2884551
ERR5396170.1000038 1615 1 GCA_000183865.1 CM001047.1 64.706 2 4.706 76 98.684 0 266 341 89817 89892 - 2884551
ERR5396170.1000036 1159 1 GCF_013394085.1 NZ_CP040910.1 95.427 1 95.427 1107 99.729 1 32 1137 1400097 1401203 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 1 86.486 707 99.151 3 104 807 242235 242941 - 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 2 86.486 707 98.444 3 104 807 1138777 1139483 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 3 84.152 688 98.983 4 104 788 154620 155306 - 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 4 84.029 687 99.127 3 104 787 32477 33163 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 5 72.727 595 98.992 3 104 695 1280183 1280777 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 6 11.671 95 100.000 0 693 787 1282480 1282574 + 1887974
ERR5396170.1000031 814 4 GCF_013394085.1 NZ_CP040910.1 86.486 7 82.064 671 99.106 3 120 787 1768782 1769452 + 1887974
CIGAR string, aligned query and subject sequences can be outputted as extra columns via the flag -a/--all
.
# Extracting similar sequences for a query gene.
# search matches with query coverage >= 90%
lexicmap search -d gtdb_complete.lmi/ b.gene_E_faecalis_SecY.fasta -o results.tsv \
--min-qcov-per-hsp 90 --all
# extract matched sequences as FASTA format
sed 1d results.tsv | awk -F'\t' '{print ">"$5":"$14"-"$15":"$16"\n"$20;}' \
| seqkit seq -g > results.fasta
seqkit head -n 1 results.fasta | head -n 3
>NZ_JALSCK010000007.1:39224-40522:-
TTGTTCAAGCTATTAAAGAACGCCTTTAAAGTCAAAGACATTAGATCAAAAATCTTATTT
ACAGTTTTAATCTTGTTTGTATTTCGCCTAGGTGCGCACATTACTGTGCCCGGGGTGAAT
Export blast-style format:
seqkit seq -M 500 q.long-reads.fasta.gz \
| seqkit head -n 1 \
| lexicmap search -d demo.lmi/ -a \
| lexicmap utils 2blast --kv-file-genome ass2species.map
Query = GCF_006742205.1_r100
Length = 431
[Subject genome #1/1] = GCF_006742205.1 Staphylococcus epidermidis
Query coverage per genome = 92.575%
>NZ_AP019721.1
Length = 2422602
HSP #1
Query coverage per seq = 92.575%, Aligned length = 402, Identities = 98.507%, Gaps = 4
Query range = 33-431, Subject range = 1321677-1322077, Strand = Plus/Minus
Query 33 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 92
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1322077 TAAAACGATTGCTAATGAGTCACGTATTTCATCTGGTTCGGTAACTATACCGTCTACTAT 1322018
Query 93 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTG-TACATTTGC 151
|||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||
Sbjct 1322017 GGACTCAGTGTAACCCTGTAATAAAGAGATTGGCGTACGTAATTCATGTGATACATTTGC 1321958
Query 152 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCACAGGATGACCA 211
|||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||
Sbjct 1321957 TATAAAATCTTTTTTCATTTGATCAAGATTATGTTCATTTGTCATATCAC-GGATGACCA 1321899
Query 212 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 271
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1321898 TGACAATACCACTTCTACCATTTGTTTGAATTCTATCTATATAACTGGAGATAAATACAT 1321839
Query 272 AGTACCTTGTATTAATTTCTAATTCTAA-TACTCATTCTGTTGTGATTCAAATGGTGCTT 330
|||||||||||||||||||||||||||| ||||||||||||||||||||||||| |||||
Sbjct 1321838 AGTACCTTGTATTAATTTCTAATTCTAAATACTCATTCTGTTGTGATTCAAATGTTGCTT 1321779
Query 331 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATAATCAG 390
|||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||
Sbjct 1321778 CAATTTGCTGTTCAATAGATTCTTTTGAAAAATCATCAATGTGACGCATAATATCATCAG 1321719
Query 391 CCATCTTGTT-GACAATATGATTTCACGTTGATTATTAATGC 431
|||||||||| |||||||||||||||||||||||||||||||
Sbjct 1321718 CCATCTTGTTTGACAATATGATTTCACGTTGATTATTAATGC 1321677
Learn more tutorials and usages.
dataset | genomes | gzip_size | tool | db_size | time | RAM |
---|---|---|---|---|---|---|
GTDB complete | 402,538 | 443 GB | LexicMap | 973 GB | 10 h 36 m | 63.3 GB |
Blastn | 387 GB | 3 h 11 m | 718 MB | |||
AllTheBacteria HQ | 1,858,610 | 2.5 TB | LexicMap | 4.26 TB | 48 h 08 m | 88.6 GB |
Blastn | 1.93 TB | 14 h 03 m | 2.9 GB | |||
Phylign | 248 GB | / | / | |||
Genbank+RefSeq | 2,340,672 | 2.7 TB | LexicMap | 5.43 TB | 54 h 33 m | 178.3 GB |
Blastn | 2.37 TB | 14 h 04 m | 4.3 GB |
Notes:
- All files are stored on a server with HDD disks. No files are cached in memory.
- Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz).
- LexicMap index building parameters:
-k 31 -m 40000
. Genome batch size:-b 5000
for GTDB datasets,-b 25000
for others.
Blastn failed to run as it requires >2000GB RAM for Genbank+RefSeq and AllTheBacteria datasets. Phylign only has the index for AllTheBacteria HQ dataset.
GTDB complete (402,538 genomes):
query | query_len | tool | genome_hits | genome_hits(qcov>50) | time | RAM |
---|---|---|---|---|---|---|
a marker gene | 1,299 bp | LexicMap | 5,170 | 5,143 | 17 s | 1.4 GB |
Blastn | 7,121 | 6,177 | 2,171 s | 351.2 GB | ||
a 16S rRNA gene | 1,542 bp | LexicMap | 303,925 | 278,141 | 235 s | 4.4 GB |
Blastn | 301,197 | 277,042 | 2,353 s | 378.4 GB | ||
a plasmid | 52,830 bp | LexicMap | 63,108 | 1,190 | 499 s | 4.6 GB |
Blastn | 69,311 | 2,308 | 2,262 s | 364.7 GB | ||
1033 AMR genes | 1 kb (median) | LexicMap | 3,867,003 | 2,228,339 | 4,350 s | 16.3 GB |
Blastn | 5,357,772 | 2,240,766 | 4,686 s | 442.1 GB |
AllTheBacteria HQ (1,858,610 genomes):
query | query_len | tool | genome_hits | genome_hits(qcov>50) | time | RAM |
---|---|---|---|---|---|---|
a marker gene | 1,299 bp | LexicMap | 27,963 | 27,953 | 31 s | 3.4 GB |
Phylign_local | 7,936 | 30 m 48 s | 77.6 GB | |||
Phylign_cluster | 7,936 | 28 m 33 s | ||||
a 16S rRNA gene | 1,542 bp | LexicMap | 1,857,761 | 1,740,000 | 9 m 36 s | 14.9 GB |
Phylign_local | 1,017,765 | 130 m 33 s | 77.0 GB | |||
Phylign_cluster | 1,017,765 | 86 m 41 s | ||||
a plasmid | 52,830 bp | LexicMap | 468,821 | 3,618 | 15 m 55 s | 15.7 GB |
Phylign_local | 46,822 | 47 m 33 s | 82.6 GB | |||
Phylign_cluster | 46,822 | 39 m 34 s | ||||
1033 AMR genes | 1 kb (median) | LexicMap | 21,288,000 | 12,148,642 | 138 m 55 s | 49.9 GB |
Phylign_local | 1,135,215 | 156 m 08 s | 85.9 GB | |||
Phylign_cluster | 1,135,215 | 133 m 49 s |
Genbank+RefSeq (2,340,672 genomes):
query | query_len | tool | genome_hits | genome_hits(qcov>50) | time | RAM |
---|---|---|---|---|---|---|
a marker gene | 1,299 bp | LexicMap | 37,164 | 37,082 | 36 s | 4.1 GB |
a 16S rRNA gene | 1,542 bp | LexicMap | 1,949,496 | 1,381,974 | 10 m 41 s | 14.1 GB |
a plasmid | 52,830 bp | LexicMap | 544,619 | 6,563 | 19 m 20 s | 19.3 GB |
1033 AMR genes | 1 kb (median) | LexicMap | 25,702,419 | 14,692,624 | 187 m 40 s | 55.4 GB |
Notes:
- All files are stored on a server with HDD disks. No files are cached in memory.
- Tests are performed in a single cluster node with 48 CPU cores (Intel Xeon Gold 6336Y CPU @ 2.40 GHz).
- Main searching parameters:
- LexicMap v0.4.0:
--threads 48 --top-n-genomes 0 --min-qcov-per-genome 0 --min-qcov-per-hsp 0 --min-match-pident 70
. - Blastn v2.15.0+:
-num_threads 48 -max_target_seqs 10000000
. - Phylign (AllTheBacteria fork 9fc65e6):
threads: 48, cobs_kmer_thres: 0.33, minimap_preset: "asm20", nb_best_hits: 5000000, max_ram_gb: 100
; For cluster, maximum number of slurm jobs is100
.
- LexicMap v0.4.0:
LexicMap is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.
Or install with conda
:
conda install -c bioconda lexicmap
Wei Shen and Zamin Iqbal. (2024) LexicMap: efficient sequence alignment against millions of prokaryotic genomes. bioRxiv. https://doi.org/10.1101/2024.08.30.610459
Please open an issue to report bugs, propose new functions or ask for help.
- High-performance LexicHash computation in Go.
- Wavefront alignment algorithm (WFA) in Golang.