Step 1. Building a database
NoteTerminology differences:
- On this page and in the LexicMap command line options, the term “mask” is used, following the terminology in the LexicHash paper.
- In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology.
- Prepare input files:
- Sequences of each reference genome should be saved in separate FASTA files, with identifiers (no tab symbols) in the file names.
E.g., GCF_000006945.2.fna.gz
- A regular expression is also available to extract reference id from the file name.
E.g.,
--ref-name-regexp '^(\w{3}_\d{9}\.\d+)'
extractsGCF_000006945.2
from GenBank assembly fileGCF_000006945.2_ASM694v2_genomic.fna.gz
- A regular expression is also available to extract reference id from the file name.
E.g.,
- While if you save a few small (viral) complete genomes (one sequence per genome) in each file, it’s feasible as sequence IDs in search result can help to distinguish target genomes.
- Sequences of each reference genome should be saved in separate FASTA files, with identifiers (no tab symbols) in the file names.
E.g., GCF_000006945.2.fna.gz
- Run:
-
From a directory with multiple genome files:
lexicmap index -I genomes/ -O db.lmi
-
From a file list with one file per line:
lexicmap index --skip-file-check -X files.txt -O db.lmi
-
NoteGenome size
LexicMap is mainly suitable for small genomes like Archaea, Bacteria, Viruses, fungi, and plasmids.Maximum genome size: 268 Mb (268,435,456). More precisely:
$total_bases + ($num_contigs - 1) * 1000 <= 268,435,456
as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.
Sequences of each reference genome should be saved in separate FASTA files, with identifiers in the file names. While if you save a few small (viral) complete genomes (one sequence per genome) in each file, it’s feasible as sequence IDs in search result can help to distinguish target genomes.
- File type: FASTA/Q files, in plain text or gzip/xz/zstd/bzip2 compressed formats.
- File name: “Genome ID” + “File extention”. E.g.,
GCF_000006945.2.fna.gz
.- Genome ID: they must not contain tab ("\t") symbols, and should be distinct for accurate result interpretation, which will be shown in the search result.
- A regular expression is also available to extract reference id from the file name.
E.g.,
--ref-name-regexp '^(\w{3}_\d{9}\.\d+)'
extractsGCF_000006945.2
from GenBank assembly fileGCF_000006945.2_ASM694v2_genomic.fna.gz
- A regular expression is also available to extract reference id from the file name.
E.g.,
- Genome ID: they must not contain tab ("\t") symbols, and should be distinct for accurate result interpretation, which will be shown in the search result.
- File extention: a regular expression set by the flag
-r/--file-regexp
is used to match input files. The default value supports common sequence file extentions, e.g.,.fa
,.fasta
,.fna
,.fa.gz
,.fasta.gz
,.fna.gz
,fasta.xz
,fasta.zst
, andfasta.bz2
. - Sequences:
- Only DNA or RNA sequences are supported.
- Sequence IDs should be distinct for accurate result interpretation, which will be shown in the search result.
- Sequence description (text behind sequence ID) is not saved. If you do need it, you can create a mapping file
(
seqkit seq -n ref.fa.gz | sed -E 's/\s+/\t/' > id2desc.tsv
) and use it to add description in search result. - One or more sequences (contigs) in each file are allowed.
- Unwanted sequences can be filtered out by regular expressions from the flag
-B/--seq-name-filter
.
- Unwanted sequences can be filtered out by regular expressions from the flag
- Genome size limit. Some none-isolate assemblies might have extremely large genomes, e.g., GCA_000765055.1 has >150 Mb.
The flag
-g/--max-genome
(default 15 Mb) is used to skip these input files, and the file list would be written to a file via the flag-G/--big-genomes
.- Changes since v0.5.0:
- Genomes with any single contig larger than the threshold will be skipped as before.
- However, fragmented (with many contigs) genomes with the total bases larger than the threshold will
be split into chunks and alignments from these chunks will be merged in
lexicmap search
.
- For fungi genomes, please increase the value of
-g/--max-genome
.
- Changes since v0.5.0:
- Minimum sequence length. A flag
-l/--min-seq-len
can filter out sequences shorter than the threshold (default is thek
value).
- At most 17,179,869,184 (234) genomes are supported. For more genomes, please create a file list and split it into multiple parts, and build an index for each part.
Input files can be given via one of the following ways:
- Positional arguments. For a few input files.
- A file list via the flag
-X/--infile-list
with one file per line. It can be STDIN (-
), e.g., you can filter a file list and pass it tolexicmap index
.- The flag
-S/--skip-file-check
is optional for skiping input file checking if you believe these files do exist. Because, by default, LexicMap checks the existence of all input files, which would take tens of minutes for >1M files.
- The flag
- A directory containing input files via the flag
-I/--in-dir
.- Multiple-level directories are supported. So you don’t need to saved hundreds of thousand files into one directoy.
- Directory and file symlinks are followed.
See benchmark of index building.
LexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.
- CPU:
- No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported.
- More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (
-j/--threads
).
- RAM
- More RAM (> 200 GB) is preferred. The memory usage in index building is mainly related to:
- The number of masks (
-m/--masks
, default 20,000). Bigger values improve the search sensitivity slightly, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=5,000 is high enough. - The number of genomes. Generally, more genomes consume more memory, but we can control the genome batch size.
- The genome batch size (
-b/--batch-size
, default 5,000). This is the main parameter to adjust memory usage. Bigger values increase indexing memory occupation. - The divergence between genome sequences in each batch. Diverse genomes consume more memory.
- The maximum seed distance or the maximum sketching desert size (
-D/--seed-max-desert
, default 100), and the distance of k-mers to fill deserts (-d/--seed-in-desert-dist
, default 50). These are the main parameters to adjust search sensitivity. Bigger-D/--seed-max-desert
values decrease the search sensitivity, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.
- The number of masks (
- If the RAM is not sufficient. Please:
- Use a smaller genome batch size. It decreases indexing memory occupation and has little affection on searching performance.
- More RAM (> 200 GB) is preferred. The memory usage in index building is mainly related to:
- Disk
- More is better. LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance. See some examples.
- Note that the index size is not linear with the number of genomes, it’s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genomes bring smaller compression rates.
- SSD disks are preferred, while HDD disks are also fast enough.
- More is better. LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance. See some examples.
NoteQuery length
LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 100 bp by default. However, short queries can also be aligned.If you just want to search long (>1kb) queries for highly similar (>95%) targets, you can build an index with a bigger
-D/--seed-max-desert
(default 50) and-d/--seed-in-desert-dist
(default 50), e.g.,--seed-max-desert 300 --seed-in-desert-dist 150
Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.
Flags in bold text are important and frequently used.
Flag | Value | Function | Comment |
---|---|---|---|
-j/--threads |
Default: all available CPUs | Number of CPU cores to use. | ► If the value is smaller than the number of available CPUs, make sure set the same value to -c/--chunks . |
Flag | Value | Function | Comment |
---|---|---|---|
-b/--batch-size |
Max: 131072, default: 5000 | Maximum number of genomes in each batch | If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected. |
Flag | Value | Function | Comment |
---|---|---|---|
-M/--mask-file |
A file | File with custom masks | File with custom masks, which could be exported from an existing index or newly generated by “lexicmap utils masks”. This flag oversides -k/--kmer , -m/--masks , -s/--rand-seed , etc. |
-k/--kmer |
Max: 32, default: 31 | K-mer size | ■ Bigger values improve the search specificity and do not increase the index size. |
-m/--masks |
Default: 20,000 | Number of masks | ■ Bigger values improve the search sensitivity slightly, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=5,000 is high enough. |
Flag | Value | Function | Comment |
---|---|---|---|
--seed-max-desert |
Default: 100 | Maximum length of distances between seeds | The default value of 100 guarantees queries >=200 bp would match at least two seeds. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every –seed-in-desert-dist (50 by default) bases. ■ Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected. |
-c/--chunks |
Maximum: 128, default: value of -j/–threads | Number of seed file chunks | Bigger values accelerate the search speed at the cost of a high disk reading load. ► The value should not exceed the maximum number of open files set by the operating systems. ► Make sure the value of -j/--threads in lexicmap search is >= this value. |
-J/--seed-data-threads |
Maximum: -c/–chunks, default: 8 | Number of threads for writing seed data and merging seed chunks from all batches | The actual value is min(–seed-data-threads, max(1, –max-open-files/($batches_1_round + 2))), where $batches_1_round = min(int($input_files / –batch-size), –max-open-files). ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation. |
-p/--partitions |
Default: 4096 | Number of partitions for indexing each seed file | Bigger values bring a little higher memory occupation. ► After indexing, lexicmap utils reindex-seeds can be used to reindex the seeds data with another value of this flag. |
--max-open-files |
Default: 1024 | Maximum number of open files | It’s only used in merging indexes of multiple genome batches. If there are >100 batches, i.e., ($input_files / –batch-size), please increase this value and set a bigger ulimit -n in shell. |
Also see the usage of lexicmap index
.
If you have hundreds of thousands of input genomes or more, it’s better to control the number of genome batches, which can be calculated via
$num_input_files / --batch-size
E.g, for GenBank prokaryotic genomes: 2,340,672 / 5000 (default) = 468.
The number is too big, and it would slow down the seed-data merging step in lexicmap index
and candidate sequence extraction in lexicmap search
.
Therefore, if you have enough memory, you can set a bigger --batch-size
(e.g., 2,340,672 / 25000 = 93.6).
If the batch number is still big (e.g. 300), you can set bigger --max-open-files
(e.g., 4096
) and -J/--seed-data-threads
(e.g., 12
. 12 <= 4096/300 = 13.6)
to accelerate the merging step. Meanwhile, don’t forget to increase the maximum open files per process via ulimit -n 4096
.
If you forgot these setting, you can rerun the merging step for an unfinished index via lexicmap utils remerge (available since v0.5.0, also see FAQ: how to resume the indexing). Other cases to use this command:
- Only one thread is used for merging indexes, which happens when there are
a lot (>200 batches) of batches (
$inpu_files / --batch-size
) and the value of--max-open-files
is not big enough. - The Slurm/PBS job time limit is almost reached and the merging step won’t be finished before that.
- Disk quota is reached in the merging step.
We use a small dataset for demonstration.
-
Preparing the test genomes (15 bacterial genomes) in the
refs
directory.Note that the genome files contain the assembly accessions (ID) in the file names.
git clone https://github.com/shenwei356/LexicMap cd LexicMap/demo/ ls refs/ GCF_000006945.2.fa.gz GCF_000392875.1.fa.gz GCF_001096185.1.fa.gz GCF_002949675.1.fa.gz GCF_006742205.1.fa.gz GCF_000017205.1.fa.gz GCF_000742135.1.fa.gz GCF_001457655.1.fa.gz GCF_002950215.1.fa.gz GCF_009759685.1.fa.gz GCF_000148585.2.fa.gz GCF_001027105.1.fa.gz GCF_001544255.1.fa.gz GCF_003697165.2.fa.gz GCF_900638025.1.fa.gz
-
Building an index with genomes from a directory.
lexicmap index -I refs/ -O demo.lmi
It would take about 6 seconds and 3 GB RAM in a 16-CPU PC.
Optionally, we can also use a file list as the input.
$ head -n 3 files.txt refs/GCF_000006945.2.fa.gz refs/GCF_000017205.1.fa.gz refs/GCF_000148585.2.fa.gz lexicmap index -S -X files.txt -O demo.lmi
The LexicMap index is a directory with multiple files.
$ tree demo.lmi/
demo.lmi/ # the index directory
├── genomes # directory of genome data
│ ├── batch_0000 # genome data of one batch
│ │ ├── genomes.bin # genome data file, containing genome ID, size, sequence lengths, bit-packed sequences
│ │ └── genomes.bin.idx # index of genome data file, for fast subsequence extraction
│ └── batch_0001
│ │ ├── genomes.bin
│ │ └── genomes.bin.idx
│ ... ...
├── seeds # seed data: pairs of k-mer and its location information (genome batch, genome number, location, strand)
│ ├── chunk_000.bin # seed data file
│ ├── chunk_000.bin.idx # index of seed data file, for fast seed searching and data extraction
│ ... ...
│ ├── chunk_015.bin # the number of chunks is set by flag `-c/--chunks`, default: #cpus
│ └── chunk_015.bin.idx
├── genomes.chunks.bin # lists of genome chunks which belong to the same genome
├── genomes.map.bin # mapping genome ID to batch number and genome number in the batch
├── info.toml # summary of the index
└── masks.bin # mask data
LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance.
Note that the index size is not linear with the number of genomes, it’s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genome bring smaller compression rates (smaller is good).
# 15 genomes, 54 Mb (54,142,446 bp)
demo.lmi/: 78.36 MiB (82,165,269)
65.26 MiB seeds
12.94 MiB genomes
156.28 KiB masks.bin
600 B info.toml
375 B genomes.map.bin
0 B genomes.chunks.bin
# 85,205 genomes, 274 Gbp (273,848,490,566 bp)
gtdb_repr.lmi: 213.27 GiB (228,999,914,466)
146.49 GiB seeds
66.78 GiB genomes
2.03 MiB genomes.map.bin
156.28 KiB masks.bin
613 B info.toml
48 B genomes.chunks.bin
# 402,538 genomes, 1.5 Tbp (1,501,864,915,560 bp)
gtdb_complete.lmi: 905.34 GiB (972,098,200,328)
542.34 GiB seeds
362.99 GiB genomes
9.60 MiB genomes.map.bin
156.28 KiB masks.bin
616 B info.toml
168 B genomes.chunks.bin
# 2,340,672 genomes, 9.2 Tbp (9,192,651,650,196 bp)
genbank_refseq.lmi: 4.96 TiB (5,454,659,703,138)
2.79 TiB seeds
2.17 TiB genomes
55.81 MiB genomes.map.bin
156.28 KiB masks.bin
3.59 KiB genomes.chunks.bin
619 B info.toml
# 1,858,610 genomes, 7.5 Tbp (7,493,622,021,123 bp)
atb_hq.lmi: 3.91 TiB (4,304,515,140,156)
2.15 TiB seeds
1.77 TiB genomes
39.22 MiB genomes.map.bin
156.28 KiB masks.bin
619 B info.toml
24 B genomes.chunks.bin
- Directory/file sizes are counted with https://github.com/shenwei356/dirsize v1.2.1 (
dirsize $file
, base: 1024). - Index building parameters:
-k 31 -m 20000 -D 100 -d 50
. Genome batch size:-b 25000
for GenBank+RefSeq and AllTheBacteria datasets,-b 5000
(default) for others.
We provide several commands to explore the index data and extract indexed subsequences:
lexicmap utils genomes
can list genome IDs of indexed genomes, see the usage and example.lexicmap utils masks
can list masks of the index, see the usage and example.lexicmap utils kmers
can list details of all seeds (k-mers), including reference, location(s), the strand, and the k-mer direction. see the usage and example.lexicmap utils seed-pos
can help to explore the seed positions, see the usage and example. Before that, the flag--save-seed-pos
needs to be added tolexicmap index
.lexicmap utils subseq
can extract subsequences via genome ID, sequence ID and positions, see the usage and example.
Index version information is available in the info.toml
file of each LexicMap index.
LexicMap search and other utility commands check compatibility via the main version.
Index version | LexicMap version | Supported LexicMap versions | Date | Changes |
---|---|---|---|---|
3.4 | 0.7.0 | 0.6.0 + | 2025-04-11 | Fix filling the seed desert region behind the last seed of a genome. |
3.3 | 0.6.0 | 0.6.0 + | 2025-03-25 | Reduce index size for batches <= 512. Add the total bases of index to info.toml for computing the Evalue. Denser seeds. |
3.1 | 0.5.0 | 0.4.0 + | 2024-12-18 | Change the default partitions of seed data index. |
3.0 | 0.4.0 | 0.4.0 + | 2024-08-15 | Support suffix matching of seeds. Better seed desert filling for highly-repetitive regions. Denser seeds. |
1.1 | 0.3.0 | 0.3.0 | 2024-05-14 | Change the format of seed data index. Use longer contig intervals. |
0.1 | 0.1.0 | 0.1.0 - 0.2.0 | 2024-01-25 | First version. |
What’s next: