Step 1. Building a database

Note

Terminology differences:

On this page and in the LexicMap command line options, the term “mask” is used, following the terminology in the LexicHash paper.

In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology.

TL;DR

Prepare input files:
- Sequences of each reference genome should be saved in separate FASTA files, with identifiers (no tab symbols) in the file names. E.g., GCF_000006945.2.fna.gz
  - A regular expression is also available to extract reference id from the file name. E.g., --ref-name-regexp '^(\w{3}_\d{9}\.\d+)' extracts GCF_000006945.2 from GenBank assembly file GCF_000006945.2_ASM694v2_genomic.fna.gz.
  - Even if you forgot to use -N/--ref-name-regexp, lexicmap utils edit-genome-ids (available since v0.8.0) can fix this without re-building the index.
- While if you save a few small (viral) complete genomes (one sequence per genome) in each file, it’s feasible as sequence IDs in search result can help to distinguish target genomes.

Run:

From a directory with multiple genome files:
```
lexicmap index -I genomes/ -O db.lmi
```

From a file list with one file per line:

lexicmap index --skip-file-check -X files.txt -O db.lmi

Input

Note
Genome size
LexicMap is mainly suitable for small genomes like Archaea, Bacteria, Viruses, fungi, and plasmids.

Maximum genome size: 268 Mb (268,435,456). More precisely:
$total_bases + ($num_contigs - 1) * 1000 <= 268,435,456
as we concatenate contigs with 1000-bp intervals of N’s to reduce the sequence scale to index.

Sequences of each reference genome should be saved in separate FASTA files, with identifiers in the file names. While if you save a few small (viral) complete genomes (one sequence per genome) in each file, it’s feasible as sequence IDs in search result can help to distinguish target genomes.

File type: FASTA/Q files, in plain text or gzip/xz/zstd/bzip2/lz4 compressed formats.
File name: “Genome ID” + “File extention”. E.g., GCF_000006945.2.fna.gz.
- Genome ID: they must not contain tab ("\t") symbols, and should be distinct for accurate result interpretation, which will be shown in the search result.
  - A regular expression is also available to extract reference id from the file name. E.g., --ref-name-regexp '^(\w{3}_\d{9}\.\d+)' extracts GCF_000006945.2 from GenBank assembly file GCF_000006945.2_ASM694v2_genomic.fna.gz.
  - If you forgot to use -N/--ref-name-regexp, lexicmap utils edit-genome-ids (available since v0.8.0) can fix this without re-building the index.
File extention: a regular expression set by the flag -r/--file-regexp is used to match input files. The default value supports common sequence file extentions, e.g., .fa, .fasta, .fna, .fa.gz, .fasta.gz, .fna.gz, fasta.xz, fasta.zst, and fasta.bz2.
Sequences:
- Only DNA or RNA sequences are supported.
  - Soft-masked sequences are supported since v0.9.0 with the flag --soft-masking. Lowercase bases in soft-masked low-complexity regions will be treated as A’s and won’t be seeded, while they will be saved for base-level alignment.
- Sequence IDs should be distinct for accurate result interpretation, which will be shown in the search result.
- Sequence description (text behind sequence ID) is not saved. If you do need it, you can create a mapping file (cat files.txt | seqkit seq -n -X - | sed -E 's/\s+/\t/' > id2desc.tsv) and use it to add description in search result.
- One or more sequences (contigs) in each file are allowed.
  - Unwanted sequences (such as plasmids) can be filtered out by regular expressions from the flag -B/--seq-name-filter.
- Genome size limit. Some none-isolate assemblies might have extremely large genomes, e.g., GCA_000765055.1 has >150 Mb. The flag -g/--max-genome (default 15 Mb) is used to skip these input files, and the file list would be written to a file via the flag -G/--big-genomes.
  - Changes since v0.5.0:
    - Genomes with any single contig larger than the threshold will be skipped as before.
    - However, fragmented (with many contigs) genomes with the total bases larger than the threshold will be split into chunks and alignments from these chunks will be merged in lexicmap search.
  - For fungi genomes, please increase the value of -g/--max-genome.
- Minimum sequence length. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value).
At most 17,179,869,184 (2³⁴) genomes are supported. For more genomes, please create a file list and split it into multiple parts, and build an index for each part.

Input files can be given via one of the following ways:

Positional arguments. For a few input files.
A file list via the flag -X/--infile-list with one file per line. It can be STDIN (-), e.g., you can filter a file list and pass it to lexicmap index.
- The flag -S/--skip-file-check is optional for skiping input file checking if you believe these files do exist. Because, by default, LexicMap checks the existence of all input files, which would take tens of minutes for >1M files.
A directory containing input files via the flag -I/--in-dir.
- Multiple-level directories are supported. So you don’t need to saved hundreds of thousand files into one directoy.
- Directory and file symlinks are followed.

Hardware requirements

See benchmark of index building.

LexicMap is designed to provide fast and low-memory sequence alignment against millions of prokaryotic genomes.

CPU:
- No specific requirements on CPU type and instruction sets. Both x86 and ARM chips are supported.
- More is better as LexicMap is a CPU-intensive software. It uses all CPUs by default (-j/--threads).
RAM
- More RAM (> 200 GB for >2 million genomes) is preferred. The memory usage in index building is mainly related to:
  - The number of masks (-m/--masks, default 20,000). Bigger values improve the search sensitivity slightly, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=5,000 is high enough.
  - The number of genomes. Generally, more genomes consume more memory, but we can reduce it by using smaller genome batch sizes (see below).
  - The genome batch size (-b/--batch-size, default 5,000). This is the main parameter to adjust memory usage. Bigger values increase indexing memory occupation.
  - The divergence between genome sequences in each batch. Diverse genomes consume more memory.
  - The maximum seed distance or the maximum sketching desert size (-D/--seed-max-desert, default 100), and the distance of k-mers to fill deserts (-d/--seed-in-desert-dist, default 50). These are the main parameters to adjust search sensitivity. Bigger -D/--seed-max-desert values decrease the search sensitivity, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.
- If the RAM is not sufficient. Please:
  - Use a smaller genome batch size. It decreases indexing memory occupation and has little affection on searching performance.
Disk
- More is better. LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance. See some examples.
  - Note that the index size is not linear with the number of genomes, it’s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genomes bring smaller compression rates.
- SSD disks are preferred, while HDD disks are also fast enough.

Algorithm

See the paper for details.

Click to show details. ...

Generating m LexicHash masks.
1. Generate m prefixes.
  1. Generating all permutations of p-bp prefixes that can cover all possible k-mers, p is the biggest value for 4^p <= m (desired number of masks), e.g., p=7 for 20,000 masks. (4⁷ = 16384)
  2. Duplicating these prefixes to m prefixes.
2. For each prefix,
  1. Randomly generating left k-p bases.
  2. If the mask is duplicated, re-generating.
Building an index for each genome batch (-b/--batch-size, default 5,000, max 131,072).
1. For each genome file in a genome batch.
  1. Optionally discarding sequences via regular expression (-B/--seq-name-filter).
  2. Skipping genomes bigger than the value of -g/--max-genome.
  3. Concatenating all sequences, with intervals of 1000-bp N’s.
  4. Capturing the most similar k-mer (in non-gap and non-interval regions) for each mask and recording the k-mer and its location(s) and strand information. Base N is treated as A.
  5. Filling sketching deserts (genome regions longer than --seed-max-desert [default 100] without any captured k-mers/seeds). In a sketching desert, not a single k-mer is captured because there’s another k-mer in another place which shares a longer prefix with the mask. As a result, for a query similar to seqs in this region, all captured k-mers can’t match the correct seeds.
    1. For a desert region (start, end), masking the extended region (start-1000, end+1000) with the masks.
    2. Starting from start, every around --seed-in-desert-dist (default 50) bp, finding a k-mer which is captured by some mask, and adding the k-mer and its position information into the index of that mask.
  6. Saving the concatenated genome sequence (bit-packed, 2 bits for one base, N is treated as A) and genome information (genome ID, size, and lengths of all sequences) into the genome data file, and creating an index file for the genome data file for fast random subsequence extraction.
2. Duplicate and reverse all k-mers, and save each reversed k-mer along with the duplicated position information in the seed data of the closest (sharing the longgest prefix) mask. This is for suffix matching of seeds.
3. Compressing k-mers and the corresponding data (k-mer-data, or seeds data, including genome batch, genome number, location, and strand) into chunks of files, and creating an index file for each k-mer-data file for fast seeding.
4. Writing summary information into info.toml file.
Merging indexes of multiple batches.
1. For each k-mer-data chunk file (belonging to a list of masks), serially reading data of each mask from all batches, merging them and writting to a new file.
2. For genome data files, just moving them.
3. Concatenating genomes.map.bin, which maps each genome ID to its batch ID and index in the batch.
4. Update the index summary file.

Parameters

Note

Query length

LexicMap is mainly designed for sequence alignment with a small number of queries (gene/plasmid/virus/phage sequences) longer than 150 bp by default.

If you want to search some short reads, you need to build the index with small -D/--seed-max-desert (default 100) and -d/--seed-in-desert-dist (default 50), e.g., -D 60 -d 30 for 125bp reads, or -D 50 -D 25 for 100bp reads. It will increase the indexing time and increase the index size. Don’t worry, if you have a small scale of genomes, like < 10,000.

If you just want to search long (>1kb) queries for highly similar (>95%) targets, you can build an index with a bigger -D/--seed-max-desert (default 100) and -d/--seed-in-desert-dist (default 50), e.g., -D 300 -d 150. Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.

Note that LexicMap is slow for ultra-long (>1Mb) queries, and the alignment might be fragmented.

Flags in bold text are important and frequently used.

General

Flag	Value	Function	Comment
`-j/--threads`	Default: all available CPUs	Number of CPU cores to use.	► If the value is smaller than the number of available CPUs, make sure set the same value to `-c/--chunks`.
`--soft-masking`	Default: false	Support soft-masked sequences	► Lowercase bases in soft-masked low-complexity regions will be treated as A’s and won’t be seeded, while they will be saved for base-level alignment.

Genome batches

Flag	Value	Function	Comment
`-b/--batch-size`	Max: 131072, default: 5000	Maximum number of genomes in each batch	If the number of input files exceeds this number, input files are split into multiple batches and indexes are built for all batches. In the end, seed files are merged, while genome data files are kept unchanged and collected. ■ Bigger values increase indexing memory occupation and increase batch searching speed, while single query searching speed is not affected.

LexicHash mask generation

Flag	Value	Function	Comment
`-M/--mask-file`	A file	File with custom masks	File with custom masks, which could be exported from an existing index or newly generated by “lexicmap utils masks”. This flag oversides `-k/--kmer`, `-m/--masks`, `-s/--rand-seed`, etc.
`-k/--kmer`	Max: 32, default: 31	K-mer size	■ Bigger values improve the search specificity and do not increase the index size.
`-m/--masks`	Default: 20,000	Number of masks	■ Bigger values improve the search sensitivity slightly, increase the index size, and slow down the search speed. For smaller genomes like phages/viruses, m=5,000 is high enough.

Seeds (k-mer-value) data

Flag	Value	Function	Comment
`--seed-max-desert`	Default: 100	Maximum length of distances between seeds	The default value of 100 guarantees queries >=200 bp would match at least two seeds. ► Large regions with no seeds are called sketching deserts. Deserts with seed distance larger than this value will be filled by choosing k-mers roughly every –seed-in-desert-dist (50 by default) bases. ■ Bigger values decrease the search sensitivity for distant targets, speed up the indexing speed, decrease the indexing memory occupation and decrease the index size. While the alignment speed is almost not affected.
`-c/--chunks`	Maximum: 128, default: value of -j/–threads	Number of seed file chunks	Bigger values accelerate the search speed at the cost of a high disk reading load. ► The value should not exceed the maximum number of open files set by the operating systems. ► Make sure the value of `-j/--threads` in `lexicmap search` is >= this value.
`-J/--seed-data-threads`	Maximum: -c/–chunks, default: 8	Number of threads for writing seed data and merging seed chunks from all batches	The actual value is min(–seed-data-threads, max(1, –max-open-files/($batches_1_round + 2))), where $batches_1_round = min(int($input_files / –batch-size), –max-open-files). ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation.
`-p/--partitions`	Default: 4096	Number of partitions for indexing each seed file	Bigger values bring a little higher memory occupation. ► After indexing, `lexicmap utils reindex-seeds` can be used to reindex the seeds data with another value of this flag.
`--max-open-files`	Default: 1024	Maximum number of open files	It’s only used in merging indexes of multiple genome batches. If there are >100 batches, i.e., ($input_files / –batch-size), please increase this value and set a bigger `ulimit -n` in shell.

Also see the usage of lexicmap index.

Notes for indexing with large datasets

If you have hundreds of thousands of input genomes or more, it’s better to control the number of genome batches, which can be calculated via

$num_input_files / --batch-size

E.g, for GenBank prokaryotic genomes: 2,340,672 / 5000 (default) = 468. The number is too big, and it would slow down the seed-data merging step in lexicmap index and candidate sequence extraction in lexicmap search.

Therefore, if you have enough memory, you can set a bigger --batch-size (e.g., 2,340,672 / 25000 = 93.6).

If the batch number is still big (e.g. 300), you can set bigger --max-open-files (e.g., 4096) and -J/--seed-data-threads (e.g., 12. 12 <= 4096/300 = 13.6) to accelerate the merging step. Meanwhile, don’t forget to increase the maximum open files per process via ulimit -n 4096.

If you forgot these setting, you can rerun the merging step for an unfinished index via lexicmap utils remerge (available since v0.5.0, also see FAQ: how to resume the indexing). Other cases to use this command:

Only one thread is used for merging indexes, which happens when there are a lot (>200 batches) of batches ($inpu_files / --batch-size) and the value of --max-open-files is not big enough.
The Slurm/PBS job time limit is almost reached and the merging step won’t be finished before that.
Disk quota is reached in the merging step.

Steps

We use a small dataset for demonstration.

Preparing the test genomes (15 bacterial genomes) in the refs directory.

Note that the genome files contain the assembly accessions (ID) in the file names.

 git clone https://github.com/shenwei356/LexicMap
 cd LexicMap/demo/

 ls refs/
 GCF_000006945.2.fa.gz  GCF_000392875.1.fa.gz  GCF_001096185.1.fa.gz  GCF_002949675.1.fa.gz  GCF_006742205.1.fa.gz
 GCF_000017205.1.fa.gz  GCF_000742135.1.fa.gz  GCF_001457655.1.fa.gz  GCF_002950215.1.fa.gz  GCF_009759685.1.fa.gz
 GCF_000148585.2.fa.gz  GCF_001027105.1.fa.gz  GCF_001544255.1.fa.gz  GCF_003697165.2.fa.gz  GCF_900638025.1.fa.gz

Building an index with genomes from a directory.

 lexicmap index -I refs/ -O demo.lmi

It would take about 5 seconds and 1.5 GB RAM in a 16-CPU PC.

Optionally, we can also use a file list as the input.

 $ head -n 3 files.txt
 refs/GCF_000006945.2.fa.gz
 refs/GCF_000017205.1.fa.gz
 refs/GCF_000148585.2.fa.gz

 lexicmap index -S -X files.txt -O demo.lmi

Click to show the log of a demo run. ...

# here we set a small --batch-size 5
# memusg is used to measure time and memory usage.
#    https://github.com/shenwei356/memusg/

$ memusg -t -s "lexicmap index -I refs/ -O demo.lmi --batch-size 5"
10:02:25.535 [INFO] LexicMap v0.9.0
10:02:25.535 [INFO]   https://github.com/shenwei356/LexicMap
10:02:25.535 [INFO] 
10:02:25.535 [INFO] checking input files ...
10:02:25.535 [INFO]   scanning files from directory: refs/
10:02:25.536 [INFO]   15 input file(s) given
10:02:25.536 [INFO] 
10:02:25.536 [INFO] --------------------- [ main parameters ] ---------------------
10:02:25.536 [INFO] 
10:02:25.536 [INFO] input and output:
10:02:25.536 [INFO]   input directory: refs/
10:02:25.536 [INFO]     regular expression of input files: (?i)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
10:02:25.536 [INFO]     *regular expression for extracting reference name from file name: (?i)(.+)\.(f[aq](st[aq])?|fna)(\.gz|\.xz|\.zst|\.bz2)?$
10:02:25.536 [INFO]     *regular expressions for filtering out sequences: []
10:02:25.536 [INFO]   min sequence length: 31
10:02:25.536 [INFO]   max genome size: 15000000
10:02:25.536 [INFO]   output directory: demo.lmi
10:02:25.536 [INFO] 
10:02:25.536 [INFO] mask generation:
10:02:25.536 [INFO]   k-mer size: 31
10:02:25.536 [INFO]   number of masks: 20000
10:02:25.536 [INFO]   rand seed: 1
10:02:25.536 [INFO] 
10:02:25.536 [INFO] seed data:
10:02:25.536 [INFO]   maximum sketching desert length: 100
10:02:25.536 [INFO]   distance of k-mers to fill deserts: 50
10:02:25.536 [INFO]   seeds data chunks: 16
10:02:25.536 [INFO]   seeds data indexing partitions: 4096
10:02:25.536 [INFO] 
10:02:25.536 [INFO] general:
10:02:25.536 [INFO]   genome batch size: 5
10:02:25.536 [INFO]   threads: 16
10:02:25.536 [INFO]   batch merge threads: 8
10:02:25.536 [INFO] 
10:02:25.536 [INFO] 
10:02:25.536 [INFO] --------------------- [ generating masks ] ---------------------
10:02:25.542 [INFO] 
10:02:25.542 [INFO] --------------------- [ building index ] ---------------------
10:02:25.701 [INFO] 
10:02:25.701 [INFO]   ------------------------[ batch 1/3 ]------------------------
10:02:25.701 [INFO]   building index for batch 1 with 5 files...
processed files:  5 / 5 [======================================] ETA: 0s. done
10:02:28.025 [INFO]   writing seeds...
10:02:28.109 [INFO]   finished writing seeds in 83.753228ms
10:02:28.109 [INFO]   finished building index for batch 1 in: 2.408534515s
10:02:28.136 [INFO] 
10:02:28.136 [INFO]   ------------------------[ batch 2/3 ]------------------------
10:02:28.136 [INFO]   building index for batch 2 with 5 files...
processed files:  5 / 5 [======================================] ETA: 0s. done
10:02:30.651 [INFO]   writing seeds...
10:02:30.731 [INFO]   finished writing seeds in 79.893444ms
10:02:30.731 [INFO]   finished building index for batch 2 in: 2.59562701s
10:02:30.760 [INFO] 
10:02:30.760 [INFO]   ------------------------[ batch 3/3 ]------------------------
10:02:30.760 [INFO]   building index for batch 3 with 5 files...
processed files:  5 / 5 [======================================] ETA: 0s. done
10:02:33.087 [INFO]   writing seeds...
10:02:33.162 [INFO]   finished writing seeds in 75.00336ms
10:02:33.162 [INFO]   finished building index for batch 3 in: 2.401368929s
10:02:33.217 [INFO] 
10:02:33.217 [INFO] merging 3 indexes...
10:02:33.217 [INFO]   [round 1]
10:02:33.217 [INFO]     batch 1/1, merging 3 indexes to demo.lmi.tmp/r1_b1 with 8 threads...
10:02:33.367 [INFO]   [round 1] finished in 150.175526ms
10:02:33.367 [INFO] rename demo.lmi.tmp/r1_b1 to demo.lmi
10:02:33.382 [INFO] 
10:02:33.382 [INFO] finished building LexicMap index from 15 files with 20000 masks in 7.855422805s
10:02:33.382 [INFO] LexicMap index saved: demo.lmi
10:02:33.382 [INFO] 
10:02:33.382 [INFO] elapsed time: 7.855461916s
10:02:33.382 [INFO] 

elapsed time: 7.981s
peak rss: 1.3 GB

Output

The LexicMap index is a directory with multiple files.

File structure

$ tree demo.lmi/
demo.lmi/                    # the index directory
├── genomes                  # directory of genome data
│   ├── batch_0000           # genome data of one batch
│   │   ├── genomes.bin      # genome data file, containing genome ID, size, sequence lengths, bit-packed sequences
│   │   └── genomes.bin.idx  # index of genome data file, for fast subsequence extraction
│   └── batch_0001
│   │   ├── genomes.bin
│   │   └── genomes.bin.idx
│   ... ...
├── seeds                    # seed data: pairs of k-mer and its location information (genome batch, genome number, location, strand)
│   ├── chunk_000.bin        # seed data file
│   ├── chunk_000.bin.idx    # index of seed data file, for fast seed searching and data extraction
│   ... ...
│   ├── chunk_015.bin        # the number of chunks is set by flag `-c/--chunks`, default: #cpus
│   └── chunk_015.bin.idx
├── genomes.chunks.bin       # lists of genome chunks which belong to the same genome
├── genomes.map.bin          # mapping genome ID to batch number and genome number in the batch
├── info.toml                # summary of the index
└── masks.bin                # mask data

Index size

LexicMap index size is related to the number of input genomes, the divergence between genome sequences, the number of masks, and the maximum seed distance.

Note that the index size is not linear with the number of genomes, it’s sublinear. Because the seed data are compressed with VARINT-GB algorithm, more genome bring smaller compression rates (smaller is good).

Demo data

# 15 genomes, 54 Mb (54,142,446 bp)

demo.lmi/: 78.36 MiB (82,165,269)
 65.26 MiB      seeds
 12.94 MiB      genomes
156.28 KiB      masks.bin
     600 B      info.toml
     375 B      genomes.map.bin
       0 B      genomes.chunks.bin

GTDB repr

# 85,205 genomes, 274 Gbp (273,848,490,566 bp)

gtdb_repr.lmi: 213.27 GiB (228,999,914,466)
146.49 GiB      seeds
 66.78 GiB      genomes
  2.03 MiB      genomes.map.bin
156.28 KiB      masks.bin
     613 B      info.toml
      48 B      genomes.chunks.bin

GTDB complete

# 402,538 genomes, 1.5 Tbp (1,501,864,915,560 bp)

gtdb_complete.lmi: 905.34 GiB (972,098,200,328)
542.34 GiB      seeds
362.99 GiB      genomes
  9.60 MiB      genomes.map.bin
156.28 KiB      masks.bin
     616 B      info.toml
     168 B      genomes.chunks.bin

GenBank+RefSeq

# 2,340,672 genomes, 9.2 Tbp (9,192,651,650,196 bp)
     
genbank_refseq.lmi: 4.96 TiB (5,454,659,703,138)
  2.79 TiB      seeds
  2.17 TiB      genomes
 55.81 MiB      genomes.map.bin
156.28 KiB      masks.bin
  3.59 KiB      genomes.chunks.bin
     619 B      info.toml

AllTheBacteria HQ

# 1,858,610 genomes, 7.5 Tbp (7,493,622,021,123 bp)

atb_hq.lmi: 3.91 TiB (4,304,515,140,156)
  2.15 TiB      seeds
  1.77 TiB      genomes
 39.22 MiB      genomes.map.bin
156.28 KiB      masks.bin
     619 B      info.toml
      24 B      genomes.chunks.bin

Directory/file sizes are counted with https://github.com/shenwei356/dirsize v1.2.1 (dirsize $file, base: 1024).
Index building parameters: -k 31 -m 20000 -D 100 -d 50. Genome batch size: -b 25000 for GenBank+RefSeq and AllTheBacteria datasets, -b 5000 (default) for others.

Explore the index

We provide several commands to explore the index data and extract indexed subsequences:

lexicmap utils genomes can list genome IDs of indexed genomes, see the usage and example.
lexicmap utils masks can list masks of the index, see the usage and example.
lexicmap utils kmers can list details of all seeds (k-mers), including reference, location(s), the strand, and the k-mer direction. see the usage and example.
lexicmap utils seed-pos can help to explore the seed positions, see the usage and example. Before that, the flag --save-seed-pos needs to be added to lexicmap index.
lexicmap utils subseq can extract subsequence via 1) reference name, sequence ID, position and strand, or 2) search result, see the usage and example.

Index format changelog

Index version information is available in the info.toml file of each LexicMap index. LexicMap search and other utility commands check compatibility via the main version.

Index version	LexicMap version	Supported LexicMap versions	Date	Changes
3.4	0.7.0	0.6.0 +	2025-04-11	Fix filling the seed desert region behind the last seed of a genome.
3.3	0.6.0	0.6.0 +	2025-03-25	Reduce index size for batches <= 512. Add the total bases of index to info.toml for computing the Evalue. Denser seeds.
3.1	0.5.0	0.4.0 +	2024-12-18	Change the default partitions of seed data index.
3.0	0.4.0	0.4.0 +	2024-08-15	Support suffix matching of seeds. Better seed desert filling for highly-repetitive regions. Denser seeds.
1.1	0.3.0	0.3.0	2024-05-14	Change the format of seed data index. Use longer contig intervals.
0.1	0.1.0	0.1.0 - 0.2.0	2024-01-25	First version.

What’s next: Searching