Releases
Note
- Please run
lexicmap version
to check update !!!- Please run
lexicmap autocompletion
to update shell autocompletion script !!!
No changes to the index format (see Index format changelog).
- New commands:
lexicmap utils merge-search-results
: Merge a query’s search results from multiple indexes.lexicmap utils edit-genome-ids
: Edit genome IDs in the index via a regular expression. It’s helpful when users forgot to use the flag-N/--ref-name-regexp
to extract the genome ID from the sequence file during indexing. This command help to fix it without rebuilding the index.
lexicmap index
:- Significantly reduce the memory usage (by up to 25%) in the merge step. Also reduce some for huge data, such as long-reads or contigs in the Logan project.
lexicmap search
:- Reduce memory usage, particularly for batch searching (by up to 50%).
- Improve search speed, mainly for batch searching.
- Support limiting search by TaxId(s) via
-t/--taxids
or--taxid-file
. Only genomes with descendant TaxIds of the specific ones or themselves are searched, in a similar way with BLAST+ 2.15.0 or later versions. Negative values are allowed as a black list. For example, searching non-Escherichia (561) genera of Enterobacteriaceae (543) family with-t 543,-561
. Users only need to provide NCBI-format taxdump files (-T/--taxdump
, can also create from any taxonomy data with TaxonKit) and a genome-ID-to-TaxId mapping file (-G/--genome2taxid
). There’s no need to rebuild the index. - Check if the output file and the log file are the same.
- Reduce the time of seed matching when using
-w
. - Change the default value of
--max-query-conc
from 12 to 8. - New flag
--gc-interval
(default 64, 0 for disable) for forcing garbage collection every N queries. This decreases memory usage a lot.
lexicmap utils subseq
:- Accept the output file of
lexicmap search
as the input. So one can extract matched sequences (including flanking regions) from the index, after alignment withlexicmap search
with or without using the flag-a/--all
. - Support extending aligned regions with
-U/--upstream
and/or-D/--downstream
.
- Accept the output file of
v0.7.0 - 2025-04-10
Please rebuild the index, as some seeds in the genome end regions were missed during computation.
lexicmap index
:- Fix a little bug in seed desert filling – forgot to fill the region (a few hundred bases) behind the last seed.
lexicmap search
:- Improve seed chaining – more accurate for complex anchors.
- Improve pseudoalignment in repetitive regions.
- Change the default value of
--seed-max-gap
from 200 to 50.
v0.6.1 - 2025-03-31
lexicmap search
:- Fix the program hang in the debug mode when no chaining result is returned.
lexicmap version
:- Do not show commit hash by default.
v0.6.0 - 2025-03-25
This version is compatible with indexes created by previous versions (requires a one-time, automatic preprocessing), but rebuilding the index is recommended for more accurate results on short queries (<500bp). However, indexes created by this version are not compatible with previous versions when the number of batches is <= 512.
lexicmap index
:- Change default option values to bring a higher sensitivity for short (<=500, especially <=250) queries,
faster indexing speed, and faster seed-matching speed
, at a cost of slightly larger index.-m/--masks
: 40,000 -> 20,000. 40k is unnecessary especially for small genomes, where seeds would be very crowded, with a big proportion of seed distance being between 0-50 bp.-D/--seed-max-desert
: 200 -> 100. This provides a smaller seed window guarantee.
- Reduce index size by using 3 bytes rather than 4 for saving seed data when the number of batches is <= 512, which requires only 9 (17 minus 8) bits to store the batch index. We also recommend controlling the number of batches for better performance.
- Fix seed desert filling near gap regions.
- Change default option values to bring a higher sensitivity for short (<=500, especially <=250) queries,
faster indexing speed, and faster seed-matching speed
lexicmap search
:- Improve pseudoalignment to produce longer alignment regions.
- Add 3 extra columns:
cls
,evalue
andbitscore
, and a new option-e/--max-evalue
. - Reduce memory usage.
- Remove flag
--pseudo-align
. - Add a progress bar for
--debug
.
lexicmap utils seed-pos
:- Change default option values of sliding window.
v0.5.0 - 2025-12-18
This version is compatible with indexes created by LexicMap v0.4.0, but rebuilding the index is recommended for more accurate results.
- New commands:
lexicmap utils remerge
: Rerun the merging step for an unfinished index.
lexicmap index
:- Big genomes with thousands of contigs (big yet fragmented assemblies) are automatically split into multiple chunks, and alignments from these chunks will be merged.
- Change the default value of
--partitions
from 1024 to 4096, which increases the seed-matching speed at the cost of 2 GiB more memory occupation. For existing lexicmap indexes, just runlexicmap utils reindex-seeds --partitions 4096
to re-create seed indexes. - Do not save seeds of low-complexity.
- Fix high memory usage in writing seed data.
- Change the default value of
-c/--chunks
from all available CPUs to the value of-j/--threads
. - Change the default value of
--max-open-files
from 512 to 1024. - Add a new flag
--debug
.
lexicmap search
:- Improving chaining, pseudoalignment, and alignment for highly repetitive sequences.
- More accurate chaining score with better chaining of overlapped anchors, this produces more accurate results with
-n/--top-n-genomes
:- Merging two overlapped non-gapped anchors into a longer one.
- For these with gaps, only the non-overlapped part of the second anchor is used to compute the weight.
- Using the score of the best chain (rather than the sum) for sorting genomes when using
-n
.
- Fix positions and alignment texts for queries with highly repetitive sequences in end regions. #9
- Skip seeds of low-complexity.
- Change the default value of
--max-open-files
from 512 to 1024. - Change the default value of
--align-band
from 50 to 100. - Improve the speed of anchor deduplication, genome information extraction, and result ordering.
- Improve the speed of chaining for long queries.
- Improve the speed of seed matching when using
-w/--load-whole-seeds
. - Improve the speed of alignment, and reduce the memory usage.
- Remain compatible after the change of
lexicmap index
. - Add a new flag
--debug
.
lexicmap utils genomes
:- Do not sort genome ids.
- Add a header line and add another column to show if the reference genome is chunked.
lexicmap utils subseq
:- Remain compatible after the change of
lexicmap index
.
- Remain compatible after the change of
lexicmap utils seed-pos
:- Remain compatible after the change of
lexicmap index
, while histograms are plotted separately for multiple genome chunks.
- Remain compatible after the change of
lexicmap utils reindex-seeds
:- Change the default value of
--partitions
from 1024 to 4096.
- Change the default value of
v0.4.0 - 2024-08-15
- New commands:
lexicmap utils 2blast
: Convert the default search output to blast-style format.
lexicmap index
:- Support suffix matching of seeds, now seeds are immune to any single SNP!!!, at the cost of doubled seed data.
- Better sketching desert filling for highly-repetitive regions.
- Change the default value of
--seed-max-desert
from 900 to 200 to increase alignment sensitivity. - Mask gap regions (N’s).
- Fix skipping interval regions by further including the last k-1 bases of contigs.
- Fix a bug in indexing small genomes.
- Change the default value of
-b, --batch-size
from 10,000 to 5,000. - Improve lexichash data structure.
- Write and merge seed data in parallel, new flag
-J/--seed-data-threads
. - Improve the log.
lexicmap search
:- Fix chaining for highly-repetitive regions.
- Perform more accurate alignment with WFA.
- Use buffered reader for seeds file reading.
- Fix object recycling and reduce memory usage.
- Fix alignment against genomes with many short contigs.
- Fix early quit when meeting a sequence shorter than k.
- Add a new option
-J/--max-query-conc
to limit the miximum number of concurrent queries, with a default valule of 12 instead of the number of CPUs, which reduces the memory usage in batch searching. - Result format:
- Cluster alignments of each target sequence.
- Remove the column
seeds
. - Add columns
gaps
,cigar
,align
, which can be reformated withlexicmap utils 2blast
.
lexicmap utils kmers
:- Fix the progress bar.
- Fix a bug where some masks do not have any k-mer.
- Add a new column
prefix
to show the length of common prefix between the seed and the probe. - Add a new column
reversed
to indicate if the k-mer is reversed for suffix matching.
lexicmap utils masks
:- Add the support of only outputting a specific mask.
lexicmap utils seed-pos
:- New columns:
sseqid
andpos_seq
. - More accurate seed distance.
- Add histograms of numbers of seed in sliding windows.
- New columns:
lexicmap utils subseq
:- Fix a bug when the given end position is larger than the sequence length.
- Add the strand ("+" or “-”) in the sequence header.
v0.3.0 - 2024-05-14
lexicmap index
:- Better seed coverage by filling sketching deserts.
- Use longer (1000bp N’s, previous: k-1) intervals between contigs.
- Fix a concurrency bug between genome data writing and k-mer-value data collecting.
- Change the format of k-mer-value index file, and fix the computation of index partitions.
- Optionally save seed positions which can be outputted by
lexicmap utils seed-pos
.
lexicmap search
:- Improved seed-chaining algorithm.
- Better support of long queries.
- Add a new flag
-w/--load-whole-seeds
for loading the whole seed data into memory for faster search. - Parallelize alignment in each query, so it’s faster for a single query.
- Optional outputing matched query and subject sequences.
- 2-5X searching speed with a faster masking method.
- Change output format.
- Add output of query start and end positions.
- Fix a target sequence extracting bug.
- Keep indexes of genome data in memory.
lexicmap utils kmers
:- Fix a little bug, wrong number of k-mers for the second k-mer in each k-mer pair.
- New commands:
lexicmap utils gen-masks
for generating masks from the top N largest genomes.lexicmap utils seed-pos
for extracting seed positions via reference names.lexicmap utils reindex-seeds
for recreating indexes of k-mer-value (seeds) data.lexicmap utils genomes
for list genomes IDs in the index.
v0.2.0 - 2024-02-02
- Software architecture and index formats are redesigned to reduce searching memory occupation.
- Indexing: genomes are processed in batches to reduce RAM usage, then indexes of all batches are merged.
- Searching: seeds matching is performed on disk yet it’s ultra-fast.
v0.1.0 - 2024-01-15
- The first release.
- Seed indexing and querying are performed in RAM.
- GTDB r214 with 10k masks: index size 75GB, RAM: 130GB.