index
- In the LexicMap source code and command line options, the term “mask” is used, following the terminology in the LexicHash paper.
- In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology.
$ lexicmap index -h
Generate an index from FASTA/Q sequences
Input:
*1. Sequences of each reference genome should be saved in separate FASTA/Q files, with reference identifiers
in the file names.
2. Input plain or gzip/xz/zstd/bzip2/lz4 compressed FASTA/Q files can be given via positional arguments or
the flag -X/--infile-list with a list of input files.
Flag -S/--skip-file-check is optional for skipping file checking if you trust the file list.
3. Input can also be a directory containing sequence files via the flag -I/--in-dir, with multiple-level
sub-directories allowed. A regular expression for matching sequencing files is available via the flag
-r/--file-regexp.
4. Some non-isolate assemblies might have extremely large genomes (e.g., GCA_000765055.1, >150 mb).
The flag -g/--max-genome is used to skip these input files, and the file list would be written to a file
(-G/--big-genomes).
Changes since v0.5.0:
- Genomes with any single contig larger than the threshold will be skipped as before.
- However, fragmented (with many contigs) genomes with the total bases larger than the threshold will
be split into chunks and alignments from these chunks will be merged in "lexicmap search".
You need to increase the value for indexing fungi genomes.
5. Maximum genome size: 268,435,456.
More precisely: $total_bases + ($num_contigs - 1) * 1000 <= 268,435,456, as we concatenate contigs with
1000-bp intervals of N’s to reduce the sequence scale to index.
6. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value).
7. Soft-masked sequences are supported with --soft-masking.
Attention:
*1) ► You can rename the sequence files for convenience, e.g., GCF_000017205.1.fa.gz, because the genome
identifiers in the index and search result would be: the basenames of files with common FASTA/Q file
extensions removed, which are extracted via the flag -N/--ref-name-regexp.
► The extracted genome identifiers better be distinct, which will be shown in search results
and are used to extract subsequences in the command "lexicmap utils subseq".
2) ► Unwanted sequences like plasmids can be filtered out by content in FASTA/Q header via regular
expressions (-B/--seq-name-filter).
3) All degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A.
code bases saved
A A A
C C C
G G G
T/U T T
M A/C A
R A/G A
W A/T A
S C/G C
Y C/T C
K G/T G
V A/C/G A
H A/C/T A
D A/G/T A
B C/G/T C
N A/C/G/T A
Important parameters:
--- Genome data ---
*1. -b/--batch-size, ► Maximum number of genomes in each batch (maximum: 131072, default: 5000).
► If the number of input files exceeds this number, input files are split into multiple
batches and indexes are built for all batches. In the end, seed files are merged, while
genome data files are kept unchanged and collected.
■ Bigger values increase indexing memory occupation and increase batch searching speed,
while single query searching speed is not affected.
--- LexicHash mask generation ---
0. -M/--mask-file, ► File with custom masks, which could be exported from an existing index or newly
generated by "lexicmap utils masks".
This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc.
*1. -k/--kmer, ► K-mer size (maximum: 32, default: 31).
■ Bigger values improve the search specificity and do not increase the index size.
*2. -m/--masks, ► Number of LexicHash masks (default: 20000).
■ Bigger values improve the search sensitivity slightly, increase the index size,
and slow down the search (seed matching) speed.
--- Seeds data (k-mer-value data) ---
*1. --seed-max-desert ► Maximum length of distances between seeds (default: 100).
The default value of 100 guarantees queries >=200 bp would match at least two seeds.
► Large regions with no seeds are called sketching deserts. Deserts with seed distance
larger than this value will be filled by choosing k-mers roughly every
--seed-in-desert-dist (50 by default) bases.
■ Big values decrease the search sensitivity for distant targets, speed up the indexing
speed, decrease the indexing memory occupation and decrease the index size. While the
alignment speed is almost not affected.
2. -c/--chunks, ► Number of seed file chunks (maximum: 128, default: value of -j/--threads).
► Bigger values accelerate the search speed at the cost of a high disk reading load.
The maximum number should not exceed the maximum number of open files set by the
operating systems.
► Make sure the value of '-j/--threads' in 'lexicmap search' is >= this value.
*3. -J/--seed-data-threads ► Number of threads for writing seed data and merging seed chunks from all batches
(maximum: -c/--chunks, default: 8).
■ The actual value is min(--seed-data-threads, max(1, --max-open-files/($batches_1_round + 2))),
where $batches_1_round = min(int($input_files / --batch-size), --max-open-files).
■ Bigger values increase indexing speed at the cost of slightly higher memory occupation.
4. --partitions, ► Number of partitions for indexing each seed file (default: 4096).
► Bigger values bring a little higher memory occupation.
► After indexing, "lexicmap utils reindex-seeds" can be used to reindex the seeds data
with another value of this flag.
*5. --max-open-files, ► Maximum number of open files (default: 1024).
► It's only used in merging indexes of multiple genome batches. If there are >100 batches,
($input_files / --batch-size), please increase this value and set a bigger "ulimit -n" in shell.
Usage:
lexicmap index [flags] [-k <k>] [-m <masks>] {-I <seqs dir> | [-S] -X <file list>} -O <index.lmi>
Flags:
-b, --batch-size int ► Maximum number of genomes in each batch (maximum value: 131072)
(default 5000)
-G, --big-genomes string ► Out file of skipped files with $total_bases + ($num_contigs - 1) *
$contig_interval >= -g/--max-genome. The second column is one of the
skip types: no_valid_seqs, too_large_genome, too_many_seqs.
-c, --chunks int ► Number of chunks for storing seeds (k-mer-value data) files. Max:
128. Default: the value of -j/--threads. (default 16)
--contig-interval int ► Length of interval (N's) between contigs in a genome. It can't be
too small (<1000) or some alignments might be fragmented (default 1000)
--debug ► Print debug information.
-r, --file-regexp string ► Regular expression for matching sequence files in -I/--in-dir,
case ignored. Attention: use double quotation marks for patterns
containing commas, e.g., -p '"A{2,}"'. (default
"\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$")
--force ► Overwrite existing output directory.
-h, --help help for index
-I, --in-dir string ► Input directory containing FASTA/Q files. Directory and file
symlinks are followed.
-k, --kmer int ► Maximum k-mer size. K needs to be <= 32. (default 31)
-M, --mask-file string ► File of custom masks. This flag oversides -k/--kmer, -m/--masks,
-s/--rand-seed etc.
-m, --masks int ► Number of LexicHash masks. (default 20000)
-g, --max-genome int ► Maximum genome size. Genomes with any single contig larger than
the threshold will be skipped, while fragmented (with many contigs)
genomes larger than the threshold will be split into chunks and
alignments from these chunks will be merged in "lexicmap search". The
value needs to be smaller than the maximum supported genome size:
268435456. (default 15000000)
--max-open-files int ► Maximum opened files, used in merging indexes. If there are >100
batches, please increase this value and set a bigger "ulimit -n" in
shell. (default 1024)
-l, --min-seq-len int ► Maximum sequence length to index. The value would be k for values
<= 0. (default -1)
--no-desert-filling ► Disable sketching desert filling (only for debug).
-O, --out-dir string ► Output LexicMap index directory.
--partitions int ► Number of partitions for indexing seeds (k-mer-value data) files.
The value needs to be the power of 4. (default 4096)
-s, --rand-seed int ► Rand seed for generating random masks. (default 1)
-N, --ref-name-regexp string ► Regular expression (must contains "(" and ")") for extracting the
reference name from the filename. Attention: use double quotation
marks for patterns containing commas, e.g., -p '"A{2,}"'. (default
"(?i)(.+)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$")
--save-seed-pos ► Save seed positions, which can be inspected with "lexicmap utils
seed-pos".
-J, --seed-data-threads int ► Number of threads for writing seed data and merging seed chunks
from all batches, the value should be in range of [1, -c/--chunks]. If
there are >100 batches, please also increase the value of
--max-open-files and set a bigger "ulimit -n" in shell. (default 8)
-d, --seed-in-desert-dist int ► Distance of k-mers to fill deserts. (default 50)
-D, --seed-max-desert int ► Maximum length of sketching deserts, or maximum seed distance.
Deserts with seed distance larger than this value will be filled by
choosing k-mers roughly every --seed-in-desert-dist bases. (default 100)
-B, --seq-name-filter strings ► List of regular expressions for filtering out sequences by
contents in FASTA/Q header/name, case ignored.
-S, --skip-file-check ► Skip input file checking when given files or a file list.
--soft-masking ► Support soft-masked genomes. Lowercase bases in soft-masked
low-complexity regions will be treated as A's, and won't be seeded.
Global Flags:
-X, --infile-list string ► File of input file list (one file per line). If given, they are
appended to files from CLI arguments.
--log string ► Log file.
--quiet ► Do not print any verbose information. But you can write them to a file
with --log.
-j, --threads int ► Number of CPU cores to use. By default, it uses all available cores.
(default 16)
See