index

Terminology differences

In the LexicMap source code and command line options, the term “mask” is used, following the terminology in the LexicHash paper.
In the LexicMap manuscript, however, we use “probe” as it is easier to understand. Because these masks, which consist of thousands of k-mers and capture k-mers from sequences through prefix matching, function similarly to DNA probes in molecular biology.
Usage

$ lexicmap index -h
Generate an index from FASTA/Q sequences

Input:
 *1. Sequences of each reference genome should be saved in separate FASTA/Q files, with reference identifiers
     in the file names.
  2. Input plain or gzip/xz/zstd/bzip2/lz4 compressed FASTA/Q files can be given via positional arguments or
     the flag -X/--infile-list with a list of input files.
     Flag -S/--skip-file-check is optional for skipping file checking if you trust the file list.
  3. Input can also be a directory containing sequence files via the flag -I/--in-dir, with multiple-level
     sub-directories allowed. A regular expression for matching sequencing files is available via the flag
     -r/--file-regexp.
  4. Some non-isolate assemblies might have extremely large genomes (e.g., GCA_000765055.1, >150 mb).
     The flag -g/--max-genome is used to skip these input files, and the file list would be written to a file
     (-G/--big-genomes).
     Changes since v0.5.0: 
       - Genomes with any single contig larger than the threshold will be skipped as before.
       - However, fragmented (with many contigs) genomes with the total bases larger than the threshold will
         be split into chunks and alignments from these chunks will be merged in "lexicmap search".
     You need to increase the value for indexing fungi genomes.
  5. Maximum genome size: 268,435,456.
     More precisely: $total_bases + ($num_contigs - 1) * 1000 <= 268,435,456, as we concatenate contigs with
     1000-bp intervals of N’s to reduce the sequence scale to index.
  6. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value).
  7. Soft-masked sequences are supported with --soft-masking.

  Attention:
   *1) ► You can rename the sequence filGenerate an index from FASTA/Q sequences

Input:
 *1. Sequences of each reference genome should be saved in separate FASTA/Q files, with reference identifiers
     in the file names.
  2. Input plain or gzip/xz/zstd/bzip2/lz4 compressed FASTA/Q files can be given via positional arguments or
     the flag -X/--infile-list with a list of input files.
     Flag -S/--skip-file-check is optional for skipping file checking if you trust the file list.
  3. Input can also be a directory containing sequence files via the flag -I/--in-dir, with multiple-level
     sub-directories allowed. A regular expression for matching sequencing files is available via the flag
     -r/--file-regexp.
  4. Some non-isolate assemblies might have extremely large genomes (e.g., GCA_000765055.1, >150 mb).
     The flag -g/--max-genome is used to skip these input files, and the file list would be written to a file
     (-G/--big-genomes).
     Changes since v0.5.0:
       - Genomes with any single contig larger than the threshold will be skipped as before.
       - However, fragmented (with many contigs) genomes with the total bases larger than the threshold will
         be split into chunks and alignments from these chunks will be merged in "lexicmap search".
     You need to increase the value for indexing fungi genomes.
  5. Maximum genome size: 268,435,456.
     More precisely: $total_bases + ($num_contigs - 1) * 1000 <= 268,435,456, as we concatenate contigs with
     1000-bp intervals of N’s to reduce the sequence scale to index.
  6. A flag -l/--min-seq-len can filter out sequences shorter than the threshold (default is the k value).
  7. Soft-masked sequences are supported with --soft-masking.

  Attention:
   *1) ► You can rename the sequence files for convenience, e.g., GCF_000017205.1.fa.gz, because the genome
       identifiers in the index and search result would be: the basenames of files with common FASTA/Q file
       extensions removed, which are extracted via the flag -N/--ref-name-regexp.
       ► The extracted genome identifiers better be distinct, which will be shown in search results
       and are used to extract subsequences in the command "lexicmap utils subseq".
    2) ► Unwanted sequences like plasmids can be filtered out by content in FASTA/Q header via regular
       expressions (-B/--seq-name-filter).
    3) All degenerate bases are converted to their lexicographic first bases. E.g., N is converted to A.
        code  bases    saved
        A     A        A
        C     C        C
        G     G        G
        T/U   T        T

        M     A/C      A
        R     A/G      A
        W     A/T      A
        S     C/G      C
        Y     C/T      C
        K     G/T      G

        V     A/C/G    A
        H     A/C/T    A
        D     A/G/T    A
        B     C/G/T    C

        N     A/C/G/T  A

Important parameters:

  --- Genome data ---
 *1. -b/--batch-size,       ► Maximum number of genomes in each batch (maximum: 131072, default: 5000).
                            ► If the number of input files exceeds this number, input files are split into multiple
                            batches and indexes are built for all batches. In the end, seed files are merged, while
                            genome data files are kept unchanged and collected.
                            ■ Bigger values increase indexing memory occupation and increase batch searching speed,
                            while single query searching speed is not affected.

  --- LexicHash mask generation ---
  0. -M/--mask-file,        ► File with custom masks, which could be exported from an existing index or newly
                            generated by "lexicmap utils masks".
                            This flag oversides -k/--kmer, -m/--masks, -s/--rand-seed, etc.
 *1. -k/--kmer,             ► K-mer size (maximum: 32, default: 31).
                            ■ Bigger values improve the search specificity and do not increase the index size.
 *2. -m/--masks,            ► Number of LexicHash masks (default: 20000).
                            ■ Bigger values improve the search sensitivity slightly, increase the index size,
                            and slow down the search (seed matching) speed.

  --- Seeds data (k-mer-value data) ---
 *1. --seed-max-desert      ► Maximum length of distances between seeds (default: 100).
                            The default value of 100 guarantees queries >=200 bp would match at least two seeds.
                            ► Large regions with no seeds are called sketching deserts. Deserts with seed distance
                            larger than this value will be filled by choosing k-mers roughly every
                            --seed-in-desert-dist (50 by default) bases.
                            ■ Big values decrease the search sensitivity for distant targets, speed up the indexing
                            speed, decrease the indexing memory occupation and decrease the index size. While the
                            alignment speed is almost not affected.
  2. -c/--chunks,           ► Number of seed file chunks (maximum: 128, default: value of -j/--threads).
                            ► Bigger values accelerate the search speed at the cost of a high disk reading load.
                            The maximum number should not exceed the maximum number of open files set by the
                            operating systems.
                            ► Make sure the value of '-j/--threads' in 'lexicmap search' is >= this value.
 *3. -J/--seed-data-threads ► Number of threads for writing seed data and merging seed chunks from all batches
                            (maximum: -c/--chunks, default: 8).
                            ■ The actual value is min(--seed-data-threads, max(1, --max-open-files/($batches_1_round + 2))),
                            where $batches_1_round = min(int($input_files / --batch-size), --max-open-files).
                            ■ Bigger values increase indexing speed at the cost of slightly higher memory occupation.
  4. --partitions,          ► Number of partitions for indexing each seed file (default: 4096).
                            ► Bigger values bring a little higher memory occupation.
                            ► After indexing, "lexicmap utils reindex-seeds" can be used to reindex the seeds data
                            with another value of this flag.
 *5. --max-open-files,      ► Maximum number of open files (default: 1024).
                            ► It's only used in merging indexes of multiple genome batches. If there are >100 batches,
                            ($input_files / --batch-size), please increase this value and set a bigger "ulimit -n" in shell.

Usage:
  lexicmap index [flags] [-k <k>] [-m <masks>] {-I <seqs dir> | [-S] -X <file list>} -O <index.lmi>

Flags:
  -b, --batch-size int            ► Maximum number of genomes in each batch (maximum value: 131072)
                                  (default 5000)
  -G, --big-genomes string        ► Out file of skipped files with $total_bases + ($num_contigs - 1) *
                                  $contig_interval >= -g/--max-genome. The second column is one of the
                                  skip types: no_valid_seqs, too_large_genome, too_many_seqs.
  -c, --chunks int                ► Number of chunks for storing seeds (k-mer-value data) files. Max:
                                  128. Default: the value of -j/--threads. (default 16)
      --contig-interval int       ► Length of interval (N's) between contigs in a genome. It can't be
                                  too small (<1000) or some alignments might be fragmented (default 1000)
      --debug                     ► Print debug information.
  -r, --file-regexp string        ► Regular expression for matching sequence files in -I/--in-dir,
                                  case ignored. Attention: use double quotation marks for patterns
                                  containing commas, e.g., -p '"A{2,}"'. (default
                                  "\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$")
      --force                     ► Overwrite existing output directory.
  -h, --help                      help for index
  -I, --in-dir string             ► Input directory containing FASTA/Q files. Directory and file
                                  symlinks are followed.
  -k, --kmer int                  ► Maximum k-mer size. K needs to be <= 32. (default 31)
  -M, --mask-file string          ► File of custom masks. This flag oversides -k/--kmer, -m/--masks,
                                  -s/--rand-seed etc.
  -m, --masks int                 ► Number of LexicHash masks. (default 20000)
  -g, --max-genome int            ► Maximum genome size. Genomes with any single contig larger than
                                  the threshold will be skipped, while fragmented (with many contigs)
                                  genomes larger than the threshold will be split into chunks and
                                  alignments from these chunks will be merged in "lexicmap search". The
                                  value needs to be smaller than the maximum supported genome size:
                                  268435456. (default 15000000)
      --max-kmer-freq int         ► If a mask captures the same k-mer at more than N positions of a
                                  genome, only the first N positions will be retained. This option may
                                  reduce search sensitivity, but it's useful when simply checking
                                  whether a query matches any position in a genome that contains many
                                  tandem repeat sequences. (0 for no filtering)
      --max-open-files int        ► Maximum opened files, used in merging indexes. If there are >100
                                  batches, please increase this value and set a bigger "ulimit -n" in
                                  shell. (default 1024)
  -l, --min-seq-len int           ► Maximum sequence length to index. The value would be k for values
                                  <= 0. (default -1)
      --no-desert-filling         ► Disable sketching desert filling (only for debug).
  -O, --out-dir string            ► Output LexicMap index directory.
      --partitions int            ► Number of partitions for indexing seeds (k-mer-value data) files.
                                  The value needs to be the power of 4. (default 4096)
  -s, --rand-seed int             ► Rand seed for generating random masks. (default 1)
  -N, --ref-name-regexp string    ► Regular expression (must contains "(" and ")") for extracting the
                                  reference name from the filename. Attention: use double quotation
                                  marks for patterns containing commas, e.g., -p '"A{2,}"'. (default
                                  "(?i)(.+)\\.(f[aq](st[aq])?|fna)(\\.gz|\\.xz|\\.zst|\\.bz2)?$")
      --save-seed-pos             ► Save seed positions, which can be inspected with "lexicmap utils
                                  seed-pos".
  -J, --seed-data-threads int     ► Number of threads for writing seed data and merging seed chunks
                                  from all batches, the value should be in range of [1, -c/--chunks]. If
                                  there are >100 batches, please also increase the value of
                                  --max-open-files and set a bigger "ulimit -n" in shell. (default 8)
  -d, --seed-in-desert-dist int   ► Distance of k-mers to fill deserts. (default 50)
  -D, --seed-max-desert int       ► Maximum length of sketching deserts, or maximum seed distance.
                                  Deserts with seed distance larger than this value will be filled by
                                  choosing k-mers roughly every --seed-in-desert-dist bases. (default 100)
  -B, --seq-name-filter strings   ► List of regular expressions for filtering out sequences by
                                  contents in FASTA/Q header/name, case ignored.
  -S, --skip-file-check           ► Skip input file checking when given files or a file list.
      --soft-masking              ► Support soft-masked genomes. Lowercase bases in soft-masked
                                  low-complexity regions will be treated as A's, and won't be seeded.

Global Flags:
  -X, --infile-list string   ► File of input file list (one file per line). If given, they are
                             appended to files from CLI arguments.
      --log string           ► Log file.
      --quiet                ► Do not print any verbose information. But you can write them to a file
                             with --log.
  -j, --threads int          ► Number of CPU cores to use. By default, it uses all available cores.
                             (default 16)
Examples

See Building an index