Skip to main content
LexicMap: efficient sequence alignment against millions of prokaryotic genomes​
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

masks

$ lexicmap utils masks -h
View masks of the index or generate new masks randomly

Usage:
  lexicmap utils masks [flags] { -d <index path> | [-k <k>] [-n <masks>] [-s <seed>] } [-o out.tsv.gz]

Flags:
  -h, --help              help for masks
  -d, --index string      ► Index directory created by "lexicmap index".
  -k, --kmer int          ► Maximum k-mer size. K needs to be <= 32. (default 31)
  -m, --masks int         ► Number of masks. (default 40000)
  -o, --out-file string   ► Out file, supports and recommends a ".gz" suffix ("-" for stdout).
                          (default "-")
  -p, --prefix int        ► Length of mask k-mer prefix for checking low-complexity (0 for no
                          checking). (default 15)
  -s, --seed int          ► The seed for generating random masks. (default 1)

Global Flags:
  -X, --infile-list string   ► File of input file list (one file per line). If given, they are
                             appended to files from CLI arguments.
      --log string           ► Log file.
      --quiet                ► Do not print any verbose information. But you can write them to a file
                             with --log.
  -j, --threads int          ► Number of CPU cores to use. By default, it uses all available cores.
                             (default 16)

Examples

$ lexicmap utils masks --quiet -d demo.lmi/ | head -n 10
1       AAAAAAATTCTCGGCGGTGTTTCCAGGCGCA
2       AAAAAACGTGGCGTCCCCTGTATAACGGCTA
3       AAAAAAGAGGGGAAGCAAGCTGAAGGATATG
4       AAAAAATACAGGCTGGCATCTTTAACCCACC
5       AAAAAATCCAGGGTTCCGTTAAGGATCTGTC
6       AAAAACATTCATGCTAGCATACCTTGGCAAC
7       AAAAACCACAATGTGGAAGCACGAGAGGATT
8       AAAAACCTGTACCCACCCGACGTGGATCCTC
9       AAAAACGTAGGCGTACCTCTCATAGCTTGTA
10      AAAAACTATGGATACTTGCCGTAAATCACCT

$ lexicmap utils masks --quiet -d demo.lmi/ | tail -n 10
19991   TTTTTGAACTTGTGAAAAAGGCAGATGTGTG
19992   TTTTTGCGTTTATGCTGCCCTCAAACCATCT
19993   TTTTTGGATCCACTGTACGAGCACACTACCC
19994   TTTTTGTGGCTCATCGGGATCGGGAGCAGTC
19995   TTTTTTACATGTTGGGCTAGGGGCGGTTCAC
19996   TTTTTTATCGGACGCCAAGTTTGTAATCGTC
19997   TTTTTTCTTGCATCGTATTCAGCACGTTCCT
19998   TTTTTTGCCGAGTGACCCCGAAAAGCTCACA
19999   TTTTTTTATCGAGGCATGGTTGAAGACGGGT
20000   TTTTTTTCCGTAACTAGGTTCTGGCGATTCC

# check a specific mask

$ lexicmap utils masks --quiet -d demo.lmi/ -m 12345
12345   GCTGCACACGCAAAGACTCACGTCTTCAACG

Freqency of prefixes.

$ lexicmap utils masks --quiet -d demo.lmi/ \
  | csvtk mutate -Ht -f 2 -p '^(.{7})' \
  | csvtk freq -Ht -f 3 -nr \
  | head -n 10
AAAAAAT 2
AAAAACC 2
AAAAACT 2
AAAAAGG 2
AAAAAGT 2
AAAAATT 2
AAAACCA 2
AAAACCC 2
AAAACGA 2
AAAACTA 2

$ lexicmap utils masks --quiet -d demo.lmi/ \
  | csvtk mutate -Ht -f 2 -p '^(.{7})' \
  | csvtk freq -Ht -f 3 -n \
  | head -n 10
AAAAAAA 1
AAAAAAC 1
AAAAAAG 1
AAAAACA 1
AAAAACG 1
AAAAAGA 1
AAAAAGC 1
AAAAATA 1
AAAAATC 1
AAAAATG 1

Frequency of frequencies. i.e., for 20,000 masks, 47 = 16384. In them, 3,616 of them are duplicated 2 times. 12768 + 2 * 3616 = 20000.

$ lexicmap utils masks --quiet -d demo.lmi/ \
  | csvtk mutate -Ht -f 2 -p '^(.{7})' \
  | csvtk freq -Ht -f 3 -n \
  | csvtk freq -Ht -f 2 -k
1       12768
2       3616