Usage

KMCP is a command-line tool consisting of several subcommands.

Subcommand	Function
compute	Generate k-mers (sketch) from FASTA/Q sequences
index	Construct adatabase from k-mer files
search	Search sequences against a database
merge	Merge search results from multiple databases
profile	Generate the taxonomic profile from search results
utils split-genomes	Split genomes into chunks
utils unik-info	Print information of .unik files
utils index-info	Print information of index files
utils index-density	Plot the element density of bloom filters for an index file
utils ref-info	Print information of reference chunks in a database
utils cov2simi	Convert k-mer coverage to sequence similarity
utils query-fpr	Compute the false positive rate of a query
utils filter	Filter search results and find species/assembly-specific queries
utils merge-regions	Merge species/assembly-specific regions

kmcp


    Program: kmcp (K-mer-based Metagenomic Classification and Profiling)
    Version: v0.9.4
  Documents: https://bioinf.shenwei.me/kmcp
Source code: https://github.com/shenwei356/kmcp

KMCP is a tool for metagenomic classification and profiling.

KMCP can also be used for:
  1. Fast sequence search against large scales of genomic datasets
     as BIGSI and COBS do.
  2. Fast assembly/genome similarity estimation as Mash and sourmash do,
     by utilizing Minimizer, FracMinHash (Scaled MinHash), or Closed Syncmers.

Usage:
  kmcp [command] 

Available Commands:
  autocompletion Generate shell autocompletion script
  compute        Generate k-mers (sketches) from FASTA/Q sequences
  index          Construct a database from k-mer files
  merge          Merge search results from multiple databases
  profile        Generate the taxonomic profile from search results
  search         Search sequences against a database
  utils          Some utilities
  version        Print version information and check for update

Flags:
  -h, --help                 help for kmcp
  -i, --infile-list string   ► File of input files list (one file per line). If given, they are
                             appended to files from CLI arguments.
      --log string           ► Log file.
  -q, --quiet                ► Do not print any verbose information. But you can write them to file
                             with --log.
  -j, --threads int          ► Number of CPUs cores to use. (default 16)

Use "kmcp [command] --help" for more information about a command.

compute

Generate k-mers (sketches) from FASTA/Q sequences

Input:
  1. Input plain or gzipped FASTA/Q files can be given via positional
     arguments or the flag -i/--infile-list with the list of input files,
  2. Or a directory containing sequence files via the flag -I/--in-dir,
     with multiple-level sub-directories allowed. A regular expression
     for matching sequencing files is available via the flag -r/--file-regexp.
 *3. For taxonomic profiling, the sequences of each reference genome should be
     saved in a separate file, with the reference identifier in the file name.

  Attention:
    You may rename the sequence files for convenience because the 
  sequence/genome identifier in the index and search results would be:
    1). For the default mode (computing k-mers for the whole file):
          the basename of file with common FASTA/Q file extension removed,
          captured via the flag -N/--ref-name-regexp.  
    2). For splitting sequence mode (see details below):
          same to 1).
    3). For computing k-mers for each sequence:
          the sequence identifier.

Attentions:
  1. Unwanted sequences like plasmid can be filtered out by
     the name via regular expressions (-B/--seq-name-filter).
  2. By default, kmcp computes k-mers (sketches) of every file,
     you can also use --by-seq to compute for every sequence,
     where sequence IDs in all input files are better to be distinct.
  3. It also supports splitting sequences into chunks, this
     could increase the specificity in search results at the cost
     of a slower searching speed.
  4. Multiple sizes of k-mers are supported.

Supported k-mer (sketches) types:
  1. K-mer:
     1). ntHash of k-mer (-k)
  2. K-mer sketchs (all using ntHash):
     1). FracMinHash    (-k -D), previously named Scaled MinHash
     2). Minimizer      (-k -W), optionally scaling/down-sampling (-D)
     3). Closed Syncmer (-k -S), optionally scaling/down-sampling (-D)

Splitting sequences:
  1. Sequences can be splitted into chunks by a chunk size 
     (-s/--split-size) or number of chunks (-n/--split-number)
     with overlap (-l/--split-overlap).
     In this mode, the sequences of each genome should be saved in an
     individual file.
  2. When splitting by number of chunks, all sequences (except for
     these matching any regular expression given by -B/--seq-name-filter)
     in a sequence file are concatenated with k-1 N's before splitting.
  3. Both sequence IDs and chunks indices are saved for later use,
     in form of meta/description data in .unik files.

Metadata:
  1. Every outputted .unik file contains the sequence/reference ID,
     chunk index, number of chunks, and genome size of reference.
  2. When parsing whole sequence files or splitting by the number of chunks,
     the identifier of a reference is the basename of the input file
     by default. It can also be extracted from the input file name via
     -N/--ref-name-regexp, e.g., "^(\w{3}_\d{9}\.\d+)" for RefSeq records.

Output:
  1. All outputted .unik files are saved in ${outdir}, with path
     ${outdir}/xxx/yyy/zzz/${infile}-id_${seqID}.unik
     where dirctory tree '/xxx/yyy/zzz/' is built for > 1000 output files.
  2. For splitting sequence mode (--split-size > 0 or --split-number > 0),
     output files are:
     ${outdir}//xxx/yyy/zzz/${infile}/{seqID}-chunk_${chunkIdx}.unik
  3. A summary file ("${outdir}/_info.txt") is generated for later use.
     Users need to check if the reference IDs (column "name") are what
     supposed to be.

Performance tips:
  1. Decrease the value of -j/--threads for data in hard disk drives to
     reduce I/O pressure.

Next step:
  1. Check the summary file (${outdir}/_info.txt) to see if the reference
     IDs (column "name") are what supposed to be.
  2. Run "kmcp index" with the output directory.

Examples:
  1. From few sequence files:

        kmcp compute -k 21 -n 10 -l 150 -O tmp-k21-n10-l150 NC_045512.2.fna.gz

  2. From a list file:

        kmcp compute -k 21 -n 10 -l 150 -O tmp-k21-210-l150 -i list.txt

  3. From a directory containing many sequence files:

        kmcp compute -k 21 -n 10 -l 150 -B plasmid \
            -O gtdb-k21-n10-l150 -I gtdb-genomes/

Usage:
  kmcp compute [flags] [-k <k>] [-n <chunks>] [-l <overlap>] {[-I <seqs dir>] | <seq files>} -O <out dir>

Flags:
      --by-seq                    ► Compute k-mers (sketches) for each sequence, instead of the whole file.
      --circular                  ► Input sequences are circular. Note that it only applies to genomes
                                  with a single chromosome.
  -c, --compress                  ► Output gzipped .unik files, it's slower and can save little space.
  -r, --file-regexp string        ► Regular expression for matching sequence files in -I/--in-dir,
                                  case ignored. (default "\\.(f[aq](st[aq])?|fna)(.gz)?$")
      --force                     ► Overwrite existed output directory.
  -h, --help                      help for compute
  -I, --in-dir string             ► Directory containing FASTA/Q files. Directory symlinks are followed.
  -k, --kmer ints                 ► K-mer size(s). K needs to be <=64. Multiple values are supported,
                                  e.g., "-k 21,31" or "-k 21 -k 31" (default [21])
  -W, --minimizer-w int           ► Minimizer window size.
  -O, --out-dir string            ► Output directory.
  -N, --ref-name-regexp string    ► Regular expression (must contains "(" and ")") for extracting
                                  reference name from filename. (default
                                  "(?i)(.+)\\.(f[aq](st[aq])?|fna)(.gz)?$")
  -D, --scale int                 ► Scale of the FracMinHash (Scaled MinHash), or down-sample factor
                                  for Syncmers and Minimizer. (default 1)
  -B, --seq-name-filter strings   ► List of regular expressions for filtering out sequences by
                                  header/name, case ignored.
  -m, --split-min-ref int         ► Only splitting sequences >= X bp. (default 1000)
  -n, --split-number int          ► Chunk number for splitting sequences, incompatible with
                                  -s/--split-size.
  -l, --split-overlap int         ► Chunk overlap for splitting sequences. The default value will be
                                  set to k-1 unless you change it.
  -s, --split-size int            ► Chunk size for splitting sequences, incompatible with
                                  -n/--split-number.
  -S, --syncmer-s int             ► Length of the s-mer in Closed Syncmers.

index

Construct a database from k-mer files

We build the index for k-mers (sketches) with a modified compact bit-sliced
signature index (COBS). We totally rewrite the algorithms, data structure,
and file format, and have improved the indexing and searching speed.

Input:
  The output directory generated by "kmcp compute".

Database size and searching accuracy:
  0. Use --dry-run to adjust parameters and check the final number of 
     index files (#index-files) and the total file size.
  1. -f/--false-positive-rate: the default value 0.3 is enough for a
     query with tens of k-mers (see BIGSI/COBS paper).
     Small values could largely increase the size of the database.
  2. -n/--num-hash: large values could reduce the database size,
     at the cost of a slower searching speed. Values <=4 are recommended.
  3. The value of block size -b/--block-size better to be multiple of 64.
     The default value is:  (#unikFiles/#threads + 7) / 8 * 8
  4. Use flag -x/--block-sizeX-kmers-t, -8/--block-size8-kmers-t,
     and -1/--block-size1-kmers-t to separately create indexes for
     inputs with a huge number of k-mers, for precise control of
     database size.

References:
  1. COBS: https://arxiv.org/abs/1905.09624

Taxonomy data:
  1. No taxonomy data are included in the database.
  2. Taxonomy information are only needed in "profile" command.

Performance tips:
  1. The number of blocks (.uniki files) is better be smaller than
     or equal to the number of CPU cores for faster searching speed. 
     We can set the flag -j/--threads to control the blocks number.
     When more threads (>= 1.3 * #blocks) are given, extra workers are
     automatically created.
  2. #threads files are simultaneously opened, and the max number
     of opened files is limited by the flag -F/--max-open-files.
     You may use a small value of -F/--max-open-files for 
     hard disk drive storages.
  3. When the database is used in a new computer with more CPU cores,
     'kmcp search' could automatically scale to utilize as many cores
     as possible.

Next step:
  1. Use 'kmcp search' for searching.
  2. Use 'kmcp utils ref-info' to check the number of k-mers and FPR
     of each genome chunk.

Examples:
  1. For bacterial genomes:

       kmcp index -f 0.3 -n 1 -j 32 -I gtdb-k21-n10-l150/ -O gtdb.kmcp

  2. For viruses, use -x and -8 to control index size of the largest chunks:

       kmcp index -f 0.05 -n 1 -j 32 -x 100K -8 1M \
           -I genbank-viral-k21-n10-l150/ -O genbank-viral.kmcp

Usage:
  kmcp index [flags] [-f <fpr>] [-n <hashes>] [-j <blocks>] -I <compute output> -O <kmcp db>

Flags:
  -a, --alias string                 ► Database alias/name. (default: basename of --out-dir). You can
                                     also manually edit it in info file: ${outdir}/__db.yml.
  -b, --block-size int               ► Block size, better be multiple of 64 for large number of input
                                     files. (default: min(#.files/#theads, 8))
  -1, --block-size1-kmers-t string   ► If k-mers of single .unik file exceeds this threshold, an
                                     individual index is created for this file. Supported units: K, M,
                                     G. (default "200M")
  -8, --block-size8-kmers-t string   ► If k-mers of single .unik file exceeds this threshold, block
                                     size is changed to 8. Supported units: K, M, G. (default "20M")
  -X, --block-sizeX int              ► If k-mers of single .unik file exceeds --block-sizeX-kmers-t,
                                     block size is changed to this value. (default 256)
  -x, --block-sizeX-kmers-t string   ► If k-mers of single .unik file exceeds this threshold, block
                                     size is changed to --block-sizeX. Supported units: K, M, G.
                                     (default "10M")
      --dry-run                      ► Dry run, useful for adjusting parameters (highly recommended).
  -f, --false-positive-rate float    ► False positive rate of the bloom filters, range: (0, 1).
                                     (default 0.3)
      --file-regexp string           ► Regular expression for matching files in -I/--in-dir, case
                                     ignored. (default ".unik$")
      --force                        ► Overwrite existed output directory.
  -h, --help                         help for index
  -I, --in-dir string                ► Directory containing .unik files. Directory symlinks are followed.
  -F, --max-open-files int           ► Maximum number of opened files, please use a small value for
                                     hard disk drive storage. (default 256)
  -n, --num-hash int                 ► Number of hash functions in bloom filters. (default 1)
  -O, --out-dir string               ► Output directory. (default: ${indir}.kmcp-db)

search

Search sequences against a database

Attentions:
  1. Input format should be (gzipped) FASTA or FASTQ from files or stdin.
     - Paired-end files should be given via -1/--read1 and -2/--read2.
        kmcp search -d db -1 read_1.fq.gz -2 read_2.fq.gz -o read.tsv.gz
     - Single-end can be given as positional arguments or -1/-2.
        kmcp search -d db file1.fq.gz file2.fq.gz -o result.tsv.gz
    **Single-end mode is recommended for paired-end reads, for higher sensitivity**.
  2. A long query sequence may contain duplicated k-mers, which are
     not removed for short sequences by default. You may modify the
     value of -u/--kmer-dedup-threshold to remove duplicates.
  3. For long reads or contigs, you should split them into short reads
     using "seqkit sliding", e.g.,
         seqkit sliding -s 100 -W 300

Shared flags between "search" and "profile":
  1. -t/--min-query-cov
  2. -f/--max-fpr

Index files loading modes:
  1. Using memory-mapped index files with mmap (default):
      - Faster startup speed when index files are buffered in memory.
      - Multiple KMCP processes can share the memory.
  2. Loading the whole index files into memory (-w/--load-whole-db):
      - This mode occupies a little more memory.
        And multiple KMCP processes can not share the database in memory.
      - It's slightly faster due to the use of physically contiguous memory.
        The speedup is more significant for smaller databases.
      - **Please switch on this flag when searching on computer clusters,
        where the default mmap mode would be very slow for network-attached
        storage (NAS)**.
  3. Low memory mode (--low-mem):
      - Do not load all index files into memory nor use mmap, using file seeking.
      - It's much slower, >4X slower on SSD and would be much slower on HDD disks.
      - Only use this mode for small number of queries or a huge database that
        can't be loaded into memory.

Output format:
  Tab-delimited format with 15 columns:

     1. query,    Identifier of the query sequence
     2. qLen,     Query length
     3. qKmers,   K-mer number of the query sequence
     4. FPR,      False positive rate of the match
     5. hits,     Number of matches
     6. target,   Identifier of the target sequence
     7. chunkIdx, Index of reference chunk
     8. chunks,   Number of reference chunks
     9. tLen,     Reference length
    10. kSize,    K-mer size
    11. mKmers,   Number of matched k-mers
    12. qCov,     Query coverage,  equals to: mKmers / qKmers
    13. tCov,     Target coverage, equals to: mKmers / K-mer number of reference chunk
    14. jacc,     Jaccard index
    15. queryIdx, Index of query sequence, only for merging

  The values of tCov and jacc in results only apply to databases built
  with a single size of k-mer.

Performance tips:
  1. Increase the value of -j/--threads for acceleratation, but values larger
     than the number of CPU cores won't bring extra speedup.
  2. When more threads (>= 1.3 * #blocks) are given, extra workers are
     automatically created.

Examples:
  1. Single-end mode (recommended)

       kmcp search -d gtdb.kmcp -o sample.kmcp@gtdb.kmcp.tsv.gz \
           sample_1.fq.gz sample_2.fq.gz sample_1_unpaired.fq.gz sample_2_unpaired.fq.gz

  2. Paired-end mode

       kmcp search -d gtdb.kmcp -o sample.kmcp@gtdb.kmcp.tsv.gz \
           -1 sample_1.fq.gz -2 sample_2.fq.gz

  3. In computer clusters, where databases are saved in NAS storages.

       kmcp search -w -d gtdb.n16-00.kmcp -o sample.kmcp@gtdb.n16-00.kmcp.tsv.gz \
           sample_1.fq.gz sample_2.fq.gz

Usage:
  kmcp search [flags] [-w] -d <kmcp db> [-t <min-query-cov>] [read1.fq.gz] [read2.fq.gz] [unpaired.fq.gz] [-o read.tsv.gz]

Flags:
  -d, --db-dir string              ► Database directory created by "kmcp index". Please add
                                   -w/--load-whole-db for databases on network-attached storages (NAS),
                                   e.g., a computer cluster environment.
  -D, --default-name-map           ► Load ${db}/__name_mapping.tsv for mapping name first.
  -S, --do-not-sort                ► Do not sort matches of a query.
  -h, --help                       help for search
  -n, --keep-top-scores int        ► Keep matches with the top N scores for a query, 0 for all.
  -K, --keep-unmatched             ► Keep unmatched query sequence information.
  -u, --kmer-dedup-threshold int   ► Remove duplicated kmers for a query with >= X k-mers. (default 256)
  -w, --load-whole-db              ► Load all index files into memory, it's faster for small databases
                                   but needs more memory. Use this for databases on network-attached
                                   storages (NAS). Please read "Index files loading modes" in "kmcp
                                   search -h".
      --low-mem                    ► Do not load all index files into memory nor use mmap, the
                                   searching would be very very slow for a large number of queries.
                                   Please read "Index files loading modes" in "kmcp search -h".
  -f, --max-fpr float              ► Maximum false positive rate of a query. (default 0.01)
  -c, --min-kmers int              ► Minimum number of matched k-mers (sketches). (default 10)
  -t, --min-query-cov float        ► Minimum query coverage, i.e., proportion of matched k-mers and
                                   unique k-mers of a query. (default 0.55)
  -m, --min-query-len int          ► Minimum query length. (default 30)
  -T, --min-target-cov float       ► Minimum target coverage, i.e., proportion of matched k-mers and
                                   unique k-mers of a target.
  -N, --name-map strings           ► Tabular two-column file(s) mapping reference IDs to user-defined
                                   values. Don't use this if you will use the result for metagenomic
                                   profiling which needs the original reference IDs.
  -H, --no-header-row              ► Do not print header row.
  -o, --out-file string            ► Out file, supports and recommends a ".gz" suffix ("-" for
                                   stdout). (default "-")
      --query-id string            ► Custom query Id when using the whole file as a query.
  -g, --query-whole-file           ► Use the whole file as a query, e.g., for genome similarity
                                   estimation against k-mer sketch database.
  -1, --read1 string               ► (Gzipped) read1 file.
  -2, --read2 string               ► (Gzipped) read2 file.
  -s, --sort-by string             ► Sort hits by "qcov", "tcov" or "jacc" (Jaccard Index). (default
                                   "qcov")
      --try-se                     ► If paired-end reads have no hits, re-search with read1, if still
                                   fails, try read2.
  -G, --use-filename               ► Use file name as query ID when using the whole file as a query.

merge

Merge search results from multiple databases

Input:
  *. Searching results of the same reads in different databases.
  *. The order of multiple input reads files should be the same during searching.
  *. When only one input given, we just copy and write to the input file.
     This is friendly to workflows which assume multiple inputs are given.

Example:
    kmcp merge -o search.kmcp.tsv.gz search.kmcp@*.kmcp.tsv.gz

Usage:
  kmcp merge [flags] [-o read.tsv.gz] [<search results> ...]

Flags:
  -n, --field-hits int       ► Field of hits. (default 5)
  -f, --field-queryIdx int   ► Field of queryIdx. (default 15)
  -h, --help                 help for merge
  -H, --no-header-row        ► Do not print header row.
  -o, --out-file string      ► Out file, supports and recommends a ".gz" suffix ("-" for stdout).
                             (default "-")
  -s, --sort-by string       ► Sort hits by "qcov", "tcov" or "jacc" (Jaccard Index). (default "qcov")

profile

Generate the taxonomic profile from search results

Methods:
  1. Reference genomes can be split into chunks when computing
     k-mers (sketches), which could help to increase the specificity
     via a threshold, i.e., the minimum proportion of matched chunks
     (-p/--min-chunks-fraction). (***highly recommended***)
     Another flag -d/--max-chunks-depth-stdev further reduces false positives.
  2. We require a part of the uniquely matched reads of a reference
     having high similarity, i.e., with high confidence for decreasing
     the false positive rate.
  3. We also use the two-stage taxonomy assignment algorithm in MegaPath
     to reduce the false positives of ambiguous matches.
     You can also disable this step by the flag --no-amb-corr.
     If stage 1/4 produces thousands of candidates, you can use
     the flag --no-amb-corr to reduce analysis time, which has very little
     effect on the results.
  4. Abundance are estimated using an Expectation-Maximization (EM) algorithm.
  5. Input files are parsed for multiple times, therefore STDIN is not supported.

Reference:
  1. MegaPath: https://doi.org/10.1186/s12864-020-06875-6

Accuracy notes:
  *. Smaller -t/--min-qcov increase sensitivity at the cost of higher false
     positive rate (-f/--max-fpr) of a query.
  *. We require a part of the uniquely matched reads of a reference
     having high similarity, i.e., with high confidence for decreasing
     the false positive rate.
     E.g., -H >= 0.8 and -P >= 0.1 equals to 90th percentile >= 0.8
     *. -U/--min-hic-ureads,      minimum number, >= 1
     *. -H/--min-hic-ureads-qcov, minimum query coverage, >= -t/--min-qcov
     *. -P/--min-hic-ureads-prop, minimum proportion, higher values
        increase precision at the cost of sensitivity.
  *. -R/--max-mismatch-err and -D/--min-dreads-prop is for determing
     the right reference for ambiguous reads with the algorithm in MegaPath.
  *. --keep-perfect-matches is not recommended, which decreases sensitivity. 
  *. --keep-main-matches is not recommended, which affects accuracy of
     abundance estimation.
  *. -n/--keep-top-qcovs is not recommended, which affects accuracy of
     abundance estimation.

Profiling modes:
  We preset six profiling modes, available with the flag -m/--mode:
    - 0 (for pathogen detection)
    - 1 (higher recall)
    - 2 (high recall)
    - 3 (default)
    - 4 (high precision)
    - 5 (higher precision)

  You can still change the values of some options below as usual.

    options                       m=0    m=1   m=2   m=3    m=4   m=5
    ---------------------------   ----   ---   ---   ----   ---   ----
    -r/--min-chunks-reads         1      5     10    50     100   100
    -p/--min-chunks-fraction      0.2    0.6   0.7   0.8    1     1
    -d/--max-chunks-depth-stdev   10     2     2     2      2     1.5
    -u/--min-uniq-reads           1      2     5     20     50    50
    -U/--min-hic-ureads           1      1     2     5      10    10
    -H/--min-hic-ureads-qcov      0.7    0.7   0.7   0.75   0.8   0.8
    -P/--min-hic-ureads-prop      0.01   0.1   0.2   0.1    0.1   0.15
    --keep-main-matches           true                            
    --max-qcov-gap                0.4                          

Taxonomy data:
  1. Mapping references IDs to TaxIds: -T/--taxid-map
  2. NCBI taxonomy dump files: -X/--taxdump
     For databases built with a custom genome collection, you can use
     "taxonkit create-taxdump" (https://github.com/shenwei356/taxonkit)
     to create NCBI-style taxdump files, which also generates a TaxId mapping file.

Performance notes:
  1. Searching results are parsed in parallel, and the number of
     lines proceeded by a thread can be set by the flag --line-chunk-size.
  2. However using a lot of threads does not always accelerate
     processing, 4 threads with a chunk size of 500-5000 is fast enough.
 *3. If stage 1/4 produces thousands of candidates, then stage 2/4
     would be very slow. You can use the flag --no-amb-corr to disable
     ambiguous reads correction which has very little effect on the results.

Profiling output formats:
  1. KMCP      (-o/--out-file)
     Note that: abundances are only computed for target references rather than
     each taxon at all taxonomic ranks, so please also output CAMI or MetaPhlAn format.
  2. CAMI      (-M/--metaphlan-report, --metaphlan-report-version,
                -s/--sample-id, --taxonomy-id)
     Related tools (https://github.com/shenwei356/taxonkit):
       - taxonkit profile2cami: convert any metagenomic profile table with
         TaxIds to CAMI format. Use this if you forget to output CAMI format.
       - taxonkit cami-filter: remove taxa of given TaxIds and their
         descendants in a CAMI metagenomic profile.
  3. MetaPhlAn (-C/--cami-report, -s/--sample-id)

KMCP format:
  Tab-delimited format with 17 columns:

     1. ref,                Identifier of the reference genome
     2. percentage,         Relative abundance of the reference
     3. coverage,           Average coverage of the reference
     4. score,              The 90th percentile of qCov of uniquely matched reads
     5. chunksFrac,         Genome chunks fraction
     6. chunksRelDepth,     Relative depths of reference chunks
     7. chunksRelDepthStd,  The standard deviation of chunksRelDepth
     8. reads,              Total number of matched reads of this reference
     9. ureads,             Number of uniquely matched reads
    10. hicureads,          Number of uniquely matched reads with high-confidence
    11. refsize,            Reference size
    12. refname,            Reference name, optional via name mapping file
    13. taxid,              TaxId of the reference
    14. rank,               Taxonomic rank
    15. taxname,            Taxonomic name
    16. taxpath,            Complete lineage
    17. taxpathsn,          Corresponding TaxIds of taxa in the complete lineage

Taxonomic binning formats:
  1. CAMI      (-B/--binning-result)

Examples:
  1. Default mode:

       kmcp profile -X taxdump/ -T taxid.map -m 3 \
           sample.kmcp.tsv.gz -o sample.k.profile \
           -C sample.c.profile -s sample

  2. For pathogen detection (you may create databases with lower FPRs,
     e.g., kmcp index -f 0.1 -n 2 for bacteria and fungi genomes,
     and search with a low k-mer coverage threshold -t 0.4):

       kmcp profile -X taxdump/ -T taxid.map -m 3 -t 0.4 \
           sample.kmcp.tsv.gz -o sample.k.profile

Usage:
  kmcp profile [flags] [-X <taxdump dir>] [-T <taxid.map>] [-m <mode>] [-o <kmcp profile>] <search results>

Flags:
  -I, --abund-max-iters int               ► Miximal iteration of abundance estimation. (default 10)
      --abund-pct-threshold float         ► If the percentage change of the predominant target is
                                          smaller than this threshold, stop the iteration. (default 0.01)
  -B, --binning-result string             ► Save extra binning result in CAMI report.
  -C, --cami-report string                ► Save extra CAMI-like report.
      --debug string                      ► Debug output file.
  -F, --filter-low-pct float              ► Filter out predictions with the smallest relative
                                          abundances summing up X%. Range: [0,100).
  -h, --help                              help for profile
      --keep-main-matches                 ► Only keep main matches, abandon matches with sharply
                                          decreased qcov (> --max-qcov-gap).
      --keep-perfect-matches              ► Only keep the perfect matches (qcov == 1) if there are.
  -n, --keep-top-qcovs int                ► Keep matches with the top N qcovs for a query, 0 for all.
      --level string                      ► Level to estimate abundance at. Available values: species,
                                          strain/assembly. (default "species")
      --line-chunk-size int               ► Number of lines to process for each thread, and 4 threads
                                          is fast enough. Type "kmcp profile -h" for details. (default 5000)
  -d, --max-chunks-depth-stdev float      ► Maximum standard deviation of relative depths of all
                                          chunks. (default 2)
  -f, --max-fpr float                     ► Maximum false positive rate of a read in search result.
                                          (default 0.01)
  -R, --max-mismatch-err float            ► Maximum error rate of a read being matched to a wrong
                                          reference, for determing the right reference for ambiguous
                                          reads. Range: (0, 1). (default 0.05)
      --max-qcov-gap float                ► Max qcov gap between adjacent matches. (default 0.4)
  -M, --metaphlan-report string           ► Save extra metaphlan-like report.
      --metaphlan-report-version string   ► Metaphlan report version (2 or 3) (default "3")
  -p, --min-chunks-fraction float         ► Minimum fraction of matched reference chunks with reads >=
                                          -r/--min-chunks-reads. (default 0.8)
  -r, --min-chunks-reads int              ► Minimum number of reads for a reference chunk. (default 50)
  -D, --min-dreads-prop float             ► Minimum proportion of distinct reads, for determing the
                                          right reference for ambiguous reads. Range: (0, 1). (default 0.05)
  -U, --min-hic-ureads int                ► Minimum number of high-confidence uniquely matched reads
                                          for a reference. (default 5)
  -P, --min-hic-ureads-prop float         ► Minimum proportion of high-confidence uniquely matched
                                          reads. (default 0.1)
  -H, --min-hic-ureads-qcov float         ► Minimum query coverage of high-confidence uniquely matched
                                          reads. (default 0.75)
  -t, --min-query-cov float               ► Minimum query coverage of a read in search result.
                                          (default 0.55)
  -u, --min-uniq-reads int                ► Minimum number of uniquely matched reads for a reference.
                                          (default 20)
  -m, --mode int                          ► Profiling mode, type "kmcp profile -h" for details.
                                          available values: 0 (for pathogen detection), 1
                                          (higherrecall), 2 (high recall), 3 (default), 4 (high
                                          precision), 5 (higher precision). (default 3)
  -N, --name-map strings                  ► Tabular two-column file(s) mapping reference IDs to
                                          reference names.
      --no-amb-corr                       ► Do not correct ambiguous reads. Use this flag to reduce
                                          analysis time if the stage 1/4 produces thousands of candidates.
      --norm-abund string                 ► Method for normalize abundance of a reference by the
                                          mean/min/max abundance in all chunks, available values: mean,
                                          min, max. (default "mean")
  -o, --out-file string                   ► Out file, supports a ".gz" suffix ("-" for stdout).
                                          (default "-")
      --rank-prefix strings               ► Prefixes of taxon name in certain ranks, used with
                                          --metaphlan-report. (default [k__,p__,c__,o__,f__,g__,s__,t__])
  -s, --sample-id string                  ► Sample ID in result file.
  -S, --separator string                  ► Separator of TaxIds and taxonomy names. (default ";")
      --show-rank strings                 ► Only show TaxIds and names of these ranks. (default
                                          [superkingdom,phylum,class,order,family,genus,species,strain])
  -X, --taxdump string                    ► Directory of NCBI taxonomy dump files: names.dmp,
                                          nodes.dmp, optional with merged.dmp and delnodes.dmp.
  -T, --taxid-map strings                 ► Tabular two-column file(s) mapping reference IDs to TaxIds.
      --taxonomy-id string                ► Taxonomy ID in result file.

utils

Some utilities

Usage:
  kmcp utils [command] 

Available Commands:
  cov2simi      Convert k-mer coverage to sequence similarity
  filter        Filter search results and find species/assembly-specific queries
  index-density Plot the element density of bloom filters for an index file
  index-info    Print information of index files
  merge-regions Merge species/assembly-specific regions
  query-fpr     Compute the false positive rate of a query
  ref-info      Print information of reference chunks in a database
  split-genomes Split genomes into chunks
  unik-info     Print information of .unik files

ref-info

Print information of reference chunks in a database

Columns:

    file,     the base name of index file
    i,        the idx of a reference chunk in the index file, 1-based
    target,   reference name
    chunkIdx, the idx of the chunk, 0-based
    chunks,   the number of chunks of the reference
    kmers,    the number of k-mers of the chunk
    fpr,      the actual false-positive rate of the chunk

Usage:
  kmcp utils ref-info [flags] 

Flags:
  -d, --db-dir string     ► Database directory created by "kmcp index".
  -h, --help              help for ref-info
  -H, --no-header-row     ► Do not print header row.
  -o, --out-file string   ► Out file, supports and recommends a ".gz" suffix ("-" for stdout).
                          (default "-")

index-info

Print information of a index file

Usage:
  kmcp utils index-info [flags] 

Flags:
  -a, --all               ► Show all information.
  -b, --basename          ► Only output basenames of files.
  -h, --help              help for index-info
  -o, --out-file string   ► Out file, supports a ".gz" suffix ("-" for stdout). (default "-")

index-density

Plot the element density of bloom filters for an index file

Purposes:
  1. Checking whether elements (Ones) in bloom filters are uniformly distributed
     via an intuitive grayscale image.

Outputs:
  1. default output (a TSV file), columns:
      1) target:   reference id
      2) chunkIdx: the index of genome chunk
      3) bins:     the number of bins in bloom filters for counting 1s
      4) binSize:  the size/width of a bin
      5) counts:   comma-seperated counts in each bin
  2. the density image (a grayscale JPEG image):
      - X: bins. The width is the number of bins
      - Y: bloom filters, with each representing a genome (chunk).
        The height is the number of names (genome or genome chunks)
      - greyscale/darkness of a pixel: the density of a bin, calculated as:
            255 - 255 * ${the number of 1s in the bin} / ${bin-size}

Examples:
  1. common use:
      kmcp utils index-density gtdb.kmcp/R001/_block001.uniki \
          --bins 1024  --out-file t.tsv --out-img t.jpg
  2. export every bit of each position, the image could fail to create:
      kmcp utils index-density gtdb.kmcp/R001/_block001.uniki \
          --bin-size 1 --out-file t.tsv

Usage:
  kmcp utils index-density [flags]

Flags:
  -s, --bin-size int      ► bin size/width
  -b, --bins int          ► number of bins for counting the number of 1s. (default 1024)
  -h, --help              help for index-density
  -o, --out-file string   ► Out file, supports and recommends a ".gz" suffix ("-" for stdout).
                          (default "-")
      --out-img string    ► Out density image, in format of jpeg

Example: https://bioinf.shenwei.me/kmcp/faq/#are-the-elements-in-the-bloom-filters-uniformly-distributed

unik-info

Print information of .unik file

Tips:
  1. For lots of small files (especially on SDD), use big value of '-j' to
     parallelize counting.

Usage:
  kmcp utils unik-info [flags] 

Flags:
  -a, --all                   ► All information, including the number of k-mers.
  -b, --basename              ► Only output basename of files.
  -h, --help                  help for unik-info
  -o, --out-file string       ► Out file, supports a ".gz" suffix ("-" for stdout). (default "-")
  -e, --skip-err              ► Skip error, only show warning message.
      --symbol-false string   ► Smybol for false. (default "✕")
      --symbol-true string    ► Smybol for true. (default "✓")
  -T, --tabular               ► Output in machine-friendly tabular format.

merge-regions

Merge species/assembly-specific regions

Steps:
  # 1. Simulating reads and searching on one or more databases.
  seqkit sliding --step 10 --window 100 ref.fna.gz \
      | kmcp search -d db1.kmcp -o ref.fna.gz.kmcp@db1.tsv.gz
  seqkit sliding --step 10 --window 100 ref.fna.gz \
      | kmcp search -d db2.kmcp -o ref.fna.gz.kmcp@db2.tsv.gz

  # 2. Merging and filtering searching results
  kmcp merge ref.fna.gz.kmcp@*.tsv.gz \
      | kmcp utils filter -X taxdump -T taxid.map \
            -o ref.fna.gz.kmcp.uniq.tsv.gz

  # 3. Merging regions.
  # Here the value of --min-overlap should be k-1.
  kmcp utils merge-regions --min-overlap 20 ref.fna.gz.kmcp.uniq.tsv.gz \
      -o ref.fna.gz.kmcp.uniq.tsv.gz.bed

Output (BED6 format):
  1. chrom      - chromosome name
  2. chromStart - starting position (0-based)
  3. chromEnd   - ending position (0-based)
  4. name       - "species-specific" or "assembly-specific"
  5. score      - 0-1000, 1000 for "assembly-specific", others for ""species-specific"
  6. strand     - "."

Performance notes:
  1. Searching results are parsed in parallel, and the number of
     lines proceeded by a thread can be set by the flag --line-chunk-size.
  2. However using a lot of threads does not always accelerate
     processing, 4 threads with a chunk size of 500-5000 is fast enough.

Usage:
  kmcp utils merge-regions [flags]

Flags:
  -h, --help                   help for merge-regions
  -I, --ignore-type            ► Merge species and assembly-specific regions.
      --line-chunk-size int    ► Number of lines to process for each thread, and 4 threads is fast
                               enough. Type "kmcp utils merge-regions -h" for details. (default 5000)
  -f, --max-fpr float          ► Maximum false positive rate of a read in search result. (default 0.05)
  -g, --max-gap int            ► Maximum distance of starting positions of two adjacent regions, 0 for
                               no limitation, 1 for no merging.
  -l, --min-overlap int        ► Minimum overlap of two adjacent regions, recommend K-1. (default 1)
  -t, --min-query-cov float    ► Minimum query coverage of a read in search result. (default 0.55)
  -a, --name-assembly string   ► Name of assembly-specific regions. (default "assembly-specific")
  -s, --name-species string    ► Name of species-specific regions. (default "species-specific")
  -o, --out-file string        ► Out file, supports and recommends a ".gz" suffix ("-" for stdout).
                               (default "-")
  -r, --regexp string          ► Regular expression for extract reference name and query locations.
                               (default "^(.+)_sliding:(\\d+)\\-(\\d+)$")

filter

Filter search results and find species/assembly-specific queries

Taxonomy data:
  1. Mapping references IDs to TaxIds: -T/--taxid-map
  2. NCBI taxonomy dump files: -X/--taxdump

Performance notes:
  1. Searching results are parsed in parallel, and the number of
     lines proceeded by a thread can be set by the flag --line-chunk-size.
  2. However using a lot of threads does not always accelerate
     processing, 4 threads with a chunk size of 500-5000 is fast enough.

Usage:
  kmcp utils filter [flags]

Flags:
  -h, --help                  help for filter
      --level string          ► Level to filter. available values: species, strain/assembly. (default
                              "species")
      --line-chunk-size int   ► Number of lines to process for each thread, and 4 threads is fast
                              enough. Type "kmcp utils filter" for details. (default 5000)
  -f, --max-fpr float         ► Maximum false positive rate of a read in search result. (default 0.05)
  -t, --min-query-cov float   ► Minimum query coverage of a read in search result. (default 0.55)
  -H, --no-header-row         ► Do not print header row.
  -o, --out-file string       ► Out file, supports and recommends a ".gz" suffix ("-" for stdout).
                              (default "-")
  -X, --taxdump string        ► Directory of NCBI taxonomy dump files: names.dmp, nodes.dmp, optional
                              with merged.dmp and delnodes.dmp.
  -T, --taxid-map strings     ► Tabular two-column file(s) mapping reference IDs to TaxIds.

split-genomes

Split genomes into chunks

This command acts like 'kmcp compute' with many same options/flags shared,
but it only performs genome splitting and does not compute k-mers. Genome
chunks will be saved into the output directory with one file for a chunk.

One single input file or a directory with one single genome file is preferred.

Warning (experimental feature):
  If more than one genome files are given, the "reference genome" with the least
and longest sequence(s) will be chosen and split into chunks. Then other genomes
are fragmented and each genome fragment is assigned to the most similar genome
chunk of the reference genome.

Usage:
  kmcp utils split-genomes [flags] [-k <k>] [-n <chunks>] [-l <overlap>] {[-I <seqs dir>] | <seq files>} -O <out dir>

Flags:
      --circular                  ► Input sequences are circular. Note that it only applies to genomes
                                  with a single chromosome.
  -r, --file-regexp string        ► Regular expression for matching sequence files in -I/--in-dir,
                                  case ignored. (default "\\.(f[aq](st[aq])?|fna)(.gz)?$")
      --force                     ► Overwrite existed output directory.
  -f, --frag-size int             ► size of sequence fragments to be assigned to the reference genome
                                  chunks. (default 100)
  -h, --help                      help for split-genomes
  -I, --in-dir string             ► Directory containing FASTA files. Directory symlinks are followed.
      --info-file string          ► An extra output file to show which chunk(s) are assigned to for
                                  each genome fragment.
  -k, --kmer int                  ► K-mer size. (default 21)
  -O, --out-dir string            ► Output directory.
  -B, --seq-name-filter strings   ► List of regular expressions for filtering out sequences by
                                  header/name, case ignored.
  -m, --split-min-ref int         ► Only splitting sequences >= X bp. (default 1000)
  -n, --split-number int          ► Chunk number for splitting sequences, incompatible with
                                  -s/--split-size.
  -l, --split-overlap int         ► Chunk overlap for splitting sequences. The default value will be
                                  set to k-1 unless you change it.

cov2simi

Convert k-mer coverage to sequence similarity

    similarity = 87.456 + 26.410*qcov - 22.008*qcov*qcov + 7.325*qcov*qcov*qcov

Usage:
  kmcp utils cov2simi [flags]

Flags:
  -h, --help              help for cov2simi
  -o, --out-file string   ► Out file, supports a ".gz" suffix ("-" for stdout). (default "-")
  -t, --query-cov float   ► K-mer query coverage, i.e., proportion of matched k-mers and unique k-mers
                          of a query. range: [0, 1]

query-fpr

Compute the false positive rate of a query

When the flag '-a/--all' is given, the Chernoff bound (column 'cbound')
is also output along with input parameters.

> Given K ≥ p, Solomon and Kingsford also apply a Chernoff bound and
show that the false positive probability for a query to be detected
in a document is ≤ exp(−l(K − p)^2 /(2(1 − p)))

Reference:
  1. Theorem 2 in https://doi.org/10.1038/nbt.3442
  2. Theorem 1 in https://arxiv.org/abs/1905.09624v2

Usage:
  kmcp utils query-fpr [flags] 

Flags:
  -H, --add-header                  ► Add header line (column names
  -a, --all                         ► Also show the value of -f, -n, and -t
  -f, --false-positive-rate float   ► False positive rate of a single k-mer, i.e., FPR of the bloom
                                    filters in the database. range: (0, 1) (default 0.3)
  -h, --help                        help for query-fpr
  -m, --matched-kmers int           ► The number of matched k-mers of a query. (default 35)
  -n, --num-kmers int               ► Number of unique k-mers of the query. (default 70)
  -o, --out-file string             ► Out file, supports a ".gz" suffix ("-" for stdout). (default "-")

autocompletion

Generate shell autocompletion script

Supported shell: bash|zsh|fish|powershell

Bash:

    # generate completion shell
    kmcp autocompletion --shell bash

    # configure if never did.
    # install bash-completion if the "complete" command is not found.
    echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
    echo "source ~/.bash_completion" >> ~/.bashrc

Zsh:

    # generate completion shell
    kmcp autocompletion --shell zsh --file ~/.zfunc/_kmcp

    # configure if never did
    echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
    echo "autoload -U compinit; compinit" >> ~/.zshrc

fish:

    kmcp autocompletion --shell fish --file ~/.config/fish/completions/kmcp.fish

Usage:
  kmcp autocompletion [flags]

Flags:
      --file string    autocompletion file (default "/home/shenwei/.bash_completion.d/kmcp.sh")
  -h, --help           help for autocompletion
      --shell string   autocompletion type (bash|zsh|fish|powershell) (default "bash")