Frequently Asked Questions

General

Does KMCP support long read metagenomic profiling?

Note: This was asked here.

KMCP is only suitable for short-read metagenomic profiling, with much lower sensitivity on long-read datasets. My initial plan was to support both short and long reads. But the read matching strategy, i.e., keeping reads with enough (>= 55% by default) k-mers contained in a genome chunk, is of low sensitivity for long reads, even for HIFI reads.

Some strategies were tried, but the results were out of expectation.

Setting a lower similarity threshold. For our probabilistic data structure, lower thresholds will significantly increase the false-positive rates of a read, though the FPR can also be reduced at the cost of bigger databases.
Using sketching algorithm. ScaledMinash, Closed Syncmers, and Minimizer were all implemented (available in the current version) and tested, but they didn't work well on error-prone long reads with lower sensitivity. Though tools like minimap2 benefit from Minimizer with location information for seeding and chaining in sequence alignment, we failed to utilize them in taxonomic profiling.
Using multiple k-mers. K-mers of different lengths, e.g., 17, 21, 31, didn't do better than a single value and doubled the database size.
Using Simhash with a higher tolerance than k-mer on base substitution. It's slower and has a lower sensitivity unexpectedly.
Breaking long reads into short fragments. It only applies to HIFI reads, but the strength of the long reads is wasted.

How can I run KMCP on a computer without enough main memory?

By default, kmcp search loads the whole database into main memory (RAM) for fast searching. Optionally, the flag --low-mem can be set to avoid loading the whole database, while it's much slower, >10X slower on SSD and should be much slower on HDD disks.

Another way is dividing the reference genomes into several parts and building smaller databases for all parts, so that the biggest database can be loaded into RAM. After performing database searching, search results on all small databases can be merged with kmcp merge for downstream analysis.

Buiding small databases can also accelerate searching on a computer cluster, where every node searches a part of the database.

Database building

Can I create a database with custom genome collections for profiling?

Yes, you can use taxonkit create-taxdump to create NCBI-style taxdump files for profiling, which also generates a taxid.map file.

What k-mer size should I use to build the database?

Multiple k-mer sizes are supported, but one value is good enough. And the K needs to be equal or less than 64, due to the bug of older versions of ntHash.

Bigger k-mer sizes bring high specificity at the cost of decrease of sensitivity. k = 21 is recommended for metagenomic profiling.

How to add new genomes to the database?

KMCP builds database very fast, you can either rebuilt the database after adding new genomes, or create a separate database with the new genomes, search against these databases, and merge the results.

Unexpected EOF error

Some files could corrupt during downloading, we recommend checking sequence file integrity using seqkit (gzip -t failed for some files in my tests).

List corrupted files

# corrupted files
find $genomes -name "*.gz" \
    | rush 'seqkit seq -w 0 {} > /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \
    > failed.txt

# empty files
find $genomes -name "*.gz" -size 0 >> failed.txt

Delete these files:
```
cat failed.txt | rush '/bin/rm {}'
```
Redownload these files. For example, from rush cache file download.rush (list of succeed commands), where the commands are in one line.
```
grep -f failed.txt download.rush \
    | sed 's/__CMD__//g' \
    | rush '{}'
```
For genome_updater, URLs of genomes can be found in files like 2021-09-30_13-32-30_url_downloaded.txt, you can extract URLs using grep -f failed.txt -v *url_downloaded.txt or some other ways, and batch redownload them using parallel.

Are the elements in the bloom filters uniformly distributed?

Someone asked me this when I was giving a talk on KMCP. The answer is yes.

I created a new command kmcp utils index-density to plot the density of a index file.

In the image,

X: bins. The width is the number of bins
Y: bloom filters, with each representing a genome (chunk). The height is the number of names (genome or genome chunks)
greyscale/darkness of a pixel: the density of a bin, calculated as: 255 - 255 * ${the number of 1s in the bin} / ${bin-size}

Here's are some example outputs.

v2023.05-genbank-viral-_block001.uniki (FPR of bloom filter: 0.3), only a part of image is shown.

Note that the genome sizes (base pairs) of this genome chunk vary widely, so the densities of the preceding bloom files are relatively spare.

Information of k-mers:

kmcp utils ref-info -d genbank-viral.kmcp \
    | csvtk grep -t -p _block001.uniki \
    | csvtk cut -t -f -2 \
    | csvtk head -n 10 \
    | csvtk csv2md -t

file	target	chunkIdx	chunks	kmers	fpr
_block001.uniki	GCA_008766775.1	0	1	145	0.022419
_block001.uniki	GCA_003330005.1	8	10	150	0.023183
_block001.uniki	GCA_002814895.1	0	1	154	0.023794
_block001.uniki	GCA_001967255.1	9	10	165	0.025471
_block001.uniki	GCA_002219745.1	4	10	197	0.030336
_block001.uniki	GCA_000839085.1	0	1	200	0.030790
_block001.uniki	GCA_000919055.1	9	10	207	0.031851
_block001.uniki	GCA_000848385.1	6	10	207	0.031851
_block001.uniki	GCA_002820565.1	9	10	208	0.032002
_block001.uniki	GCA_002820505.1	9	10	208	0.032002

Information of bloom filters:

kmcp utils index-info -b genbank-viral.kmcp/R001/_block001.uniki | csvtk csv2md -t

file	k	canonical	num-hashes	num-sigs	num-names
_block001.uniki	21	true	1	6395	10400

v2023.05-refseq-fungi-_block001.uniki (FPR of bloom filter: 0.3)

Information of k-mers:

kmcp utils ref-info -d refseq-fungi.kmcp \
    | csvtk grep -t -p _block001.uniki \
    | csvtk cut -t -f -2 \
    | csvtk head -n 10 \
    | csvtk csv2md -t

file	target	chunkIdx	chunks	kmers	fpr
_block001.uniki	GCF_000277815.2	0	10	207084	0.094334
_block001.uniki	GCF_000146465.1	0	10	209422	0.095346
_block001.uniki	GCF_000091225.2	0	10	209787	0.095504
_block001.uniki	GCF_000280035.1	8	10	216972	0.098608
_block001.uniki	GCF_000280035.1	9	10	217529	0.098849
_block001.uniki	GCF_000280035.1	3	10	217631	0.098892
_block001.uniki	GCF_000280035.1	6	10	218560	0.099293
_block001.uniki	GCF_000280035.1	7	10	218763	0.099380
_block001.uniki	GCF_000280035.1	4	10	218830	0.099409
_block001.uniki	GCF_000280035.1	0	10	218860	0.099422

Information of bloom filters:

kmcp utils index-info -b refseq-fungi.kmcp/R001/_block001.uniki | csvtk csv2md -t

file	k	canonical	num-hashes	num-sigs	num-names
_block001.uniki	21	true	1	2089979	160

v2023.05-refseq-fungi-_block002.uniki (FPR of bloom filter: 0.3)

file	target	chunkIdx	chunks	kmers	fpr
_block002.uniki	GCF_000349005.2	5	10	745638	0.249355
_block002.uniki	GCF_000349005.2	2	10	746237	0.249528
_block002.uniki	GCF_000349005.2	9	10	747586	0.249917
_block002.uniki	GCF_000349005.2	0	10	748331	0.250132
_block002.uniki	GCF_000349005.2	7	10	749497	0.250469
_block002.uniki	GCF_001477545.1	5	10	751685	0.251099
_block002.uniki	GCF_002251995.1	3	10	753439	0.251604
_block002.uniki	GCF_001477545.1	4	10	755317	0.252145
_block002.uniki	GCF_001477545.1	0	10	758418	0.253036
_block002.uniki	GCF_001477545.1	8	10	760460	0.253623

file	k	canonical	num-hashes	num-sigs	num-names
_block002.uniki	21	true	1	2599648	160

Some other useful genome catalogs

WormBase ParaSite

Searching

Why are the CPU usages are very low, not 100%?

Please check the log (in terminal not the log file via the option --log), it should report current searching speed, like:

21:07:42.949 [INFO] reading sequence file: t_t3.fa.gz
processed queries: 1253376, speed: 8.127 million queries per minute

If the speed is very slow.

Are you running KMCP in a computer cluster (HPC)?
- If yes, please switch on -w/--load-whole-db. Because the default database loading mode would be very slow for network-attached storage (NAS).
Are the reference genomes are highly similar? E.g., tens of thousands of genomes of a same species?
- If yes, check the search result to see if there are thousands of matches for a read. You may choose another graph-based sequence searching tool.
Are you searching with metatranscriptomics data?
- If yes, the search results would show that a huge number of reads from 16 rRNA genes have thousands of matches, therefore, writing results slow down the search. So these reads should be filtered out before the search using tools like https://github.com/hzi-bifo/RiboDetector.

Can I run multiple KMCP processes in a machine?

Yes you can. But note that KMCP search is CPU- and RAM-intense. So please to limit the number of CPUs cores to use for each process with the falg -j/--threads, of which the default value is the available CPUs cores of the machine.

Profiling

Where is the taxid.map?

Each prebuilt database contains a taxid.map file in its directory. You can concatenate them into a big one:

$ cat gtdb.kmcp/taxid.map refseq-viral.kmcp/taxid.map refseq-fungi.kmcp/taxid.map > taxid.map

$ head -n 5 taxid.map
GCA_004025655.1 10243
GCA_004025395.1 10243
GCA_004025355.1 10243
GCA_003971405.1 10243
GCA_003971385.1 10243

Or set the options -T/--taxid-map multiple times:

kmcp profile -T gtdb.kmcp/taxid.map -T refseq-viral.kmcp/taxid.map -T refseq-fungi.kmcp/taxid.map ...

For other custom genome collections, you can use taxonkit create-taxdump to create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV, which also generates a taxid.map file.

Unknown taxid?

19:54:54.632 [ERRO] unknown taxid for NZ_CP028116.1, please check taxid mapping file(s)

If the kmcp profile reports this, you may need to check if the taxid mapping file contain all the reference IDs. And make sure the reference IDs match these in the database, the later ones are listed in:

$ head -n 5 $kmcp_db_dir/R001/__name_mapping.tsv
NC_013654.1     NC_013654.1
NC_000913.3     NC_000913.3
NC_010655.1     NC_010655.1
NC_012971.2     NC_012971.2
NC_011750.1     NC_011750.1

There's another case: you used --name-map in kmcp search. Please don't do this if you will use the search result for metagenomic profiling which needs the original reference IDs. We add a note now:

-N, --name-map strings  ► Tabular two-column file(s) mapping reference IDs to user-defined
                        values. Don't use this if you will use the result for metagenomic
                        profiling which needs the original reference IDs.

How to tune parts of options when using preset profiling modes?

~~Sorry it's not supported due to the limitation of the command-line argument parsers. You need explicitly set all relevant options of the mode.~~

It's available since v0.8.2.