Frequently Asked Questions
General
Does KMCP support long read metagenomic profiling?
Note: This was asked here.
KMCP is only suitable for short-read metagenomic profiling, with much lower sensitivity on long-read datasets. My initial plan was to support both short and long reads. But the read matching strategy, i.e., keeping reads with enough (>= 55% by default) k-mers contained in a genome chunk, is of low sensitivity for long reads, even for HIFI reads.
Some strategies were tried, but the results were out of expectation.
- Setting a lower similarity threshold. For our probabilistic data structure, lower thresholds will significantly increase the false-positive rates of a read, though the FPR can also be reduced at the cost of bigger databases.
- Using sketching algorithm. ScaledMinash, Closed Syncmers, and Minimizer were all implemented (available in the current version) and tested, but they didn't work well on error-prone long reads with lower sensitivity. Though tools like minimap2 benefit from Minimizer with location information for seeding and chaining in sequence alignment, we failed to utilize them in taxonomic profiling.
- Using multiple k-mers. K-mers of different lengths, e.g., 17, 21, 31, didn't do better than a single value and doubled the database size.
- Using Simhash with a higher tolerance than k-mer on base substitution. It's slower and has a lower sensitivity unexpectedly.
- Breaking long reads into short fragments. It only applies to HIFI reads, but the strength of the long reads is wasted.
How can I run KMCP on a computer without enough main memory?
By default, kmcp search
loads the whole database into main memory (RAM) for fast searching.
Optionally, the flag --low-mem
can be set to avoid loading the whole database,
while it's much slower, >10X slower on SSD and should be much slower on HDD disks.
Another way is dividing the reference genomes into several parts
and building smaller databases for all parts, so that the biggest
database can be loaded into RAM. After performing database searching,
search results on all small databases can be merged with kmcp merge
for downstream analysis.
Buiding small databases can also accelerate searching on a computer cluster, where every node searches a part of the database.
Database building
Can I create a database with custom genome collections for profiling?
Yes, you can use taxonkit create-taxdump
to create NCBI-style taxdump files for profiling, which also generates a taxid.map
file.
What k-mer size should I use to build the database?
Multiple k-mer sizes are supported, but one value is good enough. And the K needs to be equal or less than 64, due to the bug of older versions of ntHash.
Bigger k-mer sizes bring high specificity at the cost of decrease
of sensitivity. k = 21
is recommended for metagenomic profiling.
How to add new genomes to the database?
KMCP builds database very fast, you can either rebuilt the database after adding new genomes, or create a separate database with the new genomes, search against these databases, and merge the results.
Unexpected EOF error
Some files could corrupt during downloading, we recommend checking
sequence file integrity using seqkit (gzip -t
failed for some files in
my tests).
-
List corrupted files
# corrupted files find $genomes -name "*.gz" \ | rush 'seqkit seq -w 0 {} > /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \ > failed.txt # empty files find $genomes -name "*.gz" -size 0 >> failed.txt
-
Delete these files:
cat failed.txt | rush '/bin/rm {}'
-
Redownload these files. For example, from
rush
cache filedownload.rush
(list of succeed commands), where the commands are in one line.grep -f failed.txt download.rush \ | sed 's/__CMD__//g' \ | rush '{}'
For
genome_updater
, URLs of genomes can be found in files like2021-09-30_13-32-30_url_downloaded.txt
, you can extract URLs usinggrep -f failed.txt -v *url_downloaded.txt
or some other ways, and batch redownload them usingparallel
.
Are the elements in the bloom filters uniformly distributed?
Someone asked me this when I was giving a talk on KMCP. The answer is yes.
I created a new command kmcp utils index-density
to plot the density of a index file.
In the image,
- X: bins. The width is the number of bins
- Y: bloom filters, with each representing a genome (chunk). The height is the number of names (genome or genome chunks)
- greyscale/darkness of a pixel: the density of a bin, calculated as: 255 - 255 * ${the number of 1s in the bin} / ${bin-size}
Here's are some example outputs.
-
v2023.05-genbank-viral-_block001.uniki (FPR of bloom filter: 0.3), only a part of image is shown.
Note that the genome sizes (base pairs) of this genome chunk vary widely, so the densities of the preceding bloom files are relatively spare.
Information of k-mers:
kmcp utils ref-info -d genbank-viral.kmcp \ | csvtk grep -t -p _block001.uniki \ | csvtk cut -t -f -2 \ | csvtk head -n 10 \ | csvtk csv2md -t
file target chunkIdx chunks kmers fpr _block001.uniki GCA_008766775.1 0 1 145 0.022419 _block001.uniki GCA_003330005.1 8 10 150 0.023183 _block001.uniki GCA_002814895.1 0 1 154 0.023794 _block001.uniki GCA_001967255.1 9 10 165 0.025471 _block001.uniki GCA_002219745.1 4 10 197 0.030336 _block001.uniki GCA_000839085.1 0 1 200 0.030790 _block001.uniki GCA_000919055.1 9 10 207 0.031851 _block001.uniki GCA_000848385.1 6 10 207 0.031851 _block001.uniki GCA_002820565.1 9 10 208 0.032002 _block001.uniki GCA_002820505.1 9 10 208 0.032002 Information of bloom filters:
kmcp utils index-info -b genbank-viral.kmcp/R001/_block001.uniki | csvtk csv2md -t
file k canonical num-hashes num-sigs num-names _block001.uniki 21 true 1 6395 10400 -
v2023.05-refseq-fungi-_block001.uniki (FPR of bloom filter: 0.3)
Information of k-mers:
kmcp utils ref-info -d refseq-fungi.kmcp \ | csvtk grep -t -p _block001.uniki \ | csvtk cut -t -f -2 \ | csvtk head -n 10 \ | csvtk csv2md -t
file target chunkIdx chunks kmers fpr _block001.uniki GCF_000277815.2 0 10 207084 0.094334 _block001.uniki GCF_000146465.1 0 10 209422 0.095346 _block001.uniki GCF_000091225.2 0 10 209787 0.095504 _block001.uniki GCF_000280035.1 8 10 216972 0.098608 _block001.uniki GCF_000280035.1 9 10 217529 0.098849 _block001.uniki GCF_000280035.1 3 10 217631 0.098892 _block001.uniki GCF_000280035.1 6 10 218560 0.099293 _block001.uniki GCF_000280035.1 7 10 218763 0.099380 _block001.uniki GCF_000280035.1 4 10 218830 0.099409 _block001.uniki GCF_000280035.1 0 10 218860 0.099422 Information of bloom filters:
kmcp utils index-info -b refseq-fungi.kmcp/R001/_block001.uniki | csvtk csv2md -t
file k canonical num-hashes num-sigs num-names _block001.uniki 21 true 1 2089979 160 -
v2023.05-refseq-fungi-_block002.uniki (FPR of bloom filter: 0.3)
file target chunkIdx chunks kmers fpr _block002.uniki GCF_000349005.2 5 10 745638 0.249355 _block002.uniki GCF_000349005.2 2 10 746237 0.249528 _block002.uniki GCF_000349005.2 9 10 747586 0.249917 _block002.uniki GCF_000349005.2 0 10 748331 0.250132 _block002.uniki GCF_000349005.2 7 10 749497 0.250469 _block002.uniki GCF_001477545.1 5 10 751685 0.251099 _block002.uniki GCF_002251995.1 3 10 753439 0.251604 _block002.uniki GCF_001477545.1 4 10 755317 0.252145 _block002.uniki GCF_001477545.1 0 10 758418 0.253036 _block002.uniki GCF_001477545.1 8 10 760460 0.253623 file k canonical num-hashes num-sigs num-names _block002.uniki 21 true 1 2599648 160
Some other useful genome catalogs
Searching
Why are the CPU usages are very low, not 100%?
Please check the log (in terminal not the log file via the option --log
),
it should report current searching speed, like:
21:07:42.949 [INFO] reading sequence file: t_t3.fa.gz
processed queries: 1253376, speed: 8.127 million queries per minute
If the speed is very slow.
-
Are you running KMCP in a computer cluster (HPC)?
- If yes, please switch on
-w/--load-whole-db
. Because the default database loading mode would be very slow for network-attached storage (NAS).
- If yes, please switch on
-
Are the reference genomes are highly similar? E.g., tens of thousands of genomes of a same species?
- If yes, check the search result to see if there are thousands of matches for a read. You may choose another graph-based sequence searching tool.
-
Are you searching with metatranscriptomics data?
- If yes, the search results would show that a huge number of reads from 16 rRNA genes have thousands of matches, therefore, writing results slow down the search. So these reads should be filtered out before the search using tools like https://github.com/hzi-bifo/RiboDetector.
Can I run multiple KMCP processes in a machine?
Yes you can. But note that KMCP search is CPU- and RAM-intense. So please to limit the number of CPUs cores to use for each process with the falg -j/--threads
, of which the default value is the available CPUs cores of the machine.
Profiling
Where is the taxid.map?
Each prebuilt database contains a taxid.map
file in its directory.
You can concatenate them into a big one:
$ cat gtdb.kmcp/taxid.map refseq-viral.kmcp/taxid.map refseq-fungi.kmcp/taxid.map > taxid.map
$ head -n 5 taxid.map
GCA_004025655.1 10243
GCA_004025395.1 10243
GCA_004025355.1 10243
GCA_003971405.1 10243
GCA_003971385.1 10243
Or set the options -T/--taxid-map
multiple times:
kmcp profile -T gtdb.kmcp/taxid.map -T refseq-viral.kmcp/taxid.map -T refseq-fungi.kmcp/taxid.map ...
For other custom genome collections, you can use
taxonkit create-taxdump
to create NCBI-style taxdump files for custom taxonomy, e.g.,
GTDB and
ICTV, which also generates a taxid.map
file.
Unknown taxid?
19:54:54.632 [ERRO] unknown taxid for NZ_CP028116.1, please check taxid mapping file(s)
If the kmcp profile
reports this, you may need to check if the taxid mapping file contain all the reference IDs.
And make sure the reference IDs match these in the database, the later ones are listed in:
$ head -n 5 $kmcp_db_dir/R001/__name_mapping.tsv
NC_013654.1 NC_013654.1
NC_000913.3 NC_000913.3
NC_010655.1 NC_010655.1
NC_012971.2 NC_012971.2
NC_011750.1 NC_011750.1
There's another case: you used --name-map
in kmcp search
.
Please don't do this if you will use the search result for metagenomic
profiling which needs the original reference IDs. We add a note now:
-N, --name-map strings ► Tabular two-column file(s) mapping reference IDs to user-defined
values. Don't use this if you will use the result for metagenomic
profiling which needs the original reference IDs.
How to tune parts of options when using preset profiling modes?
Sorry it's not supported due to the limitation of the command-line argument parsers.
You need explicitly set all relevant options of the mode.
It's available since v0.8.2.