Frequently Asked Questions
General
How can I run KMCP on a computer without enough main memory?
By default, kmcp search
loads the whole database into main memory (RAM) for fast searching.
Optionally, the flag --low-mem
can be set to avoid loading the whole database,
while it's much slower, >10X slower on SSD and should be much slower on HDD disks.
Another way is dividing the reference genomes into several parts
and building smaller databases for all parts, so that the biggest
database can be loaded into RAM. After performing database searching,
search results on all small databases can be merged with kmcp merge
for downstream analysis.
Buiding small databases can also accelerate searching on a computer cluster, where every node searches a part of the database.
Database building
Can I create a database with custom genome collections for profiling?
Yes, you can use taxonkit create-taxdump
to create NCBI-style taxdump files for profiling, which also generates a taxid.map
file.
What k-mer size should I use to build the database?
Multiple k-mer sizes are supported, but one value is good enough.
Bigger k-mer sizes bring high specificity at the cost of decrease
of sensitivity. k = 21
is recommended for metagenomic profiling.
How to add new genomes to the database?
KMCP builds database very fast, you can either rebuilt the database after adding new genomes, or create a separate database with the new genomes, search against these databases, and merge the results.
Unexpected EOF error
Some files could corrupt during downloading, we recommend checking
sequence file integrity using seqkit (gzip -t
failed for some files in
my tests).
-
List corrupted files
# corrupted files find $genomes -name "*.gz" \ | rush 'seqkit seq -w 0 {} > /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \ > failed.txt # empty files find $genomes -name "*.gz" -size 0 >> failed.txt
-
Delete these files:
cat failed.txt | rush '/bin/rm {}'
-
Redownload these files. For example, from
rush
cache filedownload.rush
(list of succeed commands), where the commands are in one line.grep -f failed.txt download.rush \ | sed 's/__CMD__//g' \ | rush '{}'
For
genome_updater
, URLs of genomes can be found in files like2021-09-30_13-32-30_url_downloaded.txt
, you can extract URLs usinggrep -f failed.txt -v *url_downloaded.txt
or some other ways, and batch redownload them usingparallel
.
Searching
Why are the CPU usages are very low, not 100%?
Please check the log (in terminal not the log file via the option --log
),
it should report current searching speed, like:
21:07:42.949 [INFO] reading sequence file: t_t3.fa.gz
processed queries: 1253376, speed: 8.127 million queries per minute
If the speed is very slow.
-
Are you running KMCP in a computer cluster (HPC)?
- If yes, please switch on
-w/--load-whole-db
. Because the default database loading mode would be very slow for network-attached storage (NAS).
- If yes, please switch on
-
Are the reference genomes are highly similar? E.g., tens of thousands of genomes of a same species?
- If yes, check the search result to see if there are thousands of matches for a read. You may choose another graph-based sequence searching tool.
Can I run multiple KMCP processes in a machine?
Yes you can. But note that KMCP search is CPU- and RAM-intense. So please to limit the number of CPUs cores to use for each process with the falg -j/--threads
, of which the default value is the available CPUs cores of the machine.
Profiling
Where is the taxid.map?
Each prebuilt database contains a taxid.map
file in its directory.
You can concatenate them into a big one:
$ cat gtdb.kmcp/taxid.map refseq-viral.kmcp/taxid.map refseq-fungi.kmcp/taxid.map > taxid.map
$ head -n 5 taxid.map
GCA_004025655.1 10243
GCA_004025395.1 10243
GCA_004025355.1 10243
GCA_003971405.1 10243
GCA_003971385.1 10243
Or set the options -T/--taxid-map
multiple times:
kmcp profile -T gtdb.kmcp/taxid.map -T refseq-viral.kmcp/taxid.map -T refseq-fungi.kmcp/taxid.map ...
For other custom genome collections, you can use
taxonkit create-taxdump
to create NCBI-style taxdump files for custom taxonomy, e.g.,
GTDB and
ICTV, which also generates a taxid.map
file.
Unknown taxid?
19:54:54.632 [ERRO] unknown taxid for NZ_CP028116.1, please check taxid mapping file(s)
If the kmcp profile
reports this, you may need to check if the taxid mapping file contain all the reference IDs.
And make sure the reference IDs match these in the database, the later ones are listed in:
$ head -n 5 $kmcp_db_dir/R001/__name_mapping.tsv
NC_013654.1 NC_013654.1
NC_000913.3 NC_000913.3
NC_010655.1 NC_010655.1
NC_012971.2 NC_012971.2
NC_011750.1 NC_011750.1
There's another case: you used --name-map
in kmcp search
.
Please don't do this if you will use the search result for metagenomic
profiling which needs the original reference IDs. We add a note now:
-N, --name-map strings ► Tabular two-column file(s) mapping reference IDs to user-defined
values. Don't use this if you will use the result for metagenomic
profiling which needs the original reference IDs.
How to tune parts of options when using preset profiling modes?
Sorry it's not supported due to the limitation of the command-line argument parsers.
You need explicitly set all relevant options of the mode.
It's available since v0.8.2.