Database

Prebuilt databases

All prebuilt databases and the used reference genomes are available at:

OneDrive for global users.
CowTransfer for Chinese users and global users.
Please click the "kmcp" icon on the left to browse directories of the current version, and choose each individual file to download..

Please check file integrity with `md5sum` after downloading the files:

md5sum -c *.kmcp.tar.gz.md5.txt

A). Databases for metagenomic profiling

These databases are created following steps below. Users can also build custom databases, it's simple and fast.

Current version: v2023.05 (OneDrive, CowTransfer)

kingdoms	source	#NCBI-species	#assemblies	db-parameters	size
Bacteria and Archaea	GTDB r214	34395+	85205	k=21, chunks=10; fpr=0.3, hashes=1	50+49 GB*
Fungi	Refseq r217	491	496	k=21, chunks=10; fpr=0.3, hashes=1	5.5 GB
Viruses	GenBank r255	26680	33479	k=21, chunks=5; fpr=0.05, hashes=1	5.9 GB

Notes: *GTDB representative genomes were split into 2 parts to build relatively small databases which can be filled into workstations with small RAM (around 64GB). Users need to search reads on all these small databases and merge the results before running kmcp profile.

Taxonomy information: Either NCBI taxonomy or a combination of GTDB and NCBI are available:

NCBI: taxdmp_2023-05-01
GTDB+NCBI: GTDB r214 + NCBI taxdmp_2023-05-01

Hardware requirements

Prebuilt databases were built for computers with >= 32 CPU cores in consideration of better parallelization, and computers should have at least 64 GB.
By default, kmcp search loads the whole database into main memory (via mmap by default) for fast searching. Optionally, the flag --low-mem can be set to avoid loading the whole database, while it's much slower, >10X slower on SSD and should be much slower on HDD disks.
To reduce memory requirements on computers without enough memory, users can split the reference genomes into partitions and build a smaller database for each partition, so that the biggest database can be loaded into RAM. This can also accelerate searching on a computer cluster, where every node searches against a small database. After performing database searching, search results from all small databases can be merged with kmcp merge for downstream analysis.

B). Databases for genome similarity estimation

Check the tutorial.

FracMinHash (Scaled MinHash), links: OneDrive, CowTransfer

kingdoms	source	#NCBI-species	#assemblies	parameters	size
Bacteria and Archaea	GTDB r214	34395+	85205	k=31, scale=1000	2.7 GB
Fungi	Refseq r217	491	496	k=31, scale=1000	131 MB
Viruses	GenBank r255	26680	33479	k=31, scale=10	2.0 GB

Building databases

GTDB

Tools:

brename for batching renaming files.
rush for executing jobs in parallel.
seqkit for FASTA file processing.
csvtk for tsv/csv data manipulations.
taxonkit for NCBI taxonomy data manipulations.
kmcp for metagenomic profiling.

Files (https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/)

representative genomes: gtdb_genomes_reps.tar.gz
metagdata: ar53_metadata.tar.gz
metagdata: bac120_metadata.tar.gz
taxdump file: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/

Uncompressing and renaming:

# uncompress
mkdir -p gtdb
tar -zxvf gtdb_genomes_reps.tar.gz -C gtdb

# rename
brename -R -p '^(\w{3}_\d{9}\.\d+).+' -r '$1.fna.gz' gtdb

Mapping file:

# assembly accession -> full head
find gtdb/ -name "*.fna.gz" \
    | rush -k 'echo -ne "{%@(.+).fna}\t$(seqkit sort --quiet -lr {} | head -n 1 | seqkit seq -n)\n" ' \
    > name.map

tar -zxvf ar53_metadata.tar.gz
tar -zxvf bac120_metadata.tar.gz

# assembly accession -> taxid
cat *_metadata*.tsv \
    | csvtk cut -t -f accession,ncbi_taxid \
    | csvtk replace -t -p '^.._' \
    | csvtk grep -t -P <(cut -f 1 name.map) \
    | csvtk del-header \
    > taxid.map


# stats (optional)
# taxdump is a directory containing NCBI taxdump files, including names.dmp, nodes.dmp, delnodes.dmp and merged.dmp
cat taxid.map  \
    | csvtk freq -Ht -f 2 -nr \
    | taxonkit lineage -r -n -L --data-dir taxdump/ \
    | taxonkit reformat -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' --data-dir taxdump/ \
    | csvtk add-header -t -n 'taxid,count,name,rank,superkindom,phylum,class,order,family,genus,species' \
    > gtdb.taxid.map.stats.tsv


# number of unique species/strains
cat taxid.map \
    | csvtk uniq -Ht -f 2 \
    | taxonkit lineage --data-dir taxdump/ -i 2 -r -n -L \
    | csvtk freq -Ht -f 4 -nr \
    | csvtk pretty -H -t \
    | tee taxid.map.uniq.stats
species           31165
strain            4073
subspecies        104
forma specialis   59
no rank           31
isolate           26

Masking prophage regions and removing plasmid sequences with genomad (optional):

conda activate genomad

# https://github.com/shenwei356/cluster_files
cluster_files -p '^(.+).fna.gz$' gtdb -o gtdb.genomad

# indentify prophages and plasmids sequences
# > 10 days
db=~/ws/db/genomad/genomad_db
find gtdb.genomad/ -name "*.fna.gz" \
    | rush -j 6 -v db=$db \
        'genomad end-to-end --relaxed --cleanup --threads 25 {} {/}/genomad {db}' \
        -c -C genomad.rush --eta

# mask proviruses and remove plasmid sequences
find gtdb.genomad/ -name "*.fna.gz" \
    | grep -v masked \
    | rush -j 10 -v 'p={/}/genomad/{/%}_summary/{/%}' \
        'csvtk grep -t -f coordinates -p NA {p}_virus_summary.tsv \
            | sed 1d | cut -f 1 \
            > {p}.v.id; \
         csvtk grep -t -f coordinates -p NA {p}_virus_summary.tsv -v \
            | sed 1d | cut -f 1 \
            | perl -ne "/^(.+)\|.+_(\d+)_(\d+)$/; \$s=\$2-1; print qq/\$1\t\$s\t\$3\n/;" \
            > {p}.v.bed; \
        cut -f 1 {p}_plasmid_summary.tsv | sed 1d > {p}.p.id; \
        bedtools maskfasta -fi <(seqkit grep --quiet -v -f <(cat {p}.p.id {p}.v.id) {}) \
                            -bed {p}.v.bed -fo {}.masked.fna; \
        gzip -f {}.masked.fna' \
        -c -C genomad.b.rush --eta

# replace masked sequence with k-Ns, for more accurate genome size.
find gtdb.genomad/ -name "*.fna.gz.masked.fna.gz" \
    | rush -j 20 'seqkit replace -s -i -p N+ -r NNNNNNNNNNNNNNNNNNNNN {} -o {...}.masked2.fna.gz' \
        -c -C genomad.c.rush --eta

# stats
time find gtdb.genomad/ -name "*.fna.gz" \
    | seqkit stats -j 20 --infile-list - -T -b -o gtdb.genomad.stats.tsv

# create symbol links in another directory, and rename
mkdir -p gtdb_masked
cluster_files  -p '^(.+).fna.gz.masked2.fna.gz$' gtdb.genomad -o gtdb_masked
brename -R -p .masked.fna.gz$ gtdb_masked -d

Building database (all k-mers, for profiling on short-reads):

input=gtdb

# compute k-mers
#   sequence containing "plasmid" in name are ignored,
#   reference genomes are split into 10 chunks with 100bp overlap
#   k = 21
kmcp compute -I $input -O gtdb-k21-n10 -k 21 -n 10 -l 150 -B plasmid \
    --log gtdb-k21-n10.log -j 32 --force

# build database
#   number of index files: 32, for server with >= 32 CPU cores
#   bloom filter parameter:
#     number of hash function: 1
#     false positive rate: 0.3
kmcp index -j 32 -I gtdb-k21-n10 -O gtdb.kmcp -n 1 -f 0.3 \
    --log gtdb.kmcp.log --force

# cp taxid and name mapping file to database directory
cp taxid.map name.map gtdb.kmcp/

# clean up
rm -rf gtdb-k21-n10

Building small databases (all k-mers, for profiling with a computer cluster or computer with limited RAM):

input=gtdb

find $input -name "*.fna.gz" > $input.files.txt

# sort files by genome size, so we can split them into chunks with similar genome sizes
cp $input.files.txt $input.files0.txt
cat $input.files0.txt \
    | rush --eta -k 'echo -e {}"\t"$(seqkit stats -T {} | sed 1d | cut -f 5)' \
    > $input.files0.size.txt
cat $input.files0.size.txt \
    | csvtk sort -Ht -k 2:nr \
    | csvtk cut -t -f 1 \
    > $input.files.txt

# number of databases
n=2

# split into $n chunks using round robin distribution
split -n r/$n -d  $input.files.txt $input.n$n-

# create database for every chunks
for f in $input.n$n-*; do 
    echo $f

    # compute k-mers
    #   sequence containing "plasmid" in name are ignored,
    #   reference genomes are split into 10 chunks with 100bp overlap
    #   k = 21
    kmcp compute -i $f -O $f-k21-n10 -k 21 -n 10 -l 150 -B plasmid \
        --log $f-k21-n10.log --force

    # build database
    #   number of index files: 32, for server with >= 32 CPU cores
    #   bloom filter parameter:
    #     number of hash function: 1
    #     false positive rate: 0.3
    kmcp index -j 32 -I $f-k21-n10 -O $f.kmcp -n 1 -f 0.3 \
        --log $f.kmcp.log --force

    # cp taxid and name mapping file to database directory
    cp taxid.map name.map $f.kmcp/
done

# rename small databases
# for example:
#   [INFO] checking: [ ok ] 'gtdb.n2-00.kmcp' -> 'gtdb.part_1.kmcp'
#   [INFO] checking: [ ok ] 'gtdb.n2-01.kmcp' -> 'gtdb.part_2.kmcp'
#   [INFO] 2 path(s) to be renamed

brename -D --only-dir -p "n$n-\d+" -r "part_{nr}" -d


# clean up
for f in $input.n$n-*; do
    rm -rf $f-k21-n10
done

RefSeq viral or fungi

Tools

genome_updater (0.5.1) for downloading genomes from NCBI.

Downloading viral and fungi sequences:

name=fungi
# name=viral

# -k for dry-run
# -i for fix
time genome_updater.sh \
    -d "refseq"\
    -g $name \
    -c "" \
    -l "" \
    -f "genomic.fna.gz" \
    -o "refseq-$name" \
    -t 12 \
    -m -a -p

# cd to 2021-09-30_19-35-19

# taxdump
mkdir -p taxdump
tar -zxvf taxdump.tar.gz -C taxdump

# assembly accession -> taxid
cut -f 1,6 assembly_summary.txt > taxid.map    
# assembly accession -> name
cut -f 1,8 assembly_summary.txt > name.map


# stats (optional)
cat taxid.map  \
    | csvtk freq -Ht -f 2 -nr \
    | taxonkit lineage -r -n -L --data-dir taxdump/ \
    | taxonkit reformat -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' --data-dir taxdump/ \
    | csvtk add-header -t -n 'taxid,count,name,rank,superkindom,phylum,class,order,family,genus,species' \
    > refseq-$name.taxid.map.stats.tsv

seqkit stats -T -j 12 --infile-list <(find files -name "*.fna.gz") > files.stats.tsv



# ------------------------------------------------------ 
# create another directory linking to genome files

input=files.renamed

mkdir -p $input
# create soft links
cd $input
find ../files -name "*.fna.gz" \
    | rush 'ln -s {}'
cd ..    
# rename
brename -R -p '^(\w{3}_\d{9}\.\d+).+' -r '$1.fna.gz' $input

Building database (all k-mers, for profiling on short-reads):

# -----------------------------------------------------------------
# for viral, only splitting into 10 chunks
name=viral

input=files.renamed

# -------------
# all kmers

kmcp compute -I $input -O refseq-$name-k21-n10 \
    -k 21 --seq-name-filter plasmid \
    --split-number 10 --split-overlap 150 \
    --log refseq-$name-k21-n10.log -j 32 --force

# viral genomes are small:
#   using small false positive rate: 0.001
#   using more hash functions: 3
kmcp index -I refseq-$name-k21-n10/ -O refseq-viral.kmcp \
    -j 32 -f 0.001 -n 3 -x 100K \
    --log refseq-viral.kmcp.log --force

# cp taxid and name mapping file to database directory
cp taxid.map name.map refseq-viral.kmcp/


# -----------------------------------------------------------------
# for fungi
name=fungi

input=files.renamed


# -------------
# all kmers

kmcp compute -I $input -O refseq-$name-k21-n10 \
    -k 21 --seq-name-filter plasmid \
    --split-number 10 --split-overlap 150 \
    --log refseq-$name-k21-n10.log --force

kmcp index -I refseq-$name-k21-n10/ -O refseq-fungi.kmcp \
    -j 32 -f 0.3 -n 1 \
    --log refseq-fungi.kmcp.log --force

# cp taxid and name mapping file to database directory
cp taxid.map name.map refseq-fungi.kmcp/

Genbank viral

Tools

genome_updater for downloading genomes from NCBI.

Downloading viral sequences:

name=viral

# -k for dry-run
# -i for fix
# keep at most 5 genomes for a taxid:
#    genome_updater v0.5.1: -A 5 -c "" -l ""
#    genome_updater v0.2.5: -j taxids:5 -c "all" -l "all"
time genome_updater.sh \
    -d "genbank"\
    -A 5 \
    -g $name \
    -c "" \
    -l "" \
    -f "genomic.fna.gz" \
    -o "genbank-$name" \
    -t 12 \
    -m -a -p

# cd genbank-viral/2021-12-06_15-27-37/

# taxdump
mkdir -p taxdump
tar -zxvf taxdump.tar.gz -C taxdump

# assembly accession -> taxid
cut -f 1,6 assembly_summary.txt > taxid.map    
# assembly accession -> name
cut -f 1,8 assembly_summary.txt > name.map


# stats (optional)
cat taxid.map  \
    | csvtk freq -Ht -f 2 -nr \
    | taxonkit lineage -r -n -L --data-dir taxdump/ \
    | taxonkit reformat -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' --data-dir taxdump/ \
    | csvtk add-header -t -n 'taxid,count,name,rank,superkindom,phylum,class,order,family,genus,species' \
    > genbank-viral.taxid.map.stats.tsv

seqkit stats -T -j 12 --infile-list <(find files -name "*.fna.gz") > files.stats.tsv



# ------------------------------------------------------ 
# create another directory linking to genome files

input=files.renamed

mkdir -p $input
# create soft links
cd $input
find ../files -name "*.fna.gz" \
    | rush 'ln -s {}'
cd ..    
# rename
brename -R -p '^(\w{3}_\d{9}\.\d+).+' -r '$1.fna.gz' $input

Keep at most 5 genomes for a taxid (optional)

# -----------------------------------------------------------------
# keep at most 5 genomes for a taxid

# backup
cat taxid.map > taxid.all.map
cat name.map > name.all.map

# keep at most 5 ids for a taxid
cat taxid.all.map \
    | csvtk sort -Ht -k 2:n -k 1:n \
    | csvtk uniq -Ht -f 2 -n 5 \
    > taxid.map

cat name.all.map \
    | csvtk grep -Ht -P <(cut -f 1 taxid.map) \
    > name.map

# organize files
input=files.renamed
output=files.renamed.slim
mkdir -p $output 
cd $output
find ../$input -name "*.fna.gz" \
    | csvtk mutate -Ht -p '/([^\/]+).fna.gz' \
    | csvtk grep -Ht -f 2 -P <(cut -f 1 ../taxid.map) \
    | rush 'ln -s {1}'
cd ..

# check number of genomes
ls $output | wc -l
wc -l taxid.map

Building database (all k-mers, for profiling on short-reads):

# -----------------------------------------------------------------
name=viral

# input=files.renamed.slim
input=files.renamed


# ----------------
# all kmers

kmcp compute -I $input -O genbank-$name-k21-n10 \
    -k 21 --seq-name-filter plasmid \
    --split-number 10 --split-overlap 150 \
    --log genbank-$name-k21-n10.log --force

# viral genomes are small:
#   using small false positive rate: 0.05
#   still using one hash function: 1
kmcp index -I genbank-$name-k21-n10/ -O genbank-viral.kmcp \
    -j 32 -f 0.05 -n 1 -x 100K -8 1M \
    --log genbank-viral.kmcp.log --force

# cp taxid and name mapping file to database directory
cp taxid.map name.map genbank-viral.kmcp/

Building small databases (all k-mers, for profiling with a computer cluster or computer with limited RAM):

name=genbank-viral

input=files.renamed.slim

find $input -name "*.fna.gz" > $input.files.txt

# sort files by genome size, so we can split them into chunks with similar genome sizes
cp $input.files.txt $input.files0.txt
cat $input.files0.txt \
    | rush -k 'echo -e {}"\t"$(seqkit stats -T {} | sed 1d | cut -f 5)' \
    > $input.files0.size.txt
cat $input.files0.size.txt \
    | csvtk sort -Ht -k 2:nr \
    | csvtk cut -t -f 1 \
    > $input.files.txt

# number of databases
n=4

# split into $n chunks using round robin distribution
split -n r/$n -d  $input.files.txt $name.n$n-

# create database for every chunks
for f in $name.n$n-*; do 
    echo $f

    kmcp compute -i $f -O $f-k21-n10 \
        -k 21 --seq-name-filter plasmid \
        --split-number 10 --split-overlap 150 \
        --log $f-k21-n10.log -j 24 --force

    # viral genomes are small:
    #   using small false positive rate: 0.001
    #   using more hash functions: 3
    kmcp index -I $f-k21-n10/ -O $f.kmcp \
        -j 24 -f 0.05 -n 1 -x 100K -8 1M \
        --log $f.kmcp.log --force

    # cp taxid and name mapping file to database directory
    cp taxid.map name.map $f.kmcp/
done

Human genome

Downloading human genome file from CHM13:

# v1.1: wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.3_CHM13_T2T_v1.1/GCA_009914755.3_CHM13_T2T_v1.1_genomic.fna.gz

# v2.0
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_genomic.fna.gz

Building database (all k-mers, < 10min):

# v1.1: input=GCA_009914755.3_CHM13_T2T_v1.1_genomic.fna.gz

# v2.0
input=GCA_009914755.4_T2T-CHM13v2.0_genomic.fna.gz

# splitting human genome into 3072 chunks, each with ~1Mb
# The regular expression '^(\w{3}_\d{9}\.\d+).+' is for extracting 'GCA_009914755.4' from the file name.
kmcp compute $input -O human-chm13-k21-n3072 \
    --ref-name-regexp '^(\w{3}_\d{9}\.\d+).+' \
    -k 21 \
    --split-number 3072 --split-overlap 150 \
    --log human-chm13-k21-n3072.log -j 32 --force

# using a small false positive rate: 0.05
# using one hash function: 1
kmcp index -I human-chm13-k21-n3072/ -O human-chm13.kmcp \
    -j 8 -f 0.05 -n 1 \
    --log human-chm13.kmcp.log --force

# taxid.map
echo -ne "GCA_009914755.4\t9606\n" > taxid.map

# name.map
echo -ne "GCA_009914755.4\tHomo sapiens isolate CHM13\n" > name.map

# cp name mapping file to database directory
cp taxid.map name.map human-chm13.kmcp/

# database size
$ du -sh human-chm13.kmcp
6.7G    human-chm13.kmcp

Refseq plasmid

Downloading plasmid sequences:

wget https://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/
cat index.html \
    | perl -ne 'next unless /"(plasmid.*genomic.fna.gz)"/; print "$1\n"' > files.txt

baseurl=https://ftp.ncbi.nlm.nih.gov/refseq/release/plasmid/
outdir=archives; mkdir -p $outdir
cat files.txt \
    | rush -j 12 -v b=$baseurl -v out=$outdir \
        'wget -c {b}/{} -o /dev/null -O {out}/{}' -c -C download.rush


# name mapping
seqkit seq -n archives/*.fna.gz \
    | sed -E 's/\s+/\t/' \
    > name.map

# length stats
seqkit stats -b -a -j 12 -T archives/*.fna.gz > archives.stats.tsv


# split to individual files
for f in archives/*.fna.gz; do \
    seqkit split2 -s 1 --quiet $f -O refseq-plasmid; \
done
# rename files with sequence ID
find refseq-plasmid -name "*.fna.gz" \
    | rush 'mv {} {/}/$(seqkit seq -ni {}).fna.gz'

Building database (all k-mers):

name=plasmid

input=refseq-plasmid

kmcp compute -I $input -O refseq-$name-k21-n10 \
    -k 21 --circular \
    --split-number 10 --split-overlap 150 \
    --log refseq-$name-k21-n10.log -j 32 --force

#   using small false positive rate: 0.01
#   using more hash functions: 3
kmcp index -I refseq-$name-k21-n10/ -O refseq-$name.kmcp \
    -j 32 -f 0.01 -n 3 -x 200K -X 1024 \
    --log refseq-$name.kmcp.log --force

# cp name mapping file to database directory
cp name.map refseq-$name.kmcp/

Building database (FracMinHash/Scaled MinHash):

name=plasmid

input=refseq-plasmid

kmcp compute -I $input -O refseq-$name-k31-D10 \
    -k 31 --circular --scale 10 \
    --log refseq-$name-k31-D10.log -j 32 --force

#   using small false positive rate: 0.01
#   using more hash functions: 3
kmcp index -I refseq-$name-k31-D10/ -O refseq-$name.minhash.kmcp \
    -j 8 -f 0.01 -n 3 -x 100K -X 1024 \
    --log refseq-$name.minhash.kmcp.log --force

# cp name mapping file to database directory
cp name.map refseq-$name.minhash.kmcp/

Building database (Closed Syncmer):

name=plasmid

input=refseq-plasmid

kmcp compute -I $input -O refseq-$name-k31-S21 \
    -k 31 --circular --syncmer-s 21 \
    --log refseq-$name-k31-S21.log -j 32 --force

#   using small false positive rate: 0.01
#   using more hash functions: 3
kmcp index -I refseq-$name-k31-S21/ -O refseq-$name.syncmer.kmcp \
    -j 8 -f 0.01 -n 3 -x 100K -X 1024 \
    --log refseq-$name.syncmer.kmcp.log --force

# cp name mapping file to database directory
cp name.map refseq-$name.syncmer.kmcp/

Building databases (prokaryotic genome collections)

HumGut (30,691 clusters)

HumGut is a comprehensive Human Gut prokaryotic genomes collection filtered by metagenome data.

In this work, we aimed to create a collection of the most prevalent healthy human gut prokaryotic genomes, to be used as a reference database, including both MAGs from the human gut and ordinary RefSeq genomes.

Dataset

Genomes: HumGut.tar.gz
Metadata: HumGut.tsv
Taxdump files:
- NCBI taxonomy: ncbi_names.dump and ncbi_nodes.dump
- GTDB taxomomy: gtdb_names.dump and gtdb_nodes.dump

Uncompressing and renaming:

# uncompress
tar -zxvf HumGut.tar.gz

# taxdump files
mkdir taxdump
mv ncbi_names.dmp taxdump/names.dmp
mv ncbi_nodes.dmp taxdump/nodes.dmp

Mapping file:

# assembly accession -> taxid
cat HumGut.tsv \
    | csvtk cut -t -f genome_file,ncbi_tax_id \
    | csvtk replace -f genome_file -t -p '\.fna\.gz$' \
    | csvtk del-header \
    > taxid.map

# assembly accession -> name
cat HumGut.tsv \
    | csvtk cut -t -f genome_file,ncbi_organism_name \
    | csvtk replace -f genome_file -t -p '\.fna\.gz$' \
    | csvtk del-header \
    > name.map

Stats:

# -------------------------------------------------------------
# NCBI taxonomy

# number of species/strains
cat taxid.map \
    | taxonkit lineage --data-dir taxdump-ncbi -i 2 -r -n -L \
    | csvtk freq -Ht -f 4 -nr \
    | csvtk pretty -H -t \
    | tee taxid.map.stats
species      23604
strain       6816
subspecies   161
no rank      69
serotype     38
genus        3

# number of unique species/strains
cat taxid.map \
    | csvtk uniq -Ht -f 2 \
    | taxonkit lineage --data-dir taxdump-ncbi -i 2 -r -n -L \
    | csvtk freq -Ht -f 4 -nr \
    | csvtk pretty -H -t \
    | tee taxid.map.uniq.stats;
species      1240
strain       458
subspecies   15
genus        1
no rank      1
serotype     1

# -------------------------------------------------------------
# GTDB taxonomy

cat HumGut.tsv \
    | csvtk cut -t -f genome_file,gtdbtk_tax_id \
    | csvtk replace -f genome_file -t -p '\.fna\.gz$' \
    | csvtk del-header \
    > taxid-gtdb.map

cat taxid-gtdb.map \
    | csvtk uniq -Ht -f 2 \
    | taxonkit lineage --data-dir taxdump-gtdb -i 2 -r -n -L \
    | csvtk freq -Ht -f 4 -nr \
    | csvtk pretty -H -t \
    | tee taxid-gtdb.map.uniq.stats
species   2810
genus     434
family    52
order     12
class     2

Building database:

# compute k-mers
#   sequence containing "plasmid" in name are ignored,
#   reference genomes are split into 10 chunks
#   k = 21
kmcp compute -I fna/ -k 21 -n 10 -B plasmid -O humgut-k21-n10 -j 32 --force

# build database
#   number of index files: 32, for server with >= 32 CPU cores
#   bloom filter parameter:
#     number of hash function: 1
#     false positive rate: 0.3
kmcp index -j 32 -I humgut-k21-n10 -O humgut.kmcp -n 1 -f 0.3

cp taxid.map taxid-gtdb.map humgut.kmcp/

proGenomes

proGenomes v2.1 provides 84,096 consistently annotated bacterial and archaeal genomes from over 12000 species.

Dataset:

Representative genomes: contigs.representatives.fasta.gz
NCBI taxonomy: proGenomes2.1_specI_lineageNCBI.tab

Organize sequences:

# download
wget https://progenomes.embl.de/data/repGenomes/freeze12.contigs.representatives.fasta.gz

# unzip for splitting by genome ID
time seqkit seq freeze12.contigs.representatives.fasta.gz -o freeze12.contigs.representatives.fasta

# split by genome ID
seqkit split --by-id --two-pass --id-regexp '(^\d+\.\w+)\.' freeze12.contigs.representatives.fasta --out-dir genomes

# batch rename
brename -p '^.+id_' genomes

# compress genomes for saving space
find genomes/ -name "*.fasta" | rush 'gzip {}'

Mapping file:

# download
wget https://progenomes.embl.de/data/proGenomes2.1_specI_lineageNCBI.tab

ls genomes/ | sed 's/.fasta.gz//' > id.txt

# id -> taxid
cut -f 1 proGenomes2.1_specI_lineageNCBI.tab \
    | awk -F . '{print $0"\t"$1}' \
    | csvtk grep -Ht -P id.txt \
    > taxid.map

cat taxid.map \
    | csvtk uniq -Ht -f 2 \
    | taxonkit lineage --data-dir taxdump -i 2 -r -n -L \
    | csvtk freq -Ht -f 4 -nr \
    | csvtk pretty -H -t \
    | tee taxid.map.uniq.stats
species           8088
strain            3262
subspecies        37
isolate           25
no rank           16
forma specialis   14
biotype           1


# use the taxonomy verion of the refseq-virus: 2021-10-01
# id -> name
cat taxid.map \
    | taxonkit lineage --data-dir taxdump -i 2 -n -L \
    | cut -f 1,3 \
    > name.map

Building database:

# compute k-mers
#   sequence containing "plasmid" in name are ignored,
#   reference genomes are split into 10 chunks
#   k = 21
kmcp compute -I genomes/ -k 21 -n 10 -B plasmid -O progenomes-k21-n10 --force

# build database
#   number of index files: 32, for server with >= 32 CPU cores
#   bloom filter parameter:
#     number of hash function: 1
#     false positive rate: 0.3
kmcp index -j 32 -I progenomes-k21-n10 -O progenomes.kmcp -n 1 -f 0.3

cp taxid.map name.map progenomes.kmcp/

Building databases (viral genome collections)

MGV

Bacteriophages have important roles in the ecology of the human gut microbiome but are under-represented in reference data- bases. To address this problem, we assembled the Metagenomic Gut Virus catalogue that comprises 189,680 viral genomes from 11,810 publicly available human stool metagenomes. Over 75% of genomes represent double-stranded DNA phages that infect members of the Bacteroidia and Clostridia classes. Based on sequence clustering we identified 54,118 candidate viral spe- cies, 92% of which were not found in existing databases. The Metagenomic Gut Virus catalogue improves detection of viruses in stool metagenomes and accounts for nearly 40% of CRISPR spacers found in human gut Bacteria and Archaea. We also pro- duced a catalogue of 459,375 viral protein clusters to explore the functional potential of the gut virome. This revealed tens of thousands of diversity-generating retroelements, which use error-prone reverse transcription to mutate target genes and may be involved in the molecular arms race between phages and their bacterial hosts.

https://doi.org/10.1038/s41564-021-00928-6 https://portal.nersc.gov/MGV/

Basic information (optional)

$ seqkit  stats mgv_contigs.fna 
file             format  type  num_seqs        sum_len  min_len   avg_len  max_len
mgv_contigs.fna  FASTA   DNA    189,680  8,803,222,510    1,244  46,410.9  553,716

# Genome completeness
Complete: n=26,030
>90% complete: n=53,220
50-90% complete: n=110,430

$ seqkit  seq -n mgv_contigs.fna | head -n 3
MGV-GENOME-0364295
MGV-GENOME-0364296
MGV-GENOME-0364303

Stats (optional)

cat mgv_contig_info.tsv \
    | csvtk cut -t -f completeness \
    | csvtk plot hist -o completeness.hist.png

# Complete
$ cat mgv_contig_info.tsv \
    | csvtk filter2 -t -f '$completeness == 100' \
    | csvtk nrows 
32577

# >90% complete
$ cat mgv_contig_info.tsv \
    | csvtk filter2 -t -f '$completeness >= 90' \
    | csvtk nrows 
78814  # < 79250

# checkv_quality
$ cat mgv_contig_info.tsv \
    | csvtk cut -t -f checkv_quality \
    | csvtk freq -t -nr | more
checkv_quality  frequency
Medium-quality  110430
High-quality    53220
Complete        26030

# >90% complete && checkv_quality == High-quality/Complete
$ cat mgv_contig_info.tsv \
    | csvtk filter2 -t -f '$completeness >= 90 && $checkv_quality != "Medium-quality"' \
    | csvtk nrows 
78813

Stats of high-quality genomes (optional)

# high-quality genomes
cat mgv_contig_info.tsv \
    | csvtk filter2 -t -f '$completeness >= 90 && $checkv_quality != "Medium-quality"' \
    > mgv.hq.tsv

# number of families
cat mgv.hq.tsv \
    | csvtk freq -t -f ictv_family -nr \
    | csvtk nrow -t
19

# number of species
cat mgv.hq.tsv \
    | csvtk freq -t -f votu_id -nr \
    | csvtk nrow -t
26779

# baltimore
cat mgv.hq.tsv \
    | csvtk freq -t -f baltimore -nr \
    | csvtk pretty -t
baltimore   frequency
---------   ---------
dsDNA       76527
ssDNA       2215
NULL        62
DNA         7
dsDNA-RT    1
ssRNA-RT    1

# prophage?
cat mgv.hq.tsv \
    | csvtk freq -t -f prophage -nr \
    | csvtk pretty -t
prophage   frequency
--------   ---------
No         58366
Yes        20447

# species with both prophage and lytic phages
cat mgv.hq.tsv \
    | csvtk freq -t -f votu_id,prophage \
    | csvtk freq -t -f votu_id \
    | csvtk filter2 -t -f '$frequency > 1' \
    | csvtk nrow -t
2745

Prepare genomes:

# ID
cat mgv_contig_info.tsv \
    | csvtk filter2 -t -f '$completeness >= 90 && $checkv_quality != "Medium-quality"' \
    | csvtk cut -t -f contig_id \
    | csvtk del-header \
    > mgv.hq.tsv.id

# extract sequences
seqkit grep -f mgv.hq.tsv.id mgv_contigs.fna -o mgv.hq.fasta.gz

# split into files with one genome
seqkit split2 -s 1 mgv.hq.fasta.gz -O mgv

# rename with 
find mgv/ -name "*.fasta.gz" \
    | rush -j 20 'mv {} {/}/$(seqkit seq -ni {}).fa.gz'

Create taxdump files and taxid.map with taxonkit (version >= v0.12.0):

cat mgv_contig_info.tsv \
        | csvtk cut -t -f ictv_order,ictv_family,ictv_genus,votu_id,contig_id \
    | csvtk del-header \
    > mgv.taxonomy.tsv

taxonkit create-taxdump mgv.taxonomy.tsv --out-dir mgv-taxdump \
    --force -A 5 -R order,family,genus,species

cp mgv-taxdump/taxid.map .

# name.map
cat mgv-taxdump/taxid.map \
    | taxonkit lineage --data-dir mgv-taxdump/ -i 2 \
    | cut -f 1,3 \
    > name.map

head -n 5 name.map 
MGV-GENOME-0364295      Caudovirales;crAss-phage;OTU-61123
MGV-GENOME-0364296      Caudovirales;crAss-phage;OTU-61123
MGV-GENOME-0364303      Caudovirales;crAss-phage;OTU-05782
MGV-GENOME-0364311      Caudovirales;crAss-phage;OTU-01114
MGV-GENOME-0364312      Caudovirales;crAss-phage;OTU-23935

Building database:

# compute k-mers
#   reference genomes are split into 10 chunks
#   k = 21
kmcp compute -I mgv/ -k 21 -n 10 -l 150 -O mgv-k21-n10 --force

# viral genomes are small:
#   using small false positive rate: 0.05
#   still using one hash function: 1
kmcp index -j 32 -I mgv-k21-n10 -O mgv.kmcp \
    -j 32 -f 0.05 -n 1 -x 100K -8 1M \
    --log mgv.kmcp.log

cp taxid.map name.map mgv.kmcp/

Building custom databases

Files:

Genome files
- (Gzip-compressed) FASTA/Q format.
- One genome per file with the reference identifier in the file name.
TaxId mapping file (for metagenomic profiling)
- Two-column (reference identifier and TaxId) tab-delimited.
NCBI taxonomy dump files (for metagenomic profiling). You can use taxonkit create-taxdump to create NCBI-style taxdump files for custom genome collections, which also generates a TaxId mapping file.
- names.dmp
- nodes.dmp
- merged.dmp (optional)
- delnodes.dmp (optional)

Tools:

kmcp for metagenomic profiling.
rush for executing jobs in parallel.
brename for batching renaming files (optional)

Memory notes:

By default, kmcp search loads the whole database into main memory (RAM) for fast searching. Optionally, the flag --low-mem can be set to avoid loading the whole database, while it's much slower, >10X slower on SSD and should be much slower on HDD disks.

Another way is dividing the reference genomes into several parts and building smaller databases for all parts, so that the biggest database can be loaded into RAM. After performing database searching, search results on all small databases can be merged with kmcp merge for downstream analysis.

Buiding small databases can also accelerate searching on a computer cluster, where every node searches a part of the database.

Step 1. Computing k-mers

Input:

Input plain or gzipped FASTA/Q files can be given via positional arguments or the flag -i/--infile-list with the list of input files,
Or a directory containing sequence files via the flag -I/--in-dir, with multiple-level sub-directories allowed. A regular expression for matching sequencing files is available via the flag -r/--file-regexp. The default pattern matches files with extension of fa, fasta, fna, fq, fastq, fa.gz, fasta.gz, fna.gz, fq.gz or fastq.gz.

You may rename the sequence files for convenience using brename. because the sequence/genome identifier in the index and search results would be:

For the default mode (computing k-mers for the whole file):

the basename of file with common FASTA/Q file extension removed,
captured via the flag -N/--ref-name-regexp.

For splitting sequence mode (see details below):
```
same to 1).
```
For computing k-mers for each sequence:
```
the sequence identifier.
```

Unwanted sequences like plasmid sequences can be filtered out by the name via regular expression(s) (-B/--seq-name-filter).

How to compute k-mers: By default, kmcp computes k-mers (sketches) of every file, you can also use --by-seq to compute for every sequence, where sequence IDs in all input files are better to be distinct. It also supports splitting sequences into chunks, this could increase the specificity in profiling results at the cost of a slower searching speed.

Splitting sequences:

Sequences can be split into chunks by a chunk size (-s/--split-size) or number of chunks (-n/--split-number) with overlap (-l/--split-overlap). In this mode, the sequences of each genome should be saved in an individual file.
When splitting by number of chunks, all sequences (except for these matching any regular expression given by -B/--seq-name-filter) in a sequence file are concatenated with k-1 Ns before splitting.
Both sequence/reference IDs and chunks indices are saved for later use, in form of meta/description data in .unik files, and will be reported in kmcp search results.

Metadata:

Every outputted .unik file contains the sequence/reference ID, chunk index, number of chunks, and genome size of reference.
When parsing whole sequence files or splitting by number of chunks, the identifier of a reference is the basename of the input file by default. It can also be extracted from the input file name via -N/--ref-name-regexp, e.g., ^(\w{3}_\d{9}\.\d+) for RefSeq assembly accessions.

Multiple sizes of k-mers are supported, but a single k-mer size is good enough.

Supported k-mer (sketches) types:

K-mer:
- ntHash of k-mer (-k)
K-mer sketchs (all using ntHash):
- Scaled MinHash (-k -D), previously named Scaled MinHash
- Minimizer (-k -W), optionally scaling/down-sampling (-D)
- Closed Syncmer (-k -S), optionally scaling/down-sampling (-D)

Output:

All outputted .unik files are saved in ${outdir}, with path
```
${outdir}/xxx/yyy/zzz/${infile}-id_${seqID}.unik
```
where dirctory tree /xxx/yyy/zzz/ is built for > 1000 output files.
For splitting sequence mode (--split-size > 0 or --split-number > 0), output files are:
```
${outdir}//xxx/yyy/zzz/${infile}/{seqID}-chunk_${chunkIdx}.unik
```
A summary file (${outdir}/_info.txt) is generated for later use. Users need to check if the reference IDs (column name) are what supposed to be.

Performance tips:

Decrease value of -j/--threads for data in hard disk drives (HDD) to reduce I/O pressure.

Commands (usage):

# compute k-mers
#   sequence containing "plasmid" in name are ignored,
#   reference genomes are split into 10 chunks,
#   k = 21
kmcp compute --in-dir refs/ \
    --kmer 21 \
    --split-number 10 \
    --split-overlap 150 \
    --seq-name-filter plasmid \
    --ref-name-regexp '(.+).fasta.gz' \
    --out-dir refs-k21-n10

# demo output
13:11:13.397 [INFO] kmcp v0.9.0
13:11:13.397 [INFO]   https://github.com/shenwei356/kmcp
13:11:13.397 [INFO] 
13:11:13.397 [INFO] checking input files ...
13:11:13.398 [INFO]   9 input file(s) given
13:11:13.398 [INFO] 
13:11:13.398 [INFO] -------------------- [main parameters] --------------------
13:11:13.398 [INFO] input and output:
13:11:13.398 [INFO]   input directory: refs/
13:11:13.398 [INFO]     regular expression of input files: (?i)\.(f[aq](st[aq])?|fna)(.gz)?$
13:11:13.398 [INFO]     *regular expression for extracting reference name from file name: (?i)(.+).fasta.gz
13:11:13.398 [INFO]     *regular expressions for filtering out sequences: [plasmid]
13:11:13.398 [INFO]   output directory: refs-k21-n10
13:11:13.398 [INFO] 
13:11:13.398 [INFO] sequences splitting: true
13:11:13.398 [INFO]   split parts: 10, overlap: 100 bp
13:11:13.398 [INFO] 
13:11:13.398 [INFO] k-mer (sketches) computing:
13:11:13.398 [INFO]   k-mer size(s): 21
13:11:13.398 [INFO]   circular genome: false
13:11:13.398 [INFO]   saving exact number of k-mers: true
13:11:13.398 [INFO] 
13:11:13.398 [INFO] -------------------- [main parameters] --------------------
13:11:13.398 [INFO] 
13:11:13.398 [INFO] computing ...
processed files:  9 / 9 [======================================] ETA: 0s. done
13:11:13.845 [INFO] 
13:11:13.845 [INFO] elapsed time: 453.870411ms
13:11:13.845 [INFO]

A summary file (_info.txt) is generated for later use. Users need to check if the reference IDs (column name) are what supposed to be.

#path	name	chunkIdx	idxNum	genomeSize	kmers
refs-k21-n10/672/060/672/NC_010655.1.fasta.gz/NC_010655.1-chunk_0.unik	NC_010655.1	0	10	2664102	264247
refs-k21-n10/672/060/672/NC_010655.1.fasta.gz/NC_010655.1-chunk_1.unik	NC_010655.1	1	10	2664102	266237
refs-k21-n10/595/162/595/NC_012971.2.fasta.gz/NC_012971.2-chunk_0.unik	NC_012971.2	0	10	4558953	450494
refs-k21-n10/292/039/292/NC_000913.3.fasta.gz/NC_000913.3-chunk_0.unik	NC_000913.3	0	10	4641652	459277
refs-k21-n10/934/859/934/NC_013654.1.fasta.gz/NC_013654.1-chunk_0.unik	NC_013654.1	0	10	4717338	470575

Meta data in the .unik file can be showed using kmcp utils unik-info:

kmcp utils unik-info refs-k21-n10/072/380/072/NZ_CP028116.1.fasta.gz/NZ_CP028116.1-chunk_0.unik -a

Step 2. Building databases

KMCP builds index for k-mers (sketches) with a modified Compact Bit-sliced Signature Index (COBS). We completely rewrite the algorithms, data structure, and file format, and have improved the indexing and searching speed (check benchmark).

Input:

The output directory generated by kmcp compute.

Database size and searching accuracy:

Use --dry-run to adjust parameters and check the final number of index files (#index-files) and the total file size.
-f/--false-positive-rate: the default value 0.3 is enough for a query with tens of k-mers (see BIGSI/COBS paper). Small values could largely increase the size of the database.
-n/--num-hash: large values could reduce the database size, at the cost of a slower searching speed. Values <=4 are recommended.
The value of block size -b/--block-size is better to be multiple of 64. The default value is:
```
(#unikFiles/#threads + 7) / 8 * 8
```
Use flag -x/--block-sizeX-kmers-t, -8/--block-size8-kmers-t, and -1/--block-size1-kmers-t to separately create indexes for inputs with huge number of k-mers, for precise control of database size.

Taxonomy data:

No taxonomy data are included in the database.
Taxonomy information are only needed in kmcp profile.

Performance tips:

The umber of blocks (.uniki files) is better to be smaller than or equal to the number of CPU cores for faster searching speed. We can set -j/--threads to control the blocks number. When more threads (>= 1.3 * #blocks) are given, extra workers are automatically created.
#threads files are simultaneously opened, and the max number of opened files is limited by the flag -F/--max-open-files. You may use a small value of -F/--max-open-files for hard disk drive storages.
When the database is used in a new computer with more CPU cores, kmcp search could automatically scale to utilize as many cores as possible.

Commands (usage):

# build database
#   number of index files: 32, for server with >= 32 CPU cores
#   bloom filter parameter:
#     number of hash function: 1
#     false positive rate: 0.3
kmcp index -I refs-k21-n10 \
    --threads 32 \
    --num-hash 1 \
    --false-positive-rate 0.3 \
    --out-dir refs.kmcp

# demo output
22:44:17.710 [INFO] kmcp v0.9.0
22:44:17.710 [INFO]   https://github.com/shenwei356/kmcp
22:44:17.710 [INFO] 
22:44:17.710 [INFO] loading .unik file infos from file: refs-k21-n10/_info.txt
22:44:17.716 [INFO]   90 cached file infos loaded
22:44:17.716 [INFO] 
22:44:17.716 [INFO] -------------------- [main parameters] --------------------
22:44:17.716 [INFO]   number of hashes: 1
22:44:17.716 [INFO]   false positive rate: 0.300000
22:44:17.717 [INFO]   k-mer size(s): 21
22:44:17.717 [INFO]   split seqequence size: 0, overlap: 0
22:44:17.717 [INFO]   block-sizeX-kmers-t: 10.00 M
22:44:17.717 [INFO]   block-sizeX        : 256.00
22:44:17.717 [INFO]   block-size8-kmers-t: 20.00 M
22:44:17.717 [INFO]   block-size1-kmers-t: 200.00 M
22:44:17.717 [INFO] -------------------- [main parameters] --------------------
22:44:17.717 [INFO] 
22:44:17.717 [INFO] building index ...
22:44:17.726 [WARN] ignore -X/--block-size (256) which is >= -b/--block-size (8)
22:44:17.726 [INFO] 
22:44:17.726 [INFO]   block size: 8
22:44:17.726 [INFO]   number of index files: 32 (may be more)
22:44:17.726 [INFO] 
22:44:17.726 [block #001]   1 / 1  100 %
22:44:17.726 [block #002]   1 / 1  100 %
22:44:17.726 [block #003]   1 / 1  100 %
22:44:17.726 [block #004]   1 / 1  100 %
22:44:17.726 [block #005]   1 / 1  100 %
22:44:17.726 [block #006]   1 / 1  100 %
22:44:17.726 [block #007]   1 / 1  100 %
22:44:17.726 [block #008]   1 / 1  100 %
22:44:17.726 [block #009]   1 / 1  100 %
22:44:17.727 [block #010]   1 / 1  100 %
22:44:17.727 [block #011]   1 / 1  100 %
22:44:17.727 [block #012]   1 / 1  100 %
[saved index files] 12 / 12  ETA: 0s. done
22:44:17.933 [INFO] 
22:44:17.933 [INFO] kmcp database with 42713316 k-mers saved to refs.kmcp
22:44:17.933 [INFO] total file size: 15.66 MB
22:44:17.933 [INFO] total index files: 12
22:44:17.933 [INFO] 
22:44:17.933 [INFO] elapsed time: 223.524128ms
22:44:17.933 [INFO]

Output:

refs.kmcp/
└── R001
    ├── _block001.uniki
    ├── _block002.uniki
    ├── _block003.uniki
    ├── _block004.uniki
    ├── _block005.uniki
    ├── _block006.uniki
    ├── _block007.uniki
    ├── _block008.uniki
    ├── _block009.uniki
    ├── _block010.uniki
    ├── _block011.uniki
    ├── _block012.uniki
    ├── __db.yml
    └── __name_mapping.tsv

__db.yml contains configuration of the database, and _blockXXX.uniki are index files. kmcp utils index-info could show the basic information of index files:

kmcp utils index-info  refs.kmcp/R001/_block001.uniki

file	k	canonical	num-hashes	num-sigs	num-names
refs.kmcp/R001/_block001.uniki	21	true	1	746442	8

Where num-sigs is the size of the bloom filters, and num-names is the number of genome (chunks).

What's next? Check the tutorials.

Merging GTDB and NCBI taxonomy

By default, we use NCBI taxonomy database for reference genomes from GTDB and Refseq. Alternately, we can also use the GTDB taxonomy.

The idea is to export lineages from both GTDB and NCBI using taxonkit reformat, and then create taxdump files from them with taxonkit create-taxdump.

Mapping genome assembly accessions to GTDB lineages using existing GTDB-taxdump, which provids GTDB taxonomy taxdump files for each GTDB version and a taxid.map file mapping accessions to TaxIds of taxa at the subspecies rank. Here, we only output lineages down to the species rank and leave name of taxa at the subspecies/strain rank empty.
```
taxonkit reformat \
    --data-dir gtdb-taxdump/R214/ \
    --taxid-field 2 \
    --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t" \
    gtdb-taxdump/R214/taxid.map \
| csvtk cut -H -t -f -2 -o gtdb-r214.tsv
```

Exporting taxonomic lineages of viral and fungal taxa with rank equal to or lower than species from NCBI taxdump. For taxa whose rank is "no rank" below the species, we treat them as tax of strain rank (--pseudo-strain, taxonkit v0.14.1 or later versionsneeded).

# for viral data
taxonkit reformat \
    --data-dir taxdump/ \
    --taxid-field 2 \
    --pseudo-strain \
    --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
    taxid.map \
| csvtk cut -H -t -f -2 -o genbank-viral.tsv

# for fungal data
taxonkit reformat \
    --data-dir taxdump/ \
    --taxid-field 2 \
    --pseudo-strain \
    --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
    taxid.map \
| csvtk cut -H -t -f -2 -o refseq-fungi.tsv

Creating taxdump from lineages above.

# move the lineage files to current directory
# cp gtdb/gtdb-r214.tsv genbank-viral/genbank-viral.tsv refseq-fungi/refseq-fungi.tsv .

cat gtdb-r214.tsv genbank-viral.tsv refseq-fungi.tsv \
    | taxonkit create-taxdump \
        --field-accession 1 \
        -R "superkingdom,phylum,class,order,family,genus,species,strain" \
        -O taxdump.gtdb+ncbi

NCBI taxonomy of Viruses changed rapidly, of which some TaxIds might be deleted. You may added them to the the taxid.map file for compatibility.

cat genbank-viral.tsv refseq-fungi.tsv gtdb-r214.tsv \
    | csvtk grep -Ht -f 2 -r -p "^$" \
    | cut -f 1 \
    | awk '{print $1"\t0"}' \
    >> taxdump.gtdb+ncbi/taxid.map

Some tests:

# SARS-COV-2 in NCBI taxonomy
$ echo 2697049 \
    | taxonkit lineage -t --data-dir taxdump/ \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L --data-dir taxdump/ \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
10239     superkingdom   Viruses
2559587   clade          Riboviria
2732396   kingdom        Orthornavirae
2732408   phylum         Pisuviricota
2732506   class          Pisoniviricetes
76804     order          Nidovirales
2499399   suborder       Cornidovirineae
11118     family         Coronaviridae
2501931   subfamily      Orthocoronavirinae
694002    genus          Betacoronavirus
2509511   subgenus       Sarbecovirus
694009    species        Severe acute respiratory syndrome-related coronavirus
2697049   no rank        Severe acute respiratory syndrome coronavirus 2

$ echo "Severe acute respiratory syndrome coronavirus 2" | taxonkit name2taxid --data-dir taxdump.gtdb+ncbi/
Severe acute respiratory syndrome coronavirus 2 216305222

$ echo 216305222 \
    | taxonkit lineage -t --data-dir taxdump.gtdb+ncbi/ \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L --data-dir taxdump.gtdb+ncbi/ \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
1287770734   superkingdom   Viruses
1506901452   phylum         Pisuviricota
1091693597   class          Pisoniviricetes
37745009     order          Nidovirales
738421640    family         Coronaviridae
906833049    genus          Betacoronavirus
1015862491   species        Severe acute respiratory syndrome-related coronavirus
216305222    strain         Severe acute respiratory syndrome coronavirus 2



$ echo "Escherichia coli"  | taxonkit name2taxid --data-dir taxdump.gtdb+ncbi/
Escherichia coli        1945799576

$ echo 1945799576 \
    | taxonkit lineage -t --data-dir taxdump.gtdb+ncbi/ \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L --data-dir taxdump.gtdb+ncbi/ \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
609216830    superkingdom   Bacteria
1641076285   phylum         Proteobacteria
329474883    class          Gammaproteobacteria
1012954932   order          Enterobacterales
87250111     family         Enterobacteriaceae
1187493883   genus          Escherichia
1945799576   species        Escherichia coli