TutorialLink

Table of ContentsLink

Formatting lineage
Parsing kraken/bracken result
Making nr blastdb for specific taxids
Summaries of taxonomy data
Merging GTDB and NCBI taxonomy
Filtering or subsetting taxdmp files to make a custom taxdmp with given TaxIDs

Formatting lineageLink

Show lineage detail of a TaxId. The command below works on Windows with help of csvtk.

$ echo 2697049 \
    | taxonkit lineage -t \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht

10239     acellular root   Viruses                                        
2559587   realm            Riboviria                                      
2732396   kingdom          Orthornavirae                                  
2732408   phylum           Pisuviricota                                   
2732506   class            Pisoniviricetes                                
76804     order            Nidovirales                                    
2499399   suborder         Cornidovirineae                                
11118     family           Coronaviridae                                  
2501931   subfamily        Orthocoronavirinae                             
694002    genus            Betacoronavirus                                
2509511   subgenus         Sarbecovirus                                   
3418604   species          Betacoronavirus pandemicum                     
2697049   no rank          Severe acute respiratory syndrome coronavirus 2

Example data.

$ cat taxids3.txt
376619
349741
239935
314101
11932
1327037
83333
1408252
2605619
2697049

Format to 7-level ranks ("superkingdom phylum class order family genus species").

$ cat taxids3.txt \
    | taxonkit reformat2 -I 1

376619  Bacteria;Pseudomonadota;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis
349741  Bacteria;Verrucomicrobiota;Verrucomicrobiia;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
239935  Bacteria;Verrucomicrobiota;Verrucomicrobiia;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
314101  Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B
11932   Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
1327037 Viruses;Uroviricota;Caudoviricetes;;;;Croceibacter phage P2559Y
83333   Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
1408252 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
2605619 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Betacoronavirus pandemicum

Format to 8-level ranks ("superkingdom phylum class order family genus species subspecies/rank").

$ cat taxids3.txt \
    | taxonkit reformat2 -I 1 -f "{domain|acellular root|superkingdom};{phylum};{class};{order};{family};{genus};{species};{strain|subspecies|no rank}"

376619  Bacteria;Pseudomonadota;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS
349741  Bacteria;Verrucomicrobiota;Verrucomicrobiia;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935  Bacteria;Verrucomicrobiota;Verrucomicrobiia;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;
314101  Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B;environmental samples
11932   Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle;unclassified Retroviridae
1327037 Viruses;Uroviricota;Caudoviricetes;;;;Croceibacter phage P2559Y;unclassified Caudoviricetes
83333   Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12
1408252 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178
2605619 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli O16:H48
2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Betacoronavirus pandemicum;Severe acute respiratory syndrome coronavirus 2

Replace missing ranks with Unassigned and output tab-delimited format. (Warning: for NCBI taxonomy data since March 2025, reformat can't handle Bacteria's rank domain and Viruses' rank acellular root simutaneously).

$ cat taxids3.txt \
    | taxonkit reformat2 -I 1 -r "Unassigned" -f "{d}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
    | csvtk pretty -H -t

376619    Bacteria     Pseudomonadota      Gammaproteobacteria   Thiotrichales        Francisellaceae      Francisella                  Francisella tularensis                            Francisella tularensis subsp. holarctica LVS
349741    Bacteria     Verrucomicrobiota   Verrucomicrobiia      Verrucomicrobiales   Akkermansiaceae      Akkermansia                  Akkermansia muciniphila                           Akkermansia muciniphila ATCC BAA-835        
239935    Bacteria     Verrucomicrobiota   Verrucomicrobiia      Verrucomicrobiales   Akkermansiaceae      Akkermansia                  Akkermansia muciniphila                           Unassigned                                  
314101    Bacteria     Unassigned          Unassigned            Unassigned           Unassigned           Unassigned                   uncultured murine large bowel bacterium BAC 54B   Unassigned                                  
11932     Unassigned   Artverviricota      Revtraviricetes       Ortervirales         Retroviridae         Intracisternal A-particles   Mouse Intracisternal A-particle                   Unassigned                                  
1327037   Unassigned   Uroviricota         Caudoviricetes        Unassigned           Unassigned           Unassigned                   Croceibacter phage P2559Y                         Unassigned                                  
83333     Bacteria     Pseudomonadota      Gammaproteobacteria   Enterobacterales     Enterobacteriaceae   Escherichia                  Escherichia coli                                  Escherichia coli K-12                       
1408252   Bacteria     Pseudomonadota      Gammaproteobacteria   Enterobacterales     Enterobacteriaceae   Escherichia                  Escherichia coli                                  Escherichia coli R178                       
2605619   Bacteria     Pseudomonadota      Gammaproteobacteria   Enterobacterales     Enterobacteriaceae   Escherichia                  Escherichia coli                                  Unassigned                                  
2697049   Unassigned   Pisuviricota        Pisoniviricetes       Nidovirales          Coronaviridae        Betacoronavirus              Betacoronavirus pandemicum                        Unassigned

Fill missing ranks and add prefixes. (Warning: for NCBI taxonomy data since March 2025, reformat can't handle Bacteria's rank domain and Viruses' rank acellular root simutaneously).

$ cat taxids3.txt \
    | taxonkit reformat -I 1 -F -P -f "{d}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
    | csvtk pretty -H -t

376619    d__Bacteria                      p__Pseudomonadota                 c__Gammaproteobacteria           o__Thiotrichales                       f__Francisellaceae                      g__Francisella                         s__Francisella tularensis                            t__Francisella tularensis subsp. holarctica LVS                                  
349741    d__Bacteria                      p__Verrucomicrobiota              c__Verrucomicrobiia              o__Verrucomicrobiales                  f__Akkermansiaceae                      g__Akkermansia                         s__Akkermansia muciniphila                           t__Akkermansia muciniphila ATCC BAA-835                                          
239935    d__Bacteria                      p__Verrucomicrobiota              c__Verrucomicrobiia              o__Verrucomicrobiales                  f__Akkermansiaceae                      g__Akkermansia                         s__Akkermansia muciniphila                           t__unclassified Akkermansia muciniphila subspecies/strain                        
314101    d__Bacteria                      p__unclassified Bacteria phylum   c__unclassified Bacteria class   o__unclassified Bacteria order         f__unclassified Bacteria family         g__unclassified Bacteria genus         s__uncultured murine large bowel bacterium BAC 54B   t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain
11932     d__unclassified Viruses domain   p__Artverviricota                 c__Revtraviricetes               o__Ortervirales                        f__Retroviridae                         g__Intracisternal A-particles          s__Mouse Intracisternal A-particle                   t__unclassified Mouse Intracisternal A-particle subspecies/strain                
1327037   d__unclassified Viruses domain   p__Uroviricota                    c__Caudoviricetes                o__unclassified Caudoviricetes order   f__unclassified Caudoviricetes family   g__unclassified Caudoviricetes genus   s__Croceibacter phage P2559Y                         t__unclassified Croceibacter phage P2559Y subspecies/strain                      
83333     d__Bacteria                      p__Pseudomonadota                 c__Gammaproteobacteria           o__Enterobacterales                    f__Enterobacteriaceae                   g__Escherichia                         s__Escherichia coli                                  t__Escherichia coli K-12                                                         
1408252   d__Bacteria                      p__Pseudomonadota                 c__Gammaproteobacteria           o__Enterobacterales                    f__Enterobacteriaceae                   g__Escherichia                         s__Escherichia coli                                  t__Escherichia coli R178                                                         
2605619   d__Bacteria                      p__Pseudomonadota                 c__Gammaproteobacteria           o__Enterobacterales                    f__Enterobacteriaceae                   g__Escherichia                         s__Escherichia coli                                  t__unclassified Escherichia coli subspecies/strain                               
2697049   d__unclassified Viruses domain   p__Pisuviricota                   c__Pisoniviricetes               o__Nidovirales                         f__Coronaviridae                        g__Betacoronavirus                     s__Betacoronavirus pandemicum                        t__unclassified Betacoronavirus pandemicum subspecies/strain

When these's no nodes of rank "subspecies" nor "strain", we can switch -S/--pseudo-strain to use the node with lowest rank as subspecies/strain name, if which rank is lower than "species". (Warning: for NCBI taxonomy data since March 2025, reformat can't handle Bacteria's rank domain and Viruses' rank acellular root simutaneously).

$ cat taxids3.txt \
    | taxonkit lineage -r -L \
    | taxonkit reformat -I 1 -F -S -f "{d}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
    | cut -f 1,2,9,10 \
    | csvtk add-header -t -n "taxid,rank,species,strain" \
    | csvtk pretty -t

taxid     rank         species                                           strain                                                                        
-------   ----------   -----------------------------------------------   ------------------------------------------------------------------------------
376619    strain       Francisella tularensis                            Francisella tularensis subsp. holarctica LVS                                  
349741    strain       Akkermansia muciniphila                           Akkermansia muciniphila ATCC BAA-835                                          
239935    species      Akkermansia muciniphila                           unclassified Akkermansia muciniphila subspecies/strain                        
314101    species      uncultured murine large bowel bacterium BAC 54B   unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain
11932     species      Mouse Intracisternal A-particle                   unclassified Mouse Intracisternal A-particle subspecies/strain                
1327037   species      Croceibacter phage P2559Y                         unclassified Croceibacter phage P2559Y subspecies/strain                      
83333     strain       Escherichia coli                                  Escherichia coli K-12                                                         
1408252   subspecies   Escherichia coli                                  Escherichia coli R178                                                         
2605619   no rank      Escherichia coli                                  Escherichia coli O16:H48                                                      
2697049   no rank      Betacoronavirus pandemicum                        Severe acute respiratory syndrome coronavirus 2

List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with "no rank". But when filtering with -L/--lower-than, you can use -n/--save-predictable-norank to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff.

$ time taxonkit list --ids 1 -I "" \
    | taxonkit filter -L species -E species -R -N -n \
    | taxonkit lineage -n -r \
    | taxonkit reformat2 -I 1 -f "{domain|acellular root|superkingdom}\t{phylum}\t{class}\t{order}\t{family}\t{genus}\t{species}\t{strain|subspecies|no rank}" \
    | csvtk cut -Ht -l -f 1,3,2,1,4-11 \
    | csvtk add-header -t -n "taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain" \
    | pigz -c > result.tsv.gz

real    0m9.778s
user    1m22.211s
sys     0m8.489s

$ pigz -cd result.tsv.gz \
    | csvtk grep -t -f taxid -p 2697049 \
    | csvtk transpose -t \
    | csvtk pretty -H -t -W 70 -x ';' -S round

╭---------┬------------------------------------------------------------------------╮
| taxid   | 2697049                                                                |
├---------┼------------------------------------------------------------------------┤
| rank    | Severe acute respiratory syndrome coronavirus 2                        |
├---------┼------------------------------------------------------------------------┤
| name    | Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;          |
|         | Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;          |
|         | Betacoronavirus;Sarbecovirus;Betacoronavirus pandemicum;               |
|         | Severe acute respiratory syndrome coronavirus 2                        |
├---------┼------------------------------------------------------------------------┤
| lineage | 2697049                                                                |
├---------┼------------------------------------------------------------------------┤
| kingdom | no rank                                                                |
├---------┼------------------------------------------------------------------------┤
| phylum  | Viruses                                                                |
├---------┼------------------------------------------------------------------------┤
| class   | Pisuviricota                                                           |
├---------┼------------------------------------------------------------------------┤
| order   | Pisoniviricetes                                                        |
├---------┼------------------------------------------------------------------------┤
| family  | Nidovirales                                                            |
├---------┼------------------------------------------------------------------------┤
| genus   | Coronaviridae                                                          |
├---------┼------------------------------------------------------------------------┤
| species | Betacoronavirus                                                        |
├---------┼------------------------------------------------------------------------┤
| strain  | Betacoronavirus pandemicum                                             |
╰---------┴------------------------------------------------------------------------╯

Mapping old species names to new onesLink

Some species names in papers or websites might changed, we can try querying their TaxIds via their old new names and then retrieve the new ones.

cat example/changed_species_names.txt
Lactobacillus fermentum
Mycoplasma gallinaceum

#  TaxonKit >= v0.15.1
cat example/changed_species_names.txt \
    | taxonkit name2taxid \
    | taxonkit lineage -i 2 -n \
    | cut -f 1,4

Lactobacillus fermentum Limosilactobacillus fermentum
Mycoplasma gallinaceum

Woops, there's no information of Mycoplasma gallinaceum. Then we check the taxid-changelog.

zcat taxonkit/taxid-changelog.csv.gz \
    | csvtk grep -f name -P example/changed_species_names.txt
    | csvtk cut -f taxid,version,change,name,rank \
    | csvtk pretty

taxid   version      change           name                      rank
-----   ----------   --------------   -----------------------   -------
1613    2013-02-21   NEW              Lactobacillus fermentum   species
1613    2016-03-01   ABSORB           Lactobacillus fermentum   species
1613    2016-03-01   CHANGE_LIN_LEN   Lactobacillus fermentum   species
29556   2013-02-21   NEW              Mycoplasma gallinaceum    species
29556   2016-03-01   CHANGE_LIN_LEN   Mycoplasma gallinaceum    species
29556   2021-01-01   CHANGE_NAME      Mycoplasma gallinaceum    species
29556   2021-01-01   CHANGE_LIN_LIN   Mycoplasma gallinaceum    species

We can see the names are changed. Full changes can be queried with the taxid. e.g.,

taxid   version      change           change-value   name                        rank
-----   ----------   --------------   ------------   -------------------------   -------
29556   2013-02-21   NEW                             Mycoplasma gallinaceum      species
29556   2016-03-01   CHANGE_LIN_LEN                  Mycoplasma gallinaceum      species
29556   2020-09-01   CHANGE_NAME                     Mycoplasmopsis gallinacea   species
29556   2020-09-01   CHANGE_LIN_TAX                  Mycoplasmopsis gallinacea   species
29556   2021-01-01   CHANGE_NAME                     Mycoplasma gallinaceum      species
29556   2021-01-01   CHANGE_LIN_LIN                  Mycoplasma gallinaceum      species
29556   2021-09-01   CHANGE_NAME                     Mycoplasmopsis gallinacea   species
29556   2021-09-01   CHANGE_LIN_LIN                  Mycoplasmopsis gallinacea   species
29556   2023-03-01   CHANGE_LIN_LIN                  Mycoplasmopsis gallinacea   species

Then we just use their TaxIds to rertrieve the new names. The final commands are:

zcat taxonkit/taxid-changelog.csv.gz \
    | csvtk grep -f name -P example/changed_species_names.txt \
    | csvtk uniq -f taxid \
    | csvtk cut -f name,taxid \
    | csvtk del-header \
    | csvtk csv2tab \
    | taxonkit lineage -i 2 -n \
    | cut -f 1,4

Lactobacillus fermentum Limosilactobacillus fermentum
Mycoplasma gallinaceum  Mycoplasmopsis gallinacea

Add taxonomy information to BLAST resultLink

An blast result file blast_result.txt, where the second column is the accession of matched sequences.

head -n 5 blast_result.txt | csvtk pretty -Ht

xxxxxxxxxxxxxxxxxxxxx/2/ccs    XM_013496560.1   78.745    494   99    3    6361    6851    895        1385       6.53e-83    326 
xxxxxxxxxxxxxxxxxxxxx/2/ccs    XM_013496560.1   78.543    494   100   3    17168   17658   895        1385       3.04e-81    320 
xxxxxxxxxxxxxxxxxxxxx/76/ccs   LR699760.1       100.000   37    0     0    8139    8175    14507874   14507910   4.27e-06    69.4
xxxxxxxxxxxxxxxxxxxxx/80/ccs   HG994975.1       80.556    540   81    16   8269    8798    3821290    3820765    8.65e-104   394 
xxxxxxxxxxxxxxxxxxxxx/80/ccs   HG994975.1       77.805    410   89    2    9590    9998    3819858    3819450    5.51e-61    252

Prepare acc2taxid.tsv file from nucl_gb.accession2taxid.gz file. Here we use the accession column instead of accession.version column, in case of unmatched versions for some accessions.

zcat nucl_gb.accession2taxid.gz | cut -f 1,3 | gzip -c > acc2taxid.tsv.gz

Extract needed acc2taxid subset to reduce memory usage.

# extract accession and deduplicate and remove versions
cut -f 2 blast_result.txt | csvtk uniq -Ht | csvtk replace -Ht -p '\.\d+$' > acc.txt

# grep from acc2taxid.tsv.gz
zcat acc2taxid.tsv.gz | grep -w -f acc.txt >  hit.acc2taxid.tsv

Prepare taxid2name.tsv, species name are retrived for the taxids.

cut -f 2 hit.acc2taxid.tsv | taxonkit reformat -f '{s}' -I 1 > hit.taxid2name.tsv

Append taxids according to the accessions, and append species names for the taxids.

csvtk add-header -t --names "qseqid,sseqid,pident,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore"  blast_result.txt \
    | csvtk mutate -t -f sseqid -n taxid \
    | csvtk replace -t -k hit.acc2taxid.tsv -f taxid -p '(.+)\.\d+' -r '{kv}' \
    | csvtk mutate -t -f taxid -n species \
    | csvtk replace -t -k hit.taxid2name.tsv -f species -p '(.+)' -r '{kv}' \
    | head -n 5 | csvtk pretty -t

qseqid                         sseqid           pident    length   mismatch   gapopen   qstart   qend    sstart     send       evalue      bitscore   taxid   species             
----------------------------   --------------   -------   ------   --------   -------   ------   -----   --------   --------   ---------   --------   -----   --------------------
xxxxxxxxxxxxxxxxxxxxx/2/ccs    XM_013496560.1   78.745    494      99         3         6361     6851    895        1385       6.53e-83    326        44415   Eimeria mitis       
xxxxxxxxxxxxxxxxxxxxx/2/ccs    XM_013496560.1   78.543    494      100        3         17168    17658   895        1385       3.04e-81    320        44415   Eimeria mitis       
xxxxxxxxxxxxxxxxxxxxx/76/ccs   LR699760.1       100.000   37       0          0         8139     8175    14507874   14507910   4.27e-06    69.4       3702    Arabidopsis thaliana
xxxxxxxxxxxxxxxxxxxxx/80/ccs   HG994975.1       80.556    540      81         16        8269     8798    3821290    3820765    8.65e-104   394        5802    Eimeria tenella

Parsing kraken/bracken resultLink

Example Data

SRS014459-Stool.fasta.gz

Run Kraken2 and Bracken

KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf
THREADS=16

CLASSIFICATION_LVL=S
THRESHOLD=10

READ_LEN=100
SAMPLE=SRS014459-Stool.fasta.gz

BRACKEN_OUTPUT_FILE=$SAMPLE

kraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken

est_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \
    -l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken

Orignial format

$ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport
100.00  9491    0       R       1       root
99.85   9477    0       R1      131567    cellular organisms
99.85   9477    0       D       2           Bacteria
66.08   6271    0       D1      1783270       FCB group
66.08   6271    0       D2      68336           Bacteroidetes/Chlorobi group
66.08   6271    0       P       976               Bacteroidetes
66.08   6271    0       C       200643              Bacteroidia
66.08   6271    0       O       171549                Bacteroidales
34.45   3270    0       F       815                     Bacteroidaceae
34.45   3270    0       G       816                       Bacteroides
10.43   990     990     S       246787                      Bacteroides cellulosilyticus
7.98    757     757     S       28116                       Bacteroides ovatus
3.10    293     0       G1      2646097                     unclassified Bacteroides
1.06    100     100     S       2755405                       Bacteroides sp. CACC 737
0.49    46      46      S       2650157                       Bacteroides sp. HF-5287

Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py)

$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
    | csvtk cut -Ht -f 5,1 \
    | taxonkit reformat2 -I 1 -f "k__{domain|acellular root|superkingdom}|p__{phylum}|c__{class}|o__{order}|f__{family}|g__{genus}|s__{species}" \
    | csvtk cut -Ht -f 3,2 \
    | csvtk replace -Ht -p "(\|[kpcofgs]__)+$" \
    | csvtk replace -Ht -p "\|([kpcofgs]__\|)+" -r "|" \
    | csvtk uniq -Ht \
    | csvtk grep -Ht -p k__ -v \
    | tee SRS014459-Stool.fasta.gz_bracken_species.kreport.format \
    | head -n 10

k__Bacteria     99.85
k__Bacteria|p__Bacteroidota     66.08
k__Bacteria|p__Bacteroidota|c__Bacteroidia      66.08
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales     66.08
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae   34.45
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides    34.45
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus    10.43
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus      7.98
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737        1.06
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides zhangwenhongii      0.49

Converting to Qiime format

$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
    | csvtk cut -Ht -f 5,1 \
    | taxonkit reformat2 -I 1 -f "k__{domain|acellular root|superkingdom}; p__{phylum}; c__{class}; o__{order}; f__{family}; g__{genus}; s__{species}" \
    | csvtk cut -Ht -f 3,2 \
    | csvtk replace -Ht -p "(; [kpcofgs]__)+$" \
    | csvtk replace -Ht -p "; ([kpcofgs]__; )+" -r "; " \
    | csvtk uniq -Ht \
    | csvtk grep -Ht -p k__ -v \
    | head -n 10

k__Bacteria     99.85
k__Bacteria; p__Bacteroidota    66.08
k__Bacteria; p__Bacteroidota; c__Bacteroidia    66.08
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales  66.08
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae       34.45
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides       34.45
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus      10.43
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus        7.98
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737  1.06
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides zhangwenhongii        0.49

Save taxon proportion and taxid, and get lineage, name and rank.

$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
    | csvtk cut -Ht -f 1,5 \
    | taxonkit lineage -i 2 -n -r \
    | csvtk cut -Ht -f 1,2,5,4,3 \
    | head -n 10 \
    | csvtk pretty -Ht

100.00   1         no rank         root                             root                                                                                                                                                 
99.85    131567    cellular root   cellular organisms               cellular organisms                                                                                                                                   
99.85    2         domain          Bacteria                         cellular organisms;Bacteria                                                                                                                          
66.08    1783270   clade           FCB group                        cellular organisms;Bacteria;Pseudomonadati;FCB group                                                                                                 
66.08    68336     clade           Bacteroidota/Chlorobiota group   cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group                                                                  
66.08    976       phylum          Bacteroidota                     cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota                                                     
66.08    200643    class           Bacteroidia                      cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota;Bacteroidia                                         
66.08    171549    order           Bacteroidales                    cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota;Bacteroidia;Bacteroidales                           
34.45    815       family          Bacteroidaceae                   cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae            
34.45    816       genus           Bacteroides                      cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides

Only save species or lower level and get lineage in format of "superkingdom phylum class order family genus species".

$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
    | csvtk cut -Ht -f 1,5 \
    | taxonkit filter -N -E species -L species -i 2 \
    | taxonkit lineage -i 2 -n -r \
    | taxonkit reformat2 -I 2  \
    | csvtk cut -Ht -f 1,2,5,4,6 \
    | csvtk add-header -t -n abundance,taxid,rank,name,lineage \
    | head -n 10 \
    | csvtk pretty -t

abundance   taxid     rank      name                           lineage                                                                                                
---------   -------   -------   ----------------------------   -------------------------------------------------------------------------------------------------------
10.43       246787    species   Bacteroides cellulosilyticus   Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus
7.98        28116     species   Bacteroides ovatus             Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus          
1.06        2755405   species   Bacteroides sp. CACC 737       Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737    
0.49        2650157   species   Bacteroides zhangwenhongii     Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides zhangwenhongii  
0.99        2528203   species   Bacteroides sp. A1C1           Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1        
0.28        2763022   species   Bacteroides sp. M10            Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10         
0.16        2650158   species   Bacteroides luhongzhouii       Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides luhongzhouii    
0.12        2715212   species   Bacteroides faecium            Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides faecium         
5.10        817       species   Bacteroides fragilis           Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis        817       species   Bacteroides fragilis           Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis

Making nr blastdb for specific taxidsLink

Attention:

(2023-11-27) BLAST+ 2.2.15 supports limiting a group of organisms without first using a custom script to get all species-level Taxonomy IDs (taxids) for the group. Details.

E.g., Search of the nr BLAST database limited to Bacteria (taxID 2).
```
blastp -db nr -taxids 2 -query ...
```
(2019) BLAST+ 2.8.1 is released with new databases, which allows you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now.

Changes:

2018-09-13 rewritten
2018-12-22 providing faster method for step 3.1
2019-01-07 add note of new blastdb version
2020-10-14 update steps for huge number of accessions belong to high taxon level like bacteria.

Data:

pre-formated blastdb (09/10/2018)
prot.accession2taxid.gz (09/07/2018) (optional, but recommended)

Hardware in this tutorial

CPU: AMD 8-cores/16-threads 3.7Ghz
RAM: 64GB
DISK:
- Taxonomy files stores in NVMe SSD
- blastdb files stores in 7200rpm HDD

Tools:

blast+
pigz (recommended, faster than gzip)
taxonkit
seqkit (recommended), version >= 0.14.0
rush (optional, for parallizing filtering sequence)

Steps:

Listing all taxids below $id using taxonkit.

id=6656

# 6656 is the phylum Arthropoda
# echo 6656 | taxonkit lineage | taxonkit reformat
# 6656    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda    Eukaryota;Arthropoda;;;;;

# 2     bacteria
# 2157  archaea
# 4751  fungi
# 10239 virus

# time: 2s
taxonkit list --ids $id --indent "" > $id.taxid.txt

# taxonkit list --ids 2,4751,10239 --indent "" > microbe.taxid.txt

wc -l $id.taxid.txt
# 518373 6656.taxid.txt

Retrieving target accessions. There are two options:

From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in nr.

# time: 4min
pigz -dc prot.accession2taxid.gz \
    | csvtk grep -t -f taxid -P $id.taxid.txt \
    | csvtk cut -t -f accession.version,taxid \
    | sed 1d \
    > $id.acc2taxid.txt

cut -f 1 $id.acc2taxid.txt > $id.acc.txt

wc -l $id.acc.txt
# 8174609 6656.acc.txt

From pre-formated nr blastdb

# time: 40min
blastdbcmd -db nr -entry all -outfmt "%a %T" | pigz -c > nr.acc2taxid.txt.gz

pigz -dc nr.acc2taxid.txt.gz | wc -l
# 555220892

# time: 3min
pigz -dc nr.acc2taxid.txt.gz \
    | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \
    | cut -d ' '  -f 1 \
    > $id.acc.txt

wc -l $id.acc.txt
# 6928021 6656.acc.txt

Retrieving FASTA sequences from pre-formated blastdb. There are two options:

From nr.fa exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly download nr.gz from ncbi ftp, in which the FASTA headers are not well formated.

# 1. exporting nr.fa from pre-formated blastdb

# time: 117min (run only once)
blastdbcmd -db nr -dbtype prot -entry all -outfmt "%f" -out - | pigz -c > nr.fa.gz

# =====================================================================

# 2. filtering sequence belong to $taxid

# ---------------------------------------------------------------------

# methond 1) (for cases where $id.acc.txt is not very huge)
# time: 80min
# perl one-liner is used to unfold records having mulitple accessions
time cat <(echo) <(pigz -dc nr.fa.gz) \
    | perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//;  $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } ' \
    | seqkit grep -f $id.acc.txt -o nr.$id.fa.gz

# ---------------------------------------------------------------------

# method 2) (**faster**)

# 33min (run only once)
# (1). split nr.fa.gz. # Note: I have 16 cpus.
$ time seqkit split2 -p 15 nr.fa.gz

# (2). parallize unfolding
$ cat _unfold_blastdb_fa.sh
#!/bin/sh
perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//;  $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } '

# 10 min
time ls nr.fa.gz.split/nr.part_*.fa.gz \
    | rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \
        | ./_unfold_blastdb_fa.sh \
        | seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\.(.+)$} '

# (3). merge result
cat nr.$id.part*.fa.gz > nr.$id.fa.gz
rm nr.$id.part*.fa.gz

# ---------------------------------------------------------------------

# method 3) (for huge $id.acc.txt file, e.g., bacteria)

# (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me).
split -d -l 300000000 $id.acc.txt $id.acc.txt.part_

# (2). filter
time ls $id.acc.txt.part_* \
    | rush -j 1 --immediate-output -v id=$id \
        'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \
        | ./_unfold_blastdb_fa.sh \
        | seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz '

# (3). merge
cat nr.$id.part*.fa.gz > nr.$id.fa.gz

# clean
rm nr.$id.part*.fa.gz
rm $id.acc.txt.part_

# (4). optionally adding taxid, you may edit replacement (-r) below
# split
time split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_

ln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz         
i=0
for f in $id.acc2taxid.txt.part_* ; do
    echo $f
    time pigz -cd nr.$id.with-taxid.part$i.fa.gz \
        | seqkit replace -k $f -p "^([^\-]+?) " -r "{kv}-\$1 " -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz;
    /bin/rm nr.$id.with-taxid.part$i.fa.gz
    i=$(($i+1));
done
mv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz

# =====================================================================

# 3. counting sequences
#
# ls -lh nr.$id.fa.gz
# -rw-r--r-- 1 shenwei shenwei 902M 9月  13 01:42 nr.6656.fa.gz
#
pigz -dc nr.$id.fa.gz | grep '^>' -c

# 6928017
# Here 6928017 ~=  6928021 ($id.acc.txt)

Directly from pre-formated blastdb

# time: 5h20min
blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz

# counting sequences
#
# Note that the headers of outputed fasta by blastdbcmd are "folded"
# for accessions from different species with same sequences, so the
# number may be small than $(wc -l $id.acc.txt).
pigz -dc nr.$id.fa.gz | grep '^>' -c
# 1577383

# counting accessions
#
# ls -lh nr.$id.fa.gz
# -rw-r--r-- 1 shenwei shenwei 2.1G 9月  13 03:38 nr.6656.fa.gz
#
pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\n>/g' | grep '^>' -c
# 288415413

makeblastdb

pigz -dc nr.$id.fa.gz > nr.$id.fa

# time: 3min ($nr.$id.fa from step 3 option 1)
#
# building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error:
#
#     BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1
#
makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id

# rm nr.$id.fa

blastp (optional)

# blastdb nr.$id is built from sequences in step 3 option 1
#
blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast
# real    0m20.866s

# $ cat t4.fa.blast | grep Query= -A 10
# Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a
#
# Length=35
                                                                     Score     E
# Sequences producing significant alignments:                          (Bits)  Value

# 2MPQ_A  Chain A, Solution structure of the sodium channel toxin Hd1a  72.4    2e-17
# A0A0J9X1W9.2  RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-...  72.4    2e-17
# ADB56726.1  HNTX-IV.2 precursor [Haplopelma hainanum]                 66.6    9e-15
# D2Y233.1  RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H...  66.6    9e-15
# ADB56830.1  HNTX-IV.3 precursor [Haplopelma hainanum]                 66.6    9e-15

Summaries of taxonomy dataLink

You can change the TaxId of interest.

Rank counts of common categories.

$ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \
    | rush -D ' ' -T b \
        'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \
            | sed 1d \
            | taxonkit filter -i 2 -E genus -L genus \
            | taxonkit lineage -L -r \
            | csvtk freq -H -t -f 2 -nr \
            > stats.{}.tsv '

$ csvtk -t join --outer-join stats.*.tsv \
    | csvtk add-header -t -n "rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )" \
    | csvtk csv2md -t

Similar data on NCBI Taxonomy

rank	Archaea	Bacteria	Eukaryota	Fungi	Metazoa	Viridiplantae
species	12482	460940	1349648	156908	957297	191026
strain	354	40643	3486	2352	33	50
genus	205	4112	90882	6844	64148	16202
isolate	7	503	809	76	17	3
species group	2	77	251	22	214	5
serotype		218
serogroup		136
subsection			21			21
subspecies		632	24523	158	17043	7212
forma specialis		521	220	179	33	1
species subgroup		23	101		101
biotype		7	10
morph			12	3	4	5
section			437	37	2	398
genotype			12			12
series			9		5	4
varietas		25	8499	1100	2	7188
forma		4	560	185	6	315
subgenus		1	1558	10	1414	112
pathogroup		5
subvariety			5			5

Count of all ranks

$ time taxonkit list --ids 1 \
    | taxonkit lineage -L -r \
    | csvtk freq -H -t -f 2 -nr \
    | csvtk pretty -H -t

species            1879659
no rank            222743
genus              96625
strain             44483
subspecies         25174
family             9492
varietas           8524
subfamily          3050
tribe              2213
order              1660
subgenus           1618
isolate            1319
serotype           1216
clade              886
superfamily        865
forma specialis    741
forma              564
subtribe           508
section            437
class              429
suborder           372
species group      330
phylum             272
subclass           156
serogroup          138
infraorder         130
species subgroup   124
superorder         55
subphylum          33
parvorder          26
subsection         21
genotype           20
infraclass         18
biotype            17
morph              12
kingdom            11
series             9
superclass         6
cohort             5
pathogroup         5
subvariety         5
superkingdom       4
subcohort          3
subkingdom         1
superphylum        1

real    0m3.663s
user    0m15.897s
sys     0m1.010s

Ranks of taxa at or below species.

$ taxonkit list --ids 1 \
    | taxonkit filter --lower-than species --equal-to species \
    | taxonkit lineage -L -r  \
    | csvtk freq -Ht -nr -f 2 \
    | csvtk add-header -t -n rank,count \
    | csvtk pretty -t

rank              count
---------------   -------
species           1880044
no rank           222756
strain            44483
subspecies        25171
varietas          8524
isolate           1319
serotype          1216
clade             885
forma specialis   741
forma             564
serogroup         138
genotype          20
biotype           17
morph             12
pathogroup        5
subvariety        5

Merging GTDB and NCBI taxonomyLink

Sometimes (1) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat, and then create taxdump files from them with taxonkit create-taxdump.

Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump.

taxonkit list --data-dir gtdb-taxdump/R226/ --ids 1 --indent "" \
    | taxonkit filter --data-dir gtdb-taxdump/R226/ --equal-to species \
    | taxonkit reformat2 --data-dir gtdb-taxdump/R226/ --taxid-field 1 \
        --format "{domain|acellular root|superkingdom}\t{phylum}\t{class}\t{order}\t{family}\t{genus}\t{species}\t{strain|subspecies|no rank}" \
        -o gtdb.tsv

Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. For taxa whose rank is "no rank" below the species, we treat them as tax of strain rank (--pseudo-strain, taxonkit v0.14.1 needed).

# taxid of Viruses: 10239
taxonkit list --data-dir ~/.taxonkit --ids 10239 --indent "" \
    | taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \
    | taxonkit reformat2 --data-dir ~/.taxonkit --taxid-field 1 \
        --format "{domain|acellular root|superkingdom}\t{phylum}\t{class}\t{order}\t{family}\t{genus}\t{species}\t{strain|subspecies|no rank}" \
        -o ncbi-viral.tsv

Creating taxdump from lineages above.

cat gtdb.tsv ncbi-viral.tsv \
    | taxonkit create-taxdump \
        --field-accession 1 \
        -R "superkingdom,phylum,class,order,family,genus,species,strain" \
        -O taxdump

# we use --field-accession  1 to output the mapping file between old taxids and new ones.
$ grep 2697049  taxdump/taxid.map  # SARS-COV-2
2697049 21630522

Some tests:

# SARS-COV-2 in NCBI taxonomy
$ echo 2697049 \
    | taxonkit lineage -t --data-dir ~/.taxonkit \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
10239     superkingdom   Viruses
2559587   clade          Riboviria
2732396   kingdom        Orthornavirae
2732408   phylum         Pisuviricota
2732506   class          Pisoniviricetes
76804     order          Nidovirales
2499399   suborder       Cornidovirineae
11118     family         Coronaviridae
2501931   subfamily      Orthocoronavirinae
694002    genus          Betacoronavirus
2509511   subgenus       Sarbecovirus
694009    species        Severe acute respiratory syndrome-related coronavirus
2697049   no rank        Severe acute respiratory syndrome coronavirus 2

$ echo "Severe acute respiratory syndrome coronavirus 2" | taxonkit name2taxid --data-dir taxdump/
Severe acute respiratory syndrome coronavirus 2 192491219

$ echo 192491219 \
    | taxonkit lineage -t --data-dir taxdump/ \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L --data-dir taxdump/ \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
1088277216   superkingdom   Viruses                                        
38781089     phylum         Pisuviricota                                   
1832208221   class          Pisoniviricetes                                
1393610206   order          Nidovirales                                    
779314330    family         Coronaviridae                                  
68549826     genus          Betacoronavirus                                
341128742    species        Betacoronavirus pandemicum                     
192491219    strain         Severe acute respiratory syndrome coronavirus 2



$ echo "Escherichia coli"  | taxonkit name2taxid --data-dir taxdump/
Escherichia coli        599451526

$ echo 599451526 \
    | taxonkit lineage -t --data-dir taxdump/ \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L --data-dir taxdump/ \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
81602897     superkingdom   Bacteria           
1712663402   phylum         Pseudomonadota     
1969409366   class          Gammaproteobacteria
1851777887   order          Enterobacterales   
1691888815   family         Enterobacteriaceae 
1028471294   genus          Escherichia        
599451526    species        Escherichia coli

Filtering or subsetting taxdmp files to make a custom taxdmp with given TaxIDsLink

You want to create a smaller version of the official NCBI taxonomy taxdmp filtered or subset to just the lineages of certain species, for purposes such as creating small test data for testing of tools using taxdmp files.

https://github.com/shenwei356/taxonkit/issues/112

Step 1: preparing taxids in the subset tree

# here, only keep nodes at the rank of species
taxonkit list --ids 707,9606 -I "" \
    | taxonkit filter -E species \
    | taxonkit lineage -t \
    | cut -f 3 \
    | sed -s 's/;/\n/g' \
    > taxids.txt

# the root node
echo 1 >> taxids.txt

Step 2: extracting data of needed nodes

mkdir subset

grep -w -f <(awk '{print "^"$1}' taxids.txt) ~/.taxonkit/nodes.dmp > subset/nodes.dmp
grep -w -f <(awk '{print "^"$1}' taxids.txt) ~/.taxonkit/names.dmp > subset/names.dmp

touch subset/delnodes.dmp subset/merged.dmp

Checking it. Since there are only two leaves here, we just dump the whole tree

$ wc -l subset/*.dmp
   0 subset/delnodes.dmp
   0 subset/merged.dmp
 146 subset/names.dmp
  40 subset/nodes.dmp
 186 total

$ taxonkit list --ids 1 --data-dir subset/ -nr
1 [no rank] root
  131567 [cellular root] cellular organisms
    2 [domain] Bacteria
      3379134 [kingdom] Pseudomonadati
        1224 [phylum] Pseudomonadota
          1236 [class] Gammaproteobacteria
            135623 [order] Vibrionales
              641 [family] Vibrionaceae
                662 [genus] Vibrio
                  28174 [species] Vibrio ordalii
    2759 [domain] Eukaryota
      33154 [clade] Opisthokonta
        33208 [kingdom] Metazoa
          6072 [clade] Eumetazoa
            33213 [clade] Bilateria
              33511 [clade] Deuterostomia
                7711 [phylum] Chordata
                  89593 [subphylum] Craniata
                    7742 [clade] Vertebrata
                      7776 [clade] Gnathostomata
                        117570 [clade] Teleostomi
                          117571 [clade] Euteleostomi
                            8287 [superclass] Sarcopterygii
                              1338369 [clade] Dipnotetrapodomorpha
                                32523 [clade] Tetrapoda
                                  32524 [clade] Amniota
                                    40674 [class] Mammalia
                                      32525 [clade] Theria
                                        9347 [clade] Eutheria
                                          1437010 [clade] Boreoeutheria
                                            314146 [superorder] Euarchontoglires
                                              9443 [order] Primates
                                                376913 [suborder] Haplorrhini
                                                  314293 [infraorder] Simiiformes
                                                    9526 [parvorder] Catarrhini
                                                      314295 [superfamily] Hominoidea
                                                        9604 [family] Hominidae
                                                          207598 [subfamily] Homininae
                                                            9605 [genus] Homo
                                                              9606 [species] Homo sapiens


$ echo 28174 | taxonkit lineage -nr --data-dir subset/
28174   cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Vibrionales;Vibrionaceae;Vibrio;Vibrio ordalii       Vibrio ordalii  species