TutorialLink
Table of ContentsLink
- Formatting lineage
- Parsing kraken/bracken result
- Making nr blastdb for specific taxids
- Summaries of taxonomy data
- Merging GTDB and NCBI taxonomy
- Filtering or subsetting taxdmp files to make a custom taxdmp with given TaxIDs
Formatting lineageLink
Show lineage detail of a TaxId. The command below works on Windows with help of csvtk.
$ echo 2697049 \
| taxonkit lineage -t \
| csvtk cut -Ht -f 3 \
| csvtk unfold -Ht -f 1 -s ";" \
| taxonkit lineage -r -n -L \
| csvtk cut -Ht -f 1,3,2 \
| csvtk pretty -Ht
10239 acellular root Viruses
2559587 realm Riboviria
2732396 kingdom Orthornavirae
2732408 phylum Pisuviricota
2732506 class Pisoniviricetes
76804 order Nidovirales
2499399 suborder Cornidovirineae
11118 family Coronaviridae
2501931 subfamily Orthocoronavirinae
694002 genus Betacoronavirus
2509511 subgenus Sarbecovirus
3418604 species Betacoronavirus pandemicum
2697049 no rank Severe acute respiratory syndrome coronavirus 2
Example data.
$ cat taxids3.txt
376619
349741
239935
314101
11932
1327037
83333
1408252
2605619
2697049
Format to 7-level ranks ("superkingdom phylum class order family genus species").
$ cat taxids3.txt \
| taxonkit reformat2 -I 1
376619 Bacteria;Pseudomonadota;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis
349741 Bacteria;Verrucomicrobiota;Verrucomicrobiia;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
239935 Bacteria;Verrucomicrobiota;Verrucomicrobiia;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila
314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B
11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle
1327037 Viruses;Uroviricota;Caudoviricetes;;;;Croceibacter phage P2559Y
83333 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
1408252 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
2605619 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli
2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Betacoronavirus pandemicum
Format to 8-level ranks ("superkingdom phylum class order family genus species subspecies/rank").
$ cat taxids3.txt \
| taxonkit reformat2 -I 1 -f "{domain|acellular root|superkingdom};{phylum};{class};{order};{family};{genus};{species};{strain|subspecies|no rank}"
376619 Bacteria;Pseudomonadota;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS
349741 Bacteria;Verrucomicrobiota;Verrucomicrobiia;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835
239935 Bacteria;Verrucomicrobiota;Verrucomicrobiia;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;
314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B;environmental samples
11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle;unclassified Retroviridae
1327037 Viruses;Uroviricota;Caudoviricetes;;;;Croceibacter phage P2559Y;unclassified Caudoviricetes
83333 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12
1408252 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178
2605619 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli O16:H48
2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Betacoronavirus pandemicum;Severe acute respiratory syndrome coronavirus 2
Replace missing ranks with Unassigned
and output tab-delimited format.
(Warning: for NCBI taxonomy data since March 2025, reformat
can't handle Bacteria's rank domain
and Viruses' rank acellular root
simutaneously).
$ cat taxids3.txt \
| taxonkit reformat2 -I 1 -r "Unassigned" -f "{d}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
| csvtk pretty -H -t
376619 Bacteria Pseudomonadota Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis Francisella tularensis subsp. holarctica LVS
349741 Bacteria Verrucomicrobiota Verrucomicrobiia Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835
239935 Bacteria Verrucomicrobiota Verrucomicrobiia Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Unassigned
314101 Bacteria Unassigned Unassigned Unassigned Unassigned Unassigned uncultured murine large bowel bacterium BAC 54B Unassigned
11932 Unassigned Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle Unassigned
1327037 Unassigned Uroviricota Caudoviricetes Unassigned Unassigned Unassigned Croceibacter phage P2559Y Unassigned
83333 Bacteria Pseudomonadota Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli K-12
1408252 Bacteria Pseudomonadota Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli R178
2605619 Bacteria Pseudomonadota Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Unassigned
2697049 Unassigned Pisuviricota Pisoniviricetes Nidovirales Coronaviridae Betacoronavirus Betacoronavirus pandemicum Unassigned
Fill missing ranks and add prefixes.
(Warning: for NCBI taxonomy data since March 2025, reformat
can't handle Bacteria's rank domain
and Viruses' rank acellular root
simutaneously).
$ cat taxids3.txt \
| taxonkit reformat -I 1 -F -P -f "{d}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
| csvtk pretty -H -t
376619 d__Bacteria p__Pseudomonadota c__Gammaproteobacteria o__Thiotrichales f__Francisellaceae g__Francisella s__Francisella tularensis t__Francisella tularensis subsp. holarctica LVS
349741 d__Bacteria p__Verrucomicrobiota c__Verrucomicrobiia o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__Akkermansia muciniphila ATCC BAA-835
239935 d__Bacteria p__Verrucomicrobiota c__Verrucomicrobiia o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__unclassified Akkermansia muciniphila subspecies/strain
314101 d__Bacteria p__unclassified Bacteria phylum c__unclassified Bacteria class o__unclassified Bacteria order f__unclassified Bacteria family g__unclassified Bacteria genus s__uncultured murine large bowel bacterium BAC 54B t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain
11932 d__unclassified Viruses domain p__Artverviricota c__Revtraviricetes o__Ortervirales f__Retroviridae g__Intracisternal A-particles s__Mouse Intracisternal A-particle t__unclassified Mouse Intracisternal A-particle subspecies/strain
1327037 d__unclassified Viruses domain p__Uroviricota c__Caudoviricetes o__unclassified Caudoviricetes order f__unclassified Caudoviricetes family g__unclassified Caudoviricetes genus s__Croceibacter phage P2559Y t__unclassified Croceibacter phage P2559Y subspecies/strain
83333 d__Bacteria p__Pseudomonadota c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli K-12
1408252 d__Bacteria p__Pseudomonadota c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli R178
2605619 d__Bacteria p__Pseudomonadota c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__unclassified Escherichia coli subspecies/strain
2697049 d__unclassified Viruses domain p__Pisuviricota c__Pisoniviricetes o__Nidovirales f__Coronaviridae g__Betacoronavirus s__Betacoronavirus pandemicum t__unclassified Betacoronavirus pandemicum subspecies/strain
When these's no nodes of rank "subspecies" nor "strain",
we can switch -S/--pseudo-strain
to use the node with lowest rank
as subspecies/strain name, if which rank is lower than "species".
(Warning: for NCBI taxonomy data since March 2025, reformat
can't handle Bacteria's rank domain
and Viruses' rank acellular root
simutaneously).
$ cat taxids3.txt \
| taxonkit lineage -r -L \
| taxonkit reformat -I 1 -F -S -f "{d}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
| cut -f 1,2,9,10 \
| csvtk add-header -t -n "taxid,rank,species,strain" \
| csvtk pretty -t
taxid rank species strain
------- ---------- ----------------------------------------------- ------------------------------------------------------------------------------
376619 strain Francisella tularensis Francisella tularensis subsp. holarctica LVS
349741 strain Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835
239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain
314101 species uncultured murine large bowel bacterium BAC 54B unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain
11932 species Mouse Intracisternal A-particle unclassified Mouse Intracisternal A-particle subspecies/strain
1327037 species Croceibacter phage P2559Y unclassified Croceibacter phage P2559Y subspecies/strain
83333 strain Escherichia coli Escherichia coli K-12
1408252 subspecies Escherichia coli Escherichia coli R178
2605619 no rank Escherichia coli Escherichia coli O16:H48
2697049 no rank Betacoronavirus pandemicum Severe acute respiratory syndrome coronavirus 2
List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with "no rank".
But when filtering with -L/--lower-than
, you can use
-n/--save-predictable-norank
to save some special ranks without order,
where rank of the closest higher node is still lower than rank cutoff.
$ time taxonkit list --ids 1 -I "" \
| taxonkit filter -L species -E species -R -N -n \
| taxonkit lineage -n -r \
| taxonkit reformat2 -I 1 -f "{domain|acellular root|superkingdom}\t{phylum}\t{class}\t{order}\t{family}\t{genus}\t{species}\t{strain|subspecies|no rank}" \
| csvtk cut -Ht -l -f 1,3,2,1,4-11 \
| csvtk add-header -t -n "taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain" \
| pigz -c > result.tsv.gz
real 0m9.778s
user 1m22.211s
sys 0m8.489s
$ pigz -cd result.tsv.gz \
| csvtk grep -t -f taxid -p 2697049 \
| csvtk transpose -t \
| csvtk pretty -H -t -W 70 -x ';' -S round
╭---------┬------------------------------------------------------------------------╮
| taxid | 2697049 |
├---------┼------------------------------------------------------------------------┤
| rank | Severe acute respiratory syndrome coronavirus 2 |
├---------┼------------------------------------------------------------------------┤
| name | Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes; |
| | Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae; |
| | Betacoronavirus;Sarbecovirus;Betacoronavirus pandemicum; |
| | Severe acute respiratory syndrome coronavirus 2 |
├---------┼------------------------------------------------------------------------┤
| lineage | 2697049 |
├---------┼------------------------------------------------------------------------┤
| kingdom | no rank |
├---------┼------------------------------------------------------------------------┤
| phylum | Viruses |
├---------┼------------------------------------------------------------------------┤
| class | Pisuviricota |
├---------┼------------------------------------------------------------------------┤
| order | Pisoniviricetes |
├---------┼------------------------------------------------------------------------┤
| family | Nidovirales |
├---------┼------------------------------------------------------------------------┤
| genus | Coronaviridae |
├---------┼------------------------------------------------------------------------┤
| species | Betacoronavirus |
├---------┼------------------------------------------------------------------------┤
| strain | Betacoronavirus pandemicum |
╰---------┴------------------------------------------------------------------------╯
Mapping old species names to new onesLink
Some species names in papers or websites might changed, we can try querying their TaxIds via their old new names and then retrieve the new ones.
cat example/changed_species_names.txt
Lactobacillus fermentum
Mycoplasma gallinaceum
# TaxonKit >= v0.15.1
cat example/changed_species_names.txt \
| taxonkit name2taxid \
| taxonkit lineage -i 2 -n \
| cut -f 1,4
Lactobacillus fermentum Limosilactobacillus fermentum
Mycoplasma gallinaceum
Woops, there's no information of Mycoplasma gallinaceum
.
Then we check the taxid-changelog.
zcat taxonkit/taxid-changelog.csv.gz \
| csvtk grep -f name -P example/changed_species_names.txt
| csvtk cut -f taxid,version,change,name,rank \
| csvtk pretty
taxid version change name rank
----- ---------- -------------- ----------------------- -------
1613 2013-02-21 NEW Lactobacillus fermentum species
1613 2016-03-01 ABSORB Lactobacillus fermentum species
1613 2016-03-01 CHANGE_LIN_LEN Lactobacillus fermentum species
29556 2013-02-21 NEW Mycoplasma gallinaceum species
29556 2016-03-01 CHANGE_LIN_LEN Mycoplasma gallinaceum species
29556 2021-01-01 CHANGE_NAME Mycoplasma gallinaceum species
29556 2021-01-01 CHANGE_LIN_LIN Mycoplasma gallinaceum species
We can see the names are changed. Full changes can be queried with the taxid. e.g.,
taxid version change change-value name rank
----- ---------- -------------- ------------ ------------------------- -------
29556 2013-02-21 NEW Mycoplasma gallinaceum species
29556 2016-03-01 CHANGE_LIN_LEN Mycoplasma gallinaceum species
29556 2020-09-01 CHANGE_NAME Mycoplasmopsis gallinacea species
29556 2020-09-01 CHANGE_LIN_TAX Mycoplasmopsis gallinacea species
29556 2021-01-01 CHANGE_NAME Mycoplasma gallinaceum species
29556 2021-01-01 CHANGE_LIN_LIN Mycoplasma gallinaceum species
29556 2021-09-01 CHANGE_NAME Mycoplasmopsis gallinacea species
29556 2021-09-01 CHANGE_LIN_LIN Mycoplasmopsis gallinacea species
29556 2023-03-01 CHANGE_LIN_LIN Mycoplasmopsis gallinacea species
Then we just use their TaxIds to rertrieve the new names. The final commands are:
zcat taxonkit/taxid-changelog.csv.gz \
| csvtk grep -f name -P example/changed_species_names.txt \
| csvtk uniq -f taxid \
| csvtk cut -f name,taxid \
| csvtk del-header \
| csvtk csv2tab \
| taxonkit lineage -i 2 -n \
| cut -f 1,4
Lactobacillus fermentum Limosilactobacillus fermentum
Mycoplasma gallinaceum Mycoplasmopsis gallinacea
Add taxonomy information to BLAST resultLink
An blast result file blast_result.txt
, where the second column is the accession of matched sequences.
head -n 5 blast_result.txt | csvtk pretty -Ht
xxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.745 494 99 3 6361 6851 895 1385 6.53e-83 326
xxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.543 494 100 3 17168 17658 895 1385 3.04e-81 320
xxxxxxxxxxxxxxxxxxxxx/76/ccs LR699760.1 100.000 37 0 0 8139 8175 14507874 14507910 4.27e-06 69.4
xxxxxxxxxxxxxxxxxxxxx/80/ccs HG994975.1 80.556 540 81 16 8269 8798 3821290 3820765 8.65e-104 394
xxxxxxxxxxxxxxxxxxxxx/80/ccs HG994975.1 77.805 410 89 2 9590 9998 3819858 3819450 5.51e-61 252
Prepare acc2taxid.tsv
file from nucl_gb.accession2taxid.gz file.
Here we use the accession
column instead of accession.version
column, in case of unmatched versions for some accessions.
zcat nucl_gb.accession2taxid.gz | cut -f 1,3 | gzip -c > acc2taxid.tsv.gz
Extract needed acc2taxid subset to reduce memory usage.
# extract accession and deduplicate and remove versions
cut -f 2 blast_result.txt | csvtk uniq -Ht | csvtk replace -Ht -p '\.\d+$' > acc.txt
# grep from acc2taxid.tsv.gz
zcat acc2taxid.tsv.gz | grep -w -f acc.txt > hit.acc2taxid.tsv
Prepare taxid2name.tsv
, species name are retrived for the taxids.
cut -f 2 hit.acc2taxid.tsv | taxonkit reformat -f '{s}' -I 1 > hit.taxid2name.tsv
Append taxids according to the accessions, and append species names for the taxids.
csvtk add-header -t --names "qseqid,sseqid,pident,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore" blast_result.txt \
| csvtk mutate -t -f sseqid -n taxid \
| csvtk replace -t -k hit.acc2taxid.tsv -f taxid -p '(.+)\.\d+' -r '{kv}' \
| csvtk mutate -t -f taxid -n species \
| csvtk replace -t -k hit.taxid2name.tsv -f species -p '(.+)' -r '{kv}' \
| head -n 5 | csvtk pretty -t
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore taxid species
---------------------------- -------------- ------- ------ -------- ------- ------ ----- -------- -------- --------- -------- ----- --------------------
xxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.745 494 99 3 6361 6851 895 1385 6.53e-83 326 44415 Eimeria mitis
xxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.543 494 100 3 17168 17658 895 1385 3.04e-81 320 44415 Eimeria mitis
xxxxxxxxxxxxxxxxxxxxx/76/ccs LR699760.1 100.000 37 0 0 8139 8175 14507874 14507910 4.27e-06 69.4 3702 Arabidopsis thaliana
xxxxxxxxxxxxxxxxxxxxx/80/ccs HG994975.1 80.556 540 81 16 8269 8798 3821290 3820765 8.65e-104 394 5802 Eimeria tenella
Parsing kraken/bracken resultLink
Example Data
Run Kraken2 and Bracken
KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf
THREADS=16
CLASSIFICATION_LVL=S
THRESHOLD=10
READ_LEN=100
SAMPLE=SRS014459-Stool.fasta.gz
BRACKEN_OUTPUT_FILE=$SAMPLE
kraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken
est_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \
-l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken
Orignial format
$ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport
100.00 9491 0 R 1 root
99.85 9477 0 R1 131567 cellular organisms
99.85 9477 0 D 2 Bacteria
66.08 6271 0 D1 1783270 FCB group
66.08 6271 0 D2 68336 Bacteroidetes/Chlorobi group
66.08 6271 0 P 976 Bacteroidetes
66.08 6271 0 C 200643 Bacteroidia
66.08 6271 0 O 171549 Bacteroidales
34.45 3270 0 F 815 Bacteroidaceae
34.45 3270 0 G 816 Bacteroides
10.43 990 990 S 246787 Bacteroides cellulosilyticus
7.98 757 757 S 28116 Bacteroides ovatus
3.10 293 0 G1 2646097 unclassified Bacteroides
1.06 100 100 S 2755405 Bacteroides sp. CACC 737
0.49 46 46 S 2650157 Bacteroides sp. HF-5287
Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py)
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
| csvtk cut -Ht -f 5,1 \
| taxonkit reformat2 -I 1 -f "k__{domain|acellular root|superkingdom}|p__{phylum}|c__{class}|o__{order}|f__{family}|g__{genus}|s__{species}" \
| csvtk cut -Ht -f 3,2 \
| csvtk replace -Ht -p "(\|[kpcofgs]__)+$" \
| csvtk replace -Ht -p "\|([kpcofgs]__\|)+" -r "|" \
| csvtk uniq -Ht \
| csvtk grep -Ht -p k__ -v \
| tee SRS014459-Stool.fasta.gz_bracken_species.kreport.format \
| head -n 10
k__Bacteria 99.85
k__Bacteria|p__Bacteroidota 66.08
k__Bacteria|p__Bacteroidota|c__Bacteroidia 66.08
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales 66.08
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae 34.45
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides 34.45
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus 10.43
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus 7.98
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737 1.06
k__Bacteria|p__Bacteroidota|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides zhangwenhongii 0.49
Converting to Qiime format
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
| csvtk cut -Ht -f 5,1 \
| taxonkit reformat2 -I 1 -f "k__{domain|acellular root|superkingdom}; p__{phylum}; c__{class}; o__{order}; f__{family}; g__{genus}; s__{species}" \
| csvtk cut -Ht -f 3,2 \
| csvtk replace -Ht -p "(; [kpcofgs]__)+$" \
| csvtk replace -Ht -p "; ([kpcofgs]__; )+" -r "; " \
| csvtk uniq -Ht \
| csvtk grep -Ht -p k__ -v \
| head -n 10
k__Bacteria 99.85
k__Bacteria; p__Bacteroidota 66.08
k__Bacteria; p__Bacteroidota; c__Bacteroidia 66.08
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales 66.08
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae 34.45
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides 34.45
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus 10.43
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus 7.98
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737 1.06
k__Bacteria; p__Bacteroidota; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides zhangwenhongii 0.49
Save taxon proportion and taxid, and get lineage, name and rank.
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
| csvtk cut -Ht -f 1,5 \
| taxonkit lineage -i 2 -n -r \
| csvtk cut -Ht -f 1,2,5,4,3 \
| head -n 10 \
| csvtk pretty -Ht
100.00 1 no rank root root
99.85 131567 cellular root cellular organisms cellular organisms
99.85 2 domain Bacteria cellular organisms;Bacteria
66.08 1783270 clade FCB group cellular organisms;Bacteria;Pseudomonadati;FCB group
66.08 68336 clade Bacteroidota/Chlorobiota group cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group
66.08 976 phylum Bacteroidota cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota
66.08 200643 class Bacteroidia cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota;Bacteroidia
66.08 171549 order Bacteroidales cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota;Bacteroidia;Bacteroidales
34.45 815 family Bacteroidaceae cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae
34.45 816 genus Bacteroides cellular organisms;Bacteria;Pseudomonadati;FCB group;Bacteroidota/Chlorobiota group;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides
Only save species or lower level and get lineage in format of "superkingdom phylum class order family genus species".
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \
| csvtk cut -Ht -f 1,5 \
| taxonkit filter -N -E species -L species -i 2 \
| taxonkit lineage -i 2 -n -r \
| taxonkit reformat2 -I 2 \
| csvtk cut -Ht -f 1,2,5,4,6 \
| csvtk add-header -t -n abundance,taxid,rank,name,lineage \
| head -n 10 \
| csvtk pretty -t
abundance taxid rank name lineage
--------- ------- ------- ---------------------------- -------------------------------------------------------------------------------------------------------
10.43 246787 species Bacteroides cellulosilyticus Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus
7.98 28116 species Bacteroides ovatus Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus
1.06 2755405 species Bacteroides sp. CACC 737 Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737
0.49 2650157 species Bacteroides zhangwenhongii Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides zhangwenhongii
0.99 2528203 species Bacteroides sp. A1C1 Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1
0.28 2763022 species Bacteroides sp. M10 Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10
0.16 2650158 species Bacteroides luhongzhouii Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides luhongzhouii
0.12 2715212 species Bacteroides faecium Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides faecium
5.10 817 species Bacteroides fragilis Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis 817 species Bacteroides fragilis Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis
Making nr blastdb for specific taxidsLink
Attention:
-
(2023-11-27) BLAST+ 2.2.15 supports limiting a group of organisms without first using a custom script to get all species-level Taxonomy IDs (taxids) for the group. Details.
E.g., Search of the nr BLAST database limited to Bacteria (taxID 2).
blastp -db nr -taxids 2 -query ...
-
(2019) BLAST+ 2.8.1 is released with new databases, which allows you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now.
Changes:
- 2018-09-13 rewritten
- 2018-12-22 providing faster method for step 3.1
- 2019-01-07 add note of new blastdb version
- 2020-10-14 update steps for huge number of accessions belong to high taxon level like bacteria.
Data:
- pre-formated blastdb (09/10/2018)
- prot.accession2taxid.gz (09/07/2018) (optional, but recommended)
Hardware in this tutorial
- CPU: AMD 8-cores/16-threads 3.7Ghz
- RAM: 64GB
- DISK:
- Taxonomy files stores in NVMe SSD
- blastdb files stores in 7200rpm HDD
Tools:
- blast+
- pigz (recommended, faster than gzip)
- taxonkit
- seqkit (recommended), version >= 0.14.0
- rush (optional, for parallizing filtering sequence)
Steps:
-
Listing all taxids below
$id
using taxonkit.id=6656 # 6656 is the phylum Arthropoda # echo 6656 | taxonkit lineage | taxonkit reformat # 6656 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda Eukaryota;Arthropoda;;;;; # 2 bacteria # 2157 archaea # 4751 fungi # 10239 virus # time: 2s taxonkit list --ids $id --indent "" > $id.taxid.txt # taxonkit list --ids 2,4751,10239 --indent "" > microbe.taxid.txt wc -l $id.taxid.txt # 518373 6656.taxid.txt
-
Retrieving target accessions. There are two options:
-
From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in
nr
.# time: 4min pigz -dc prot.accession2taxid.gz \ | csvtk grep -t -f taxid -P $id.taxid.txt \ | csvtk cut -t -f accession.version,taxid \ | sed 1d \ > $id.acc2taxid.txt cut -f 1 $id.acc2taxid.txt > $id.acc.txt wc -l $id.acc.txt # 8174609 6656.acc.txt
-
From pre-formated
nr
blastdb# time: 40min blastdbcmd -db nr -entry all -outfmt "%a %T" | pigz -c > nr.acc2taxid.txt.gz pigz -dc nr.acc2taxid.txt.gz | wc -l # 555220892 # time: 3min pigz -dc nr.acc2taxid.txt.gz \ | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \ | cut -d ' ' -f 1 \ > $id.acc.txt wc -l $id.acc.txt # 6928021 6656.acc.txt
-
-
Retrieving FASTA sequences from pre-formated blastdb. There are two options:
-
From
nr.fa
exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly downloadnr.gz
from ncbi ftp, in which the FASTA headers are not well formated.# 1. exporting nr.fa from pre-formated blastdb # time: 117min (run only once) blastdbcmd -db nr -dbtype prot -entry all -outfmt "%f" -out - | pigz -c > nr.fa.gz # ===================================================================== # 2. filtering sequence belong to $taxid # --------------------------------------------------------------------- # methond 1) (for cases where $id.acc.txt is not very huge) # time: 80min # perl one-liner is used to unfold records having mulitple accessions time cat <(echo) <(pigz -dc nr.fa.gz) \ | perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//; $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } ' \ | seqkit grep -f $id.acc.txt -o nr.$id.fa.gz # --------------------------------------------------------------------- # method 2) (**faster**) # 33min (run only once) # (1). split nr.fa.gz. # Note: I have 16 cpus. $ time seqkit split2 -p 15 nr.fa.gz # (2). parallize unfolding $ cat _unfold_blastdb_fa.sh #!/bin/sh perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//; $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } ' # 10 min time ls nr.fa.gz.split/nr.part_*.fa.gz \ | rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \ | ./_unfold_blastdb_fa.sh \ | seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\.(.+)$} ' # (3). merge result cat nr.$id.part*.fa.gz > nr.$id.fa.gz rm nr.$id.part*.fa.gz # --------------------------------------------------------------------- # method 3) (for huge $id.acc.txt file, e.g., bacteria) # (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me). split -d -l 300000000 $id.acc.txt $id.acc.txt.part_ # (2). filter time ls $id.acc.txt.part_* \ | rush -j 1 --immediate-output -v id=$id \ 'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \ | ./_unfold_blastdb_fa.sh \ | seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz ' # (3). merge cat nr.$id.part*.fa.gz > nr.$id.fa.gz # clean rm nr.$id.part*.fa.gz rm $id.acc.txt.part_ # (4). optionally adding taxid, you may edit replacement (-r) below # split time split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_ ln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz i=0 for f in $id.acc2taxid.txt.part_* ; do echo $f time pigz -cd nr.$id.with-taxid.part$i.fa.gz \ | seqkit replace -k $f -p "^([^\-]+?) " -r "{kv}-\$1 " -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz; /bin/rm nr.$id.with-taxid.part$i.fa.gz i=$(($i+1)); done mv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz # ===================================================================== # 3. counting sequences # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 902M 9月 13 01:42 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' -c # 6928017 # Here 6928017 ~= 6928021 ($id.acc.txt)
-
Directly from pre-formated blastdb
# time: 5h20min blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz # counting sequences # # Note that the headers of outputed fasta by blastdbcmd are "folded" # for accessions from different species with same sequences, so the # number may be small than $(wc -l $id.acc.txt). pigz -dc nr.$id.fa.gz | grep '^>' -c # 1577383 # counting accessions # # ls -lh nr.$id.fa.gz # -rw-r--r-- 1 shenwei shenwei 2.1G 9月 13 03:38 nr.6656.fa.gz # pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\n>/g' | grep '^>' -c # 288415413
-
-
makeblastdb
pigz -dc nr.$id.fa.gz > nr.$id.fa # time: 3min ($nr.$id.fa from step 3 option 1) # # building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error: # # BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1 # makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id # rm nr.$id.fa
-
blastp (optional)
# blastdb nr.$id is built from sequences in step 3 option 1 # blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast # real 0m20.866s # $ cat t4.fa.blast | grep Query= -A 10 # Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a # # Length=35 Score E # Sequences producing significant alignments: (Bits) Value # 2MPQ_A Chain A, Solution structure of the sodium channel toxin Hd1a 72.4 2e-17 # A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-... 72.4 2e-17 # ADB56726.1 HNTX-IV.2 precursor [Haplopelma hainanum] 66.6 9e-15 # D2Y233.1 RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H... 66.6 9e-15 # ADB56830.1 HNTX-IV.3 precursor [Haplopelma hainanum] 66.6 9e-15
Summaries of taxonomy dataLink
You can change the TaxId of interest.
-
Rank counts of common categories.
$ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \ | rush -D ' ' -T b \ 'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \ | sed 1d \ | taxonkit filter -i 2 -E genus -L genus \ | taxonkit lineage -L -r \ | csvtk freq -H -t -f 2 -nr \ > stats.{}.tsv ' $ csvtk -t join --outer-join stats.*.tsv \ | csvtk add-header -t -n "rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )" \ | csvtk csv2md -t
rank Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae species 12482 460940 1349648 156908 957297 191026 strain 354 40643 3486 2352 33 50 genus 205 4112 90882 6844 64148 16202 isolate 7 503 809 76 17 3 species group 2 77 251 22 214 5 serotype 218 serogroup 136 subsection 21 21 subspecies 632 24523 158 17043 7212 forma specialis 521 220 179 33 1 species subgroup 23 101 101 biotype 7 10 morph 12 3 4 5 section 437 37 2 398 genotype 12 12 series 9 5 4 varietas 25 8499 1100 2 7188 forma 4 560 185 6 315 subgenus 1 1558 10 1414 112 pathogroup 5 subvariety 5 5 -
Count of all ranks
$ time taxonkit list --ids 1 \ | taxonkit lineage -L -r \ | csvtk freq -H -t -f 2 -nr \ | csvtk pretty -H -t species 1879659 no rank 222743 genus 96625 strain 44483 subspecies 25174 family 9492 varietas 8524 subfamily 3050 tribe 2213 order 1660 subgenus 1618 isolate 1319 serotype 1216 clade 886 superfamily 865 forma specialis 741 forma 564 subtribe 508 section 437 class 429 suborder 372 species group 330 phylum 272 subclass 156 serogroup 138 infraorder 130 species subgroup 124 superorder 55 subphylum 33 parvorder 26 subsection 21 genotype 20 infraclass 18 biotype 17 morph 12 kingdom 11 series 9 superclass 6 cohort 5 pathogroup 5 subvariety 5 superkingdom 4 subcohort 3 subkingdom 1 superphylum 1 real 0m3.663s user 0m15.897s sys 0m1.010s
-
Ranks of taxa at or below species.
$ taxonkit list --ids 1 \ | taxonkit filter --lower-than species --equal-to species \ | taxonkit lineage -L -r \ | csvtk freq -Ht -nr -f 2 \ | csvtk add-header -t -n rank,count \ | csvtk pretty -t rank count --------------- ------- species 1880044 no rank 222756 strain 44483 subspecies 25171 varietas 8524 isolate 1319 serotype 1216 clade 885 forma specialis 741 forma 564 serogroup 138 genotype 20 biotype 17 morph 12 pathogroup 5 subvariety 5
Merging GTDB and NCBI taxonomyLink
Sometimes (1) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat, and then create taxdump files from them with taxonkit create-taxdump.
-
Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump.
taxonkit list --data-dir gtdb-taxdump/R226/ --ids 1 --indent "" \ | taxonkit filter --data-dir gtdb-taxdump/R226/ --equal-to species \ | taxonkit reformat2 --data-dir gtdb-taxdump/R226/ --taxid-field 1 \ --format "{domain|acellular root|superkingdom}\t{phylum}\t{class}\t{order}\t{family}\t{genus}\t{species}\t{strain|subspecies|no rank}" \ -o gtdb.tsv
-
Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. For taxa whose rank is "no rank" below the species, we treat them as tax of strain rank (
--pseudo-strain
, taxonkit v0.14.1 needed).# taxid of Viruses: 10239 taxonkit list --data-dir ~/.taxonkit --ids 10239 --indent "" \ | taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \ | taxonkit reformat2 --data-dir ~/.taxonkit --taxid-field 1 \ --format "{domain|acellular root|superkingdom}\t{phylum}\t{class}\t{order}\t{family}\t{genus}\t{species}\t{strain|subspecies|no rank}" \ -o ncbi-viral.tsv
-
Creating taxdump from lineages above.
cat gtdb.tsv ncbi-viral.tsv \ | taxonkit create-taxdump \ --field-accession 1 \ -R "superkingdom,phylum,class,order,family,genus,species,strain" \ -O taxdump # we use --field-accession 1 to output the mapping file between old taxids and new ones. $ grep 2697049 taxdump/taxid.map # SARS-COV-2 2697049 21630522
Some tests:
# SARS-COV-2 in NCBI taxonomy
$ echo 2697049 \
| taxonkit lineage -t --data-dir ~/.taxonkit \
| csvtk cut -Ht -f 3 \
| csvtk unfold -Ht -f 1 -s ";" \
| taxonkit lineage -r -n -L --data-dir ~/.taxonkit \
| csvtk cut -Ht -f 1,3,2 \
| csvtk pretty -Ht
10239 superkingdom Viruses
2559587 clade Riboviria
2732396 kingdom Orthornavirae
2732408 phylum Pisuviricota
2732506 class Pisoniviricetes
76804 order Nidovirales
2499399 suborder Cornidovirineae
11118 family Coronaviridae
2501931 subfamily Orthocoronavirinae
694002 genus Betacoronavirus
2509511 subgenus Sarbecovirus
694009 species Severe acute respiratory syndrome-related coronavirus
2697049 no rank Severe acute respiratory syndrome coronavirus 2
$ echo "Severe acute respiratory syndrome coronavirus 2" | taxonkit name2taxid --data-dir taxdump/
Severe acute respiratory syndrome coronavirus 2 192491219
$ echo 192491219 \
| taxonkit lineage -t --data-dir taxdump/ \
| csvtk cut -Ht -f 3 \
| csvtk unfold -Ht -f 1 -s ";" \
| taxonkit lineage -r -n -L --data-dir taxdump/ \
| csvtk cut -Ht -f 1,3,2 \
| csvtk pretty -Ht
1088277216 superkingdom Viruses
38781089 phylum Pisuviricota
1832208221 class Pisoniviricetes
1393610206 order Nidovirales
779314330 family Coronaviridae
68549826 genus Betacoronavirus
341128742 species Betacoronavirus pandemicum
192491219 strain Severe acute respiratory syndrome coronavirus 2
$ echo "Escherichia coli" | taxonkit name2taxid --data-dir taxdump/
Escherichia coli 599451526
$ echo 599451526 \
| taxonkit lineage -t --data-dir taxdump/ \
| csvtk cut -Ht -f 3 \
| csvtk unfold -Ht -f 1 -s ";" \
| taxonkit lineage -r -n -L --data-dir taxdump/ \
| csvtk cut -Ht -f 1,3,2 \
| csvtk pretty -Ht
81602897 superkingdom Bacteria
1712663402 phylum Pseudomonadota
1969409366 class Gammaproteobacteria
1851777887 order Enterobacterales
1691888815 family Enterobacteriaceae
1028471294 genus Escherichia
599451526 species Escherichia coli
Filtering or subsetting taxdmp files to make a custom taxdmp with given TaxIDsLink
You want to create a smaller version of the official NCBI taxonomy taxdmp filtered or subset to just the lineages of certain species, for purposes such as creating small test data for testing of tools using taxdmp files.
https://github.com/shenwei356/taxonkit/issues/112
Step 1: preparing taxids in the subset tree
# here, only keep nodes at the rank of species
taxonkit list --ids 707,9606 -I "" \
| taxonkit filter -E species \
| taxonkit lineage -t \
| cut -f 3 \
| sed -s 's/;/\n/g' \
> taxids.txt
# the root node
echo 1 >> taxids.txt
Step 2: extracting data of needed nodes
mkdir subset
grep -w -f <(awk '{print "^"$1}' taxids.txt) ~/.taxonkit/nodes.dmp > subset/nodes.dmp
grep -w -f <(awk '{print "^"$1}' taxids.txt) ~/.taxonkit/names.dmp > subset/names.dmp
touch subset/delnodes.dmp subset/merged.dmp
Checking it. Since there are only two leaves here, we just dump the whole tree
$ wc -l subset/*.dmp
0 subset/delnodes.dmp
0 subset/merged.dmp
146 subset/names.dmp
40 subset/nodes.dmp
186 total
$ taxonkit list --ids 1 --data-dir subset/ -nr
1 [no rank] root
131567 [cellular root] cellular organisms
2 [domain] Bacteria
3379134 [kingdom] Pseudomonadati
1224 [phylum] Pseudomonadota
1236 [class] Gammaproteobacteria
135623 [order] Vibrionales
641 [family] Vibrionaceae
662 [genus] Vibrio
28174 [species] Vibrio ordalii
2759 [domain] Eukaryota
33154 [clade] Opisthokonta
33208 [kingdom] Metazoa
6072 [clade] Eumetazoa
33213 [clade] Bilateria
33511 [clade] Deuterostomia
7711 [phylum] Chordata
89593 [subphylum] Craniata
7742 [clade] Vertebrata
7776 [clade] Gnathostomata
117570 [clade] Teleostomi
117571 [clade] Euteleostomi
8287 [superclass] Sarcopterygii
1338369 [clade] Dipnotetrapodomorpha
32523 [clade] Tetrapoda
32524 [clade] Amniota
40674 [class] Mammalia
32525 [clade] Theria
9347 [clade] Eutheria
1437010 [clade] Boreoeutheria
314146 [superorder] Euarchontoglires
9443 [order] Primates
376913 [suborder] Haplorrhini
314293 [infraorder] Simiiformes
9526 [parvorder] Catarrhini
314295 [superfamily] Hominoidea
9604 [family] Hominidae
207598 [subfamily] Homininae
9605 [genus] Homo
9606 [species] Homo sapiens
$ echo 28174 | taxonkit lineage -nr --data-dir subset/
28174 cellular organisms;Bacteria;Pseudomonadota;Gammaproteobacteria;Vibrionales;Vibrionaceae;Vibrio;Vibrio ordalii Vibrio ordalii species