Metagenomic profiling

Requirements

Database
- Prebuilt databases are available.
- Or build custom databases.
Hardware.
- CPU: ≥ 32 cores preferred.
- RAM: ≥ 64 GB, depends on file size of the biggest database.

Datasets

Short reads, single or paired end.

Steps

Step 1. Preprocessing reads

For example, removing adapters and trimming using fastp:

fastp -i in_1.fq.gz -I in_2.fq.gz  \
    -o out_1.fq.gz -O out_2.fq.gz \
    -l 75 -q 20 -W 4 -M 20 -3 20 --thread 32 \
    --trim_poly_g --poly_g_min_len 10 --low_complexity_filter \
    --html out.fastp.html

Step 2. Removing host reads

Tools:

bowtie2 is recommended for removing host reads.
samtools is also used for processing reads mapping file.

Host reference genomes:

Human: CHM13. We also provide a database of CHM13 for fast removing human reads.

Building the index (~60min):

bowtie2-build --threads 32 GCA_009914755.4_T2T-CHM13v2.0_genomic.fna.gz chm13v2.0

Mapping and removing mapped reads:

index=~/ws/db/bowtie2/chm13v2.0

# paired-end reads
bowtie2 --threads 32 -x $index -1 in_1.fq.gz -2 in_2.fq.gz \
    | samtools fastq -f 4 -o sample.fq.gz -

# unpaired reads
bowtie2 --threads 32 -x $index -U in.fq.gz \
    | samtools fastq -f 4 | pigz -c > sample.fq.gz

Step 3. Searching

Reads can be searched against multiple databases which can be built with different parameters, and the results can be fastly merged for downstream analysis.

Attentions

Input format should be (gzipped) FASTA or FASTQ from files or stdin. Paired-End reads should be given via -1/--read1 and -2/--read2.
```
kmcp search -d db -1 read_1.fq.gz -2 read_2.fq.gz -o read.tsv.gz
```
Single-end can be given as positional arguments or -1/-2.
```
kmcp search -d db file1.fq.gz file2.fq.gz -o result.tsv.gz
```
Single-end mode is recommended for paired-end reads, for higher sensitivity.
A long query sequence may contain duplicated k-mers, which are not removed for short sequences by default. You may modify the value of -u/--kmer-dedup-threshold (default 256) to remove duplicates.
For long reads or contigs, you should split them into short reads using seqkit sliding, e.g.,
```
seqkit sliding -s 100 -W 300
```
The values of tCov and jacc in results only apply to databases built with a single size of k-mer.

kmcp search and kmcp profile share some flags, therefore users can use stricter criteria in kmcp profile.

-t/--min-query-cov, minimum query coverage, i.e., proportion of matched k-mers and unique k-mers of a query (default 0.55, close to ~96.5% sequence similarity)
-f/--max-fpr, maximum false positive rate of a query. (default 0.05)

Index files loading modes

Using memory-mapped index files with mmap (default):
- Faster startup speed when index files are buffered in memory.
- Multiple KMCP processes can share the memory.
Loading the whole index files into memory (-w/--load-whole-db):
- This mode occupies a little more memory. And multiple KMCP processes can not share the database in memory.
- It's slightly faster due to the use of physically contiguous memory. The speedup is more significant for smaller databases.
- Please switch on this flag (-w) when searching on computer clusters, where the default mmap mode would be very slow for network-attached storage (NAS).
Low memory mode (--low-mem):
- Do not load all index files into memory nor use mmap, using file seeking.
- It's much slower, >4X slower on SSD and would be much slower on HDD disks.
- Only use this mode for small number of queries or a huge database that can't be loaded into memory.

Performance tips:

Increase the value of -j/--threads for acceleratation, but values larger than the the number of CPU cores won't bring extra speedup.

Commands

Single-end mode is recommended for paired-end reads, for higher sensitivity:

# ---------------------------------------------------
# single-end (recommended)

read1=sample_1.fq.gz
read2=sample_2.fq.gz
sample=sample

# 1. searching results against multiple databases
for db in refseq-fungi.kmcp genbank-viral.kmcp gtdb.kmcp ; do
    dbname=$(basename $db)

    kmcp search \
        --threads            32 \
        --db-dir            $db \
        --min-kmers          10 \
        --min-query-len      30 \
        --min-query-cov    0.55 \
        $read1                  \
        $read2                  \
        --out-file         $sample.kmcp@$dbname.tsv.gz \
        --log              $sample.kmcp@$dbname.tsv.gz.log
done

# 2. Merging search results against multiple databases
kmcp merge $sample.kmcp@*.tsv.gz --out-file $sample.kmcp.tsv.gz

Paired-end reads:

# ---------------------------------------------------
# paired-end

read1=sample_1.fq.gz
read2=sample_2.fq.gz
sample=sample

# 1. searching results against multiple databases
for db in refseq-fungi.kmcp genbank-viral.kmcp gtdb.kmcp ; do
    dbname=$(basename $db)

    kmcp search \
        --threads            32 \
        --db-dir            $db \
        --min-kmers          10 \
        --min-query-len      30 \
        --min-query-cov    0.55 \
        --read1          $read1 \
        --read2          $read2 \
        --out-file       $sample.kmcp@$dbname.tsv.gz \
        --log            $sample.kmcp@$dbname.tsv.gz.log
done

# 2. Merging search results against multiple databases
kmcp merge $sample.kmcp@*.tsv.gz --out-file $sample.kmcp.tsv.gz

Search result format

Tab-delimited format with 15 columns:

 1. query,    Identifier of the query sequence
 2. qLen,     Query length
 3. qKmers,   K-mer number of the query sequence
 4. FPR,      False positive rate of the match
 5. hits,     Number of matches
 6. target,   Identifier of the target sequence
 7. chunkIdx, Index of reference chunk
 8. chunks,   Number of reference chunks
 9. tLen,     Reference length
10. kSize,    K-mer size
11. mKmers,   Number of matched k-mers
12. qCov,     Query coverage,  equals to: mKmers / qKmers
13. tCov,     Target coverage, equals to: mKmers / K-mer number of reference chunk
14. jacc,     Jaccard index
15. queryIdx, Index of query sequence, only for merging

Note: The header line starts with #, you need to assign another comment charactor if using csvtk for analysis. e.g.,

csvtk filter2 -C '$' -t -f '$qCov > 0.55' mock.kmcp.gz

Demo result:

#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
NC_003197.2-64416/1	150	130	7.4626e-15	1	GCF_000006945.2	9	10	4857450	21	90	0.6923	0.0002	0.0002	1
NC_003197.2-64414/1	150	130	7.4626e-15	1	GCF_000006945.2	6	10	4857450	21	130	1.0000	0.0003	0.0003	2
NC_003197.2-64412/1	150	130	7.4626e-15	1	GCF_000006945.2	6	10	4857450	21	121	0.9308	0.0002	0.0002	3
NC_003197.2-64410/1	150	130	7.4626e-15	1	GCF_000006945.2	1	10	4857450	21	101	0.7769	0.0002	0.0002	4
NC_003197.2-64408/1	150	130	7.8754e-15	1	GCF_000006945.2	9	10	4857450	21	83	0.6385	0.0002	0.0002	5
NC_003197.2-64406/1	150	130	7.4626e-15	1	GCF_000006945.2	2	10	4857450	21	103	0.7923	0.0002	0.0002	6
NC_003197.2-64404/1	150	130	7.4671e-15	1	GCF_000006945.2	5	10	4857450	21	86	0.6615	0.0002	0.0002	7
NC_003197.2-64402/1	150	130	7.5574e-15	1	GCF_000006945.2	3	10	4857450	21	84	0.6462	0.0002	0.0002	8
NC_003197.2-64400/1	150	130	7.4626e-15	1	GCF_000006945.2	1	10	4857450	21	89	0.6846	0.0002	0.0002	9

Searching on a computer cluster

Update: We recommend analyzing one sample using one computer node, which is easier to setup up.

Here, we split genomes of GTDB into 16 partitions and build a database for every partition, so we can use computer cluster to accelerate the searching. The genbank-viral genomes are also diveded into 4 partition.

A helper script easy_sbatch is used for batch submitting Slurm jobs via script templates.

# ---------------------------------------------------
# searching


j=32
reads=reads

# -----------------
# gtdb

dbprefix=~/ws/db/kmcp/gtdb.n16-

for file in $reads/*.left.fq.gz; do
    prefix=$(echo $file | sed 's/.left.fq.gz//')
    read1=$file
    read2=$(echo $file | sed 's/left.fq.gz/right.fq.gz/')

    ls -d $dbprefix*.kmcp \
        | easy_sbatch \
            -c $j -J $(basename $prefix) \
            "kmcp search         \
                --load-whole-db  \
                --threads   $j   \
                --db-dir    {}   \
                $read1 $read2    \
                --out-file  $prefix.kmcp@\$(basename {}).tsv.gz \
                --log       $prefix.kmcp@\$(basename {}).tsv.gz.log \
                --quiet "
done

# -----------------
# viral

dbprefix=~/ws/db/kmcp/genbank-viral.n4-

for file in $reads/*.left.fq.gz; do
    prefix=$(echo $file | sed 's/.left.fq.gz//')
    read1=$file
    read2=$(echo $file | sed 's/left.fq.gz/right.fq.gz/')

    ls -d $dbprefix*.kmcp \
        | easy_sbatch \
            -c $j -J $(basename $prefix) \
            "kmcp search         \
                --load-whole-db  \
                --threads   $j   \
                --db-dir    {}   \
                $read1 $read2    \
                --out-file  $prefix.kmcp@\$(basename {}).tsv.gz \
                --log       $prefix.kmcp@\$(basename {}).tsv.gz.log \
                --quiet "
done

# -----------------
# fungi

dbprefix=~/ws/db/kmcp/refseq-fungi

for file in $reads/*.left.fq.gz; do
    prefix=$(echo $file | sed 's/.left.fq.gz//')
    read1=$file
    read2=$(echo $file | sed 's/left.fq.gz/right.fq.gz/')

    ls -d $dbprefix*.kmcp \
        | easy_sbatch \
            -c $j -J $(basename $prefix) \
            "kmcp search         \
                --load-whole-db  \
                --threads   $j   \
                --db-dir    {}   \
                $read1 $read2    \
                --out-file  $prefix.kmcp@\$(basename {}).tsv.gz \
                --log       $prefix.kmcp@\$(basename {}).tsv.gz.log \
                --quiet "
done


# ---------------------------------------------------
# wait all job being done



# ---------------------------------------------------
# merge result and profiling

# merge results
# there's no need to submit to slurm, which could make it slower, cause the bottleneck is file IO
for file in $reads/*.left.fq.gz; do
    prefix=$(echo $file | sed 's/.left.fq.gz//')

    echo $prefix; date
    kmcp merge $prefix.kmcp@*.tsv.gz --out-file $prefix.kmcp.tsv.gz \
        --quiet --log $prefix.kmcp.tsv.gz.merge.log
done

# profiling
X=taxdump/
T=taxid.map

fd kmcp.tsv.gz$ $reads/ \
    | rush -v X=$X -v T=$T \
        'kmcp profile -X {X} -T {T} {} -o {}.k.profile -C {}.c.profile -s {%:} \
            --log {}.k.profile.log'

Step 4. Profiling

Input

TaxId mapping file(s).
Taxdump files.
KMCP search results.

Methods

Reference genomes can be split into chunks when computing k-mers (sketches), which could help to increase the specificity via a threshold, i.e., the minimum proportion of matched chunks (-p/--min-chunks-fraction) (highly recommended). Another flag -d/--max-chunks-cov-stdev further reduces false positives.
We require a part of the uniquely matched reads of a reference having high similarity, i.e., with high confidence for decreasing the false positive rate.
We also use the two-stage taxonomy assignment algorithm in MegaPath to reduce the false positive of ambiguous matches. You can also disable this step by the flag --no-amb-corr. If stage 1/4 produces thousands of candidates, you can use the flag --no-amb-corr to reduce analysis time, which has very little effect on the results.
Abundance are estimated using an Expectation-Maximization (EM) algorithm..
Input files are parsed for multiple times, therefore STDIN is not supported.

Accuracy notes:

Smaller -t/--min-qcov increase sensitivity at the cost of higher false positive rate (-f/--max-fpr) of a query.
And we require part of the uniquely matched reads of a reference having high similarity, i.e., with high confidence to decrease the false positive. E.g., -H >= 0.8 and -P >= 0.1 equals to 90th percentile >= 0.8
- -U/--min-hic-ureads, minimum number, >= 1
- -H/--min-hic-ureads-qcov, minimum query coverage, >= -t/--min-qcov
- -P/--min-hic-ureads-prop, minimum proportion, higher values increase precision at the cost of sensitivity.
-R/--max-mismatch-err and -D/--min-dreads-prop is for determing the right reference for ambigous reads with the algorithm in MegaPath.
--keep-perfect-match is not recommended, which decreases sensitivity.
-n/--keep-top-qcovs is not recommended, which affects accuracy of abundance estimation.

Profiling modes

We preset six profiling modes, available with the flag -m/--mode.

0 (for pathogen detection)
1 (higher recall)
2 (high recall)
3 (default)
4 (high precision)
5 (higher precision)

You can still change the values of some options below as usual.

options                       m=0    m=1   m=2   m=3    m=4   m=5
---------------------------   ----   ---   ---   ----   ---   ----
-r/--min-chunks-reads         1      5     10    50     100   100
-p/--min-chunks-fraction      0.2    0.6   0.7   0.8    1     1
-d/--max-chunks-depth-stdev   10     2     2     2      2     1.5
-u/--min-uniq-reads           1      2     5     20     50    50
-U/--min-hic-ureads           1      1     2     5      10    10
-H/--min-hic-ureads-qcov      0.7    0.7   0.7   0.75   0.8   0.8
-P/--min-hic-ureads-prop      0.01   0.1   0.2   0.1    0.1   0.15
--keep-main-matches           true                            
--max-qcov-gap                0.4

Taxonomy data:

Mapping references IDs to TaxIds: -T/--taxid-map
NCBI taxonomy dump files: -X/--taxdump

For databases built with a custom genome collection, you can use taxonkit create-taxdump to create NCBI-style taxdump files, which also generates a TaxId mapping file.

Performance notes:

Searching results are parsed in parallel, and the number of lines proceeded by a thread can be set by the flag --line-chunk-size.
However using a lot of threads does not always accelerate processing, 4 threads with a chunk size of 500-5000 is fast enough.
If stage 1/4 produces thousands of candidates, then stage 2/4 would be very slow. You can use the flag --no-amb-corr to disable ambiguous reads correction which has very little effect on the results.

Commands

# taxid mapping files, multiple files supported.
taxid_map=gtdb.kmcp/taxid.map,refseq-viral.kmcp/taxid.map,refseq-fungi.kmcp/taxid.map

# or concatenate them into a big taxid.map
#    cat gtdb.kmcp/taxid.map refseq-viral.kmcp/taxid.map refseq-fungi.kmcp/taxid.map > taxid.map
# taxid_map=taxid.map

# taxdump directory
taxdump=taxdump

sfile=$file.kmcp.tsv.gz

kmcp profile \
    --taxid-map      $taxid_map \
    --taxdump         $taxdump/ \
    --level             species \
    --min-query-cov        0.55 \
    --min-chunks-reads       50 \
    --min-chunks-fraction   0.8 \
    --max-chunks-depth-stdev  2 \
    --min-uniq-reads         20 \
    --min-hic-ureads          5 \
    --min-hic-ureads-qcov  0.75 \
    --min-hic-ureads-prop   0.1 \
    $sfile                      \
    --out-file         $sfile.kmcp.profile \
    --metaphlan-report $sfile.metaphlan.profile \
    --cami-report      $sfile.cami.profile \
    --sample-id        "0" \
    --binning-result   $sfile.binning.gz

Profiling result formats

Taxonomic profiling output formats:

KMCP (-o/--out-file). Note that: abundances are only computed for target references rather than each taxon at all taxonomic ranks, so please output CAMI or MetaPhlAn format.
CAMI (-M/--metaphlan-report, --metaphlan-report-version, sample name: -s/--sample-id, taxonomy data: --taxonomy-id)
MetaPhlAn (-C/--cami-report, sample name: -s/--sample-id))

Taxonomic binning formats:

CAMI (-B/--binning-result)

KMCP format (Tab-delimited format with 17 columns):

 1. ref,                Identifier of the reference genome
 2. percentage,         Relative abundance of the reference
 3. coverage,           Average coverage of the reference
 4. score,              The 90th percentile of qCov of uniquely matched reads
 5. chunksFrac,         Genome chunks fraction
 6. chunksRelDepth,     Relative depths of reference chunks
 7. chunksRelDepthStd,  The standard deviation of chunksRelDepth
 8. reads,              Total number of matched reads of this reference
 9. ureads,             Number of uniquely matched reads
10. hicureads,          Number of uniquely matched reads with high-confidence
11. refsize,            Reference size
12. refname,            Reference name, optional via name mapping file
13. taxid,              TaxId of the reference
14. rank,               Taxonomic rank
15. taxname,            Taxonomic name
16. taxpath,            Complete lineage
17. taxpathsn,          Corresponding TaxIds of taxa in the complete lineage

Demo output:

ref	percentage	coverage	score	chunksFrac	chunksRelDepth	chunksRelDepthStd	reads	ureads	hicureads	refsize	taxid	rank	taxname	taxpath	taxpathsn
GCF_003697165.2	18.663804	1.864553	100.00	1.00	1.04;0.90;1.03;1.00;0.90;1.00;1.00;1.02;1.11;0.99	0.06	60952	27831	15850	4903501	4093283224	species	Escherichia coli	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli	609216830;3788559933;329474883;3160438580;2234733759;3334977531;4093283224
GCF_002949675.1	18.201855	1.818404	97.69	1.00	1.04;0.93;1.02;1.03;1.04;0.97;1.04;1.02;0.95;0.98	0.04	53288	17152	8866	4395762	524994882	species	Shigella dysenteriae	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae	609216830;3788559933;329474883;3160438580;2234733759;2258433137;524994882
GCF_000006945.2	18.143627	1.812587	100.00	1.00	1.02;0.98;0.98;0.99;1.03;0.99;0.98;1.03;0.97;1.02	0.02	58697	57300	40690	4857450	1678121664	species	Salmonella enterica	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica	609216830;3788559933;329474883;3160438580;2234733759;794943543;1678121664
GCF_000742135.1	17.738253	1.772089	100.00	1.00	1.01;1.01;1.02;0.99;1.01;1.01;1.00;0.97;0.96;1.03	0.02	65518	63665	44088	5545864	3958205156	species	Klebsiella pneumoniae	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae	609216830;3788559933;329474883;3160438580;2234733759;2440106587;3958205156

CAMI format:

@SampleID:0
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@TaxonomyID:
@@TAXID RANK    TAXPATH TAXPATHSN   PERCENTAGE
609216830   superkingdom    609216830   Bacteria    100.000000
3788559933  phylum  609216830|3788559933    Bacteria|Proteobacteria 94.259786
3642462009  phylum  609216830|3642462009    Bacteria|Firmicutes 5.740214
329474883   class   609216830|3788559933|329474883  Bacteria|Proteobacteria|Gammaproteobacteria 94.259786
1845768359  class   609216830|3642462009|1845768359 Bacteria|Firmicutes|Bacilli 5.740214
3160438580  order   609216830|3788559933|329474883|3160438580   Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales    90.475599
185544332   order   609216830|3642462009|1845768359|185544332   Bacteria|Firmicutes|Bacilli|Lactobacillales 3.952422
1718572462  order   609216830|3788559933|329474883|1718572462   Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales  3.596852
813944714   order   609216830|3642462009|1845768359|813944714   Bacteria|Firmicutes|Bacilli|Bacillales  1.787792
2185117029  order   609216830|3788559933|329474883|2185117029   Bacteria|Proteobacteria|Gammaproteobacteria|Moraxellales    0.098158
86398254    order   609216830|3788559933|329474883|86398254 Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales 0.089177
2234733759  family  609216830|3788559933|329474883|3160438580|2234733759    Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae 90.475599
2871800275  family  609216830|3788559933|329474883|1718572462|2871800275    Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae  3.596852
3209851916  family  609216830|3642462009|1845768359|185544332|3209851916    Bacteria|Firmicutes|Bacilli|Lactobacillales|Enterococcaceae 3.595973
1997712377  family  609216830|3642462009|1845768359|813944714|1997712377    Bacteria|Firmicutes|Bacilli|Bacillales|Staphylococcaceae    1.787792
1255484345  family  609216830|3642462009|1845768359|185544332|1255484345    Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae    0.356449
943158193   family  609216830|3788559933|329474883|2185117029|943158193 Bacteria|Proteobacteria|Gammaproteobacteria|Moraxellales|Moraxellaceae  0.098158
1478401337  family  609216830|3788559933|329474883|86398254|1478401337  Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae    0.089177
2258433137  genus   609216830|3788559933|329474883|3160438580|2234733759|2258433137 Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Shigella    35.929915
3334977531  genus   609216830|3788559933|329474883|3160438580|2234733759|3334977531 Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Escherichia 18.663804
794943543   genus   609216830|3788559933|329474883|3160438580|2234733759|794943543  Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Salmonella  18.143627
2440106587  genus   609216830|3788559933|329474883|3160438580|2234733759|2440106587 Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Klebsiella  17.738253
2077617176  genus   609216830|3788559933|329474883|1718572462|2871800275|2077617176 Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae|Haemophilus  3.596852
602175708   genus   609216830|3642462009|1845768359|185544332|3209851916|602175708  Bacteria|Firmicutes|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus    3.595973
1824050977  genus   609216830|3642462009|1845768359|813944714|1997712377|1824050977 Bacteria|Firmicutes|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus 1.787792
2394826844  genus   609216830|3642462009|1845768359|185544332|1255484345|2394826844 Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus  0.356449
568178587   genus   609216830|3788559933|329474883|2185117029|943158193|568178587   Bacteria|Proteobacteria|Gammaproteobacteria|Moraxellales|Moraxellaceae|Acinetobacter    0.098158
1616653803  genus   609216830|3788559933|329474883|86398254|1478401337|1616653803   Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas    0.089177
4093283224  species 609216830|3788559933|329474883|3160438580|2234733759|3334977531|4093283224  Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Escherichia|Escherichia coli    18.663804
524994882   species 609216830|3788559933|329474883|3160438580|2234733759|2258433137|524994882   Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Shigella|Shigella dysenteriae   18.201855
1678121664  species 609216830|3788559933|329474883|3160438580|2234733759|794943543|1678121664   Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Salmonella|Salmonella enterica  18.143627
3958205156  species 609216830|3788559933|329474883|3160438580|2234733759|2440106587|3958205156  Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Klebsiella|Klebsiella pneumoniae    17.738253
2695851945  species 609216830|3788559933|329474883|3160438580|2234733759|2258433137|2695851945  Bacteria|Proteobacteria|Gammaproteobacteria|Enterobacterales|Enterobacteriaceae|Shigella|Shigella flexneri  17.728060
1063930303  species 609216830|3788559933|329474883|1718572462|2871800275|2077617176|1063930303  Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae|Haemophilus|Haemophilus parainfluenzae   1.809292
3809813362  species 609216830|3642462009|1845768359|185544332|3209851916|602175708|3809813362   Bacteria|Firmicutes|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecalis  1.800250
4145431389  species 609216830|3642462009|1845768359|185544332|3209851916|602175708|4145431389   Bacteria|Firmicutes|Bacilli|Lactobacillales|Enterococcaceae|Enterococcus|Enterococcus faecium   1.795723
328800344   species 609216830|3788559933|329474883|1718572462|2871800275|2077617176|328800344   Bacteria|Proteobacteria|Gammaproteobacteria|Pasteurellales|Pasteurellaceae|Haemophilus|Haemophilus influenzae   1.787560
1920251658  species 609216830|3642462009|1845768359|813944714|1997712377|1824050977|1920251658  Bacteria|Firmicutes|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus epidermidis  0.906778
1569132721  species 609216830|3642462009|1845768359|813944714|1997712377|1824050977|1569132721  Bacteria|Firmicutes|Bacilli|Bacillales|Staphylococcaceae|Staphylococcus|Staphylococcus aureus   0.881014
1527235303  species 609216830|3642462009|1845768359|185544332|1255484345|2394826844|1527235303  Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|Streptococcus mitis  0.178996
2983929374  species 609216830|3642462009|1845768359|185544332|1255484345|2394826844|2983929374  Bacteria|Firmicutes|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus|Streptococcus pneumoniae 0.177453
72054943    species 609216830|3788559933|329474883|2185117029|943158193|568178587|72054943  Bacteria|Proteobacteria|Gammaproteobacteria|Moraxellales|Moraxellaceae|Acinetobacter|Acinetobacter baumannii    0.098158
3843752343  species 609216830|3788559933|329474883|86398254|1478401337|1616653803|3843752343    Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas|Pseudomonas aeruginosa 0.089177

Related tools:

taxonkit profile2cami can convert any metagenomic profile table with TaxIds to CAMI format.
taxonkit cami-filter can remove taxa of given TaxIds and their descendants in CAMI metagenomic profile.

Metaphlan3 format (--metaphlan-report):

#SampleID   0
#clade_name NCBI_tax_id relative_abundance  additional_species
k__Bacteria 609216830   100.000000
k__Bacteria|p__Proteobacteria   609216830|3788559933    94.259786
k__Bacteria|p__Firmicutes   609216830|3642462009    5.740214
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria    609216830|3788559933|329474883  94.259786
k__Bacteria|p__Firmicutes|c__Bacilli    609216830|3642462009|1845768359 5.740214
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales    609216830|3788559933|329474883|3160438580   90.475599
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales 609216830|3642462009|1845768359|185544332   3.952422
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pasteurellales  609216830|3788559933|329474883|1718572462   3.596852
k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales  609216830|3642462009|1845768359|813944714   1.787792
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Moraxellales    609216830|3788559933|329474883|2185117029   0.098158
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales 609216830|3788559933|329474883|86398254 0.089177
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae  609216830|3788559933|329474883|3160438580|2234733759    90.475599
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pasteurellales|f__Pasteurellaceae   609216830|3788559933|329474883|1718572462|2871800275    3.596852
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae  609216830|3642462009|1845768359|185544332|3209851916    3.595973
k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Staphylococcaceae 609216830|3642462009|1845768359|813944714|1997712377    1.787792
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae 609216830|3642462009|1845768359|185544332|1255484345    0.356449
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Moraxellales|f__Moraxellaceae   609216830|3788559933|329474883|2185117029|943158193 0.098158
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Pseudomonadaceae 609216830|3788559933|329474883|86398254|1478401337  0.089177
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella  609216830|3788559933|329474883|3160438580|2234733759|2258433137 35.929915
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia   609216830|3788559933|329474883|3160438580|2234733759|3334977531 18.663804
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Salmonella    609216830|3788559933|329474883|3160438580|2234733759|794943543  18.143627
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Klebsiella    609216830|3788559933|329474883|3160438580|2234733759|2440106587 17.738253
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pasteurellales|f__Pasteurellaceae|g__Haemophilus    609216830|3788559933|329474883|1718572462|2871800275|2077617176 3.596852
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Enterococcus  609216830|3642462009|1845768359|185544332|3209851916|602175708  3.595973
k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Staphylococcaceae|g__Staphylococcus   609216830|3642462009|1845768359|813944714|1997712377|1824050977 1.787792
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus    609216830|3642462009|1845768359|185544332|1255484345|2394826844 0.356449
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Moraxellales|f__Moraxellaceae|g__Acinetobacter  609216830|3788559933|329474883|2185117029|943158193|568178587   0.098158
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Pseudomonadaceae|g__Pseudomonas  609216830|3788559933|329474883|86398254|1478401337|1616653803   0.089177
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli   609216830|3788559933|329474883|3160438580|2234733759|3334977531|4093283224  18.663804
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella dysenteriae  609216830|3788559933|329474883|3160438580|2234733759|2258433137|524994882   18.201855
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Salmonella|s__Salmonella enterica 609216830|3788559933|329474883|3160438580|2234733759|794943543|1678121664   18.143627
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Klebsiella|s__Klebsiella pneumoniae   609216830|3788559933|329474883|3160438580|2234733759|2440106587|3958205156  17.738253
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Shigella|s__Shigella flexneri 609216830|3788559933|329474883|3160438580|2234733759|2258433137|2695851945  17.728060
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pasteurellales|f__Pasteurellaceae|g__Haemophilus|s__Haemophilus parainfluenzae  609216830|3788559933|329474883|1718572462|2871800275|2077617176|1063930303  1.809292
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Enterococcus|s__Enterococcus faecalis 609216830|3642462009|1845768359|185544332|3209851916|602175708|3809813362   1.800250
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Enterococcaceae|g__Enterococcus|s__Enterococcus faecium  609216830|3642462009|1845768359|185544332|3209851916|602175708|4145431389   1.795723
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pasteurellales|f__Pasteurellaceae|g__Haemophilus|s__Haemophilus influenzae  609216830|3788559933|329474883|1718572462|2871800275|2077617176|328800344   1.787560
k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Staphylococcaceae|g__Staphylococcus|s__Staphylococcus epidermidis 609216830|3642462009|1845768359|813944714|1997712377|1824050977|1920251658  0.906778
k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Staphylococcaceae|g__Staphylococcus|s__Staphylococcus aureus  609216830|3642462009|1845768359|813944714|1997712377|1824050977|1569132721  0.881014
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus|s__Streptococcus mitis 609216830|3642462009|1845768359|185544332|1255484345|2394826844|1527235303  0.178996
k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus|s__Streptococcus pneumoniae    609216830|3642462009|1845768359|185544332|1255484345|2394826844|2983929374  0.177453
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Moraxellales|f__Moraxellaceae|g__Acinetobacter|s__Acinetobacter baumannii   609216830|3788559933|329474883|2185117029|943158193|568178587|72054943  0.098158
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Pseudomonadaceae|g__Pseudomonas|s__Pseudomonas aeruginosa    609216830|3788559933|329474883|86398254|1478401337|1616653803|3843752343    0.089177

Metaphlan2 format (--metaphlan-report-version 2 --metaphlan-report):

#SampleID   
k__Bacteria 100.000000
k__Bacteria|p__Proteobacteria   99.530189
k__Bacteria|p__Verrucomicrobia  0.469811
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria    99.530189
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae  0.469811
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales    99.530189
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales    0.469811
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae  99.530189
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales|f__Akkermansiaceae 0.469811
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia   99.530189
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales|f__Akkermansiaceae|g__Akkermansia  0.469811
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli   99.530189
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales|f__Akkermansiaceae|g__Akkermansia|s__Akkermansia muciniphila   0.469811
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli|t__Escherichia coli SE15  48.321535
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli|t__Escherichia coli K-12  46.194629
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli|t__Escherichia coli O157:H7 str. Sakai    5.014025
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales|f__Akkermansiaceae|g__Akkermansia|s__Akkermansia muciniphila|t__Akkermansia muciniphila ATCC BAA-835   0.469811

Binning result:

# This is the bioboxes.org binning output format at
# https://github.com/bioboxes/rfc/tree/master/data-format
@Version:0.10.0
@SampleID:
@@SEQUENCEID    TAXID
NC_000913.3_sliding:1244941-1245090     511145
NC_013654.1_sliding:344871-345020       562
NC_000913.3_sliding:3801041-3801190     511145
NC_013654.1_sliding:752751-752900       562
NC_000913.3_sliding:4080871-4081020     562
NC_000913.3_sliding:3588091-3588240     511145
NC_000913.3_sliding:2249621-2249770     562
NC_013654.1_sliding:2080171-2080320     431946
NC_000913.3_sliding:2354841-2354990     511145
NC_013654.1_sliding:437671-437820       431946