Detecting specific pathogens in sequencing data
KMCP v0.9.3 or later versions is needed, which fixed a bug in chunk computation when splitting circular genomes.
Data
-
Reference genomes (taken HBV virus for example).
wget https://hbvdb.lyon.inserm.fr/data/nucleic/fasta/all_Genomes.fas # removing duplicated sequences seqkit rmdup -s all_Genomes.fas -o refs.fa seqkit stats refs.fa file format type num_seqs sum_len min_len avg_len max_len refs.fa FASTA DNA 9,248 29,673,289 3,182 3,208.6 3,275 # --- only for HBV ref genomes ----- # only keep at most 10 genomes for each genotype seqkit fx2tab -i refs.fa \ | csvtk mutate -Ht -p '_([A-Z\-]+)$' \ | csvtk uniq -Ht -f 4 -n 10 \ | seqkit tab2fx -o refs.slim.fa file format type num_seqs sum_len min_len avg_len max_len refs.slim.fa FASTA DNA 221 709,855 3,182 3,212 3,257
-
NCBI taxdump files, only needed for multiple reference genomes.
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz mkdir -p taxdump tar -zxvf taxdump.tar.gz -C taxdump/
Preprocessing reference genomes
Saving genomes in separated files, with the genome ID as the file name.
# 1. split into separated files, each with one genome
seqkit split2 -s 1 refs.slim.fa -O refs --force
# 2. rename files
# rush: https://github.com/shenwei356/rush
find refs/ -name "*.fa" \
| rush 'id=$(seqkit seq -ni {} | head -n 1); mv {} "refs/$id.fa"'
tree refs/ | head -n 8
refs/
├── gnl|hbvnuc|AB010290_FT00000_P-B.fa
├── gnl|hbvnuc|AB014392_FT00000_P-C.fa
├── gnl|hbvnuc|AB033558_FT00000_P-RF-DE.fa
├── gnl|hbvnuc|AB059660_FT00000_P-H.fa
├── gnl|hbvnuc|AB059661_FT00000_P-H.fa
├── gnl|hbvnuc|AB064310_FT00000_P-G.fa
├── gnl|hbvnuc|AB064311_FT00000_P-G.fa
Creating a KMCP database
Spliting each genome into 10 chunks with 150-bp overlaps, and computing k-mers.
# the default k-mer size is 21.
# but for many highly similar reference genomes, you may need a larger k value.
kmcp compute --circular -k 31 -n 10 -l 150 -I refs/ -O refs.tmp --force
Indexing the k-mers, using a small false-positive rate for small genomes.
# you can add the flag --dry-run to see the size of database in advance
kmcp index -f 0.001 -I refs.tmp/ -O refs.kmcp --force
Creating the taxid mapping file (only needed for multiple reference genomes).
# 10407 is the taxid of Hepatitis B virus species
cut -f 1 refs.kmcp/R001/__name_mapping.tsv \
| awk '{print $0"\t10407"}' \
> taxid.map
cp taxid.map refs.kmcp/
Searching reads against the KMCP database
kmcp search -w -d refs.kmcp/ sample_1.fq.gz sample_2.fq.gz -o sample.kmcp.tsv.gz
Profiling
-
For the database with a single reference genome.
# the preset profiling mode 1 with a higher sensitivity is used here. kmcp profile --level strain -m 1 \ sample.kmcp.tsv.gz -o sample.k.profile -C sample.c.profile -M sample.m.profile
-
For the database with multiple reference genomes.
# the preset profiling mode 1 with a higher sensitivity is used here. kmcp profile -X taxdump/ -T taxid.map -m 1 \ sample.kmcp.tsv.gz -o sample.k.profile -C sample.c.profile -M sample.m.profile