Sequence and genome searching

Using cases

Searching sequences against raws reads.
- For checking existence:
  - Building database with raws reads and searching with query sequences, optionally using k-mer sketches for long query sequences.
- For abundance estimation:
  - For long sequences (similar to taxonomic profiling):
    - Building database with query sequences, searching with raws reads, and profiling.
  - For short (shorter than reads length) sequences:
    - Not capable.
Searching sequences against assemblies/genomes.
- For checking existence:
  - For short sequences (short reads, e.g., detecting host contamination):
    - Building database with assemblies/genomes and searching with raws reads.
  - For long sequences:
    - Building database with assemblies/genomes using k-mer sketches.
- For genome similarity estimation:
  - Building database with assemblies/genomes using k-mer sketches.
- For taxonomic profiling:
  - For short reads:
    - Building database with assemblies/genomes, searching with raws reads, and profiling.
  - For long reads:
    - Split long reads into short reads before searching.

Sequence search

KMCP can be used for fast sequence search against large scales of genomic datasets as BIGSI and COBS do.

KMCP reimplemented and modified the Compact Bit-Sliced Signature index (COBS) algorithm, bringing a small database size and much faster searching speed (check the benchmark).

Step 1. Building databases

The database building process is similar to that of metagenomic profiling, with one difference:

Reference genomes are not splitted into chunks.

Taken GTDB for example:

# mask low-complexity region (optional)
mkdir -p gtdb.masked
find gtdb/ -name "*.fna.gz" \
    | rush 'dustmasker -in <(zcat {}) -outfmt fasta \
        | sed -e "/^>/!s/[a-z]/n/g" \
        | gzip -c > gtdb.masked/{%}'

# compute k-mers
#   sequences containing "plasmid" in name are ignored,
#   k = 21
kmcp compute -I gtdb.masked/ -k 21 -B plasmid -O gtdb-r202-k21

# build database
#   number of index files: 32, for server with >= 32 CPU cores
#   bloom filter parameter:
#     number of hash function: 1
#     false positive rate: 0.3
kmcp index -j 32 -I gtdb-r202-k21 -O gtdb-r202-k21 -n 1 -f 0.3

# cp name mapping file to database directory
cp name.map gtdb-r202-k21.kmcp/

The size of database is 56GB.

Step 2. Searching

By default, kmcp search loads the whole database into main memory (RAM) for fast searching. Optionally, the flag --low-mem can be set to avoid loading the whole database, while it's much slower, >10X slower on SSD and should be much slower on HDD disks. Please switch on this flag when searching on computer clusters, where the default mmap mode would be very slow for network-attached storage (NAS)

kmcp search supports FASTA/Q format from STDIN or files (usage).

kmcp search -d gtdb.kmcp/ test.fq.gz -o result.tsv.gz

22:21:26.017 [INFO] kmcp v0.8.0
22:21:26.017 [INFO]   https://github.com/shenwei356/kmcp
22:21:26.017 [INFO] 
22:21:26.017 [INFO] checking input files ...
22:21:26.017 [INFO]   1 input file(s) given
22:21:26.018 [INFO] loading database with mmap enabled ...
22:21:26.018 [INFO]   number of extra workers for every index file: 4
22:21:26.328 [INFO] database loaded: gtdb.kmcp/
22:21:26.328 [INFO] 
22:21:26.328 [INFO] -------------------- [main parameters] --------------------
22:21:26.328 [INFO]   minimum    query length: 30
22:21:26.328 [INFO]   minimum  matched k-mers: 10
22:21:26.328 [INFO]   minimum  query coverage: 0.550000
22:21:26.328 [INFO]   minimum target coverage: 0.000000
22:21:26.328 [INFO]   minimum target coverage: 0.000000
22:21:26.328 [INFO] -------------------- [main parameters] --------------------
22:21:26.328 [INFO] 
22:21:26.328 [INFO] searching ...
22:21:26.328 [INFO] reading sequence file: test.fq.gz

22:21:26.376 [INFO] 
22:21:26.376 [INFO] processed queries: 10, speed: 0.012 million queries per minute
22:21:26.376 [INFO] 90.0000% (9/10) queries matched
22:21:26.376 [INFO] done searching
22:21:26.423 [INFO] 
22:21:26.423 [INFO] elapsed time: 405.849288ms
22:21:26.423 [INFO]

The result is tab-delimited.

#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
S0R0/1	150	130	3.1777e-06	2	GCF_002872255.1	5	10	2583551	21	87	0.6692	0.0003	0.0003	0
S0R0/1	150	130	1.1936e-03	2	GCF_001434585.1	6	10	2221511	21	74	0.5692	0.0003	0.0003	0
S0R0/2	150	130	6.2145e-07	4	GCF_002872255.1	5	10	2583551	21	90	0.6923	0.0004	0.0004	1
S0R0/2	150	130	3.1777e-06	4	GCF_001434585.1	6	10	2221511	21	87	0.6692	0.0004	0.0004	1
S0R0/2	150	130	2.4002e-05	4	GCF_001438655.1	5	10	2038820	21	83	0.6385	0.0004	0.0004	1

Reference IDs (column target) can be optionally mapped to their names during searching:

kmcp search -d gtdb-r202-k21.kmcp/ -N name.map test.fq.gz -o result.tsv.gz

Or after searching using csvtk:

csvtk replace -t -C $ -f target -k name.map -p '(.+)' -r '{kv}' result.tsv.gz

Print the main columns only:

csvtk head -t -C $ -n 5 result.tsv.gz \
    | csvtk rename -t -C $ -f 1 -n query \
    | csvtk cut -t -f query,FPR,qCov,target \
    | csvtk csv2md -t

query	FPR	qCov	target
S0R0/1	3.1777e-06	0.6692	GCF_002872255.1
S0R0/1	1.1936e-03	0.5692	GCF_001434585.1
S0R0/2	6.2145e-07	0.6923	GCF_002872255.1
S0R0/2	3.1777e-06	0.6692	GCF_001434585.1
S0R0/2	2.4002e-05	0.6385	GCF_001438655.1

Genome similarity estimation

KMCP can be used for fast similarity estimation of newly assembled genome against known reference genomes. BLAST is an option while it focuses on local similarity. Besides, the database is big and the searching speed is relative slow.

Genome sketching is a method of utilizing small and approximate summaries of genomic data for fast searching and comparison. Mash and Sourmash provide fast genome distance estimation using MinHash (Mash) or FracMinHash (Scaled MinHash) (Sourmash). Here KMCP utilizes multiple sketches (Minimizer, FracMinHash (previously named Scaled MinHash) and Syncmers) for genome similarity estimation.

Prebuilt databases are available and users can also build custom databases following steps below.

Step 1. Building databases

The database building process is similar to that of metagenomic profiling, with few differences:

K-mer sketches instead of all k-mers are computed.
Reference genomes are not splitted into chunks.
Smaller false positive rate (-f) and more than one hash functions (-n) are used to improve accuracy.
Using a bigger k-mer size: 31.

Supported k-mer (sketches) types:

K-mer:
- ntHash of k-mer (-k)
K-mer sketchs (all using ntHash):
- FracMinHash (-k -D), previously named Scaled MinHash
- Minimizer (-k -W), optionally scaling/down-sampling (-D)
- Closed Syncmer (-k -S), optionally scaling/down-sampling (-D)

Taken GTDB for example:

# compute FracMinHash (Scaled MinHash) with scale 1000
#   sequence containing "plasmid" in name are ignored,
#   k = 31
kmcp compute -I gtdb/ -k 31 -D 1000 -B plasmid -O gtdb-r202-minhash

# build database
#   number of index files: 8, for server with >= 8 CPU cores
#   bloom filter parameter:
#     number of hash function: 3
#     false positive rate: 0.001
kmcp index -j 8 -I gtdb-r202-minhash -O gtdb.minhash.kmcp -n 3 -f 0.001

# cp name mapping file to database directory
cp taxid.map name.map gtdb.minhash.kmcp/

Attention:

For small genomes like viruses, sketching parameters should be adjusted. For examples, setting a smaller scale like 10.

Step 2. Searching

The searching process is simple and very fast (<1 second).

kmcp search --query-whole-file -d gtdb.minhash.kmcp/ \
    --query-whole-file --sort-by jacc --min-query-cov 0.2 \
    --query-id genome1 contigs.fasta -o result.tsv

The output is in tab-delimited format:

 1. query,    Identifier of the query sequence
 2. qLen,     Query length
 3. qKmers,   K-mer number of the query sequence
 4. FPR,      False positive rate of the match
 5. hits,     Number of matches
 6. target,   Identifier of the target sequence
 7. chunkIdx, Index of reference chunk
 8. chunks,   Number of reference chunks
 9. tLen,     Reference length
10. kSize,    K-mer size
11. mKmers,   Number of matched k-mers
12. qCov,     Query coverage,  equals to: mKmers / qKmers
13. tCov,     Target coverage, equals to: mKmers / K-mer number of reference chunk
14. jacc,     Jaccard index
15. queryIdx, Index of query sequence, only for merging

A full search result:

#query	qLen	qKmers	FPR	hits	target	chunkIdx	chunks	tLen	kSize	mKmers	qCov	tCov	jacc	queryIdx
genome1	9488952	18737	0.0000e+00	2	GCF_000742135.1	0	1	5545784	31	8037	0.4289	0.7365	0.3719	0
genome1	9488952	18737	3.1964e-183	2	GCF_000392875.1	0	1	2881400	31	3985	0.2127	0.7062	0.1954	0

Reference IDs can be optionally mapped to their names, let's print the main columns only:

kmcp search --query-whole-file -d gtdb.minhash.kmcp/ \
    --name-map name.map \
    --query-whole-file --sort-by jacc --min-query-cov 0.2 \
    --query-id genome1 contigs.fasta \
    | csvtk rename -t -C $ -f 1 -n query \
    | csvtk cut -t -f query,jacc,target \
    > result.tsv

query	jacc	target
genome1	0.3719	NZ_KN046818.1 Klebsiella pneumoniae strain ATCC 13883 scaffold1, whole genome shotgun sequence
genome1	0.1954	NZ_KB944588.1 Enterococcus faecalis ATCC 19433 acAqW-supercont1.1, whole genome shotgun sequence

Using closed syncmer:

kmcp search --query-whole-file -d gtdb.syncmer.kmcp/ \
    --name-map name.map \
    --query-whole-file --sort-by jacc --min-query-cov 0.2 \
    --query-id genome1 contigs.fasta \
    | csvtk rename -t -C $ -f 1 -n query \
    | csvtk cut -t -f query,jacc,target

query	jacc	target
genome1	0.3712	NZ_KN046818.1 Klebsiella pneumoniae strain ATCC 13883 scaffold1, whole genome shotgun sequence
genome1	0.1974	NZ_KB944588.1 Enterococcus faecalis ATCC 19433 acAqW-supercont1.1, whole genome shotgun sequence