subseq
$ lexicmap utils subseq -h
Extract subsequence via 1) reference name, sequence ID, position and strand, or 2) search result
Input:
- Manually specify reference name, sequence ID, region, and strand.
- Or give the search result file generated by "lexicmap search".
- If the search result data has no header row, after filtering using tools like awk,
please switch on the flag -H/--no-header-row.
Tips:
- The aligned region can be extended with -U/--upstream and/or -D/--downstream.
- If the search results are merged from multiple indexes, it would report error
"reference name not found". You can switch on -e/--ignore-err to ignore this.
Output:
- FASTA format, with a sequence ID in the format of "seqid:begin-end:strand".
The begin and end are positions in the indexed sequence, and begin is <= end.
And the sequence is reverse-complementary when the strand is "-".
Attention:
1. When manually specifying the genome and region, the option -s/--seq-id is optional.
1) If given, the positions are these in the original sequence.
2) If not given, the positions are these in the concatenated sequence.
2. All degenerate bases in reference genomes were converted to the lexicographic first bases.
E.g., N was converted to A. Therefore, consecutive A's in output might be N's in the genomes.
Usage:
lexicmap utils subseq [flags]
Flags:
-b, --buffer-size string ► Size of buffer, supported unit: K, M, G. You need increase the value
when "bufio.Scanner: token too long" error reported (default "20M")
-D, --downstream int ► Extract extra N bp on the downstream of the aligned/specified region.
-h, --help help for subseq
-e, --ignore-err ► Ignore errors such as 'reference name not found' or 'failed to
extract subsequence'. Switch on this flag if search results are merged
from multiple indexes.
-d, --index string ► Index directory created by "lexicmap index".
-w, --line-width int ► Line width of sequence (0 for no wrap). (default 60)
--max-open-files int ► Maximum opened files. It mainly affects candidate subsequence
extraction. Increase this value if you have hundreds of genome batches or
have multiple queries, and do not forgot to set a bigger "ulimit -n" in
shell if the value is > 1024. (default 1024)
-H, --no-header-row ► The search result file has no header row, this happens when using
tools like awk to filter the file.
-o, --out-file string ► Out file, supports the ".gz" suffix ("-" for stdout). (default "-")
-n, --ref-name string ► Reference name.
-r, --region string ► Region of the subsequence (1-based).
-R, --revcom ► Extract subsequence on the negative strand.
-f, --search-result string ► Use search result file from "lexicmap search" as input. It can be "-"
to accept filtered result from stdin
-s, --seq-id string ► Sequence ID. If the value is empty, the positions in the region are
treated as that in the concatenated sequence.
-U, --upstream int ► Extract extra N bp on the upstream of the aligned/specified region.
Global Flags:
-X, --infile-list string ► File of input file list (one file per line). If given, they are
appended to files from CLI arguments.
--log string ► Log file.
--quiet ► Do not print any verbose information. But you can write them to a file
with --log.
-j, --threads int ► Number of CPU cores to use. By default, it uses all available cores.
(default 16)
-
Extracting aligned sequences with the output file from
lexicmap search
.# search $ lexicmap search -d demo.lmi/ q.gene.fasta -o t.txt # extract $ lexicmap utils subseq -d demo.lmi/ -f t.txt -o t.txt.aligned.fasta 17:45:50.134 [INFO] loading index: demo.lmi/ 17:45:50.135 [INFO] creating reader pools for 1 genome batches, each with 16 readers... 17:45:50.135 [INFO] extracting subsequences... 84 subsequences extracted from 84 records 17:45:50.139 [INFO] 84 subsequences extracted from 84 records 17:45:50.139 [INFO] 17:45:50.139 [INFO] elapsed time: 13.923238ms 17:45:50.139 [INFO] # extract with filtered search result $ csvtk filter2 -t -f '$qcovHSP > 90 && $pident > 90' t.txt \ | lexicmap utils subseq -d demo.lmi/ -f - -o t.txt.aligned.fasta 17:46:15.216 [INFO] loading index: demo.lmi/ 17:46:15.216 [INFO] creating reader pools for 1 genome batches, each with 16 readers... 17:46:15.216 [INFO] extracting subsequences... 33 subsequences extracted from 33 records 17:46:15.219 [INFO] 33 subsequences extracted from 33 records 17:46:15.219 [INFO] 17:46:15.219 [INFO] elapsed time: 21.785783ms 17:46:15.219 [INFO] # extend the region with -U/--upstream and/or -D/--downstream. $ csvtk filter2 -t -f '$qcovHSP > 90 && $pident > 90' t.txt \ | lexicmap utils subseq -d demo.lmi/ -f - -o t.txt.aligned.extended.fasta -U 1000 -D 1000
Sequence header
$ seqkit head -n 1 t.txt.aligned.fasta | seqkit seq -n NZ_CP033092.2:458559-460100:+ query=NC_000913.3:4166659-4168200 sgenome=GCF_003697165.2 sseqid=NZ_CP033092.2 qcovGnm=100.000 cls=1 hsp=1 qcovHSP=100.000 alenHSP=1542 pident=99.805 gaps=0 qstart=1 qend=1542 sstart=458559 send=460100 sstr=+ slen=4903501 evalue=0.00e+00 bitscore=2767 $ seqkit head -n 1 t.txt.aligned.extended.fasta | seqkit seq -n NZ_CP033092.2:457559-461100:+ query=NC_000913.3:4166659-4168200 sgenome=GCF_003697165.2 sseqid=NZ_CP033092.2 qcovGnm=100.000 cls=1 hsp=1 qcovHSP=100.000 alenHSP=1542 pident=99.805 gaps=0 qstart=1 qend=1542 sstart=458559 send=460100 sstr=+ slen=4903501 evalue=0.00e+00 bitscore=2767
-
Extracting subsequence with genome ID, sequence ID, position range and strand information.
$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4591684:4593225 -R >NZ_CP033092.2:4591684-4593225:- AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAA GTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAA TGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCAT AACGTCGCAAGACCAAAGAGGGGGACCTTAGGGCCTCTTGCCATCGGATGTGCCCAGATG GGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGG GGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCT TCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATT GACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAG GGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTC GTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACC GGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCA AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCC CTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAAT TCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAG AATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGA AATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGC CGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTC ATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCG ACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAAC TCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGT TCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGT AGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAA CAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
-
If the sequence ID (
-s/--seq-id
) is not given, the positions are these in the concatenated sequence.Checking sequence lengths of a genome with seqkit.
$ seqkit fx2tab -nil refs/GCF_003697165.2.fa.gz NZ_CP033092.2 4903501 NZ_CP033091.2 131333
Extracting the 1000-bp interval sequence inserted by
lexicmap index
.$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501 >GCF_003697165.2:4903502-4904501:+ AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Extend the region with
-U
/--upstream
and/or-D
/--downstream
.$ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501 -U 5 -D 5 >GCF_003697165.2:4903497-0:+ CCGCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTGGA
-
It detects if the end position is larger than the sequence length.
# the length of NZ_CP033092.2 is 4903501 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903501:1000000000 >NZ_CP033092.2:4903501-4903501:+ C $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903502:1000000000 >NZ_CP033092.2:4903502-4903501:+