subseq

Usage

$ lexicmap utils subseq -h
Extract subsequence via 1) reference name, sequence ID, position and strand, or 2) search result

Input:
  - Manually specify reference name, sequence ID, region, and strand.
  - Or give the search result file generated by "lexicmap search".
    - If the search result data has no header row, after filtering using tools like awk,
      please switch on the flag -H/--no-header-row.

Tips:
  - The aligned region can be extended with -U/--upstream and/or -D/--downstream.
  - If the search results are merged from multiple indexes, it would report error
    "reference name not found". You can switch on -e/--ignore-err to ignore this.

Output:
  - FASTA format, with a sequence ID in the format of "seqid:begin-end:strand".
    The begin and end are positions in the indexed sequence, and begin is <= end.
    And the sequence is reverse-complementary when the strand is "-".

Attention:
  1. When manually specifying the genome and region, the option -s/--seq-id is optional.
     1) If given, the positions are these in the original sequence.
     2) If not given, the positions are these in the concatenated sequence.
  2. All degenerate bases in reference genomes were converted to the lexicographic first bases.
     E.g., N was converted to A. Therefore, consecutive A's in output might be N's in the genomes.

Usage:
  lexicmap utils subseq [flags] 

Flags:
  -b, --buffer-size string     ► Size of buffer, supported unit: K, M, G. You need increase the value
                               when "bufio.Scanner: token too long" error reported (default "20M")
  -D, --downstream int         ► Extract extra N bp on the downstream of the aligned/specified region.
  -h, --help                   help for subseq
  -e, --ignore-err             ► Ignore errors such as 'reference name not found' or 'failed to
                               extract subsequence'. Switch on this flag if search results are merged
                               from multiple indexes.
  -d, --index string           ► Index directory created by "lexicmap index".
  -w, --line-width int         ► Line width of sequence (0 for no wrap). (default 60)
      --max-open-files int     ► Maximum opened files. It mainly affects candidate subsequence
                               extraction. Increase this value if you have hundreds of genome batches or
                               have multiple queries, and do not forgot to set a bigger "ulimit -n" in
                               shell if the value is > 1024. (default 1024)
  -H, --no-header-row          ► The search result file has no header row, this happens when using
                               tools like awk to filter the file.
  -o, --out-file string        ► Out file, supports the ".gz" suffix ("-" for stdout). (default "-")
  -n, --ref-name string        ► Reference name.
  -r, --region string          ► Region of the subsequence (1-based).
  -R, --revcom                 ► Extract subsequence on the negative strand.
  -f, --search-result string   ► Use search result file from "lexicmap search" as input. It can be "-"
                               to accept filtered result from stdin
  -s, --seq-id string          ► Sequence ID. If the value is empty, the positions in the region are
                               treated as that in the concatenated sequence.
  -U, --upstream int           ► Extract extra N bp on the upstream of the aligned/specified region.

Global Flags:
  -X, --infile-list string   ► File of input file list (one file per line). If given, they are
                             appended to files from CLI arguments.
      --log string           ► Log file.
      --quiet                ► Do not print any verbose information. But you can write them to a file
                             with --log.
  -j, --threads int          ► Number of CPU cores to use. By default, it uses all available cores.
                             (default 16)

Examples

Extracting aligned sequences with the output file from lexicmap search.

 # search
 $ lexicmap search -d demo.lmi/ q.gene.fasta -o t.txt

 # extract
 $ lexicmap utils subseq -d demo.lmi/ -f t.txt -o t.txt.aligned.fasta
 17:45:50.134 [INFO] loading index: demo.lmi/
 17:45:50.135 [INFO]   creating reader pools for 1 genome batches, each with 16 readers...
 17:45:50.135 [INFO] extracting subsequences...
 84 subsequences extracted from 84 records
 17:45:50.139 [INFO] 84 subsequences extracted from 84 records
 17:45:50.139 [INFO] 
 17:45:50.139 [INFO] elapsed time: 13.923238ms
 17:45:50.139 [INFO] 

 # extract with filtered search result
 $ csvtk filter2 -t -f '$qcovHSP > 90 && $pident > 90' t.txt \
     | lexicmap utils subseq -d demo.lmi/ -f - -o t.txt.aligned.fasta 
 17:46:15.216 [INFO] loading index: demo.lmi/
 17:46:15.216 [INFO]   creating reader pools for 1 genome batches, each with 16 readers...
 17:46:15.216 [INFO] extracting subsequences...
 33 subsequences extracted from 33 records
 17:46:15.219 [INFO] 33 subsequences extracted from 33 records
 17:46:15.219 [INFO] 
 17:46:15.219 [INFO] elapsed time: 21.785783ms
 17:46:15.219 [INFO]

 # extend the region with -U/--upstream and/or -D/--downstream.
 $ csvtk filter2 -t -f '$qcovHSP > 90 && $pident > 90' t.txt \
     | lexicmap utils subseq -d demo.lmi/ -f - -o t.txt.aligned.extended.fasta -U 1000 -D 1000

Sequence header

$ seqkit head -n 1 t.txt.aligned.fasta | seqkit seq -n
NZ_CP033092.2:458559-460100:+ query=NC_000913.3:4166659-4168200 sgenome=GCF_003697165.2 sseqid=NZ_CP033092.2 qcovGnm=100.000 cls=1 hsp=1 qcovHSP=100.000 alenHSP=1542 pident=99.805 gaps=0 qstart=1 qend=1542 sstart=458559 send=460100 sstr=+ slen=4903501 evalue=0.00e+00 bitscore=2767

$ seqkit head -n 1 t.txt.aligned.extended.fasta | seqkit seq -n
NZ_CP033092.2:457559-461100:+ query=NC_000913.3:4166659-4168200 sgenome=GCF_003697165.2 sseqid=NZ_CP033092.2 qcovGnm=100.000 cls=1 hsp=1 qcovHSP=100.000 alenHSP=1542 pident=99.805 gaps=0 qstart=1 qend=1542 sstart=458559 send=460100 sstr=+ slen=4903501 evalue=0.00e+00 bitscore=2767

Extracting subsequence with genome ID, sequence ID, position range and strand information.

 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4591684:4593225 -R
 >NZ_CP033092.2:4591684-4593225:-
 AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAA
 GTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAA
 TGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCAT
 AACGTCGCAAGACCAAAGAGGGGGACCTTAGGGCCTCTTGCCATCGGATGTGCCCAGATG
 GGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA
 GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGG
 GGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCT
 TCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATT
 GACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAG
 GGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
 GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTC
 GTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACC
 GGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCA
 AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCC
 CTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
 AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAAT
 TCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAG
 AATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGA
 AATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGC
 CGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTC
 ATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCG
 ACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAAC
 TCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGT
 TCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGT
 AGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAA
 CAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA

If the sequence ID (-s/--seq-id) is not given, the positions are these in the concatenated sequence.

Checking sequence lengths of a genome with seqkit.

 $ seqkit fx2tab -nil refs/GCF_003697165.2.fa.gz
 NZ_CP033092.2   4903501
 NZ_CP033091.2   131333

Extracting the 1000-bp interval sequence inserted by lexicmap index.

 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501
 >GCF_003697165.2:4903502-4904501:+
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Extend the region with -U/--upstream and/or -D/--downstream.

 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501 -U 5 -D 5
 >GCF_003697165.2:4903497-0:+
 CCGCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTGGA

It detects if the end position is larger than the sequence length.

 # the length of NZ_CP033092.2 is 4903501

 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903501:1000000000
 >NZ_CP033092.2:4903501-4903501:+
 C


 $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903502:1000000000
 >NZ_CP033092.2:4903502-4903501:+