Skip to main content
LexicMap: efficient sequence alignment against millions of prokaryotic genomes​
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

subseq

Usage

$ lexicmap utils subseq -h
Extract subsequence via 1) reference name, sequence ID, position and strand, or 2) search result

Input:
  - Manually specify reference name, sequence ID, region, and strand.
  - Or give the search result file generated by "lexicmap search".
    - If the search result data has no header row, after filtering using tools like awk,
      please switch on the flag -H/--no-header-row.

Tips:
  - The aligned region can be extended with -U/--upstream and/or -D/--downstream.
  - If the search results are merged from multiple indexes, it would report error
    "reference name not found". You can switch on -e/--ignore-err to ignore this.

Output:
  - FASTA format, with a sequence ID in the format of "seqid:begin-end:strand".
    The begin and end are positions in the indexed sequence, and begin is <= end.
    And the sequence is reverse-complementary when the strand is "-".

Attention:
  1. When manually specifying the genome and region, the option -s/--seq-id is optional.
     1) If given, the positions are these in the original sequence.
     2) If not given, the positions are these in the concatenated sequence.
  2. All degenerate bases in reference genomes were converted to the lexicographic first bases.
     E.g., N was converted to A. Therefore, consecutive A's in output might be N's in the genomes.

Usage:
  lexicmap utils subseq [flags] 

Flags:
  -b, --buffer-size string     ► Size of buffer, supported unit: K, M, G. You need increase the value
                               when "bufio.Scanner: token too long" error reported (default "20M")
  -D, --downstream int         ► Extract extra N bp on the downstream of the aligned/specified region.
  -h, --help                   help for subseq
  -e, --ignore-err             ► Ignore errors such as 'reference name not found' or 'failed to
                               extract subsequence'. Switch on this flag if search results are merged
                               from multiple indexes.
  -d, --index string           ► Index directory created by "lexicmap index".
  -w, --line-width int         ► Line width of sequence (0 for no wrap). (default 60)
      --max-open-files int     ► Maximum opened files. It mainly affects candidate subsequence
                               extraction. Increase this value if you have hundreds of genome batches or
                               have multiple queries, and do not forgot to set a bigger "ulimit -n" in
                               shell if the value is > 1024. (default 1024)
  -H, --no-header-row          ► The search result file has no header row, this happens when using
                               tools like awk to filter the file.
  -o, --out-file string        ► Out file, supports the ".gz" suffix ("-" for stdout). (default "-")
  -n, --ref-name string        ► Reference name.
  -r, --region string          ► Region of the subsequence (1-based).
  -R, --revcom                 ► Extract subsequence on the negative strand.
  -f, --search-result string   ► Use search result file from "lexicmap search" as input. It can be "-"
                               to accept filtered result from stdin
  -s, --seq-id string          ► Sequence ID. If the value is empty, the positions in the region are
                               treated as that in the concatenated sequence.
  -U, --upstream int           ► Extract extra N bp on the upstream of the aligned/specified region.

Global Flags:
  -X, --infile-list string   ► File of input file list (one file per line). If given, they are
                             appended to files from CLI arguments.
      --log string           ► Log file.
      --quiet                ► Do not print any verbose information. But you can write them to a file
                             with --log.
  -j, --threads int          ► Number of CPU cores to use. By default, it uses all available cores.
                             (default 16)

Examples

  1. Extracting aligned sequences with the output file from lexicmap search.

     # search
     $ lexicmap search -d demo.lmi/ q.gene.fasta -o t.txt
    
     # extract
     $ lexicmap utils subseq -d demo.lmi/ -f t.txt -o t.txt.aligned.fasta
     17:45:50.134 [INFO] loading index: demo.lmi/
     17:45:50.135 [INFO]   creating reader pools for 1 genome batches, each with 16 readers...
     17:45:50.135 [INFO] extracting subsequences...
     84 subsequences extracted from 84 records
     17:45:50.139 [INFO] 84 subsequences extracted from 84 records
     17:45:50.139 [INFO] 
     17:45:50.139 [INFO] elapsed time: 13.923238ms
     17:45:50.139 [INFO] 
    
     # extract with filtered search result
     $ csvtk filter2 -t -f '$qcovHSP > 90 && $pident > 90' t.txt \
         | lexicmap utils subseq -d demo.lmi/ -f - -o t.txt.aligned.fasta 
     17:46:15.216 [INFO] loading index: demo.lmi/
     17:46:15.216 [INFO]   creating reader pools for 1 genome batches, each with 16 readers...
     17:46:15.216 [INFO] extracting subsequences...
     33 subsequences extracted from 33 records
     17:46:15.219 [INFO] 33 subsequences extracted from 33 records
     17:46:15.219 [INFO] 
     17:46:15.219 [INFO] elapsed time: 21.785783ms
     17:46:15.219 [INFO]
    
     # extend the region with -U/--upstream and/or -D/--downstream.
     $ csvtk filter2 -t -f '$qcovHSP > 90 && $pident > 90' t.txt \
         | lexicmap utils subseq -d demo.lmi/ -f - -o t.txt.aligned.extended.fasta -U 1000 -D 1000
    

    Sequence header

    $ seqkit head -n 1 t.txt.aligned.fasta | seqkit seq -n
    NZ_CP033092.2:458559-460100:+ query=NC_000913.3:4166659-4168200 sgenome=GCF_003697165.2 sseqid=NZ_CP033092.2 qcovGnm=100.000 cls=1 hsp=1 qcovHSP=100.000 alenHSP=1542 pident=99.805 gaps=0 qstart=1 qend=1542 sstart=458559 send=460100 sstr=+ slen=4903501 evalue=0.00e+00 bitscore=2767
    
    $ seqkit head -n 1 t.txt.aligned.extended.fasta | seqkit seq -n
    NZ_CP033092.2:457559-461100:+ query=NC_000913.3:4166659-4168200 sgenome=GCF_003697165.2 sseqid=NZ_CP033092.2 qcovGnm=100.000 cls=1 hsp=1 qcovHSP=100.000 alenHSP=1542 pident=99.805 gaps=0 qstart=1 qend=1542 sstart=458559 send=460100 sstr=+ slen=4903501 evalue=0.00e+00 bitscore=2767
    
  2. Extracting subsequence with genome ID, sequence ID, position range and strand information.

     $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4591684:4593225 -R
     >NZ_CP033092.2:4591684-4593225:-
     AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAA
     GTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAA
     TGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCAT
     AACGTCGCAAGACCAAAGAGGGGGACCTTAGGGCCTCTTGCCATCGGATGTGCCCAGATG
     GGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA
     GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGG
     GGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCT
     TCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATT
     GACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAG
     GGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
     GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTC
     GTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACC
     GGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCA
     AACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCC
     CTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
     AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAAT
     TCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAG
     AATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGA
     AATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGC
     CGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTC
     ATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCG
     ACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAAC
     TCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGT
     TCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGT
     AGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAA
     CAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
    
  3. If the sequence ID (-s/--seq-id) is not given, the positions are these in the concatenated sequence.

    Checking sequence lengths of a genome with seqkit.

     $ seqkit fx2tab -nil refs/GCF_003697165.2.fa.gz
     NZ_CP033092.2   4903501
     NZ_CP033091.2   131333
    

    Extracting the 1000-bp interval sequence inserted by lexicmap index.

     $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501
     >GCF_003697165.2:4903502-4904501:+
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    

    Extend the region with -U/--upstream and/or -D/--downstream.

     $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -r 4903502:4904501 -U 5 -D 5
     >GCF_003697165.2:4903497-0:+
     CCGCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTGGA
    
  4. It detects if the end position is larger than the sequence length.

     # the length of NZ_CP033092.2 is 4903501
    
     $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903501:1000000000
     >NZ_CP033092.2:4903501-4903501:+
     C
    
    
     $ lexicmap utils subseq -d demo.lmi/ -n GCF_003697165.2 -s NZ_CP033092.2 -r 4903502:1000000000
     >NZ_CP033092.2:4903502-4903501:+