FAQs on FASTA/Q manipulations

This page was originally written as the section FASTA/Q manipulations of The Biostar Handbook: A Beginner's Guide to Bioinformatics (discussion on Biostars.org).

This page illustrates common FASTA/Q manipulations using SeqKit. Some other utilities, including csvtk (CSV/TSV toolkit) and shell commands were also used.

Note: SeqKit seamlessly support FASTA and FASTQ formats both in their original form or in stored in gzipped compressed format. We list FASTA or FASTQ depending on the more common usage but you can always use it on the other type as well.

Example data
How to produce an overview of FASTQ files?
How to get GC content of every sequence in FASTA/Q file?
How to extract sequences subset from FASTA/Q file with name/ID list file?
How to find FASTA/Q sequences containing degenerate bases and locate them?
How to remove duplicated FASTA/Q records with same sequences?
How to locate motif/subsequence/enzyme digest site in FASTA/Q sequence?
How to sort huge number of FASTA sequences by length?
How to split FASTA sequences according to information in header?
How to search and replace FASTA header with known character strings from a text file?
How to extract paired reads from two paired-end reads file?
How to concatenate two FASTA sequences in to one?

Example data

One FASTQ file (sample reads, 1M) and two FASTA files (Virus DNA and protein sequences from NCBI RefSeq database, 60+40M) are used.

wget http://data.biostarhandbook.com/reads/duplicated-reads.fq.gz
wget ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
wget ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.1.protein.faa.gz

How to produce an overview of FASTQ files?

Sequence format and type are automatically detected.

$ seqkit stat *.gz
file                      format  type     num_seqs      sum_len  min_len   avg_len    max_len
duplicated-reads.fq.gz    FASTQ   DNA        15,000    1,515,000      101       101        101
viral.1.1.genomic.fna.gz  FASTA   DNA         7,048  203,325,120      200  28,848.6  2,473,870
viral.1.protein.faa.gz    FASTA   Protein   252,611   62,024,702        5     245.5      8,960

How to get GC content of every sequence in FASTA/Q file?

seqkit fx2tab converts FASTA/Q to 3-column tabular format (1th: name/ID, 2nd: sequence, 3rd: quality), and can also provides various information in new columns, including sequence length, GC content/GC skew, alphabet.

GC content:

$ seqkit fx2tab --name --only-id --gc viral*.fna.gz
gi|526245010|ref|NC_021865.1|                   40.94
gi|460042095|ref|NC_020479.1|                   41.82
gi|526244636|ref|NC_021857.1|                   49.17

Custom bases (A, C and A+C):

$ seqkit fx2tab -H -n -i -B a -B c -B ac viral.1.1.genomic.fna.gz
#name   seq     qual    a       c       ac
gi|526245010|ref|NC_021865.1|                   33.20   18.24   51.44
gi|460042095|ref|NC_020479.1|                   32.57   19.63   52.20
gi|526244636|ref|NC_021857.1|                   25.52   23.06   48.59

How to extract sequences subset from FASTA/Q file with name/ID list file?

This is a frequently used manipulation. Let's create a sample ID list file, which may also come from other way like mapping result.

$ seqkit sample --proportion 0.001  duplicated-reads.fq.gz \
    | seqkit seq --name --only-id > id.txt

ID list file:

$ head id.txt
SRR1972739.2996
SRR1972739.3044
SRR1972739.3562

Searching by ID list file:

$ seqkit grep --pattern-file id.txt duplicated-reads.fq.gz \
    > duplicated-reads.subset.fq.gz

How to find FASTA/Q sequences containing degenerate bases and locate them?

seqkit fx2tab converts FASTA/Q to tabular format and can output the sequence alphabet in a new column. And then text searching tools can be used to filter the table.

$ seqkit fx2tab -n -i -a viral.1.1.genomic.fna.gz \
    | csvtk -H -t grep -f 4 -r -i -p "[^ACGT]"
gi|446730228|ref|NC_019782.1|                   ACGNT
gi|557940284|ref|NC_022800.1|                   ACGKT
gi|564292828|ref|NC_023009.1|                   ACGNT

Long-option version of the command:

$ seqkit fx2tab --name --only-id --alphabet  viral.1.1.genomic.fna.gz \
    | csvtk --no-header-row --tabs grep --fields 4 --use-regexp --ignore-case --pattern "[^ACGT]"

You can then exclude these sequences with seqkit grep:

# save the sequenece IDs.
$ seqkit fx2tab -n -i -a viral.1.1.genomic.fna.gz \
    | csvtk -H -t grep -f 4 -r -i -p "[^ACGT]" | csvtk -H -t cut -f 1 > id2.txt

# search and exclude.
$ seqkit grep --pattern-file id2.txt --invert-match viral.1.1.genomic.fna.gz > clean.fa

Or locate the degenerate bases, e.g, N and K

$ seqkit grep --pattern-file id2.txt viral.1.1.genomic.fna.gz \
    | seqkit locate --ignore-case --only-positive-strand --pattern K+ --pattern N+
seqID   patternName     pattern strand  start   end     matched
gi|564292828|ref|NC_023009.1|   N+      N+      +       87972   87972   N
gi|564292828|ref|NC_023009.1|   N+      N+      +       100983  100983  N
gi|557307918|ref|NC_022755.1|   K+      K+      +       1788    1788    K
gi|557307918|ref|NC_022755.1|   K+      K+      +       4044    4044    K
gi|589287065|ref|NC_023585.1|   K+      K+      +       28296   28296   K
gi|590911929|ref|NC_023639.1|   N+      N+      +       741654  741753  NNNNNNNNNNNNNNNNNNNNNNNNNNN

How to remove duplicated FASTA/Q records with same sequences?

$ seqkit rmdup --by-seq --ignore-case duplicated-reads.fq.gz > duplicated-reads.uniq.fq.gz

If the FASTA/Q file is very large, please switch on flag -m/--md5, which use MD5 instead of original seqs to reduce memory usage when comparing by sequences.

You can also deduplicate according to sequence ID (default) or full name (--by-name).

How to locate motif/subsequence/enzyme digest site in FASTA/Q sequence?

Assuming a list of motifs (enzyme digest sites) in FASTA format to be located:

$ cat enzymes.fa
>EcoRI
GAATTC
>MmeI
TCCRAC
>SacI
GAGCTC
>XcmI
CCANNNNNNNNNTGG

Flag --degenerate is on because patterns contain degenerate bases. Command:

$ seqkit locate --degenerate --ignore-case --pattern-file enzymes.fa viral.1.1.genomic.fna.gz

Sample output (simplified and reformated by csvtk -t uniq -f 3 | csvtk -t pretty)

seqID                           patternName   pattern           strand   start   end     matched
gi|526245010|ref|NC_021865.1|   MmeI          TCCRAC            +        1816    1821    TCCGAC
gi|526245010|ref|NC_021865.1|   SacI          GAGCTC            +        19506   19511   GAGCTC
gi|526245010|ref|NC_021865.1|   XcmI          CCANNNNNNNNNTGG   +        2221    2235    CCATATTTAGTGTGG

How to sort huge number of FASTA sequences by length?

Sorting FASTA file in order of sequence size (small to large).

$ seqkit sort --by-length viral.1.1.genomic.fna.gz > viral.1.1.genomic.sorted.fa

If the files are too big, use flag --two-pass which consumes lesser memory.

$ seqkit sort --by-length --two-pass viral.1.1.genomic.fna.gz > viral.1.1.genomic.sorted.fa

You can also sort by sequence ID (default), full header (--by-name) or sequence content (--by-seq).

How to split FASTA sequences according to information in header?

For example, FASTA header line of viral.1.protein.faa.gz contain species name in square brackets.

Overview of FASTA Headers:

$ seqkit head -n 3 viral.1.protein.faa.gz | seqkit seq --name
gi|526245011|ref|YP_008320337.1| terminase small subunit [Paenibacillus phage phiIBB_Pl23]
gi|526245012|ref|YP_008320338.1| terminase large subunit [Paenibacillus phage phiIBB_Pl23]
gi|526245013|ref|YP_008320339.1| portal protein [Paenibacillus phage phiIBB_Pl23]

seqkit split can split FASTA/Q files according to ID, number of parts, size of every parts, and sequence region. In this case, we'll split according to sequence ID (species names) which can be specified by flag --id-regexp.

Default ID:

$ seqkit head -n 3 viral.1.protein.faa.gz \
    | seqkit seq --name --only-id
gi|526245011|ref|YP_008320337.1|
gi|526245012|ref|YP_008320338.1|
gi|526245013|ref|YP_008320339.1|

New ID:

$ seqkit head -n 3 viral.1.protein.faa.gz \
    | seqkit seq --name --only-id --id-regexp "\[(.+)\]"
Paenibacillus phage phiIBB_Pl23
Paenibacillus phage phiIBB_Pl23
Paenibacillus phage phiIBB_Pl23

Split:

$ seqkit split --by-id --id-regexp "\[(.+)\]" viral.1.protein.faa.gz

How to search and replace FASTA header with known character strings from a text file?

seqKit replace can find substrings in FASTA/Q header with regular expression and replace them with strings or corresponding values of found substrings provided by the tab-delimited key-value file.

For example, to unify names of protein with unknown functions, we want to rename "hypothetical" to "putative" in lower case. The replacing rules are listed below in tab-delimited file:

$ cat changes.tsv
Hypothetical    putative
hypothetical    putative
Putative        putative

Overview the FASTA headers containing "hypothetical":

$ seqkit grep --by-name --use-regexp --ignore-case --pattern hypothetical viral.1.protein.faa.gz \
    | seqkit head -n 3 | seqkit seq --name
gi|526245016|ref|YP_008320342.1| hypothetical protein IBBPl23_06 [Paenibacillus phage phiIBB_Pl23]
gi|526245019|ref|YP_008320345.1| hypothetical protein IBBPl23_09 [Paenibacillus phage phiIBB_Pl23]
gi|526245020|ref|YP_008320346.1| hypothetical protein IBBPl23_10 [Paenibacillus phage phiIBB_Pl23]

A regular expression, ^([^ ]+ )(\w+), was used to specify the key to be replaced, which is the first word after sequence ID in this case. Note that we also capture the ID (^([^ ]+ )) so we can restore it in "replacement" with capture variable ${1} (robuster than $1). And flag -I/--key-capt-idx (default: 1) is set to 2 because the key (\w+) is the second captured match. Command:

$ seqkit replace --kv-file changes.tsv --pattern "^([^ ]+ )(\w+) " \
    --replacement "\${1}{kv} " --key-capt-idx 2 --keep-key viral.1.protein.faa.gz > renamed.fa

How to extract paired reads from two paired-end reads file?

Use seqkit pair.

How to concatenate two FASTA sequences in to one?

Data (not in same order):

$ cat 1.fa
>seq1
aaaaa
>seq2
ccccc
>seq3
ggggg

$ cat 2.fa
>seq3
TTTTT
>seq2
GGGGG
>seq1
CCCCC

Just one command:

$ seqkit concat 1.fa 2.fa
>seq1
aaaaaCCCCC
>seq2
cccccGGGGG
>seq3
gggggTTTTT