Download
SeqKit is implemented in Go programming language, statically-linked executable binary files are freely available.
Please cite:
Wei Shen*, Botond Sipos, and Liuyang Zhao. 2024. SeqKit2: A Swiss Army Knife for Sequence and Alignment Processing. iMeta e191. doi:10.1002/imt2.191.
Current Version
- SeqKit v2.9.0 - 2024-11-01
seqkit
:- Fix sequence ID parsing with the default regular expression (in this case, we actually use bytes.Index instead) for a rare case: "xxx\tyyy zzz" was wrongly parsed as "xxx\tyyy". #486
seqkit locate
:- Fix
-G/--non-greedy
for tandem repeats, e.g., ATTCGATTCGATTCG (ATTCGx3).
- Fix
seqkit grep/subseq
:- Fix negative regions longer than sequence length. #479.
seqkit stats
:- Add an extra column
sum_n
to count the number of ambiguous characters. #490
- Add an extra column
Links
OS | Arch | File, 中国镜像 | Download Count |
---|---|---|---|
Linux | 32-bit | seqkit_linux_386.tar.gz, 中国镜像 |
|
Linux | 64-bit | seqkit_linux_amd64.tar.gz, 中国镜像 |
|
Linux | arm64 | seqkit_linux_arm64.tar.gz, 中国镜像 |
|
macOS | 64-bit | seqkit_darwin_amd64.tar.gz, 中国镜像 |
|
macOS | arm64 | seqkit_darwin_arm64.tar.gz, 中国镜像 |
|
Windows | 32-bit | seqkit_windows_386.exe.tar.gz, 中国镜像 |
|
Windows | 64-bit | seqkit_windows_amd64.exe.tar.gz, 中国镜像 |
Notes
- please open an issuse to request binaries for other platforms.
- run
seqkit version
to check update !!! - run
seqkit genautocomplete
to update shell autocompletion script !!!
Installation
Method 1: Download binaries (latest stable version)
Just download compressed
executable file of your operating system,
and decompress it with tar -zxvf *.tar.gz
command or other tools.
And then:
-
For Linux-like systems
-
If you have root privilege simply copy it to
/usr/local/bin
:sudo cp seqkit /usr/local/bin/
-
Or copy to anywhere in the environment variable
PATH
:mkdir -p $HOME/bin/; cp seqkit $HOME/bin/
-
-
For windows, just copy
seqkit.exe
toC:\WINDOWS\system32
.
Method 2: Install via conda (latest stable version)
conda install -c bioconda seqkit
Method 3: Install via homebrew (might not be latest stable version)
brew install seqkit
Method 4: For Go developer (latest stable/dev version)
go get -u github.com/shenwei356/seqkit/v2/seqkit/
Method 5: Docker based installation (might not be latest stable versio)
git clone this repo:
git clone https://github.com/shenwei356/seqkit
Run the following commands:
cd seqkit
docker build -t shenwei356/seqkit .
docker run -it shenwei356/seqkit:latest
Method 6: Compiling from source (latest stable/dev version)
# ------------------- install golang -----------------
# download Go from https://go.dev/dl
wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz
tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/
# or
# echo "export PATH=$PATH:$HOME/go/bin" >> ~/.bashrc
# source ~/.bashrc
export PATH=$PATH:$HOME/go/bin
# ------------- the latest stable version -------------
go get -v -u github.com/shenwei356/seqkit/seqkit
# The executable binary file is located in:
# ~/go/bin/seqkit
# You can also move it to anywhere in the $PATH
mkdir -p $HOME/bin
cp ~/go/bin/seqkit $HOME/bin/
# --------------- the development version --------------
git clone https://github.com/shenwei356/seqkit
cd seqkit/seqkit/
go build
# The executable binary file is located in:
# ./seqkit
# You can also move it to anywhere in the $PATH
mkdir -p $HOME/bin
cp ./seqkit $HOME/bin/
Shell-completion
Supported shell: bash|zsh|fish|powershell
Bash:
# generate completion shell
seqkit genautocomplete --shell bash
# configure if never did.
# install bash-completion if the "complete" command is not found.
echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
echo "source ~/.bash_completion" >> ~/.bashrc
Zsh:
# generate completion shell
seqkit genautocomplete --shell zsh --file ~/.zfunc/_seqkit
# configure if never did
echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
echo "autoload -U compinit; compinit" >> ~/.zshrc
fish:
seqkit genautocomplete --shell fish --file ~/.config/fish/completions/seqkit.fish
Release history
- SeqKit v2.8.2 - 2024-05-17
seqkit amplicon
:- Fix a big introduced in v2.7.0. When more than one pairs of primers are given, only the last one is used. #457
seqkit translate
:- Add option
-e/--skip-translate-errors
to skip translate error and output empty sequence. #458
- Add option
seqkit split
:- Add flag
-I/--ignore-case
for-i/--by-id
. #462
- Add flag
- SeqKit v2.8.1 - 2024-04-07
- SeqKit v2.8.0 - 2024-03-11
seqkit stats
:- Add column
N50_num
, an alias of L50, #15.
- Add column
seqkit seq/locate/fish/watch
:- Removing the flag
-V/--validate-seq-length
. Now the whole sequence will be checked if-v/--validate-seq
is given.
- Removing the flag
seqkit amplicon
:- Fix the speed problem, introduced in v2.7.0. #439.
- Slightly faster by reusing objects.
seqkit seq
:- Change the threshold sequence length for parallelizing complement sequence computation, 1kb->1Mb.
- SeqKit v2.7.0 - 2024-01-31
seqkit
:- Grouping subcommands in help message, which is intuitive for beginners.
seqkit grep
:- New flag:
-D/--allow-duplicated-patterns
for outputting records multiple times when duplicated patterns are given. #427
- New flag:
seqkit subseq
:- Use the ID regular expression from the option
--id-regexp
to create FASTA index file. This solves the panic happened for sequences containing tabs in the headers. #432
- Use the ID regular expression from the option
seqkit split/sort/shuffle
:- When using the two-pass mode (
-2/--two-pass
), replace possible tabs in the sequence header.
- When using the two-pass mode (
seqkit rmdup
:- Write an empty file of duplicate numbers and lists of IDs even if there's no duplicates when using
-D/--dup-num-file
. #436
- Write an empty file of duplicate numbers and lists of IDs even if there's no duplicates when using
seqkit stats
:- New flag
-S/--skip-file-check
to skip input file checking when given files or a file list. It's very useful if you run it with millions of files.
- New flag
- SeqKit v2.6.1 - 2023-11-18
seqkit
:- fix panic of nil pointer introduced in v2.6.0, which happens when handling multiple input files and some of them have file sizes of zero.
seqkit seq
:- fix panic (close of closed channel) when using
-v
to checking sequences.
- fix panic (close of closed channel) when using
- SeqKit v2.6.0 - 2023-11-09
seqkit
:- add the shortcut
-X
for the flag--infile-list
.
- add the shortcut
seqkit common
:- add a new flag
-e/--check-embedded-seqs
for detecting embedded sequences. - for matching by sequences: reduced the memory occupation and corrected numbers in the log. #416
- add a new flag
seqkit stat
:- add a new column
AvgQual
for average quality score. #411
- add a new column
seqkit split2
:- fix the panic for invalid input.
seqkit subseq
:- add a new flag
-R/--region-coord
for appending coordinates to sequence ID for-r/--region
. #413
- add a new flag
seqkit locate
:- add a new flag
-s/--max-len-to-show
to show at most X characters for the search pattern or matched sequences.
- add a new flag
seqkit seq
:- change the nucleotide color theme. #412
- SeqKit v2.5.1 - 2023-08-09
- SeqKit v2.5.0 - 2023-07-16
- new command
seqkit merge-slides
: merge sliding windows generated from seqkit sliding. #390 seqkit stats
:- added a new flag
-N/--N
for appending other N50-like stats as new columns. #393 - added a progress bar for > 1 input files.
- write the result of each file immediately (no output buffer) when using
-T/--tabular
.
- added a new flag
seqkit translate
:- add options
-s/--out-subseqs
and-m/--min-len
to write ORFs longer thanx
amino acids as individual records. #389
- add options
seqkit sum
:- do not remove possible '*' by default and delete confusing warnings. Thanks to @photocyte. #399
- added a progress bar for > 1 input files.
seqkit pair
:- remove the restriction of requiring FASTQ format, i.e., FASTA files are also supported.
seqkit seq
:- update help messages. #387
seqkit fxtab
:- faster alphabet computation (
-a/--alphabet
) with a new data structure. Thanks to @elliotwutingfeng #388
- faster alphabet computation (
seqkit subseq
:- accept reverse coordinates in BED/GTF. #392
- new command
- SeqKit v2.4.0 - 2023-03-17
seqkit
:seqkit locate
:- do not remove embeded regions when searching with regular expressions. #368
seqkit amplicon
:- fix BED coordinates for amplicons found in the minus strand. #367
seqkit split
:- fix forgetting to add extension for
--two-pass
. #332
- fix forgetting to add extension for
seqkit stats
:- fix compute Q1 and Q3 of sequence length for one record. #353
seqkit grep
:- fix count number (
-C
) for matching with mismatch (-m > 0
). #370
- fix count number (
seqkit replace
:- add some flags to match partly records to edit; these flags are transplanted from
seqkit grep
. #348
- add some flags to match partly records to edit; these flags are transplanted from
seqkit faidx
:- allow empty lines at the end of sequences.
seqkit faidx/sort/shuffle/split/subseq
:seqkit seq
:- allow filtering sequences of length zero. thanks to @penglbio.
seqkit rename
:- new flag
-s/--separator
for setting separator between original ID/name and the counter (default "_"). #360 - new flag
-N/--start-num
for setting starting count number for duplicated IDs/names (default 2). #360 - new flag
-1/--rename-1st-rec
for renaming the first record as well. #360 - do not append space if there's no description after the sequene ID.
- new flag
seqkit sliding
:- new flag
-S/--suffix
for change the suffix added to the sequence ID (default: "_sliding").
- new flag
- SeqKit v2.3.1 - 2022-09-22
- SeqKit v2.3.0 - 2022-08-12
-
SeqKit v2.2.0 - 2020-03-14
seqkit
:- add support of
xz
andzstd
input/output formats. #274 - fix panic when reading records with header of
ID
+ blanks.
- add support of
- new command
seqkit sum
: computing message digest for all sequences in FASTA/Q files. The idea comes from @photocyte and the format borrows from seqhash #262 - new command
seqkit fa2fq
: retrieving corresponding FASTQ records by a FASTA file seqkit split2
:seqkit concat
:seqkit locate
:- parallelizing
-F/--use-fmi
and-m
for large number of search patterns.
- parallelizing
seqkit amplicon
:- new flag
-M/--output-mismatches
to append the total mismatches and mismatches of 5' end and 3' end. #286
- new flag
seqkit grep
:- detect FASTA/Q symbol
@
and>
in the searching patterns and show warnings. - add new flag
-C/--count
, likegrep -c
in GNU grep. #267
- detect FASTA/Q symbol
seqkit range
:- support removing leading 100 seqs (
seqkit range -r 101:-1
==tail -n +101
). #279
- support removing leading 100 seqs (
seqkit subseq
:- report error when no options were given.
- update doc:
-
SeqKit v2.1.0 - 2021-11-15
seqkit seq
:- fix filtering by average quality
-Q/-R
. #257
- fix filtering by average quality
seqkit convert
:seqkit split
:- fix writing an extra empty file when using
--two-pass
#244
- fix writing an extra empty file when using
seqkit subseq
:- fix
--bed
which fail to recognize strand.
.
- fix
seqkit fq2fa
:- faster, and do not wrap sequences.
seqkit grep/locate/mutate
:- detect unquoted comma and show warning message, e.g.,
-p 'A{2,}'
. #250
- detect unquoted comma and show warning message, e.g.,
- SeqKit v2.0.0 - 2021-08-27
- Performance improvements
seqkit
:- faster FASTA/Q reading and writing, especially on FASTQ, see the benchmark.
- reading (plain text): 4X faster.
seqkit stats dataset_C.fq
- reading (gzip files): 45% faster.
seqkit stats dataset_C.fq.gz
- reading + writing (plain text): 3.5X faster.
seqkit grep -p . -v dataset_C.fq -o t
- reading + writing (gzip files): 2.2X faster.
seqkit grep -p . -v dataset_C.fq.gz -o t.gz
- reading (plain text): 4X faster.
- change default value of
-j/--threads
from 2 to 4, which is faster for writting gzip files.
- faster FASTA/Q reading and writing, especially on FASTQ, see the benchmark.
seqkit seq
:- fix writing speed, which was slowed down in v0.12.1.
- Breaking changes
seqkit grep/rmdup/common
:- consider reverse complement sequence by default for comparing by sequence, add flag
-P/--only-positive-strand
. #215
- consider reverse complement sequence by default for comparing by sequence, add flag
seqkit rename
:- rename ID only, do not append original header to new ID. #236
seqkit fx2tab
:- for
-s/--seq-hash
: outputing MD5 instead of hash value (integers) of xxhash. #219
- for
- Bugfixes
- New features/enhancements
seqkit grep
:- allow empty pattern files.
seqkit faidx
:- support region with
begin > end
, i.e., returning reverse complement sequence - add new flag
-l/--region-file
: file containing a list of regions.
- support region with
seqkit fx2tab
:- new flag
-Q/--no-qual
for disabling outputing quality even for FASTQ file. #221
- new flag
seqkit amplicon
:- new flag
-u/--save-unmatched
for saving records that do not match any primer.
- new flag
seqkit sort
:- new flag
-b/--by-bases
for sorting by non-gap bases, for multiple sequence alignment files.#216
- new flag
- Performance improvements
- SeqKit v0.16.1 - 2021-05-20
- SeqKit v0.16.0 - 2021-04-16
- new command
seqkit head-genome
:- print sequences of the first genome with common prefixes in name
seqkit grep/locate/amplicon -m
- much faster (300-400x) searching with mismatch allowed by optimizing FM-indexing and parallelization.
- new flag
-I/--immediate-output
.
seqkit grep/locate
:seqkit locate
:- removing debug info for
-r
introduced in a0f6b6e. #180
- removing debug info for
seqkit amplicon
:- fix bug of
-m
, when mismatch is allowed.
- fix bug of
seqkit fx2tab
:- new flag
-C/--base-count
for counting bases. #183
- new flag
seqkit tab2fx
:- fix a rare bug. #197
seqkit subseq
:- fix bug for BED with empty columns. #195
seqkit genautocomplete
:- support bash|zsh|fish|powershell.
- new command
- SeqKit v0.15.0 - 2021-01-12
seqkit grep/locate
: update help message.seqkit grep
: search on both strand when searching by sequence.seqkit split2
: fix redundant log when using-s
.seqkit bam
: new fieldRightSoftClipSeq
. #172seqkit sample -2
: remove extra\n
. #173seqkit split2 -l
: fix bug for splitting by accumulative length, this bug occurs when the first record is longer than-l
, no sequences are lost.
- SeqKit v0.14.0 - 2020-10-30
- new command
seqkit pair
: match up paired-end reads from two fastq files, faster than fastq-pair. seqkit translate
: new flag-F/--append-fram
for optional adding frame info to ID. #159seqkit stats
: reduce memory usage when using-a
for calculating N50. #153seqkit mutate
: fix inserting sequence-i/--insertion
, this bug occurs wheninsert site
is big in some cases, don't worry if no error reported.seqkit replace
:- new flag
-U/--keep-untouched
: do not change anything when no value found for the key (only for sequence name). - do no support editing FASTQ sequence.
- new flag
seqkit grep/locate
: new flag--circular
for supporting circular genome. #158
- new command
- SeqKit v0.13.2 - 2020-07-13
seqkit sana
: fix bug causing hanging on empty files. #149
- SeqKit v0.13.1 - 2020-07-09
seqkit sana
: fix bug causing hanging on empty files. #148
- SeqKit v0.13.0 - 2020-07-07
seqkit
: fix a rare FASTA/Q parser bug. #127seqkit seq
: output sequence or quality in single line when-s/--seq
or-q/--qual
is on. #132seqkit translate
: delete debug info, #133, and fix typo. #134seqkit split2
: tiny performance improvement. #137seqkit stats
: new flag-i/--stdin-label
for replacing default "-" for stdin. #139seqkit fx2tab
: new flag-s/--seq-hash
for printing hash of sequence (case sensitive). #144seqkit amplicon
:- New features and improvements by @bsipos. #130, #147
- new command
seqkit scat
, for real-time robust concatenation of fastx files. - Rewrote the parser behind the
sana
subcommand, now it supports robust parsing of fasta file as well. - Added a "toolbox" feature to the
bam
subcommand (-T
), which is a collection of filters acting on streams of BAM records configured through a YAML string (see the docs for more). - Added the
SEQKIT_THREADS
environmental variable to override the default number of threads.
- new command
- SeqKit v0.12.1 - 2020-04-21
seqkit bam
: add colorised and pretty printed output, by @bsipos. #110seqkit locate/grep
: fix bug of-m
, when query contains letters not in subject sequences. #124seqkit split2
: new flag-l/--by-length
for splitting into chunks of N bases.seqkit fx2tab
:seqkit seq
: new flag-k/--color
: colorize sequences.
- SeqKit v0.12.0 - 2020-02-18
seqkit
:- fix checking input file existence.
- new global flag
--infile-list
for long list of input files, if given, they are appended to files from cli arguments.
seqkit faidx
: supporting "truncated" (no ending newline charactor) file.seqkit seq
:- do not force switching on
-g
when using-m/-M
. - show recommendation if flag
-t/--seq-type
is not DNA/RNA when computing complement sequence. #103
- do not force switching on
seqkit translate
: supporting multiple frames. #96seqkit grep/locate
:- add detection and warning for space existing in search pattern/sequence.
- speed improvement (2X) for
-m/--max-mismatch
. shenwei356/bwt/issues/3
seqkit locate
:- new flag
-M/--hide-matched
for hiding matched sequences. #98 - new flag
-r/--use-regexp
for explicitly using regular expression, so improve speed of defaultindex
operation. And you have to switch this on if using regexp now. #101 - new flag
-F/--use-fmi
for improving search speed for lots of sequence patterns.
- new flag
seqkit rename
: making IDs unique across multiple files, and can write into multiple files. #100seqkit sample
: fix stdin checking for flag-2
. #102.seqkit rename/split/split2
: fix detection of existed outdir.split split
: fix bug ofseqkit split -i -2
and parallizing it.seqkit version
: checking update is optional (-u
).
- SeqKit v0.11.0 - 2019-09-25
seqkit
: fix hanging when reading from truncated gzip file.- new commands:
seqkit amplicon
: retrieve amplicon (or specific region around it) via primer(s).
- new commands by @bsipos:
seqkit watch
: monitoring and online histograms of sequence features.seqkit sana
: sanitize broken single line fastq files.seqkit fish
: look for short sequences in larger sequences using local alignment.seqkit bam
: monitoring and online histograms of BAM record features.
seqkit grep/locate
: reduce memory occupation when using flag-m/--max-mismatch
.seqkit seq
: fix panic of computing complement sequence for long sequences containing illegal letters without flag-v
on. #84
- SeqKit v0.10.2 - 2019-07-30
seqkit
: fix bug of parsing sequence ID delimited by tab (\t
). #78seqkit grep
: better logic of--delete-matched
.seqkit common/rmdup/split
: use xxhash to replace MD5 when comparing with sequence, discard flag-m/--md5
.seqkit stats
: new flag-b/--basename
for outputting basename instead of full path.
- SeqKit v0.10.1 - 2019-02-27
seqkit fx2tab
: new option-q/--avg-qual
for outputting average read quality. #60seqkit grep/locate
: fix support ofX
when using-d/--degenerate
. #61seqkit translate
:- new flag
-M/--init-codon-as-M
to translate initial codon at beginning to 'M'. #62 - translates
---
to-
for aligned DNA/RNA, flag-X
needed. #63 - supports codons containing ambiguous bases, e.g.,
GGN->G
,ATH->I
. #64 - new flag
-l/--list-transl-table
to show details of translate table N - new flag
-l/--list-transl-table-with-amb-codons
to show details of translate table N (including ambigugous codons)
- new flag
seqkit split/split2
, fix bug of ignoring-O
when reading from stdin.
- SeqKit v0.10.0 - 2018-12-24
seqkit
: report error when input is directory.- new command
seqkit mutate
: edit sequence (point mutation, insertion, deletion).
- SeqKit v0.9.3 - 2018-12-02
seqkit stats
: fix panic for empty file. #57seqkit translate
: add flag-x/--allow-unknown-codon
to translate unknown codon toX
.
- SeqKit v0.9.2 - 2018-11-16
seqkit
: stricter checking for value of global flag-t/--seq-type
.seqkit sliding
: fix bug for flag-g/--greedy
. #54seqkit translate
: fix bug for frame < 0. #55seqkit seq
: add TAB to default blank characters (flag-G/--gap-letters
), and fix filter result when using flag-g/--remove-gaps
along with-m/--min-len
or-M/--max-len
- SeqKit v0.9.1 - 2018-10-12
- SeqKit v0.9.0 - 2018-09-26
seqkit
: better handle of empty file, no error message shown. #36- new subcommand
seqkit split2
: split sequences into files by size/parts (FASTA, PE/SE FASTQ). #35 - new subcommand
seqkit translate
: translate DNA/RNA to protein sequence. #28 seqkit sort
: fix bug when using-2 -i
, and add support for sorting in natural order. #39seqkit grep
andseqkit locate
: add experimental support of mismatch when searching subsequences. #14seqkit stats
: add stats of Q20 and Q30 for FASTQ. #45
- SeqKit v0.8.1 - 2018-06-29
seqkit
: do not callpigz
orgzip
for decompressing gzipped file any more. But you can still utilizepigz
orgzip
bypigz -d -c seqs.fq.gz | seqkit xxx
.seqkit subseq
: fix bug of missing quality when using--gtf
or--bed
seqkit stats
: parallelize counting files, it's much faster for lots of small files, especially for files on SSD
- SeqKit v0.8.0 - 2018-03-22
seqkit
, stricter FASTA/Q format requirement, i.e., must starting with>
or@
.seqkit
, fix output format for FASTQ files containing zero-length records, yes this happens.seqkit
, add amino acid codeO
(pyrrolysine) andU
(selenocysteine).seqkit replace
, add flag--nr-width
to fill leading 0s for{nr}
, useful for preparing sequence submission (">strain_00001 XX", ">strain_00002 XX").seqkit subseq
, require BED file to be tab-delimited.- SeqKit v0.7.2 - 2017-12-03
seqkit tab2fx
: fix a concurrency bug that occurs in low proprobability when only 1-column data provided.seqkit stats
: add quartiles of sequence lengthseqkit faidx
: add support for retrieving subsequence using seq ID and region, which is similar with "samtools faidx" but has some extra features
- SeqKit v0.7.1 - 2017-09-22
seqkit convert
: fix bug of read quality containing only 3 or less values. shenwei356/bio/issues/3seqkit stats
: add option-T/--tabular
to output in machine-friendly tabular format. #23seqkit common
: increase speed and decrease memory occupation, and add some notes.- fix some typos. #22
- suggestion: please install pigz to gain better parsing performance for gzipped data.
- SeqKit v0.7.0 - 2017-08-12
- add new command
convert
for converting FASTQ quality encoding between Sanger, Solexa and Illumina. Thanks suggestion from @cviner ( #18). usage & example. - add new command
range
for printing FASTA/Q records in a range (start:end). #19. usage & example. - add new command
concate
for concatenating sequences with same ID from multiple files. usage & example.
- add new command
- SeqKit v0.6.0 - 2017-06-21
- SeqKit v0.5.5 - 2017-05-10
- Increasing speed of reading
.gz
file by utilizinggzip
(1.3X), it would be much faster if you installedpigz
(2X). - Fixing colorful output in Windows
seqkit locate
: add flag--gtf
and--bed
to output GTF/BED6 format, so the result can be used inseqkit subseq
.seqkit subseq
: fix bug of--bed
, add checking coordinate.
- Increasing speed of reading
- SeqKit v0.5.4 - 2017-04-11
seqkit subseq --gtf
, add flag--gtf-tag
to set tag that's outputted as sequence comment- fix
seqkit split
andseqkit sample
: forget not to wrap sequence and quality in output for FASTQ format - compile with go1.8.1
- SeqKit v0.5.3 - 2017-04-01
seqkit grep
: fix bug when usingseqkit grep -r -f patternfile
: all records will be retrived due to failing to discarding the blank pattern (""
). #11
- SeqKit v0.5.2 - 2017-03-24
seqkit stats -a
andseqkit seq -g -G
: change default gap letters from '- ' to '- .'seqkit subseq
: fix bug of range overflow when using-d/--down-stream
or-u/--up-stream
for retieving subseq using BED (--beb
) or GTF (--gtf
) file.seqkit locate
: add flag-G/--non-greedy
, non-greedy mode, faster but may miss motifs overlaping with others.
- SeqKit v0.5.1 - 2017-03-12
seqkit restart
: fix bug of flag parsing
- SeqKit v0.5.0 - 2017-03-11
- new command
seqkit restart
, for resetting start position for circular genome. seqkit sliding
: add flag-g/--greedy
, exporting last subsequences even shorter than windows size.seqkit seq
:- add flag
-m/--min-len
and-M/--max-len
to filter sequences by length. - rename flag
-G/--gap-letter
to-G/--gap-letters
.
- add flag
seqkit stat
:- renamed to
seqkit stats
, don't worry, old name is still available as an alias. - add new flag
-a/all
, for all statistics, includingsum_gap
,N50
, andL50
.
- renamed to
- new command
- SeqKit v0.4.5 - 2017-02-26
seqkit seq
: fix bug of failing to reverse quality of FASTQ sequence
- SeqKit v0.4.4 - 2017-02-17
seqkit locate
: fix bug of missing regular-expression motifs containing non-DNA characters (e.g.,ACT.{6,7}CGG
) from motif file (-f
).- compiled with go v1.8.
- SeqKit v0.4.3 - 2016-12-22
- fix bug of
seqkit stat
:min_len
always be0
in versions: v0.4.0, v0.4.1, v0.4.2
- fix bug of
- SeqKit v0.4.2 - 2016-12-21
- fix header information of
seqkit subseq
when restriving up- and down-steam sequences using GTF/BED file.
- fix header information of
- SeqKit v0.4.1 - 2016-12-16
- enchancement: remove redudant regions for
seqkit locate
.
- enchancement: remove redudant regions for
- SeqKit v0.4.0 - 2016-12-07
- fix bug of
seqkit locate
, e.g, only find two locations (1-4
,7-10
, missing4-7
) ofACGA
inACGACGACGA
. - better output of
seqkit stat
for empty file.
- fix bug of
- SeqKit v0.3.9 - 2016-12-04
- fix bug of region selection for blank sequences. affected commands include
seqkit subseq --region
,seqkit grep --region
,seqkit split --by-region
. - compile with go1.8beta1.
- fix bug of region selection for blank sequences. affected commands include
- SeqKit v0.3.8.1 - 2016-11-25
- enhancement and bugfix of
seqkit common
: two or more same files allowed, fix log information of number of extracted sequences in the first file.
- enhancement and bugfix of
- SeqKit v0.3.8 - 2016-12-24
- enhancement of
seqkit common
: better handling of files containing replicated sequences
- enhancement of
- SeqKit v0.3.7 - 2016-12-23
- fix bug in
seqkit split --by-id
when sequence ID contains invalid characters for system path. - add more flags validation for
seqkit replace
. - enhancement: raise error when key pattern matches multiple targes in cases of replacing with key-value files and more controls are added.
- changes: do not wrap sequence and quality in output for FASTQ format.
- fix bug in
- SeqKit v0.3.6 - 2016-11-03
- add new feature for
seqkit grep
: new flag-R
(--region
) for specifying sequence region for searching.
- add new feature for
- SeqKit v0.3.5 - 2016-10-30
- fig bug of
seqkit grep
: flag-i
(--ignore-case
) did not work when not using regular expression
- fig bug of
- SeqKit v0.3.4.1 - 2016-09-21
- improve performance of reading (~10%) and writing (100%) gzip-compressed file
by using
github.com/klauspost/pgzip
package - add citation
- improve performance of reading (~10%) and writing (100%) gzip-compressed file
by using
- SeqKit v0.3.4 - 2016-09-17
- bugfix:
seq
wrongly handles only the first one sequence file when multiple files given - new feature:
fx2tab
can output alphabet letters of a sequence by flag-a
(--alphabet
) - new feature: new flag
-K
(--keep-key
) forreplace
, when replacing with key-value file, one can choose keeping the key as value or not.
- bugfix:
- SeqKit v0.3.3 - 2016-08-18
- fix bug of
seqkit replace
, wrongly starting from 2 when using{nr}
in-r
(--replacement
) - new feature:
seqkit replace
supports replacement symbols{nr}
(record number) and{kv}
(corresponding value of the key ($1) by key-value file)
- fix bug of
- SeqKit v0.3.2 - 2016-08-13
- fix bug of
seqkit split
, error when target file is in a directory. - improve performance of
seqkit spliding
for big sequences, and output last part even if it's shorter than window sze, output of FASTQ is also supported.
- fix bug of
- SeqKit v0.3.1.1 - 2016-08-07
- compile with go1.7rc5, with higher performance and smaller size of binary file
- SeqKit v0.3.1 - 2016-08-02
- improve speed of
seqkit locate
- improve speed of
- SeqKit v0.3.0 - 2016-07-28
- use fork of github.com/brentp/xopen, using
zcat
for speedup of .gz file reading on *nix systems. - improve speed of parsing sequence ID when creating FASTA index
- reduce memory usage of
seqkit subseq --gtf
- fix bug of
seqkit subseq
when using flag--id-ncbi
- fix bug of
seqkit split
, outdir error - fix bug of
seqkit seq -p
, last base is wrongly failed to convert when sequence length is odd. - add "sum_len" result for output of
seqkit stat
- use fork of github.com/brentp/xopen, using
- seqkit v0.2.9 - 2016-07-24
- fix minor bug of
seqkit split
andseqkit shuffle
, header name error due to improper use of pointer - add option
-O (--out-dir)
toseqkit split
- fix minor bug of
- seqkit v0.2.8 - 2016-07-19
- improve speed of parsing sequence ID, not using regular expression for default
--id-regexp
- improve speed of record outputing for small-size sequences
- fix minor bug:
seqkit seq
for blank record - update benchmark result
- improve speed of parsing sequence ID, not using regular expression for default
- seqkit v0.2.7 - 2016-07-18
- reduce memory usage by optimize the outputing of sequences.
detail: using
BufferedByteSliceWrapper
to resuse bytes.Buffer. - reduce memory usage and improve speed by using custom buffered
reading mechanism, instead of using standard library
bufio
, which is slow for large genome sequence. - discard strategy of "buffer" and "chunk" of FASTA/Q records, just parse records one by one.
- delete global flags
-c (--chunk-size)
and-b (--buffer-size)
. - add function testing scripts
- reduce memory usage by optimize the outputing of sequences.
detail: using
- seqkit v0.2.6 - 2016-07-01
- fix bug of
seqkit subseq
: Inplace subseq method leaded to wrong result
- fix bug of
- seqkit v0.2.5.1
- fix a bug of
seqkit subseq
: chromesome name was not be converting to lower case when using--gtf
or--bed
- fix a bug of
- seqkit v0.2.5 - 2016-07-01
- fix a serious bug brought in
v0.2.3
, using unsafe method to convertstring
to[]byte
- add awk-like built-in variable of record number (
{NR}
) forseqkit replace
- fix a serious bug brought in
- seqkit v0.2.4.1 - 2016-06-12
- fix several bugs from library
bio
, affected situations:- Locating patterns in sequences by pattern FASTA file:
seqkit locate -f
- Reading FASTQ file with record of which the quality starts with
+
- Locating patterns in sequences by pattern FASTA file:
- add command
version
- fix several bugs from library
- seqkit v0.2.4 - 2016-05-31
- add subcommand
head
- add subcommand
- seqkit v0.2.3 - 2016-05-08
- reduce memory occupation by avoid copy data when convert
string
to[]byte
- speedup reverse-complement by avoid repeatly calling functions
- reduce memory occupation by avoid copy data when convert
- seqkit v0.2.2 - 2016-05-06
- reduce memory occupation of subcommands that use FASTA index
- seqkit v0.2.1 - 2016-05-02
- improve performance of outputing.
- fix bug of
seqkit seq -g
for FASTA fromat - some other minor fix of code and docs
- update benchmark results
- seqkit v0.2.0 - 2016-04-29
- reduce memory usage of writing output
- fix bug of
subseq
,shuffle
,sort
when reading from stdin - reduce memory usage of
faidx
- make validating sequences an optional option in
seq
command, it saves some time.
- seqkit v0.1.9 - 2016-04-26
- using custom FASTA index file extension:
.seqkit.fai
- reducing memory usage of
sample --number --two-pass
- change default CPU number to 2 for multi-cpus computer, and 1 for single-CPU computer
- using custom FASTA index file extension:
- seqkit v0.1.8 - 2016-04-24
- add subcommand
rename
to rename duplicated IDs - add subcommand
faidx
to create FASTA index file - utilize faidx to improve performance of
subseq
shuffle
,sort
and split support two-pass mode (by flag-2
) with faidx to reduce memory usage.- document update
- add subcommand
- seqkit v0.1.7 - 2016-04-21
- add support for (multi-line) FASTQ format
- update document, add technical details
- rename subcommands
fa2tab
andtab2fa
tofx2tab
andtab2fx
- add subcommand
fq2fa
- add column "seq_format" to
stat
- add global flag
-b
(--bufer-size
) - little change of flag in
subseq
and some other commands
- seqkit v0.1.6 - 2016-04-07
- add subcommand
replace
- add subcommand
- seqkit v0.1.5.2 - 2016-04-06
- fix bug of
grep
, when not using flag-r
, flag-i
will not take effect.
- fix bug of
- seqkit v0.1.5.1
- fix result of
seqkit sample -n
- fix benchmark script
- fix result of
- seqkit v0.1.5 - 2016-03-29
- add global flag
--id-ncbi
- add flag
-d
(--dup-seqs-file
) and-D
(--dup-num-file
) for subcommandrmdup
- make using MD5 as an optional flag
-m
(--md5
) in subcommandrmdup
andcommon
- fix file name suffix of
seqkit split
result - minor modification of
sliding
output
- add global flag
- seqkit v0.1.4.1 - 2016-03-27
- change alignment of
stat
output - preciser CPUs number control
- change alignment of
- seqkit v0.1.4 - 2016-03-25
- add subcommand
sort
- improve subcommand
subseq
: supporting of getting subsequences by GTF and BED files - change name format of
sliding
result - prettier output of
stat
- add subcommand
- seqkit v0.1.3.1 - 2016-03-16
- Performance improvement by reducing time of cleaning spaces
- Document update
- seqkit v0.1.3 - 2016-03-15
- Further performance improvement
- Rename sub command
extract
togrep
- Change default value of flag
--threads
back CPU number of current device, change default value of flag--chunk-size
back 10000 sequences. - Update benchmark
- seqkit v0.1.2 - 2016-03-14
- Add flag
--dna2rna
and--rna2dna
to subcommandseq
.
- Add flag
- seqkit v0.1.1 - 2016-03-13
- 5.5X speedup of FASTA file parsing by avoid using regular expression to remove spaces (detail ) and using slice indexing instead of map to validate letters (detail)
- Change default value of global flag
-- thread
to 1. Since most of the subcommands are I/O intensive, For computation intensive jobs, like extract and locate, you may set a bigger value. - Change default value of global flag
--chunk-size
to 100. - Add subcommand
stat
- Fix bug of failing to automatically detect alphabet when only one record in file.
- seqkit v0.1 - 2016-03-11
- first release of seqkit