LexicMap: efficient sequence alignment against millions of prokaryotic genomes​
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Indexing GenBank+RefSeq

Make sure you have enough disk space, >10 TB is preferred.

Tools:

Data:

time genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" \
    -f "genomic.fna.gz" -o "genbank" -M "ncbi" -t 12 -m -L curl

cd genbank/2024-02-15_11-00-51/


# ----------------- check the file integrity -----------------

genomes=files

# corrupted files
# find $genomes -name "*.gz" \
fd ".gz$" $genomes \
    | rush --eta 'seqkit seq -w 0 {} > /dev/null; if [ $? -ne 0 ]; then echo {}; fi' \
    > failed.txt

# empty files
find $genomes -name "*.gz" -size 0 >> failed.txt

# delete these files
cat failed.txt | rush '/bin/rm {}'

# redownload them:
# run the genome_updater command again, with the flag -i

Indexing. On a 48-CPU machine, time: 54 h, ram: 178 GB, index size: 4.94 TB. If you don’t have enough memory, please decrease the value of -b.

lexicmap index \
    -I files/ \
    --ref-name-regexp '^(\w{3}_\d{9}\.\d+)' \
    -O genbank_refseq.lmi --log genbank_refseq.lmi.log \
    -b 25000