Indexing AllTheBacteria
-
Launch an EC2 instance in Europe London region (eu-west-2) where the index is located.
- OS: Amazon Linux 2023 64-bit (Arm)
- Instance type (You might need to increase the limit of CPUs):
- c7g.8xlarge (32 vCPU, 64 GiB memory, 15 Gigabit, 1.3738 USD per Hour)
- c6gn.12xlarge (48 vCPU, 96 GiB memory, 75 Gigabit, 2.46 USD per Hour) (recommended)
- Storage: 20 GiB General purpose (gp3), only for storing queries and results.
-
Mount the LexicMap index with mount-s3 (it’s fast but still slower than local disks):
# install mount-s3. You might need to replace arm64 with x86_64 for other architectures wget https://s3.amazonaws.com/mountpoint-s3-release/latest/arm64/mount-s3.rpm sudo yum install -y ./mount-s3.rpm rm ./mount-s3.rpm # mount # --log-directory log --debug --log-metrics mkdir -p atb.lmi log UNSTABLE_MOUNTPOINT_MAX_PREFETCH_WINDOW_SIZE=65536 \ mount-s3 --read-only --prefix 202408/ allthebacteria-lexicmap atb.lmi --no-sign-request -
Install LexicMap.
# Binary's path depends on the architecture of the CPUs: amd64 or arm64 # Please check the latest version here: https://github.com/shenwei356/LexicMap/releases # or the pre-release here: https://github.com/shenwei356/LexicMap/issues/10 wget https://github.com/shenwei356/LexicMap/releases/download/v0.7.0/lexicmap_linux_arm64.tar.gz mkdir -p bin tar -zxvf lexicmap_linux_arm64.tar.gz -C bin rm lexicmap_linux_arm64.tar.gz -
wget https://github.com/shenwei356/LexicMap/raw/refs/heads/main/demo/bench/b.gene_E_faecalis_SecY.fasta -
Run LexicMap.
# create and enter a screen session screen -S lexicmap # run # it takes 20 minutes with c7g.8xlarge, 12.5 minutes with c6gn.12xlarge # b.gene_E_coli_16S.fasta takes 1h54m with c6gn.12xlarge. lexicmap search -d atb.lmi b.gene_E_faecalis_SecY.fasta -o t.txt --debug -
Unmount the index.
sudo umount atb.lmi
Install awscli by
conda install -c conda-forge awscli
Test access
aws s3 ls s3://allthebacteria-lexicmap/202408/ --no-sign-request
# output
PRE genomes/
PRE seeds/
2025-04-08 16:39:17 62488 genomes.chunks.bin
2025-04-08 16:39:17 54209660 genomes.map.bin
2025-04-08 22:32:35 619 info.toml
2025-04-08 22:32:36 160032 masks.bin
Download the index (it’s 5.24 TiB!!!).
aws s3 cp s3://allthebacteria-lexicmap/202408/ atb.lmi --recursive --no-sign-request
# dirsize atb.lmi
atb.lmi: 5.24 TiB (5,758,875,365,595)
2.87 TiB seeds
2.37 TiB genomes
51.70 MiB genomes.map.bin
156.28 KiB masks.bin
61.02 KiB genomes.chunks.bin
619 B info.toml
Make sure you have enough disk space, at least 8 TB, >10 TB is preferred.
Tools:
- https://github.com/shenwei356/rush, for running jobs
Info:
- AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable.
- Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable
- Data on OSF: https://osf.io/xv7q9/
After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.
-
Downloading the list file of all assemblies in the latest version (v0.2 plus incremental versions).
mkdir -p atb; cd atb; # attention, the URL might changes, please check it in the browser. wget https://osf.io/download/4yv85/ -O file_list.all.latest.tsv.gzIf you only need to add assemblies from an incremental version. Please manually download the file list in the path
AllTheBacteria/Assembly/OSF Storage/File_lists. -
Downloading assembly tarball files.
# tarball file names and their URLs zcat file_list.all.latest.tsv.gz | awk -F'\t' 'NR>1 {print $5"\t"$6}' | uniq > tar2url.tsv # download cat tar2url.tsv | rush --eta -j 4 -c -C download.rush 'wget -O {1} {2}' -
Decompressing all tarballs. The decompressed genomes are stored in plain text, so we use
gzip(can be replaced with fasterpigz) to compress them to save disk space.# {^tar.xz} is for removing the suffix "tar.xz" ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^.tar.xz}/*.fa' cd ..After that, the assemblies directory would have multiple subdirectories. When you give the directory to
lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files. You can also give a file list with selected assemblies.$ tree atb | more atb ├── atb.assembly.r0.2.batch.1 │ ├── SAMD00013333.fa.gz │ ├── SAMD00049594.fa.gz │ ├── SAMD00195911.fa.gz │ ├── SAMD00195914.fa.gz -
Prepare a file list of assemblies.
-
Just use
findor fd (much faster).# find find atb/ -name "*.fa.gz" > files.txt # fd fd .fa.gz$ atb/ > files.txtWhat it looks like:
$ head -n 2 files.txt atb/atb.assembly.r0.2.batch.1/SAMD00013333.fa.gz atb/atb.assembly.r0.2.batch.1/SAMD00049594.fa.gz -
(Optional) Only keep assemblies of high-quality. Please click this link to download the
hq_set.sample_list.txt.gzfile, or from this page.find atb/ -name "*.fa.gz" | grep -w -f <(zcat hq_set.sample_list.txt.gz) > files.txt
-
-
Creating a LexicMap index. (more details: https://bioinf.shenwei.me/LexicMap/tutorials/index/)
lexicmap index -S -X files.txt -O atb.lmi -b 25000 --log atb.lmi.log # dirsize atb.lmi atb.lmi: 5.24 TiB (5,758,698,088,389) 2.87 TiB seeds 2.37 TiB genomes 51.70 MiB genomes.map.bin 156.28 KiB masks.bin 61.02 KiB genomes.chunks.bin 619 B info.tomlIt took 47h40m and 145GB RAM with 48 CPUs for 2.44m ATB genomes.
-
(Optional) Prepare Taxonomy data to limit TaxId in
lexicmap searchsince LexicMap v0.7.1.# Download species_calls.tsv.gz file in the directory (Latest_2024-08) of this page: # https://osf.io/h7wzy/files/osfstorage# wget https://osf.io/download/7t9qd/ -O species_calls.tsv.gz # Download gtdb-taxdump files of version r214 that was used in # taxonomic classification of AllTheBacteria v2.0 and incremental 202408 # from here: https://github.com/shenwei356/gtdb-taxdump/releases/tag/v0.4.0 wget https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.4.0/gtdb-taxdump.tar.gz tar -zxvf gtdb-taxdump.tar.gz mv gtdb-taxdump/R214 taxdump # Prepare A file mapping assembly accession to TaxId # using TaxonKit: https://github.com/shenwei356/taxonkit cat species_calls.tsv.gz | sed 1d | cut -f 1,2 \ | taxonkit name2taxid --data-dir taxdump/ -i 2 \ | cut -f 1,3 \ > taxid.map
-
Downloading assemblies tarballs here (except these starting with
unknown__) to a directory (likeatb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/mkdir -p atb; cd atb; # assembly file list, 650 files in total wget https://bioinf.shenwei.me/LexicMap/AllTheBacteria-v0.2.url.txt # download # rush is used: https://github.com/shenwei356/rush # The download.rush file stores finished jobs, which will be skipped in a second run for resuming jobs. cat AllTheBacteria-v0.2.url.txt | rush --eta -j 2 -c -C download.rush 'wget {}' # list of high-quality samples wget https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/metadata/hq_set.sample_list.txt.gz -
Decompressing all tarballs. The decompressed genomes are stored in plain text, so we use
gzip(can be replaced with fasterpigz) to compress them to save disk space.# {^asm.tar.xz} is for removing the suffix "asm.tar.xz" ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^asm.tar.xz}/*.fa' cd ..After that, the assemblies directory would have multiple subdirectories. When you give the directory to
lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files. You can also give a file list with selected assemblies.$ tree atb | more atb ├── achromobacter_xylosoxidans__01 │ ├── SAMD00013333.fa.gz │ ├── SAMD00049594.fa.gz │ ├── SAMD00195911.fa.gz │ ├── SAMD00195914.fa.gz # disk usage $ du -sh atb 2.9T atb $ du -sh atb --apparent-size 2.1T atb -
Creating a LexicMap index. (more details: https://bioinf.shenwei.me/LexicMap/tutorials/index/)
# file paths of all samples find atb/ -name "*.fa.gz" > atb_all.txt # wc -l atb_all.txt # 1876015 atb_all.txt # file paths of high-quality samples grep -w -f <(zcat atb/hq_set.sample_list.txt.gz) atb_all.txt > atb_hq.txt # wc -l atb_hq.txt # 1858610 atb_hq.txt # index lexicmap index -S -X atb_hq.txt -O atb_hq.lmi -b 25000 --log atb_hq.lmi.logFor 1,858,610 HQ genomes, on a 48-CPU machine, time: 48 h, ram: 85 GB, index size: 3.88 TB. If you don’t have enough memory, please decrease the value of
-b.# disk usage $ du -sh atb_hq.lmi 4.6T atb_hq.lmi $ du -sh atb_hq.lmi --apparent-size 3.9T atb_hq.lmi $ dirsize atb_hq.lmi atb_hq.lmi: 3.88 TiB (4,261,437,129,065) 2.11 TiB seeds 1.77 TiB genomes 39.22 MiB genomes.map.bin 312.53 KiB masks.bin 332 B info.tomlNote that, there’s a tmp directory
atb_hq.lmibeing created during indexing. In the tmp directory, the seed data would be bigger than the final size ofseedsdirectory, however, the genome files are simply moved to the final index.