Indexing AllTheBacteria

Searching with the pre-built index on AWS

Run on EC2

Launch an EC2 instance in Europe London region (eu-west-2) where the index is located.
- OS: Amazon Linux 2023 64-bit (Arm)
- Instance type (You might need to increase the limit of CPUs):
  - c7g.8xlarge (32 vCPU, 64 GiB memory, 15 Gigabit, 1.3738 USD per Hour)
  - c6gn.12xlarge (48 vCPU, 96 GiB memory, 75 Gigabit, 2.46 USD per Hour) (recommended)
- Storage: 20 GiB General purpose (gp3), only for storing queries and results.
Connect to the instance via online console or a ssh client.

Mount the LexicMap index with mount-s3 (it’s fast but still slower than local disks):

 # install mount-s3. You might need to replace arm64 with x86_64 for other architectures
 wget https://s3.amazonaws.com/mountpoint-s3-release/latest/arm64/mount-s3.rpm

 sudo yum install -y ./mount-s3.rpm
 rm ./mount-s3.rpm

 # mount
 #     --log-directory log --debug --log-metrics
 mkdir -p atb.lmi log
 UNSTABLE_MOUNTPOINT_MAX_PREFETCH_WINDOW_SIZE=65536 \
     mount-s3 --read-only --prefix 202408/ allthebacteria-lexicmap atb.lmi --no-sign-request

Install LexicMap.

 # Binary's path depends on the architecture of the CPUs: amd64 or arm64
 # Please check the latest version here: https://github.com/shenwei356/LexicMap/releases
 #              or the pre-release here: https://github.com/shenwei356/LexicMap/issues/10
 wget https://github.com/shenwei356/LexicMap/releases/download/v0.7.0/lexicmap_linux_arm64.tar.gz

 mkdir -p bin
 tar -zxvf lexicmap_linux_arm64.tar.gz -C bin
 rm lexicmap_linux_arm64.tar.gz

Upload queries.

 wget https://github.com/shenwei356/LexicMap/raw/refs/heads/main/demo/bench/b.gene_E_faecalis_SecY.fasta

Run LexicMap.

 # create and enter a screen session
 screen -S lexicmap

 # run
 # it takes 20 minutes with c7g.8xlarge, 12.5 minutes with c6gn.12xlarge
 # b.gene_E_coli_16S.fasta takes 1h54m with c6gn.12xlarge.
 lexicmap search -d atb.lmi b.gene_E_faecalis_SecY.fasta -o t.txt --debug

Unmount the index.
```
 sudo umount atb.lmi
```

Only download it and run locally

Install awscli by

conda install -c conda-forge awscli

Test access

aws s3 ls s3://allthebacteria-lexicmap/202408/ --no-sign-request

# output
                           PRE genomes/
                           PRE seeds/
2025-04-08 16:39:17      62488 genomes.chunks.bin
2025-04-08 16:39:17   54209660 genomes.map.bin
2025-04-08 22:32:35        619 info.toml
2025-04-08 22:32:36     160032 masks.bin

Download the index (it’s 5.24 TiB!!!).

aws s3 cp s3://allthebacteria-lexicmap/202408/ atb.lmi --recursive --no-sign-request

# dirsize atb.lmi
atb.lmi: 5.24 TiB (5,758,875,365,595)
  2.87 TiB      seeds
  2.37 TiB      genomes
 51.70 MiB      genomes.map.bin
156.28 KiB      masks.bin
 61.02 KiB      genomes.chunks.bin
     619 B      info.toml

Steps for v0.2 and later versions hosted at OSF

Make sure you have enough disk space, at least 8 TB, >10 TB is preferred.

Tools:

https://github.com/shenwei356/rush, for running jobs

Info:

AllTheBacteria, All WGS isolate bacterial INSDC data to June 2023 uniformly assembled, QC-ed, annotated, searchable.
Preprint: AllTheBacteria - all bacterial genomes assembled, available and searchable
Data on OSF: https://osf.io/xv7q9/

After v0.2, AllTheBacteria releases incremental datasets periodically, with all data stored at OSF.

Downloading the list file of all assemblies in the latest version (v0.2 plus incremental versions).
```
 mkdir -p atb;
 cd atb;

 # attention, the URL might changes, please check it in the browser.
 wget https://osf.io/download/4yv85/ -O file_list.all.latest.tsv.gz
```
If you only need to add assemblies from an incremental version. Please manually download the file list in the path AllTheBacteria/Assembly/OSF Storage/File_lists.

Downloading assembly tarball files.

 # tarball file names and their URLs
 zcat file_list.all.latest.tsv.gz | awk 'NR>1 {print $3"\t"$4}' | uniq > tar2url.tsv

 # download
 cat tar2url.tsv | rush --eta -j 2 -c -C download.rush 'wget -O {1} {2}'

Decompressing all tarballs. The decompressed genomes are stored in plain text, so we use gzip (can be replaced with faster pigz ) to compress them to save disk space.
```
 # {^tar.xz} is for removing the suffix "tar.xz"
 ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^.tar.xz}/*.fa'

 cd ..
```
After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files. You can also give a file list with selected assemblies.
```
 $ tree atb | more
 atb
 ├── atb.assembly.r0.2.batch.1
 │   ├── SAMD00013333.fa.gz
 │   ├── SAMD00049594.fa.gz
 │   ├── SAMD00195911.fa.gz
 │   ├── SAMD00195914.fa.gz
```

Prepare a file list of assemblies.

Just use find or fd (much faster).

 # find
 find atb/ -name "*.fa.gz" > files.txt

 # fd
 fd .fa.gz$ atb/ > files.txt

What it looks like:

 $ head -n 2 files.txt
 atb/atb.assembly.r0.2.batch.1/SAMD00013333.fa.gz
 atb/atb.assembly.r0.2.batch.1/SAMD00049594.fa.gz

(Optional) Only keep assemblies of high-quality. Please click this link to download the hq_set.sample_list.txt.gz file, or from this page.
```
  find atb/ -name "*.fa.gz" | grep -w -f <(zcat hq_set.sample_list.txt.gz) > files.txt
```

Creating a LexicMap index. (more details: https://bioinf.shenwei.me/LexicMap/tutorials/index/)

 lexicmap index -S -X files.txt -O atb.lmi -b 25000 --log atb.lmi.log

 # dirsize atb.lmi
 atb.lmi: 5.24 TiB (5,758,698,088,389)
   2.87 TiB      seeds
   2.37 TiB      genomes
  51.70 MiB      genomes.map.bin
 156.28 KiB      masks.bin
  61.02 KiB      genomes.chunks.bin
      619 B      info.toml

It took 47h40m and 145GB RAM with 48 CPUs for 2.44m ATB genomes.

Steps for v0.2 hosted at EBI ftp

Downloading assemblies tarballs here (except these starting with unknown__) to a directory (like atb): https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/

 mkdir -p atb;
 cd atb;

 # assembly file list, 650 files in total
 wget https://bioinf.shenwei.me/LexicMap/AllTheBacteria-v0.2.url.txt

 # download
 #   rush is used: https://github.com/shenwei356/rush
 #   The download.rush file stores finished jobs, which will be skipped in a second run for resuming jobs.
 cat AllTheBacteria-v0.2.url.txt | rush --eta -j 2 -c -C download.rush 'wget {}'


 # list of high-quality samples
 wget https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/metadata/hq_set.sample_list.txt.gz

Decompressing all tarballs. The decompressed genomes are stored in plain text, so we use gzip (can be replaced with faster pigz ) to compress them to save disk space.

 # {^asm.tar.xz} is for removing the suffix "asm.tar.xz"
 ls *.tar.xz | rush --eta -c -C decompress.rush 'tar -Jxf {}; gzip -f {^asm.tar.xz}/*.fa'

 cd ..

After that, the assemblies directory would have multiple subdirectories. When you give the directory to lexicmap index -I, it can recursively scan (plain or gz/xz/zstd-compressed) genome files. You can also give a file list with selected assemblies.

 $ tree atb | more
 atb
 ├── achromobacter_xylosoxidans__01
 │   ├── SAMD00013333.fa.gz
 │   ├── SAMD00049594.fa.gz
 │   ├── SAMD00195911.fa.gz
 │   ├── SAMD00195914.fa.gz


 # disk usage

 $ du -sh atb
 2.9T    atb

 $ du -sh atb --apparent-size
 2.1T    atb