Indexing UHGG

Info:

Unified Human Gastrointestinal Genome (UHGG) v2.0.2
A unified catalog of 204,938 reference genomes from the human gut microbiome
Number of Genomes: 289,232

Tools:

https://github.com/shenwei356/seqkit, for checking sequence files
https://github.com/shenwei356/rush, for running jobs

Data:

# meta data
wget https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.2/genomes-all_metadata.tsv

# gff url
sed 1d genomes-all_metadata.tsv | cut -f 20  | sed 's/v2.0/v2.0.2/' | sed -E 's/^ftp/https/' > url.txt

# download gff files
mkdir -p files; cd files

time cat ../url.txt \
    | rush --eta -v 'dir={///%}/{//%}' \
        'mkdir -p {dir}; curl -s -o {dir}/{%} {}' \
        -c -C download.rush -j 12
cd ..

# extract sequences from gff files
find files/ -name "*.gff.gz" \
    | rush --eta \
        'zcat {} | perl -ne "print if \$s; \$s=true if /^##FASTA/" | seqkit seq -w 0 -o {/}/{%:}.fna.gz' \
        -c -C extract.rush

Indexing. On a 48-CPU machine, time: 3 h, ram: 41 GB, index size: 426 GB. If you don’t have enough memory, please decrease the value of -b.

lexicmap index \
    -I files/ \
    -O uhgg.lmi --log uhgg.lmi.log \
    -b 5000

File sizes:

$ du -sh files/ uhgg.lmi
658G    files/
509G    uhgg.lmi

$ du -sh files/ uhgg.lmi --apparent-size
425G    files/
426G    uhgg.lmi

$ dirsize uhgg.lmi
uhgg.lmi: 425.15 GiB (456,497,171,291)
243.47 GiB      seeds
181.67 GiB      genomes
  6.34 MiB      genomes.map.bin
312.53 KiB      masks.bin
     330 B      info.toml