This repository contains a Bash script (script1.sh) for processing metagenomic paired-end sequencing data. The pipeline performs quality control, host decontamination, sequence assembly, gene prediction, and abundance quantification, generating organized results for each sample.
- Quality Control: Uses
fastpto trim and filter low-quality reads. - Host Decontamination: Removes host sequences using
bowtie2with a mouse genome index. - Sequence Assembly: Assembles contigs using
megahit. - Sequence Statistics and Filtering: Generates contig statistics and filters contigs (>500 bp) with
seqkit. - Gene Prediction: Predicts genes on filtered contigs using
prodigal. - Abundance Quantification: Quantifies contig abundance using
salmon.
Ensure the following tools are installed and accessible in your PATH:
- fastp (>= 0.20.0)
- bowtie2 (>= 2.3.5)
- megahit (>= 1.2.9)
- seqkit (>= 0.12.0)
- prodigal (>= 2.6.3)
- salmon (>= 1.4.0)
Additionally, you need:
A Bowtie2 index for the host genome (e.g., mouse genome, specified in the script as /mnt/d/Datas/Metagenomics/Base/bowtie2/mouse).
Human
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/GCA_000001405.15_GRCh38_genomic.fna.gz
gunzip GCA_000001405.15_GRCh38_genomic.fna.gz
bowtie2-build GCA_000001405.15_GRCh38_genomic.fna human
Mus
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39/GCA_000001635.9_GRCm39_genomic.fna.gz
gunzip GCA_000001635.9_GRCm39_genomic.fna.gz
bowtie2-build GCA_000001635.9_GRCm39_genomic.fna mouse
# From UCSC
wget -c https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz
gunzip mm39.fa.gz
# From EBL
wget -c https://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
gunzip Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
Maybe Clink here https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml is better!
Input data in paired-end FASTQ format (gzip-compressed).
The script expects input data organized in a directory (total_folder) with subdirectories for each sample:
total_folder/
├── sample1/
│ ├── sample1_1.fastq.gz
│ ├── sample1_2.fastq.gz
├── sample2/
│ ├── sample2_1.fastq.gz
│ ├── sample2_2.fastq.gz
...
Each sample directory (sampleX) contains paired-end FASTQ files named sampleX_1.fastq.gz and sampleX_2.fastq.gz.
The script automatically detects sample directories matching the pattern sample*.
For each sample, the script creates a result directory (sampleX_results) with the following structure:
sampleX_results/
├── sampleX_clean_1.fastq.gz # Quality-controlled FASTQ (R1)
├── sampleX_clean_2.fastq.gz # Quality-controlled FASTQ (R2)
├── sampleX_fastp.html # Fastp quality report (HTML)
├── sampleX_fastp.json # Fastp quality report (JSON)
├── sampleX_host.sam # Bowtie2 alignment (SAM)
├── sampleX_host_removed_1.fastq # Host-decontaminated FASTQ (R1)
├── sampleX_host_removed_2.fastq # Host-decontaminated FASTQ (R2)
├── sampleX_bowtie2.log # Bowtie2 log
├── sampleX_megahit/ # Megahit assembly results
│ ├── sampleX.contigs.fa # Assembled contigs
│ ├── contigs_stats.txt # Seqkit contig statistics
│ ├── contigs_500.fa # Filtered contigs (>500 bp)
│ ├── sampleX.prodigal # Prodigal gene predictions (GFF)
│ ├── contigs_500.fna # Prodigal nucleotide sequences
│ ├── salmon_index/ # Salmon index
│ ├── salmon/ # Salmon quantification results
│ │ ├── quant.sf # Abundance quantification
│ │ ├── logs/ # Salmon logs
- Clone the repository:
git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name
-
Ensure all required tools are installed (see Prerequisites).
-
Verify the host Bowtie2 index path in the script (
HOST_INDEX variable). Update it to your local path if necessary:
HOST_INDEX="/path/to/your/bowtie2/mouse"
- Update the input directory path in the script (
INPUT_DIR variable) to your data directory:
INPUT_DIR="/path/to/total_folder"
Make the script executable:
chmod +x script1.sh
Run the script:
./script1.sh
The script will process each sample in INPUT_DIR, creating result directories (sampleX_results) in the current working directory.
Assume your data is in /data/metagenomics/total_folder with two samples:
/data/metagenomics/total_folder/
├── sample1/
│ ├── sample1_1.fastq.gz
│ ├── sample1_2.fastq.gz
├── sample2/
│ ├── sample2_1.fastq.gz
│ ├── sample2_2.fastq.gz
Update script1.sh:
INPUT_DIR="/data/metagenomics/total_folder"
HOST_INDEX="/path/to/bowtie2/mouse"
Run:
./script1.sh
Results will be generated in the current directory:
./sample1_results/
./sample2_results/
Error Handling: The script checks for input file existence and tool execution status, skipping failed samples with error messages.
Parallelization: The script uses multiple threads (4 for fastp/bowtie2, 8 for megahit/salmon). Adjust thread counts (-t, -p) based on your system's resources.
Large Datasets: For large datasets, consider running the script on a high-performance computing cluster with a job scheduler (e.g., SLURM).
Customization: Modify tool parameters (e.g., salmon --validateMappings, seqkit -m 500) in the script to suit your analysis needs.
WSL2 Environment: This script was developed and tested in WSL2 (Debian Linux). All data is assumed to be under WSL's mounted /mnt directory. When running megahit, move input data to /usr/home/ (or another non-mounted directory) before execution, as megahit does not support mounted drives.
Contributions are welcome! Please open an issue or submit a pull request with improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or support, please open an issue or contact 1636770513@qq.com.