Whole Genome Sequencing File Processing
This bash script implements an end-to-end genomic analysis pipeline designed to run on a local server. It performs a series of steps from quality control of raw sequencing data to variant calling, utilizing various bioinformatics tools.
The following tools must be installed and available in your system PATH:
- fastp
- bwa
- samtools
- gatk
- deepvariant
-
Edit the script to set the following variables:
INPUT_R1: Path to input FASTQ file for read 1INPUT_R2: Path to input FASTQ file for read 2REFERENCE_GENOME: Path to reference genome FASTA fileKNOWN_SITES: Path to known sites VCF fileOUTPUT_DIR: Path to output directoryTHREADS: Number of threads to use (adjust based on your server's capabilities)
-
Make the script executable:
chmod +x genomic_analysis_pipeline.sh -
Run the script:
./genomic_analysis_pipeline.sh
- Quality Control and Trimming: Uses fastp to perform quality control and trimming on input FASTQ files.
- Alignment to Reference Genome: Aligns trimmed reads to the reference genome using BWA-MEM.
- Sorting and Indexing BAM Files: Sorts and indexes the aligned reads using samtools.
- Marking Duplicates: Marks duplicate reads in the BAM file.
- Base Quality Score Recalibration (BQSR): Performs base quality score recalibration using GATK.
- Variant Calling: Calls variants using DeepVariant.
The script generates several output files in the specified output directory, including:
- Trimmed FASTQ files
- Aligned, sorted, and indexed BAM files
- Recalibrated BAM file
- VCF and gVCF files containing called variants
The script will exit immediately if any command fails, helping to catch errors early in the pipeline.
You can customize the pipeline by modifying the parameters passed to each tool. Refer to the documentation of individual tools for more information on available options.
- This pipeline is designed for whole genome sequencing (WGS) data.
- Ensure you have sufficient disk space in the output directory.
- The pipeline may take several hours to complete, depending on the size of your input data and the computational resources available.
If you encounter any issues:
- Check that all required tools are properly installed and in your PATH.
- Verify that input files exist and are readable.
- Ensure you have write permissions in the output directory.
- Check the server logs for any error messages.
For further assistance, please contact your system administrator or bioinformatics support team.