NAPLS Genomic Data Processing Pipeline (ENIGMA-DTI QC)

Overview

This document describes the pre-imputation Quality Control (QC) pipeline applied to the North American Prodrome Longitudinal Study (NAPLS) Phase 3 (NAPLS3) genomic data. The pipeline follows the ENIGMA-DTI Quality Control (QC) Protocol and is implemented via a series of shell scripts designed to run on the Hoffman2 cluster.

The primary goal is to prepare the raw NAPLS3 genotype data (Genome Build: hg19/GRCh37) for subsequent imputation by performing SNP renaming and rigorous QC prior to imputation.

The workflow consists of the following stages:

(Optional Setup) Creation of dbSNP binary files for efficient SNP mapping.
SNP Renaming: Standardizing variant identifiers to rsIDs or chr:pos:ref:alt.
ENIGMA-DTI QC Part 1: Initial filtering, sex/phenotype updates, and sex checks.
ENIGMA-DTI QC Part 2: Duplicate/relatedness checks, HapMap3 merging, and MDS analysis for ancestry outlier removal.
ENIGMA-DTI QC Part 3: PCA covariate generation, summary statistics calculation, and final results packaging.

Prerequisites

Software

PLINK v1.9: (/u/project/cbearden/hughesdy/software/plinkv1.9/plink)
PLINK v2.0: (/u/project/cbearden/hughesdy/software/plink2)
rsid_tools: ($HOME/apps/rsid_tools/bin/rsid_tools) - Installation required, see rsid_tools GitHub.
R: (v4.2.2+ recommended) with packages: data.table, ggplot2, calibrate, rmarkdown, tinytex, knitr, xfun. (Loaded via module R/4.2.2-BIO in scripts).
Hoffman2 Modules: parallel, bcftools, htslib, aria2.

Input Data

Raw NAPLS3 Genotypes: Located at /u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/NAPLS3_n710.{bed,bim,fam}. Genotyped on Illumina Infinium Global Screening Array-24 (GSAMD-24v1-0_20011747_A1).
Phenotype/Sample Information: Located in processed_genotype/enigma/DTIgenetics/info/:
- NAPLS3_Terra_samplestab_phenofile.txt: Contains subject IDs, sex, and case/control status.
- napls3_MS_diffusion.csv: Contains list of subjects with DTI data.
dbSNP VCF (for Stage 0): dbSNP build 156 for GRCh37 (GCF_000001405.25.gz and .tbi). Can be downloaded automatically or provided locally.

Environment

Scripts are designed for the Hoffman2 cluster environment.
Access to project directories (/u/project/cbearden/, /u/home/c/cobeaman/) and $SCRATCH space is required.

Workflow Steps

The pipeline can be executed using the master script run_napls_qc.sh or by running the individual scripts sequentially.

Detailed NAPLS3 Genomic Data Pre-Imputation QC Workflow (ENIGMA-DTI Protocol Implementation)

Stage 0: Create rsID Binaries (Optional - `processed_genotype/01_create_rsid_binaries.sh`)

Purpose: Generates binary index files from a dbSNP VCF. These files allow rsid_tools (used in Stage 1) to quickly map variant coordinates to rsIDs.
Execution: This is typically a one-time setup. The master script: run_napls_qc.sh checks for existing binaries in $RS_BIN_DIR (defined as $HOME/scratch/GRCh37_dbSNP156_Binaries/Standard) and skips this step if found.
Inputs: dbSNP VCF file (e.g., GCF_000001405.25.gz) and its index (.tbi).
Outputs: Binary .bin files (e.g., GRCh37_1.hash2rsid.bin, GRCh37_1.rsid2pos.bin, etc.) placed in $OUTPUT_DIR (defined as $HOME/project-cbearden/napls/binaries in the script, but the master script expects them in $RS_BIN_DIR).

Stage 1: SNP Renaming (`processed_genotype/01_rename_snps_direct.sh`)

Purpose: Renames variant identifiers in the raw NAPLS3 .bim file. It attempts to find the corresponding rsID using the dbSNP binaries created in Stage 0. If an rsID is not found, it uses a composite ID format (chr:pos:ref:alt).
Inputs:
- Raw NAPLS3 PLINK files (/u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/NAPLS3_n710.*).
- rsID binary files (from Stage 0, located via $RS_BIN_DIR).
Outputs:
- Renamed PLINK fileset: processed_genotype/NAPLS3_n710_renamed*.{bed,bim,fam}.
- Renaming map file: processed_genotype/final_snp_rename*.txt.
- Log files in processed_genotype/logs/<jobid>_rename_snps_direct/.

Stage 2: ENIGMA-DTI QC Part 1 (`processed_genotype/02_enigma_dti_qc_napls3_part1.sh`)

Purpose: Implements ENIGMA-DTI QC Steps 1-3. Filters subjects, updates sex and phenotype information, performs initial SNP/sample QC, splits the X chromosome, and performs sex checks.
Inputs:
- Renamed PLINK files from Stage 1 (processed_genotype/NAPLS3_n710_renamed*.{bed,bim,fam}).
- Phenotype file (processed_genotype/enigma/DTIgenetics/info/NAPLS3_Terra_samplestab_phenofile.txt).
- DTI subject list (processed_genotype/enigma/DTIgenetics/info/napls3_MS_diffusion.csv).
Outputs (in processed_genotype/enigma/DTIgenetics/<jobid>_enigma_dti_qc_napls3_part1/):
- QC'd PLINK fileset: *_QC1.{bed,bim,fam}.
- Sex mismatch list: sex_mismatches.txt.
- Summary files and logs.

Stage 3: ENIGMA-DTI QC Part 2 (`processed_genotype/02_enigma_dti_qc_napls3_part2.sh`)

Purpose: Implements ENIGMA-DTI QC Steps 4-6. Checks for duplicates and relatedness, merges data with HapMap3 reference, performs MDS analysis to identify and remove ancestry outliers (targeting European ancestry based on CEU/TSI cluster).
Inputs: Output directory from Stage 2 (processed_genotype/enigma/DTIgenetics/<jobid>_enigma_dti_qc_napls3_part1/).
Outputs (in processed_genotype/enigma/DTIgenetics/<jobid>_enigma_dti_qc_napls3_part2/):
- QC'd PLINK fileset after outlier removal: *_QC3.{bed,bim,fam}.
- MDS plots (before and after outlier removal): mdsplot_*.pdf.
- Outlier lists: *_pop_strat_mds.outlier.txt, *_pop_strat_mds.eur.txt.
- Duplicate/Relatedness counts.
- Summary files and logs.

Stage 4: ENIGMA-DTI QC Part 3 (`processed_genotype/02_enigma_dti_qc_napls3_part3.sh`)

Purpose: Implements ENIGMA-DTI QC Steps 8-9 and final packaging. Generates PCA covariates, calculates pre- and post-QC summary statistics, creates summary reports (text and PDF), and packages essential results into a zip archive for submission.
Inputs: Output directories from Stage 2 and Stage 3.
Outputs (in processed_genotype/enigma/DTIgenetics/<jobid>_enigma_dti_qc_napls3_part3/):
- PCA results: *_PCACovariates.{eigenval,eigenvec,log}.
- PCA scree plot: screeplot_*.pdf.
- Summary statistics files: *_basic_stats_preQC.txt, *_basic_stats_postQC.txt, *_qc_summary.txt.
- Summary reports: *_QC3_summary.txt, summary_report.pdf.
- Final submission archive: *_ENIGMA-DTI_FilesToSend.zip (contains logs, stats, plots).
- output_all/ directory containing intermediate and final files.
- output_final/ directory containing files included in the zip archive.

Running the Pipeline

Using the Master Script (Recommended)

The entire pipeline can be run using the master orchestration script run_napls_qc.sh.

Review Configuration: Check the environment variables defined within run_napls_qc.sh (e.g., NAPLS3_DIR, WORK_DIR, SCRATCH_DIR, tool paths) and adjust if necessary for your environment.
Submit Job: Submit the script to the Hoffman2 scheduler:
```
qsub run_napls_qc.sh
```
Monitoring:
- The main pipeline log is written to $LOG_DIR/napls3_qc_run.log (where $LOG_DIR is defined in the script, e.g., processed_genotype/logs/<jobid>_napls_qc_master).
- Individual script logs are stored within their respective output directories (e.g., processed_genotype/enigma/DTIgenetics/<jobid>_partX/logs/).
- The pipeline uses a checkpoint file ($LOG_DIR/napls3_qc_checkpoint.txt) to track completed steps, allowing resumption if interrupted.

Running Individual Scripts

Alternatively, the scripts (01_create_rsid_binaries.sh, 01_rename_snps_direct.sh, 02_enigma_dti_qc_napls3_part1.sh, etc.) can be run sequentially via qsub or directly in an interactive session. Ensure the necessary inputs from the previous step are available and correctly located. The master script uses cached_find to locate outputs dynamically, which would need manual replication or hardcoding if running scripts individually.

Output Structure

SNP Renaming: Renamed files (NAPLS3_n710_renamed*) are placed directly in processed_genotype/.
ENIGMA QC: Each part of the ENIGMA QC creates a timestamped/job-ID-based directory within processed_genotype/enigma/DTIgenetics/.
- ..._part1/: Contains *_QC1.* files, logs, sex mismatch info.
- ..._part2/: Contains *_QC3.* files (final QC'd dataset), MDS plots, outlier lists, logs.
- ..._part3/: Contains PCA results, summary stats/reports, logs, and the final *_ENIGMA-DTI_FilesToSend.zip archive.
Logs: Overall pipeline logs are in processed_genotype/logs/<jobid>_napls_qc_master/. Logs specific to each step are within the step's output directory.
Final Summary: A comprehensive summary of the pipeline run is generated at processed_genotype/napls3_qc_pipeline_summary.txt.

Troubleshooting

Job Failures: Check the .log file corresponding to the qsub job ID in the relevant log directory (master log or step-specific log).
Prerequisite Errors: Ensure all required software is installed/loaded and input files exist and are accessible. The master script performs checks at the start.
File Not Found: Verify that output files from previous steps were generated correctly and that paths used in subsequent scripts are accurate. The master script attempts to find these dynamically.
R Script Errors: Check the R script output within the main log file (*_run.log) for specific error messages, often related to missing packages or data format issues.
PLINK Errors: Consult the PLINK .log files generated within the step's output directory for detailed error messages.

References

ENIGMA Resources

ENIGMA-DTI QC Protocol Summary (Aug 2024)
ENIGMA-DTI Quality Control (QC) Protocol
ENIGMA_DTI_GWAS GitHub Repository
ENIGMA Genetics Protocols Overview
ENIGMA Genetics GitHub Repository

Imputation Servers

Michigan Imputation Server 2
TOPMed Imputation Server
Sanger Imputation Service
Kiel EagleImp-web Imputation Server
1. EagleImp: fast and accurate genome-wide phasing and imputation in a single tool
  1. "For common variants investigated in typical genome-wide association studies, EagleImp provided same or higher imputation accuracy than the Sanger Imputation Service, Michigan Imputation Server and the newly developed TOPMed Imputation Server, despite larger (not publicly available) reference panels."
2. EagleImp Github
Helmholtz Munich Imputation Server (HMIS)
1. Toward GDPR compliance with the Helmholtz Munich genotype imputation server

Tools & Formats

PLINK 1.9 Documentation
PLINK 2.0 Documentation
rsid_tools GitHub

History

The ENIGMA group had wanted access to our genotyped data for NAPLS to perform a diffusion imaging GWAS.
We initially identified an older ENIGMA Genetics processing pipeline, which has since been updated to the current version specified below.
1. ENIGMA Genetics Protocols
2. ENIGMA Protocols for Imputation and Genetic Associations

ENIGMA-DTI Quality Control (QC) Protocol - Pre Imputation Tasks/Info

The NAPLS3 data has been downloaded
1. NAPLS2 data will be downloaded soon, and processed after a pipeline has been established and validated for NAPLS3
2. Start with N3
3. The data is already formatted in ENIGMA's required genome build: hg19/grch37
4. The data was genotyped using the Illumina Infinium Global Screening Array-24 chip referred to as GSAMD-24v1-0_20011747_A1 in the raw data
5. bed, bim, and fam files are located here /u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/NAPLS3_n710*
Before starting ENIGMA-DTI Quality Control (QC) Protocol - Pre Imputation SNP names must be changed:
1. If you look in the .bim file, you'll see 6 columns: https://www.cog-genomics.org/plink/1.9/formats#bim
2. Convert to rs format
  1. ENIGMA may request chr:bp format [chrom # : base-pair location ]
  2. rs is a good place to start and will make things cleaner
  3. Most of them are already in rs format already
    1. Some have GSA- prefix
3. Make a new text file with two columns
  1. old variant name [currently in the bim file]
  2. new variant name
4. Remove GSA- prefix from variant names and transfer the rs # to the new (second) column.
5. There are a couple wonky ones with that are in chr:bp format or some other format.
  1. For these, you can honestly probably get away with keeping them like that
  2. Otherwise you can look here [list of rs ids linked to their chr:bp format for build hg19]: /u/project/cbearden/hughesdy/NAPLS/rsDict/hg19/noDups/AllChr_Sorted_Tabdelim_nochr.txt
Because the base-pair location is listed after the chromosome in the first column of this file, you can use that information to match it to the corresponding SNP in chr:bp format in the NAPLS data.
Then add the new name to the renaming file
1. More information on renaming [--update-name documentation]: https://www.cog-genomics.org/plink/1.9/data#update_map
2. Path for plinkv1.9 /u/project/cbearden/hughesdy/software/plinkv1.9/plink
3. Path for plinkv2.0 /u/project/cbearden/hughesdy/software/plink2
4. They're pretty much the same, but plink2 can work with more efficient versions of the bed/bim/fam files.
5. Should only need v1.9 though

Uh oh!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

README.md

README.md

NAPLS Genomic Data Processing Pipeline (ENIGMA-DTI QC)

Overview

Prerequisites

Software

Input Data

Environment

Workflow Steps

Detailed NAPLS3 Genomic Data Pre-Imputation QC Workflow (ENIGMA-DTI Protocol Implementation)

Stage 0: Create rsID Binaries (Optional - `processed_genotype/01_create_rsid_binaries.sh`)

Stage 1: SNP Renaming (`processed_genotype/01_rename_snps_direct.sh`)

Stage 2: ENIGMA-DTI QC Part 1 (`processed_genotype/02_enigma_dti_qc_napls3_part1.sh`)

Stage 3: ENIGMA-DTI QC Part 2 (`processed_genotype/02_enigma_dti_qc_napls3_part2.sh`)

Stage 4: ENIGMA-DTI QC Part 3 (`processed_genotype/02_enigma_dti_qc_napls3_part3.sh`)

Running the Pipeline

Using the Master Script (Recommended)

Running Individual Scripts

Output Structure

Troubleshooting

References

ENIGMA Resources

Imputation Servers

Tools & Formats

History

ENIGMA-DTI Quality Control (QC) Protocol - Pre Imputation Tasks/Info

Uh oh!

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

NAPLS Genomic Data Processing Pipeline (ENIGMA-DTI QC)

Overview

Prerequisites

Software

Input Data

Environment

Workflow Steps

Detailed NAPLS3 Genomic Data Pre-Imputation QC Workflow (ENIGMA-DTI Protocol Implementation)

Stage 0: Create rsID Binaries (Optional - processed_genotype/01_create_rsid_binaries.sh)

Stage 1: SNP Renaming (processed_genotype/01_rename_snps_direct.sh)

Stage 2: ENIGMA-DTI QC Part 1 (processed_genotype/02_enigma_dti_qc_napls3_part1.sh)

Stage 3: ENIGMA-DTI QC Part 2 (processed_genotype/02_enigma_dti_qc_napls3_part2.sh)

Stage 4: ENIGMA-DTI QC Part 3 (processed_genotype/02_enigma_dti_qc_napls3_part3.sh)

Running the Pipeline

Using the Master Script (Recommended)

Running Individual Scripts

Output Structure

Troubleshooting

References

ENIGMA Resources

Imputation Servers

Tools & Formats

History

ENIGMA-DTI Quality Control (QC) Protocol - Pre Imputation Tasks/Info

Stage 0: Create rsID Binaries (Optional - `processed_genotype/01_create_rsid_binaries.sh`)

Stage 1: SNP Renaming (`processed_genotype/01_rename_snps_direct.sh`)

Stage 2: ENIGMA-DTI QC Part 1 (`processed_genotype/02_enigma_dti_qc_napls3_part1.sh`)

Stage 3: ENIGMA-DTI QC Part 2 (`processed_genotype/02_enigma_dti_qc_napls3_part2.sh`)

Stage 4: ENIGMA-DTI QC Part 3 (`processed_genotype/02_enigma_dti_qc_napls3_part3.sh`)