This document describes the pre-imputation Quality Control (QC) pipeline applied to the North American Prodrome Longitudinal Study (NAPLS) Phase 3 (NAPLS3) genomic data. The pipeline follows the ENIGMA-DTI Quality Control (QC) Protocol and is implemented via a series of shell scripts designed to run on the Hoffman2 cluster.
The primary goal is to prepare the raw NAPLS3 genotype data (Genome Build: hg19/GRCh37) for subsequent imputation by performing SNP renaming and rigorous QC prior to imputation.
The workflow consists of the following stages:
- (Optional Setup) Creation of dbSNP binary files for efficient SNP mapping.
- SNP Renaming: Standardizing variant identifiers to rsIDs or
chr:pos:ref:alt
. - ENIGMA-DTI QC Part 1: Initial filtering, sex/phenotype updates, and sex checks.
- ENIGMA-DTI QC Part 2: Duplicate/relatedness checks, HapMap3 merging, and MDS analysis for ancestry outlier removal.
- ENIGMA-DTI QC Part 3: PCA covariate generation, summary statistics calculation, and final results packaging.
- PLINK v1.9: (
/u/project/cbearden/hughesdy/software/plinkv1.9/plink
) - PLINK v2.0: (
/u/project/cbearden/hughesdy/software/plink2
) - rsid_tools: (
$HOME/apps/rsid_tools/bin/rsid_tools
) - Installation required, see rsid_tools GitHub. - R: (v4.2.2+ recommended) with packages:
data.table
,ggplot2
,calibrate
,rmarkdown
,tinytex
,knitr
,xfun
. (Loaded via moduleR/4.2.2-BIO
in scripts). - Hoffman2 Modules:
parallel
,bcftools
,htslib
,aria2
.
- Raw NAPLS3 Genotypes: Located at
/u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/NAPLS3_n710.{bed,bim,fam}
. Genotyped on Illumina Infinium Global Screening Array-24 (GSAMD-24v1-0_20011747_A1
). - Phenotype/Sample Information: Located in
processed_genotype/enigma/DTIgenetics/info/
:NAPLS3_Terra_samplestab_phenofile.txt
: Contains subject IDs, sex, and case/control status.napls3_MS_diffusion.csv
: Contains list of subjects with DTI data.
- dbSNP VCF (for Stage 0): dbSNP build 156 for GRCh37 (
GCF_000001405.25.gz
and.tbi
). Can be downloaded automatically or provided locally.
- Scripts are designed for the Hoffman2 cluster environment.
- Access to project directories (
/u/project/cbearden/
,/u/home/c/cobeaman/
) and$SCRATCH
space is required.
The pipeline can be executed using the master script run_napls_qc.sh
or by running the individual scripts sequentially.
- Purpose: Generates binary index files from a dbSNP VCF. These files allow
rsid_tools
(used in Stage 1) to quickly map variant coordinates to rsIDs. - Execution: This is typically a one-time setup. The master script:
run_napls_qc.sh
checks for existing binaries in$RS_BIN_DIR
(defined as$HOME/scratch/GRCh37_dbSNP156_Binaries/Standard
) and skips this step if found. - Inputs: dbSNP VCF file (e.g.,
GCF_000001405.25.gz
) and its index (.tbi
). - Outputs: Binary
.bin
files (e.g.,GRCh37_1.hash2rsid.bin
,GRCh37_1.rsid2pos.bin
, etc.) placed in$OUTPUT_DIR
(defined as$HOME/project-cbearden/napls/binaries
in the script, but the master script expects them in$RS_BIN_DIR
).
- Purpose: Renames variant identifiers in the raw NAPLS3
.bim
file. It attempts to find the corresponding rsID using the dbSNP binaries created in Stage 0. If an rsID is not found, it uses a composite ID format (chr:pos:ref:alt
). - Inputs:
- Raw NAPLS3 PLINK files (
/u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/NAPLS3_n710.*
). - rsID binary files (from Stage 0, located via
$RS_BIN_DIR
).
- Raw NAPLS3 PLINK files (
- Outputs:
- Renamed PLINK fileset:
processed_genotype/NAPLS3_n710_renamed*.{bed,bim,fam}
. - Renaming map file:
processed_genotype/final_snp_rename*.txt
. - Log files in
processed_genotype/logs/<jobid>_rename_snps_direct/
.
- Renamed PLINK fileset:
- Purpose: Implements ENIGMA-DTI QC Steps 1-3. Filters subjects, updates sex and phenotype information, performs initial SNP/sample QC, splits the X chromosome, and performs sex checks.
- Inputs:
- Renamed PLINK files from Stage 1 (
processed_genotype/NAPLS3_n710_renamed*.{bed,bim,fam}
). - Phenotype file (
processed_genotype/enigma/DTIgenetics/info/NAPLS3_Terra_samplestab_phenofile.txt
). - DTI subject list (
processed_genotype/enigma/DTIgenetics/info/napls3_MS_diffusion.csv
).
- Renamed PLINK files from Stage 1 (
- Outputs (in
processed_genotype/enigma/DTIgenetics/<jobid>_enigma_dti_qc_napls3_part1/
):- QC'd PLINK fileset:
*_QC1.{bed,bim,fam}
. - Sex mismatch list:
sex_mismatches.txt
. - Summary files and logs.
- QC'd PLINK fileset:
- Purpose: Implements ENIGMA-DTI QC Steps 4-6. Checks for duplicates and relatedness, merges data with HapMap3 reference, performs MDS analysis to identify and remove ancestry outliers (targeting European ancestry based on CEU/TSI cluster).
- Inputs: Output directory from Stage 2 (
processed_genotype/enigma/DTIgenetics/<jobid>_enigma_dti_qc_napls3_part1/
). - Outputs (in
processed_genotype/enigma/DTIgenetics/<jobid>_enigma_dti_qc_napls3_part2/
):- QC'd PLINK fileset after outlier removal:
*_QC3.{bed,bim,fam}
. - MDS plots (before and after outlier removal):
mdsplot_*.pdf
. - Outlier lists:
*_pop_strat_mds.outlier.txt
,*_pop_strat_mds.eur.txt
. - Duplicate/Relatedness counts.
- Summary files and logs.
- QC'd PLINK fileset after outlier removal:
- Purpose: Implements ENIGMA-DTI QC Steps 8-9 and final packaging. Generates PCA covariates, calculates pre- and post-QC summary statistics, creates summary reports (text and PDF), and packages essential results into a zip archive for submission.
- Inputs: Output directories from Stage 2 and Stage 3.
- Outputs (in
processed_genotype/enigma/DTIgenetics/<jobid>_enigma_dti_qc_napls3_part3/
):- PCA results:
*_PCACovariates.{eigenval,eigenvec,log}
. - PCA scree plot:
screeplot_*.pdf
. - Summary statistics files:
*_basic_stats_preQC.txt
,*_basic_stats_postQC.txt
,*_qc_summary.txt
. - Summary reports:
*_QC3_summary.txt
,summary_report.pdf
. - Final submission archive:
*_ENIGMA-DTI_FilesToSend.zip
(contains logs, stats, plots). output_all/
directory containing intermediate and final files.output_final/
directory containing files included in the zip archive.
- PCA results:
The entire pipeline can be run using the master orchestration script run_napls_qc.sh
.
- Review Configuration: Check the environment variables defined within
run_napls_qc.sh
(e.g.,NAPLS3_DIR
,WORK_DIR
,SCRATCH_DIR
, tool paths) and adjust if necessary for your environment. - Submit Job: Submit the script to the Hoffman2 scheduler:
qsub run_napls_qc.sh
- Monitoring:
- The main pipeline log is written to
$LOG_DIR/napls3_qc_run.log
(where$LOG_DIR
is defined in the script, e.g.,processed_genotype/logs/<jobid>_napls_qc_master
). - Individual script logs are stored within their respective output directories (e.g.,
processed_genotype/enigma/DTIgenetics/<jobid>_partX/logs/
). - The pipeline uses a checkpoint file (
$LOG_DIR/napls3_qc_checkpoint.txt
) to track completed steps, allowing resumption if interrupted.
- The main pipeline log is written to
Alternatively, the scripts (01_create_rsid_binaries.sh
, 01_rename_snps_direct.sh
, 02_enigma_dti_qc_napls3_part1.sh
, etc.) can be run sequentially via qsub
or directly in an interactive session. Ensure the necessary inputs from the previous step are available and correctly located. The master script uses cached_find
to locate outputs dynamically, which would need manual replication or hardcoding if running scripts individually.
- SNP Renaming: Renamed files (
NAPLS3_n710_renamed*
) are placed directly inprocessed_genotype/
. - ENIGMA QC: Each part of the ENIGMA QC creates a timestamped/job-ID-based directory within
processed_genotype/enigma/DTIgenetics/
...._part1/
: Contains*_QC1.*
files, logs, sex mismatch info...._part2/
: Contains*_QC3.*
files (final QC'd dataset), MDS plots, outlier lists, logs...._part3/
: Contains PCA results, summary stats/reports, logs, and the final*_ENIGMA-DTI_FilesToSend.zip
archive.
- Logs: Overall pipeline logs are in
processed_genotype/logs/<jobid>_napls_qc_master/
. Logs specific to each step are within the step's output directory. - Final Summary: A comprehensive summary of the pipeline run is generated at
processed_genotype/napls3_qc_pipeline_summary.txt
.
- Job Failures: Check the
.log
file corresponding to theqsub
job ID in the relevant log directory (master log or step-specific log). - Prerequisite Errors: Ensure all required software is installed/loaded and input files exist and are accessible. The master script performs checks at the start.
- File Not Found: Verify that output files from previous steps were generated correctly and that paths used in subsequent scripts are accurate. The master script attempts to find these dynamically.
- R Script Errors: Check the R script output within the main log file (
*_run.log
) for specific error messages, often related to missing packages or data format issues. - PLINK Errors: Consult the PLINK
.log
files generated within the step's output directory for detailed error messages.
- ENIGMA-DTI QC Protocol Summary (Aug 2024)
- ENIGMA-DTI Quality Control (QC) Protocol
- ENIGMA_DTI_GWAS GitHub Repository
- ENIGMA Genetics Protocols Overview
- ENIGMA Genetics GitHub Repository
- Michigan Imputation Server 2
- TOPMed Imputation Server
- Sanger Imputation Service
- Kiel EagleImp-web Imputation Server
- EagleImp: fast and accurate genome-wide phasing and imputation in a single tool
- "For common variants investigated in typical genome-wide association studies, EagleImp provided same or higher imputation accuracy than the Sanger Imputation Service, Michigan Imputation Server and the newly developed TOPMed Imputation Server, despite larger (not publicly available) reference panels."
- EagleImp Github
- EagleImp: fast and accurate genome-wide phasing and imputation in a single tool
- Helmholtz Munich Imputation Server (HMIS)
- The ENIGMA group had wanted access to our genotyped data for NAPLS to perform a diffusion imaging GWAS.
- We initially identified an older ENIGMA Genetics processing pipeline, which has since been updated to the current version specified below.
ENIGMA-DTI Quality Control (QC) Protocol - Pre Imputation Tasks/Info
- The NAPLS3 data has been downloaded
- NAPLS2 data will be downloaded soon, and processed after a pipeline has been established and validated for NAPLS3
- Start with N3
- The data is already formatted in ENIGMA's required genome build: hg19/grch37
- The data was genotyped using the Illumina Infinium Global Screening Array-24 chip referred to as
GSAMD-24v1-0_20011747_A1
in the raw data - bed, bim, and fam files are located here
/u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/NAPLS3_n710*
- Before starting ENIGMA-DTI Quality Control (QC) Protocol - Pre Imputation SNP names must be changed:
- If you look in the .bim file, you'll see 6 columns: https://www.cog-genomics.org/plink/1.9/formats#bim
- Convert to
rs
format- ENIGMA may request
chr:bp
format [chrom # : base-pair location ] rs
is a good place to start and will make things cleaner- Most of them are already in
rs
format already- Some have
GSA-
prefix
- Some have
- ENIGMA may request
- Make a new text file with two columns
- old variant name [currently in the bim file]
- new variant name
- Remove
GSA-
prefix from variant names and transfer the rs # to the new (second) column. - There are a couple wonky ones with that are in
chr:bp
format or some other format.- For these, you can honestly probably get away with keeping them like that
- Otherwise you can look here [list of
rs
ids linked to theirchr:bp
format for buildhg19
]:/u/project/cbearden/hughesdy/NAPLS/rsDict/hg19/noDups/AllChr_Sorted_Tabdelim_nochr.txt
- Because the base-pair location is listed after the chromosome in the first column of this file, you can use that information to match it to the corresponding SNP in chr:bp format in the NAPLS data.
- Then add the new name to the renaming file
- More information on renaming [--update-name documentation]: https://www.cog-genomics.org/plink/1.9/data#update_map
- Path for plinkv1.9
/u/project/cbearden/hughesdy/software/plinkv1.9/plink
- Path for plinkv2.0
/u/project/cbearden/hughesdy/software/plink2
- They're pretty much the same, but plink2 can work with more efficient versions of the bed/bim/fam files.
- Should only need v1.9 though