gprep_workflow.md

NAPLS3 Genomic Data Pre-Imputation QC Workflow (ENIGMA-DTI Protocol Implementation)

1. Overview

This document provides a comprehensive, step-by-step explanation of the pre-imputation Quality Control (QC) workflow applied to the North American Prodrome Longitudinal Study Phase 3 (NAPLS3) genomic dataset. This implementation, executed via shell scripts on the Hoffman2 cluster, meticulously follows the ENIGMA-DTI Quality Control (QC) Protocol (v1.1, July 2024).

Goal: To rigorously prepare the raw NAPLS3 genotype data (Genome Build: hg19/GRCh37) for subsequent imputation by standardizing variant identifiers (SNP renaming) and applying the specific QC filters mandated by the ENIGMA-DTI protocol.

Orchestration: The entire pipeline is managed by the master script run_napls_qc.sh. This script coordinates the execution of several modular scripts, each responsible for a specific stage of the workflow. It manages directories, handles job submission parameters for Hoffman2's SGE scheduler, verifies prerequisites, uses checkpointing (napls3_qc_checkpoint.txt) for resumability, and logs progress.

2. Workflow Stages

The pipeline is divided into distinct stages, each implemented by a dedicated script:

Stage 0: Create rsID Binaries (Optional Setup)

Script: 01_create_rsid_binaries.sh
Protocol Relevance: This is a preparatory step, not explicitly part of the ENIGMA-DTI protocol document, but necessary for the chosen SNP renaming tool (rsid_tools).
Purpose: To generate efficient binary index files from a comprehensive dbSNP VCF (Build 156 for GRCh37). These binaries allow the rsid_tools annotate command (used in Stage 1) to rapidly map variant coordinates (chromosome, position, alleles) to their corresponding reference SNP cluster IDs (rsIDs).
How it Works (Hoffman2 Implementation):
1. Environment Setup: Loads necessary Hoffman2 modules (parallel, bcftools, htslib). Defines directories, including $SCRATCH for intensive I/O and $OUTPUT_DIR for final binaries.
2. dbSNP Acquisition (get_dbsnp_files):
  - Checks $SCRATCH for existing dbSNP VCF (GCF_000001405.25.gz) and index (.tbi). If found, creates symbolic links.
  - Checks EXISTING_VCF_DIR (if specified) and copies files if found.
  - If not found locally, downloads the VCF and index from the NCBI FTP site using curl.
3. VCF Verification (verify_vcf): Uses bcftools view -h to quickly check if the downloaded/linked VCF header is readable, ensuring the file is not corrupted before processing.
4. Parallel Chromosome Extraction: Uses parallel and bcftools view -r <region> to extract data for each chromosome (1-22, X, Y, M) from the main dbSNP VCF into separate compressed VCF files (chr{}.vcf.gz) within $TEMP_DIR. This breaks down the large VCF for parallel processing. NCBI contig names (e.g., NC_000001.10) are mapped to simple chromosome numbers/letters.
5. Parallel Binary Creation: Uses parallel again to process each chromosome's extracted VCF:
  - bcftools query: Extracts relevant fields (CHROM, ID, POS, REF, ALT).
  - awk: Replaces the NCBI chromosome name with the simple name (e.g., '1', 'X').
  - sed 's/rs//g': Removes the 'rs' prefix from rsIDs, as expected by rsid_tools make_bin.
  - sort: Sorts the data numerically by position.
  - bgzip: Compresses the sorted TSV file.
  - rsid_tools make_bin: Creates the .hash2rsid.bin and .rsid2pos.bin files for the chromosome from the compressed TSV.
6. Transfer & Cleanup: Moves the generated .bin files from $TEMP_DIR to the final $OUTPUT_DIR. Optionally cleans up temporary files in $SCRATCH.
Why this Stage: Pre-generating these binaries significantly speeds up the SNP renaming process in the next stage, which would otherwise involve much slower lookups in the large text-based dbSNP VCF. Parallelization leverages Hoffman2's multi-core nodes.

Stage 1: SNP Renaming

Script: 01_rename_snps_direct.sh
Protocol Relevance: Addresses the implicit requirement for standardized variant IDs before merging with reference panels (like HapMap3 in Stage 3). While the protocol doesn't mandate a specific tool, consistent naming (preferably rsIDs) is crucial.
Purpose: To update the variant identifiers in the NAPLS3 .bim file. It prioritizes mapping variants to rsIDs using the dbSNP binaries created in Stage 0. If a variant cannot be mapped to an rsID, it assigns a composite key in the format chr:pos:ref:alt. This ensures every variant has a unique and informative identifier.
How it Works (Hoffman2 Implementation):
1. Environment Setup: Loads parallel module. Defines paths to tools (plink2, rsid_tools), input/output directories, and the location of the dbSNP binaries ($RS_BIN_DIR). Uses $SCRATCH for intermediate files.
2. Input Copying: Copies the raw NAPLS3 .bed, .bim, .fam files to a job-specific $SCRATCH_DIR.
3. Binary Linking (setup_links): Creates symbolic links in $SCRATCH_DIR/bin_links pointing to the actual binary files in $RS_BIN_DIR. This avoids potential issues with rsid_tools path length limits or special characters.
4. BIM Preprocessing: Uses awk to read the input .bim file. It filters for valid SNPs (numeric position, ACGT alleles), handles chromosome names (mapping PAR1/PAR2 to X), creates a composite key (chr:pos:ref:alt using REF:ALT alleles), and outputs a temporary map (preprocessed_map.txt) containing OriginalID <tab> CompositeKey. !seen[$2]++ ensures only the first occurrence of a variant ID is kept.
5. Parallel Annotation (annotate_chromosome function called by parallel):
  - For each chromosome (1-22, X, Y):
    - awk: Extracts the composite keys belonging to that chromosome from preprocessed_map.txt into chr{}_ids.txt.
    - rsid_tools annotate: Uses the dbSNP binaries (via links) to find rsIDs for the composite keys in chr{}_ids.txt. Outputs results to chr{}/annotated/hash2rsid*.tsv.
    - awk: Merges the rsid_tools output (CompositeKey, FoundID, rsID/.) with the preprocessed_map.txt to create a chromosome-specific map (chr{}_map.txt) containing OriginalID <tab> FinalID (where FinalID is the rsID if found, otherwise the OriginalID).
6. Map Combination: Concatenates all chr{}_map.txt files. Uses awk '!seen[$1]++' to remove potential duplicate OriginalIDs introduced during parallel processing. Appends any OriginalIDs from preprocessed_map.txt that were not present in the concatenated map (these are variants that rsid_tools couldn't process or map), ensuring all original variants are accounted for, using their OriginalID as the FinalID. Saves this complete map as final_snp_rename.txt.
7. PLINK Renaming: Uses plink2 --update-name with the final_snp_rename.txt file (specifying columns 1 and 2 for old and new IDs) to create the new, renamed PLINK fileset (NAPLS3_n710_renamed*.{bed,bim,fam}). --merge-par handles the PAR regions correctly. --rm-dup force-first list removes variants with duplicate positions, keeping the first one encountered and listing removed duplicates. --not-chr MT excludes the mitochondrial chromosome.
8. Output Comparison: Uses awk to compare the original .bim and the newly created renamed .bim. It reconstructs the composite key from the original BIM and checks if it exists in the renamed BIM, listing any variants present in the original but missing in the renamed output (missing_variants*.txt).
9. Result Transfer: Copies the final renamed PLINK fileset, the final map, the duplicate list, and the missing variants list from $SCRATCH back to the $WORK_DIR.
Why this Stage: Standardizes variant IDs for compatibility with reference datasets (HapMap3) and downstream tools. Using rsIDs where possible is standard practice. Composite keys provide unique identifiers for unmapped variants. rsid_tools + parallel offers an efficient way to perform the mapping on Hoffman2. plink2 is used for the actual file update.

Stage 2: ENIGMA-DTI QC Part 1 (Protocol Steps 1-3)

Script: 02_enigma_dti_qc_napls3_part1.sh
Protocol Relevance: Directly implements the initial QC steps outlined in the ENIGMA-DTI protocol: filtering for DTI availability, ensuring correct sex/phenotype coding, splitting the X chromosome, applying basic SNP/sample filters, and performing a sex check.
Purpose: To perform initial data cleaning and filtering specific to the ENIGMA-DTI project requirements, focusing on sample selection, data consistency, and basic quality thresholds before more intensive checks.
How it Works (Hoffman2 Implementation):
1. Environment Setup: Loads parallel. Defines paths, including input files (renamed PLINK set from Stage 1, phenotype/DTI info files) and output directories ($TEMP_DIR, $FINAL_DIR). Sets the output prefix (ANC_DATA). Copies input PLINK files to $TEMP_DIR.
2. Phenotype/Sex Update:
  - Reads the main phenotype file (NAPLS3_Terra_samplestab_phenofile.txt) using awk to create a map (sex_map.txt) of internal ID to sex code (1=Male, 2=Female).
  - Uses awk again to update the 5th column (sex) in the input .fam file (input.fam) based on the sex_map.txt, saving as input_sex_updated.fam.
3. DTI Subject Filtering:
  - Creates an ID map (id_map.txt) linking internal IDs to subject IDs (awk on PHENO_FILE).
  - Extracts subject IDs with DTI data from napls3_MS_diffusion.csv (awk on DTI_FILE, saving as dti_ids_raw.txt).
  - Uses join to find the intersection between the id_map.txt and dti_ids_raw.txt, outputting a list of internal IDs (FID and IID) for subjects with DTI data (dti_fid_list.txt).
  - Uses plink1.9 --keep with dti_fid_list.txt on the sex-updated input data to create a dataset (*_dti.*) containing only individuals with DTI data. (Protocol Step: "Very important Please remove individuals who do not have DTI data available...")
4. Phenotype Recoding:
  - Creates a phenotype map (pheno_map.txt) from PHENO_FILE mapping internal ID to case/control status (1=Control, 2=Case) (awk).
  - Uses awk to update the 6th column (phenotype) in the *_dti.fam file based on pheno_map.txt. Defaults to 1 (Control) if not found in the map. (Protocol Step: "assign phenotype... coding controls... as 1 and cases as 2")
5. X Chromosome Splitting:
  - Uses plink1.9 --split-x b37 no-fail on the *_dti dataset to handle pseudo-autosomal regions (PARs) correctly, creating *_splitx.*. It attempts hg19 if b37 fails. (Protocol Step: "Split-X to deal with pseudo-autosomal regions")
6. Initial Filtering & Pruning:
  - Applies basic SNP/sample QC filters to the *_splitx dataset using plink1.9: --mind 1 (no missing per-sample call rate filter applied here, differs slightly from protocol example's --mind 1 in pre-sexcheck filtering), --geno 0.01 (removes SNPs missing >1% calls), --maf 0.05 (removes SNPs with MAF < 5%), --hwe 1e-06 (removes SNPs failing HWE test at p < 1e-6, likely in controls only if cases exist). Creates *_filtered.*. (Protocol Step: Implicit in filtering before sex check, though thresholds differ slightly from the protocol example's pre-sexcheck filter)
  - Performs LD pruning using plink1.9 --indep-pairphase 20000 2000 0.5 (window 20kb, step 2000 SNPs, r2 > 0.5) in parallel across chromosomes (run_parallel_pruning function) on the *_filtered dataset. Creates *_pruned.prune.in. (Protocol Step: Uses --indep-pairphase as in protocol example)
  - Extracts the pruned SNPs using plink1.9 --extract to create the *_pruned.* dataset.
7. Sex Check:
  - Performs sex check on the *_splitx dataset using plink1.9 --check-sex 0.2 0.8. This compares reported sex (from .fam) with genetic sex inferred from X chromosome homozygosity (F-statistic). Thresholds 0.2/0.8 define boundaries for female/male calls. Creates *_sexcheck.sexcheck. (Protocol Step: "plink --bfile ${datafile}_splitX --check-sex 0.2 0.8")
  - Uses grep PROBLEM to extract individuals with conflicting sex information into sex_mismatches.txt.
8. Sex Mismatch Removal:
  - If sex_mismatches.txt is not empty, uses plink1.9 --remove to exclude these individuals from the *_splitx dataset, creating the final *_QC1.* dataset for this stage. (Protocol Step: "plink --bfile ${datafile} --remove sex.drop --make-bed --out ${datafile}_QC1")
  - If no mismatches, *_QC1.* is simply a copy of *_splitx.*.
9. Reporting: Calculates final counts (cases, controls, X SNPs) and generates summary files.
Why this Stage: This stage enforces initial data quality and consistency according to ENIGMA's requirements. Filtering by DTI availability focuses the analysis. Correct sex/phenotype coding is essential for checks and association studies. Splitting X is crucial for accurate sex inference. Basic filtering removes unreliable data points. The sex check identifies and removes sample mix-ups or incorrect reporting.

Stage 3: ENIGMA-DTI QC Part 2 (Protocol Steps 4-7)

Script: 02_enigma_dti_qc_napls3_part2.sh
Protocol Relevance: Implements ENIGMA-DTI steps for identifying and removing duplicate samples, assessing relatedness, merging with the HapMap3 reference panel, performing Multi-Dimensional Scaling (MDS) for ancestry visualization, and removing ancestry outliers.
Purpose: To refine the dataset by removing technical duplicates, assessing cryptic relatedness, and ensuring the sample consists of individuals with the target ancestry (European, in this case) by comparing against a standard reference panel (HapMap3).
How it Works (Hoffman2 Implementation):
1. Environment Setup: Loads aria2, parallel, R. Defines paths, including the input directory (output of Part 1), $TEMP_DIR, $FINAL_DIR, HapMap3 URL/directory. Copies the *_QC1.* files from Part 1 input to $TEMP_DIR.
2. Duplicate/Relatedness Pruning:
  - Applies basic filtering (--mind 0.1, --geno 0.01, --maf 0.05) to *_QC1 creating *_QC1tmp. Note --mind 0.1 is looser than later filters.
  - Performs LD pruning (plink1.9 --indep-pairwise 100 5 0.2) on *_QC1tmp creating *_QC1pruned.*. (Protocol Step: Uses --indep-pairwise as in protocol)
3. Duplicate Check & Removal:
  - Calculates pairwise Identity-By-Descent (IBD) using plink1.9 --genome --min 0.9 on the pruned data (*_QC1pruned), outputting pairs with PI_HAT > 0.9 to pihat_duplicates.genome. (Protocol Step: "plink --bfile ${datafile}_QC1pruned --genome --min 0.9")
  - Uses awk to extract one individual (FID, IID) from each duplicate pair found in pihat_duplicates.genome, excluding known monozygotic twin pairs specified in the script (mz_twins array). Saves this list to pihat_duplicates.txt.
  - Counts the number of individuals to be removed (dup_count) and saves it.
  - If dup_count > 0, uses plink1.9 --remove with pihat_duplicates.txt on the unpruned *_QC1 dataset to create *_QC2.*. (Protocol Step: "plink --bfile ${datafile}_QC1 --remove pihat_duplicates.txt --make-bed --out ${datafile}_QC2")
  - If dup_count == 0, links *_QC1.* files to *_QC2.*.
4. Relatedness Check:
  - Calculates pairwise IBD using plink1.9 --genome --min 0.25 --max 0.9 on the pruned data (*_QC1pruned), outputting pairs with 0.25 < PI_HAT < 0.9 to pihat_relatedness.genome. (Protocol Step: "plink --bfile ${datafile}_QC1pruned --genome --min 0.25 --max 0.9")
  - Counts the number of related pairs (rel_count) and saves it. These individuals are not removed per the protocol.
5. HapMap3 Preparation:
  - Downloads HapMap3 reference data (HM3_b37.{bed,bim,fam}.gz) using aria2c if not already present in $HAPMAP_DIR. Uses flock for safe concurrent downloads if multiple jobs run. Decompresses using pigz. Creates HM3_b37.snplist.txt. (Protocol Step: "Download the following 3 files...")
6. MDS Filtering & Preparation:
  - Applies stricter QC filters to *_QC2 for MDS: plink1.9 --mind 1 (no sample missingness filter), --hwe 1e-6, --geno 0.05, --maf 0.01. Creates *_QC2_filtered.*. (Protocol Step: "plink --bfile ${datafile}_QC2 --mind 1 --hwe 1e-6 --geno 0.05 --maf 0.01")
  - Extracts SNPs present in HapMap3 from *_QC2_filtered using plink1.9 --extract HM3_b37.snplist.txt, creating *_QC2local.*. (Protocol Step: "plink --bfile ${datafile}_QC2_filtered --extract HM3_b37.snplist.txt")
  - Uses awk to identify and exclude ambiguous SNPs (A/T, C/G) from *_QC2_filtered.bim, saving the list of unambiguous SNPs to local.snplist.txt. (Protocol Step: "awk '{ if (($5=="T" && $6=="A")...")
  - Extracts these unambiguous SNPs from the HapMap3 data using plink1.9 --extract local.snplist.txt, creating HM3_b37_external.*. (Protocol Step: "plink --bfile HM3_b37 --extract local.snplist.txt")
  - Identifies and excludes multi-allelic SNPs found within the local data (*_QC2local.bim) before merging.
  - Uses plink1.9 --flip-scan to identify potential strand flips between local data and reference. Flips necessary SNPs using plink1.9 --flip.
7. Merging:
  - Attempts to merge the prepared local data (*_QC2local_flipped or *_QC2local_no_multi) with the prepared HapMap3 data (HM3_b37_external_no_multi) using plink1.9 --bmerge. (Protocol Step: "plink --bfile ${datafile}_QC2local --bmerge HM3_b37_external...")
  - If merging fails due to mismatching SNPs (-merge.missnp file created), it excludes these problematic SNPs (plink1.9 --exclude) and retries the merge. (Protocol Step: Handles merge errors similar to protocol alternatives)
8. MDS Calculation:
  - Performs MDS on the successfully merged dataset (*_QC2local_HM3b37merge) using plink1.9 --cluster --mind .05 --mds-plot 10 --extract local.snplist.txt. Calculates 10 MDS components. (Protocol Step: "plink --bfile ${datafile}_QC2local_HM3b37merge --cluster --mind .05 --mds-plot 10")
  - Formats the MDS output (.mds) into TSV and CSV formats for R.
9. Outlier Identification & Plotting (R Script):
  - Executes the R script mds_plot.R.
  - Loads MDS data (.csv). Assigns population labels (NAPLS3 cohort vs. HapMap3 populations).
  - Generates a PDF plot (mdsplot_*_outliersincluded.pdf) showing MDS components C1 vs C2, color-coded by population.
  - Defines specific thresholds for C1 (-0.06 to -0.04) and C2 (0.055 to 0.07) to isolate the CEU/TSI (European) cluster, as specified in the script comments reflecting ENIGMA guidance.
  - Flags individuals from the NAPLS3 cohort falling outside these C1/C2 boundaries as outliers.
  - Writes lists of outliers (*_pop_strat_mds.outlier.txt) and non-outliers/Europeans (*_pop_strat_mds.eur.txt).
  - Generates a second PDF plot (mdsplot_*_outliersexcluded.pdf) showing only the HapMap3 reference and the non-outlier NAPLS3 individuals. (Protocol Step: R code section for plotting and outlier identification)
10. Outlier Removal:
  - If outliers were identified, uses plink1.9 --keep with the *_pop_strat_mds.eur.txt list on the *_QC2_filtered dataset (the dataset before merging with HapMap3) to create the final *_QC3.* dataset for this stage. (Protocol Step: "plink --bfile ${datafile}_QC2 --keep ${datafile}_pop_strat_mds.eur.txt --make-bed --out ${datafile}_QC3")
  - If no outliers were found, links *_QC2_filtered.* to *_QC3.*.
11. Summary Report: Generates a text file summarizing removals during this stage.
Why this Stage: Removes technical artifacts (duplicates). Assesses sample structure (relatedness, ancestry). Ensures the final dataset used for imputation and association testing primarily consists of the target ancestry (European), minimizing confounding due to population stratification, by comparing to HapMap3 and removing individuals distant from the CEU/TSI cluster in MDS space.

Stage 4: ENIGMA-DTI QC Part 3 (Protocol Steps 8-9 & Packaging)

Script: 02_enigma_dti_qc_napls3_part3.sh
Purpose: Finalizes the QC process by generating PCA covariates for downstream association analyses, calculating comprehensive pre- and post-QC summary statistics, creating summary reports, and packaging the essential output files required by ENIGMA.
Protocol Alignment: Directly implements ENIGMA-DTI Steps 8 ("Get genetic principal components") and 9 ("Cohort QC summary data"), plus the final data packaging instructions.
Execution Details:
1. Setup: Creates necessary directories (TEMP_DIR, ANC_DIR, FINAL_DIR, OUTPUT_ALL, OUTPUT_FINAL, SCRIPT_DIR, LOG_DIR). Defines constants and paths, including locating the output directories from Part 1 and Part 2 using ls -t and grep. Creates R scripts (pca_plot.R, summary_report.Rmd) dynamically in $SCRIPT_DIR. Sets up a trap cleanup EXIT INT TERM to ensure results are saved even if the script is interrupted. Initializes log files.
2. Copy Inputs: Uses find and transfer_files (an rsync wrapper) to copy required files from Part 1 (*_QC1.fam, *_QC1.bim, sex_mismatches.txt renamed to sexcheck_PROBLEM.txt, *_QC_summary.txt renamed to snp_count_X.txt) and Part 2 (mdsplot*.pdf, *duplicates_count.txt, *relatedness_count.txt, *QC2_filtered.fam/log, *pop_strat_mds.outlier.txt, *QC3.bed/bim/fam, local.snplist.txt) into the temporary ANC_DIR. This centralizes inputs for this stage.
3. Step 8: Generate PCA Covariates:
  - Runs plink1.9 --pca on the final QC'd dataset (*_QC3) using the common SNP list (local.snplist.txt) generated in Part 2. This calculates the top 20 principal components by default, outputting .eigenval and .eigenvec files to ANC_DIR.
  - Executes the pca_plot.R script using Rscript. This script reads the .eigenval file, calculates the variance explained by each PC, and uses ggplot2 to generate a scree plot (screeplot_*.pdf), saving it to ANC_DIR.
4. Step 9: Generate Summary Statistics:
  - Calls generate_stats function for pre-QC stats: Reads the *_QC1.fam file (copied from Part 1), uses awk to count cases/controls and males/females within each group, calculates proportions using bc, and writes the results to *_basic_stats_preQC.txt in ANC_DIR.
  - Calls generate_stats function for post-QC stats: Reads the *_QC3.fam file (copied from Part 2), performs the same counts and calculations as above, and writes results to *_basic_stats_postQC.txt in ANC_DIR.
  - Calls generate_snp_summary: Creates *_qc_summary.txt in ANC_DIR. It populates this file by:
    - Counting lines (wc -l) in *_QC1.bim and *_QC1.fam for pre-QC SNP/sample counts.
    - Grepping specific lines from the *_QC2_filtered.log (copied from Part 2) to extract counts of SNPs/samples removed due to missingness (--geno, --mind), MAF (--maf), and HWE (--hwe). Uses awk to get the numeric count. Handles cases where counts might be zero or missing (0 is output).
    - Calculating the number of samples removed as MDS outliers by subtracting the line count of *_QC3.fam from *_QC2_filtered.fam.
    - Counting lines in *_QC3.bim and *_QC3.fam for post-QC SNP/sample counts.
  - Calls generate_summary_report_txt: Creates *_QC3_summary.txt in ANC_DIR. This provides a human-readable summary including initial/final sample counts, counts of removed duplicates and outliers, final case/control numbers, final SNP count, and the specific MDS outlier thresholds used (hardcoded in the script based on Part 2 R script).
5. Generate PDF Summary Report:
  - Calls generate_summary_report_pdf. This function first copies the text summary (*_QC3_summary.txt) into the $SCRIPT_DIR as summary_report.txt.
  - It then executes an Rscript command that uses the rmarkdown package to render the summary_report.Rmd file (created during setup). The Rmd file simply includes the content of summary_report.txt.
  - The R script includes logic to install tinytex (a LaTeX distribution) if needed and handles potential rendering errors.
  - The resulting summary_report.pdf is saved in $SCRIPT_DIR and then copied to $OUTPUT_ALL.
6. Packaging and Cleanup:
  - The cleanup function (triggered by the trap at the end of the script or on interruption) performs the final packaging.
  - It copies all generated files from ANC_DIR and SCRIPT_DIR into $OUTPUT_ALL. It also copies logs from Part 1 and Part 2 directories into $OUTPUT_ALL for a complete archive.
  - It then identifies the specific files required for ENIGMA submission (based on the protocol's list: specific logs, stats files, MDS plots) by searching ANC_DIR and OUTPUT_ALL.
  - It copies these required files into $OUTPUT_FINAL, renaming the MDS plots to include the $ANC_DATA prefix for consistency.
  - It checks if the number of files in $OUTPUT_FINAL matches the expected count (27, or 26 if *QC2.log wasn't generated) and logs a warning if there's a mismatch.
  - It logs any missing required files to missing_submission_files.txt.
  - Finally, it creates the *_ENIGMA-DTI_FilesToSend.zip archive containing the contents of $OUTPUT_FINAL using zip -r -j (the -j flag junks the paths, putting all files in the root of the zip).
- Key Tools: plink1.9, R (data.table, ggplot2, rmarkdown, tinytex, knitr, xfun), bc, awk, find, grep, wc, rsync, zip.

5. Prerequisites

5.1 Software & Tools

PLINK v1.9: /u/project/cbearden/hughesdy/software/plinkv1.9/plink (Used for most QC steps)
PLINK v2.0: /u/project/cbearden/hughesdy/software/plink2 (Used in Stage 1 for --update-name and --rm-dup)
rsid_tools: $HOME/apps/rsid_tools/bin/rsid_tools (Used in Stage 1 for SNP mapping; requires separate installation)
R: v4.2.2+ (Loaded via module R/4.2.2-BIO)
- Required R Packages: data.table, ggplot2, calibrate (for MDS plot), rmarkdown, tinytex, knitr, xfun (for PDF report). Scripts attempt installation via install.packages if missing.
Hoffman2 Modules: parallel, bcftools, htslib (for Stage 0), aria2 (for HapMap download in Stage 3).
Standard Unix Utilities: bash, awk, grep, sort, join, curl, zip, rsync, timeout, bc, find, tee, cat, mv, cp, ln, mkdir, rm, stat, tr, wc, pigz (or gunzip), tail, head.

5.2 Input Data

Raw NAPLS3 Genotypes: /u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/NAPLS3_n710.{bed,bim,fam} (Build hg19/GRCh37).
Phenotype/Sample Information: Located in processed_genotype/enigma/DTIgenetics/info/:
- NAPLS3_Terra_samplestab_phenofile.txt: Subject IDs (column 3), sex (column 7), case/control status (column 8). Used in Part 1 for updating .fam file and ID mapping.
- napls3_MS_diffusion.csv: List of subjects with DTI data (column 1, format SITE-S####). Used in Part 1 for filtering individuals (--keep).
dbSNP VCF (for Stage 0): dbSNP build 156 for GRCh37 (GCF_000001405.25.gz and .tbi). Downloaded automatically by 01_create_rsid_binaries.sh if not found locally or specified via EXISTING_VCF_DIR.
HapMap3 Reference Data (for Stage 3): HM3_b37.{bed,bim,fam}. Downloaded automatically by 02_enigma_dti_qc_napls3_part2.sh from ENIGMA website (aria2c) if not present in temp directory ($HAPMAP_DIR).

5.3 Environment

Hoffman2 Cluster: Scripts rely on SGE job scheduler syntax (#$) and module system (module load).
Directory Access: Read access to input data directories (/u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/) and write access to working directories (/u/home/c/cobeaman/project-cbearden/napls/gprep/processed_genotype/, $SCRATCH).
Environment Variables: The master script run_napls_qc.sh defines key paths (e.g., WORK_DIR, SCRATCH_DIR, LOG_DIR, tool paths) which are exported and inherited by the subscripts. Subscripts also define their own specific paths relative to these.

6. Execution

The pipeline is executed by submitting the master script run_napls_qc.sh to the Hoffman2 scheduler:

qsub run_napls_qc.sh

Configuration: Review and adjust environment variables within run_napls_qc.sh if necessary (e.g., paths to software, base directories).
Monitoring:
- Overall pipeline progress and logs: ${LOG_DIR}/napls3_qc_run.log (where $LOG_DIR is defined in run_napls_qc.sh).
- Individual script logs: Stored within subdirectories corresponding to each stage (e.g., ${FINAL_DIR}/logs/ for Part 3). Check both the main SGE log ($JOB_ID_*.log) and the script-specific run log (*_run.log) within these directories.
- SGE job output: $HOME/project-cbearden/napls/gprep/processed_genotype/logs/$JOB_ID_napls_qc_master.log.
Resumption: The pipeline uses a checkpoint file (${LOG_DIR}/napls3_qc_checkpoint.txt) to track completed stages. If the job is interrupted, resubmitting run_napls_qc.sh will detect completed steps via step_completed function and skip them, resuming from the first incomplete stage.

7. Output Structure

SNP Renaming (Stage 1):
- Renamed PLINK files: processed_genotype/NAPLS3_n710_renamed*.{bed,bim,fam} (Timestamped)
- Renaming map: processed_genotype/final_snp_rename*.txt (Timestamped)
- Missing variants list: processed_genotype/missing_variants*.txt (Timestamped)
- Logs: processed_genotype/logs/<jobid>_rename_snps_direct/
ENIGMA QC (Stages 2-4): Each part creates a job-ID-based directory within DTIgenetics.
- Part 1 (..._part1/):
  - *_QC1.{bed,bim,fam}: Dataset after initial filtering (DTI subjects), sex/phenotype update, X-split, and sex check removal.
  - sex_mismatches.txt: List of individuals removed due to sex mismatch (FID, IID).
  - *_QC_summary.txt: Basic counts (Total N, Cases, Controls, X SNPs) for the QC1 dataset.
  - Logs: logs subdirectory (includes run.log, PLINK logs).
- Part 2 (..._part2/):
  - *_QC3.{bed,bim,fam}: Final QC'd dataset after duplicate removal and MDS outlier removal. This is the primary input for Part 3's PCA.
  - mdsplot_*.pdf: MDS plots (before and after outlier removal).
  - *_pop_strat_mds.outlier.txt: List of removed ancestry outliers (FID, IID).
  - *_pop_strat_mds.eur.txt: List of individuals kept (European ancestry cluster) (FID, IID).
  - *_QC1pruned_duplicates_count.txt, *_QC1pruned_relatedness_count.txt: Counts from duplicate/relatedness checks.
  - *_QC2_summary.txt: Text summary of removals in Part 2.
  - local.snplist.txt: List of non-ambiguous SNPs common between dataset and HapMap3, used for MDS and PCA.
  - Logs: logs subdirectory (includes run.log, PLINK logs, R script output).
- Part 3 (..._part3/):
  - *_PCACovariates.{eigenval,eigenvec}: PCA results (eigenvalues and eigenvectors).
  - screeplot_*.pdf: PCA scree plot.
  - *_basic_stats_preQC.txt, *_basic_stats_postQC.txt: Case/control/sex counts before/after QC.
  - *_qc_summary.txt: Detailed SNP/sample removal counts across steps (missingness, MAF, HWE, MDS).
  - *_QC3_summary.txt: Text summary report of final counts and thresholds.
  - summary_report.pdf: PDF version of the summary report.
  - output_all/: Archive of most intermediate and final files from all QC steps (useful for debugging).
  - output_final/: Curated set of files required for ENIGMA submission (specific logs, stats, plots).
  - *_ENIGMA-DTI_FilesToSend.zip: Final zip archive containing output_final contents.
  - missing_submission_files.txt: List of expected submission files that were not found (if any).
  - Logs: logs subdirectory (includes run.log, PLINK logs, R script output).
Master Script Logs: processed_genotype/logs/<jobid>_napls_qc_master/ (includes overall run log napls3_qc_run.log and checkpoint file napls3_qc_checkpoint.txt).
Overall Summary: processed_genotype/napls3_qc_pipeline_summary.txt (Generated at the end of the master script, summarizing runtime, outputs, and completion status).

8. Troubleshooting

Job Failure: Check the SGE output log ($HOME/project-cbearden/napls/gprep/processed_genotype/logs/$JOB_ID_*.log) and the pipeline run log (${LOG_DIR}/napls3_qc_run.log). Also check logs within the specific stage's output directory (e.g., ${FINAL_DIR}/logs/). Look for ERROR: messages in the logs.
Prerequisite Errors: Ensure all software (PLINK, R, rsid_tools, etc.) is correctly installed/loaded and paths in run_napls_qc.sh are accurate. Verify input files (NAPLS3_n710.*, phenotype files) exist and are accessible. The master script's verify_prerequisites function checks most common issues.
File Not Found: Check if the previous step completed successfully and generated the expected output files. The cached_find function in the master script relies on specific naming patterns and timestamps; ensure outputs exist in the expected locations (WORK_DIR, ENIGMA_DIR/<jobid>_partX/). Errors in transfer_files (rsync) might also cause issues.
PLINK Errors: Consult the PLINK .log files generated within the relevant stage's output directory (often in output_all or logs subdirectories). Common errors involve memory limits, file format issues, or conflicting parameters.
R Script Errors: Look for errors within the main run log (*_run.log) or specific R output files/logs. Common issues include missing packages (though scripts attempt installation), incorrect input file formats/paths, or problems with plotting libraries. PDF generation errors might relate to LaTeX (tinytex) setup; check tinytex installation and logs.
rsid_tools Errors: Ensure rsid_tools is correctly installed and the path is set. Verify the dbSNP binary files (Stage 0) were created successfully and are accessible in $RS_BIN_DIR (or linked correctly in Stage 1's scratch space).
Permissions: Ensure write permissions in $WORK_DIR, $SCRATCH_DIR, and their subdirectories. Check read permissions for input data.
Timeout Errors: If a step times out (reported by the master script), it may require more resources (time h_rt, memory h_data, or cores pe shared) requested in the SGE header (#$) of the failing script or the master script.

9. References

ENIGMA-DTI QC Protocol: GitHub Link, Summary HTML
PLINK 1.9: Documentation
PLINK 2.0: Documentation
rsid_tools: GitHub Repository
R Project: Homepage
GNU Parallel: Documentation
Hoffman2 Cluster: User Guide

Expand file tree

gprep_workflow.md

Latest commit

History

Expand file tree

gprep_workflow.md

File metadata and controls

NAPLS3 Genomic Data Pre-Imputation QC Workflow (ENIGMA-DTI Protocol Implementation)

1. Overview

2. Workflow Stages

Stage 0: Create rsID Binaries (Optional Setup)

Stage 1: SNP Renaming

Stage 2: ENIGMA-DTI QC Part 1 (Protocol Steps 1-3)

Stage 3: ENIGMA-DTI QC Part 2 (Protocol Steps 4-7)

Stage 4: ENIGMA-DTI QC Part 3 (Protocol Steps 8-9 & Packaging)

5. Prerequisites

5.1 Software & Tools

5.2 Input Data

5.3 Environment

6. Execution

7. Output Structure

8. Troubleshooting

9. References