This document provides a comprehensive, step-by-step explanation of the pre-imputation Quality Control (QC) workflow applied to the North American Prodrome Longitudinal Study Phase 3 (NAPLS3) genomic dataset. This implementation, executed via shell scripts on the Hoffman2 cluster, meticulously follows the ENIGMA-DTI Quality Control (QC) Protocol (v1.1, July 2024).
Goal: To rigorously prepare the raw NAPLS3 genotype data (Genome Build: hg19/GRCh37) for subsequent imputation by standardizing variant identifiers (SNP renaming) and applying the specific QC filters mandated by the ENIGMA-DTI protocol.
Orchestration: The entire pipeline is managed by the master script run_napls_qc.sh
. This script coordinates the execution of several modular scripts, each responsible for a specific stage of the workflow. It manages directories, handles job submission parameters for Hoffman2's SGE scheduler, verifies prerequisites, uses checkpointing (napls3_qc_checkpoint.txt
) for resumability, and logs progress.
The pipeline is divided into distinct stages, each implemented by a dedicated script:
- Script:
01_create_rsid_binaries.sh
- Protocol Relevance: This is a preparatory step, not explicitly part of the ENIGMA-DTI protocol document, but necessary for the chosen SNP renaming tool (
rsid_tools
). - Purpose: To generate efficient binary index files from a comprehensive dbSNP VCF (Build 156 for GRCh37). These binaries allow the
rsid_tools annotate
command (used in Stage 1) to rapidly map variant coordinates (chromosome, position, alleles) to their corresponding reference SNP cluster IDs (rsIDs). - How it Works (Hoffman2 Implementation):
- Environment Setup: Loads necessary Hoffman2 modules (
parallel
,bcftools
,htslib
). Defines directories, including$SCRATCH
for intensive I/O and$OUTPUT_DIR
for final binaries. - dbSNP Acquisition (
get_dbsnp_files
):- Checks
$SCRATCH
for existing dbSNP VCF (GCF_000001405.25.gz
) and index (.tbi
). If found, creates symbolic links. - Checks
EXISTING_VCF_DIR
(if specified) and copies files if found. - If not found locally, downloads the VCF and index from the NCBI FTP site using
curl
.
- Checks
- VCF Verification (
verify_vcf
): Usesbcftools view -h
to quickly check if the downloaded/linked VCF header is readable, ensuring the file is not corrupted before processing. - Parallel Chromosome Extraction: Uses
parallel
andbcftools view -r <region>
to extract data for each chromosome (1-22, X, Y, M) from the main dbSNP VCF into separate compressed VCF files (chr{}.vcf.gz
) within$TEMP_DIR
. This breaks down the large VCF for parallel processing. NCBI contig names (e.g.,NC_000001.10
) are mapped to simple chromosome numbers/letters. - Parallel Binary Creation: Uses
parallel
again to process each chromosome's extracted VCF:bcftools query
: Extracts relevant fields (CHROM, ID, POS, REF, ALT).awk
: Replaces the NCBI chromosome name with the simple name (e.g., '1', 'X').sed 's/rs//g'
: Removes the 'rs' prefix from rsIDs, as expected byrsid_tools make_bin
.sort
: Sorts the data numerically by position.bgzip
: Compresses the sorted TSV file.rsid_tools make_bin
: Creates the.hash2rsid.bin
and.rsid2pos.bin
files for the chromosome from the compressed TSV.
- Transfer & Cleanup: Moves the generated
.bin
files from$TEMP_DIR
to the final$OUTPUT_DIR
. Optionally cleans up temporary files in$SCRATCH
.
- Environment Setup: Loads necessary Hoffman2 modules (
- Why this Stage: Pre-generating these binaries significantly speeds up the SNP renaming process in the next stage, which would otherwise involve much slower lookups in the large text-based dbSNP VCF. Parallelization leverages Hoffman2's multi-core nodes.
- Script:
01_rename_snps_direct.sh
- Protocol Relevance: Addresses the implicit requirement for standardized variant IDs before merging with reference panels (like HapMap3 in Stage 3). While the protocol doesn't mandate a specific tool, consistent naming (preferably rsIDs) is crucial.
- Purpose: To update the variant identifiers in the NAPLS3
.bim
file. It prioritizes mapping variants to rsIDs using the dbSNP binaries created in Stage 0. If a variant cannot be mapped to an rsID, it assigns a composite key in the formatchr:pos:ref:alt
. This ensures every variant has a unique and informative identifier. - How it Works (Hoffman2 Implementation):
- Environment Setup: Loads
parallel
module. Defines paths to tools (plink2
,rsid_tools
), input/output directories, and the location of the dbSNP binaries ($RS_BIN_DIR
). Uses$SCRATCH
for intermediate files. - Input Copying: Copies the raw NAPLS3
.bed
,.bim
,.fam
files to a job-specific$SCRATCH_DIR
. - Binary Linking (
setup_links
): Creates symbolic links in$SCRATCH_DIR/bin_links
pointing to the actual binary files in$RS_BIN_DIR
. This avoids potential issues withrsid_tools
path length limits or special characters. - BIM Preprocessing: Uses
awk
to read the input.bim
file. It filters for valid SNPs (numeric position, ACGT alleles), handles chromosome names (mapping PAR1/PAR2 to X), creates a composite key (chr:pos:ref:alt
using REF:ALT alleles), and outputs a temporary map (preprocessed_map.txt
) containingOriginalID <tab> CompositeKey
.!seen[$2]++
ensures only the first occurrence of a variant ID is kept. - Parallel Annotation (
annotate_chromosome
function called byparallel
):- For each chromosome (1-22, X, Y):
awk
: Extracts the composite keys belonging to that chromosome frompreprocessed_map.txt
intochr{}_ids.txt
.rsid_tools annotate
: Uses the dbSNP binaries (via links) to find rsIDs for the composite keys inchr{}_ids.txt
. Outputs results tochr{}/annotated/hash2rsid*.tsv
.awk
: Merges thersid_tools
output (CompositeKey, FoundID, rsID/.) with thepreprocessed_map.txt
to create a chromosome-specific map (chr{}_map.txt
) containingOriginalID <tab> FinalID
(where FinalID is the rsID if found, otherwise the OriginalID).
- For each chromosome (1-22, X, Y):
- Map Combination: Concatenates all
chr{}_map.txt
files. Usesawk '!seen[$1]++'
to remove potential duplicate OriginalIDs introduced during parallel processing. Appends any OriginalIDs frompreprocessed_map.txt
that were not present in the concatenated map (these are variants thatrsid_tools
couldn't process or map), ensuring all original variants are accounted for, using their OriginalID as the FinalID. Saves this complete map asfinal_snp_rename.txt
. - PLINK Renaming: Uses
plink2 --update-name
with thefinal_snp_rename.txt
file (specifying columns 1 and 2 for old and new IDs) to create the new, renamed PLINK fileset (NAPLS3_n710_renamed*.{bed,bim,fam}
).--merge-par
handles the PAR regions correctly.--rm-dup force-first list
removes variants with duplicate positions, keeping the first one encountered and listing removed duplicates.--not-chr MT
excludes the mitochondrial chromosome. - Output Comparison: Uses
awk
to compare the original.bim
and the newly created renamed.bim
. It reconstructs the composite key from the original BIM and checks if it exists in the renamed BIM, listing any variants present in the original but missing in the renamed output (missing_variants*.txt
). - Result Transfer: Copies the final renamed PLINK fileset, the final map, the duplicate list, and the missing variants list from
$SCRATCH
back to the$WORK_DIR
.
- Environment Setup: Loads
- Why this Stage: Standardizes variant IDs for compatibility with reference datasets (HapMap3) and downstream tools. Using rsIDs where possible is standard practice. Composite keys provide unique identifiers for unmapped variants.
rsid_tools
+parallel
offers an efficient way to perform the mapping on Hoffman2.plink2
is used for the actual file update.
-
Script:
02_enigma_dti_qc_napls3_part1.sh
- Protocol Relevance: Directly implements the initial QC steps outlined in the ENIGMA-DTI protocol: filtering for DTI availability, ensuring correct sex/phenotype coding, splitting the X chromosome, applying basic SNP/sample filters, and performing a sex check.
- Purpose: To perform initial data cleaning and filtering specific to the ENIGMA-DTI project requirements, focusing on sample selection, data consistency, and basic quality thresholds before more intensive checks.
-
How it Works (Hoffman2 Implementation):
-
Environment Setup: Loads
parallel
. Defines paths, including input files (renamed PLINK set from Stage 1, phenotype/DTI info files) and output directories ($TEMP_DIR
,$FINAL_DIR
). Sets the output prefix (ANC_DATA
). Copies input PLINK files to$TEMP_DIR
. -
Phenotype/Sex Update:
- Reads the main phenotype file (
NAPLS3_Terra_samplestab_phenofile.txt
) usingawk
to create a map (sex_map.txt
) of internal ID to sex code (1=Male, 2=Female). - Uses
awk
again to update the 5th column (sex) in the input.fam
file (input.fam
) based on thesex_map.txt
, saving asinput_sex_updated.fam
.
- Reads the main phenotype file (
-
DTI Subject Filtering:
- Creates an ID map (
id_map.txt
) linking internal IDs to subject IDs (awk
onPHENO_FILE
). - Extracts subject IDs with DTI data from
napls3_MS_diffusion.csv
(awk
onDTI_FILE
, saving asdti_ids_raw.txt
). - Uses
join
to find the intersection between theid_map.txt
anddti_ids_raw.txt
, outputting a list of internal IDs (FID and IID) for subjects with DTI data (dti_fid_list.txt
). - Uses
plink1.9 --keep
withdti_fid_list.txt
on the sex-updated input data to create a dataset (*_dti.*
) containing only individuals with DTI data. (Protocol Step: "Very important Please remove individuals who do not have DTI data available...")
- Creates an ID map (
-
Phenotype Recoding:
- Creates a phenotype map (
pheno_map.txt
) fromPHENO_FILE
mapping internal ID to case/control status (1=Control, 2=Case) (awk
). - Uses
awk
to update the 6th column (phenotype) in the*_dti.fam
file based onpheno_map.txt
. Defaults to 1 (Control) if not found in the map. (Protocol Step: "assign phenotype... coding controls... as 1 and cases as 2")
- Creates a phenotype map (
-
X Chromosome Splitting:
- Uses
plink1.9 --split-x b37 no-fail
on the*_dti
dataset to handle pseudo-autosomal regions (PARs) correctly, creating*_splitx.*
. It attemptshg19
ifb37
fails. (Protocol Step: "Split-X to deal with pseudo-autosomal regions")
- Uses
-
Initial Filtering & Pruning:
- Applies basic SNP/sample QC filters to the
*_splitx
dataset usingplink1.9
:--mind 1
(no missing per-sample call rate filter applied here, differs slightly from protocol example's--mind 1
in pre-sexcheck filtering),--geno 0.01
(removes SNPs missing >1% calls),--maf 0.05
(removes SNPs with MAF < 5%),--hwe 1e-06
(removes SNPs failing HWE test at p < 1e-6, likely in controls only if cases exist). Creates*_filtered.*
. (Protocol Step: Implicit in filtering before sex check, though thresholds differ slightly from the protocol example's pre-sexcheck filter) - Performs LD pruning using
plink1.9 --indep-pairphase 20000 2000 0.5
(window 20kb, step 2000 SNPs, r2 > 0.5) in parallel across chromosomes (run_parallel_pruning
function) on the*_filtered
dataset. Creates*_pruned.prune.in
. (Protocol Step: Uses--indep-pairphase
as in protocol example) - Extracts the pruned SNPs using
plink1.9 --extract
to create the*_pruned.*
dataset.
- Applies basic SNP/sample QC filters to the
-
Sex Check:
- Performs sex check on the
*_splitx
dataset usingplink1.9 --check-sex 0.2 0.8
. This compares reported sex (from.fam
) with genetic sex inferred from X chromosome homozygosity (F-statistic). Thresholds 0.2/0.8 define boundaries for female/male calls. Creates*_sexcheck.sexcheck
. (Protocol Step: "plink --bfile ${datafile}_splitX --check-sex 0.2 0.8") - Uses
grep PROBLEM
to extract individuals with conflicting sex information intosex_mismatches.txt
.
- Performs sex check on the
-
Sex Mismatch Removal:
- If
sex_mismatches.txt
is not empty, usesplink1.9 --remove
to exclude these individuals from the*_splitx
dataset, creating the final*_QC1.*
dataset for this stage. (Protocol Step: "plink --bfile ${datafile} --remove sex.drop --make-bed --out ${datafile}_QC1") - If no mismatches,
*_QC1.*
is simply a copy of*_splitx.*
.
- If
- Reporting: Calculates final counts (cases, controls, X SNPs) and generates summary files.
-
Environment Setup: Loads
- Why this Stage: This stage enforces initial data quality and consistency according to ENIGMA's requirements. Filtering by DTI availability focuses the analysis. Correct sex/phenotype coding is essential for checks and association studies. Splitting X is crucial for accurate sex inference. Basic filtering removes unreliable data points. The sex check identifies and removes sample mix-ups or incorrect reporting.
-
Script:
02_enigma_dti_qc_napls3_part2.sh
- Protocol Relevance: Implements ENIGMA-DTI steps for identifying and removing duplicate samples, assessing relatedness, merging with the HapMap3 reference panel, performing Multi-Dimensional Scaling (MDS) for ancestry visualization, and removing ancestry outliers.
- Purpose: To refine the dataset by removing technical duplicates, assessing cryptic relatedness, and ensuring the sample consists of individuals with the target ancestry (European, in this case) by comparing against a standard reference panel (HapMap3).
-
How it Works (Hoffman2 Implementation):
-
Environment Setup: Loads
aria2
,parallel
,R
. Defines paths, including the input directory (output of Part 1),$TEMP_DIR
,$FINAL_DIR
, HapMap3 URL/directory. Copies the*_QC1.*
files from Part 1 input to$TEMP_DIR
. -
Duplicate/Relatedness Pruning:
- Applies basic filtering (
--mind 0.1
,--geno 0.01
,--maf 0.05
) to*_QC1
creating*_QC1tmp
. Note--mind 0.1
is looser than later filters. - Performs LD pruning (
plink1.9 --indep-pairwise 100 5 0.2
) on*_QC1tmp
creating*_QC1pruned.*
. (Protocol Step: Uses--indep-pairwise
as in protocol)
- Applies basic filtering (
-
Duplicate Check & Removal:
- Calculates pairwise Identity-By-Descent (IBD) using
plink1.9 --genome --min 0.9
on the pruned data (*_QC1pruned
), outputting pairs with PI_HAT > 0.9 topihat_duplicates.genome
. (Protocol Step: "plink --bfile ${datafile}_QC1pruned --genome --min 0.9") - Uses
awk
to extract one individual (FID, IID) from each duplicate pair found inpihat_duplicates.genome
, excluding known monozygotic twin pairs specified in the script (mz_twins
array). Saves this list topihat_duplicates.txt
. - Counts the number of individuals to be removed (
dup_count
) and saves it. - If
dup_count > 0
, usesplink1.9 --remove
withpihat_duplicates.txt
on the unpruned*_QC1
dataset to create*_QC2.*
. (Protocol Step: "plink --bfile ${datafile}_QC1 --remove pihat_duplicates.txt --make-bed --out ${datafile}_QC2") - If
dup_count == 0
, links*_QC1.*
files to*_QC2.*
.
- Calculates pairwise Identity-By-Descent (IBD) using
-
Relatedness Check:
- Calculates pairwise IBD using
plink1.9 --genome --min 0.25 --max 0.9
on the pruned data (*_QC1pruned
), outputting pairs with 0.25 < PI_HAT < 0.9 topihat_relatedness.genome
. (Protocol Step: "plink --bfile ${datafile}_QC1pruned --genome --min 0.25 --max 0.9") - Counts the number of related pairs (
rel_count
) and saves it. These individuals are not removed per the protocol.
- Calculates pairwise IBD using
-
HapMap3 Preparation:
- Downloads HapMap3 reference data (
HM3_b37.{bed,bim,fam}.gz
) usingaria2c
if not already present in$HAPMAP_DIR
. Usesflock
for safe concurrent downloads if multiple jobs run. Decompresses usingpigz
. CreatesHM3_b37.snplist.txt
. (Protocol Step: "Download the following 3 files...")
- Downloads HapMap3 reference data (
-
MDS Filtering & Preparation:
- Applies stricter QC filters to
*_QC2
for MDS:plink1.9 --mind 1
(no sample missingness filter),--hwe 1e-6
,--geno 0.05
,--maf 0.01
. Creates*_QC2_filtered.*
. (Protocol Step: "plink --bfile ${datafile}_QC2 --mind 1 --hwe 1e-6 --geno 0.05 --maf 0.01") - Extracts SNPs present in HapMap3 from
*_QC2_filtered
usingplink1.9 --extract HM3_b37.snplist.txt
, creating*_QC2local.*
. (Protocol Step: "plink --bfile ${datafile}_QC2_filtered --extract HM3_b37.snplist.txt") - Uses
awk
to identify and exclude ambiguous SNPs (A/T, C/G) from*_QC2_filtered.bim
, saving the list of unambiguous SNPs tolocal.snplist.txt
. (Protocol Step: "awk '{ if (($5=="T" && $6=="A")...") - Extracts these unambiguous SNPs from the HapMap3 data using
plink1.9 --extract local.snplist.txt
, creatingHM3_b37_external.*
. (Protocol Step: "plink --bfile HM3_b37 --extract local.snplist.txt") - Identifies and excludes multi-allelic SNPs found within the local data (
*_QC2local.bim
) before merging. - Uses
plink1.9 --flip-scan
to identify potential strand flips between local data and reference. Flips necessary SNPs usingplink1.9 --flip
.
- Applies stricter QC filters to
-
Merging:
- Attempts to merge the prepared local data (
*_QC2local_flipped
or*_QC2local_no_multi
) with the prepared HapMap3 data (HM3_b37_external_no_multi
) usingplink1.9 --bmerge
. (Protocol Step: "plink --bfile ${datafile}_QC2local --bmerge HM3_b37_external...") - If merging fails due to mismatching SNPs (
-merge.missnp
file created), it excludes these problematic SNPs (plink1.9 --exclude
) and retries the merge. (Protocol Step: Handles merge errors similar to protocol alternatives)
- Attempts to merge the prepared local data (
-
MDS Calculation:
- Performs MDS on the successfully merged dataset (
*_QC2local_HM3b37merge
) usingplink1.9 --cluster --mind .05 --mds-plot 10 --extract local.snplist.txt
. Calculates 10 MDS components. (Protocol Step: "plink --bfile ${datafile}_QC2local_HM3b37merge --cluster --mind .05 --mds-plot 10") - Formats the MDS output (
.mds
) into TSV and CSV formats for R.
- Performs MDS on the successfully merged dataset (
-
Outlier Identification & Plotting (R Script):
- Executes the R script
mds_plot.R
. - Loads MDS data (
.csv
). Assigns population labels (NAPLS3 cohort vs. HapMap3 populations). - Generates a PDF plot (
mdsplot_*_outliersincluded.pdf
) showing MDS components C1 vs C2, color-coded by population. - Defines specific thresholds for C1 (
-0.06
to-0.04
) and C2 (0.055
to0.07
) to isolate the CEU/TSI (European) cluster, as specified in the script comments reflecting ENIGMA guidance. - Flags individuals from the NAPLS3 cohort falling outside these C1/C2 boundaries as outliers.
- Writes lists of outliers (
*_pop_strat_mds.outlier.txt
) and non-outliers/Europeans (*_pop_strat_mds.eur.txt
). - Generates a second PDF plot (
mdsplot_*_outliersexcluded.pdf
) showing only the HapMap3 reference and the non-outlier NAPLS3 individuals. (Protocol Step: R code section for plotting and outlier identification)
- Executes the R script
-
Outlier Removal:
- If outliers were identified, uses
plink1.9 --keep
with the*_pop_strat_mds.eur.txt
list on the*_QC2_filtered
dataset (the dataset before merging with HapMap3) to create the final*_QC3.*
dataset for this stage. (Protocol Step: "plink --bfile ${datafile}_QC2 --keep ${datafile}_pop_strat_mds.eur.txt --make-bed --out ${datafile}_QC3") - If no outliers were found, links
*_QC2_filtered.*
to*_QC3.*
.
- If outliers were identified, uses
- Summary Report: Generates a text file summarizing removals during this stage.
-
Environment Setup: Loads
- Why this Stage: Removes technical artifacts (duplicates). Assesses sample structure (relatedness, ancestry). Ensures the final dataset used for imputation and association testing primarily consists of the target ancestry (European), minimizing confounding due to population stratification, by comparing to HapMap3 and removing individuals distant from the CEU/TSI cluster in MDS space.
-
Script:
02_enigma_dti_qc_napls3_part3.sh
- Purpose: Finalizes the QC process by generating PCA covariates for downstream association analyses, calculating comprehensive pre- and post-QC summary statistics, creating summary reports, and packaging the essential output files required by ENIGMA.
- Protocol Alignment: Directly implements ENIGMA-DTI Steps 8 ("Get genetic principal components") and 9 ("Cohort QC summary data"), plus the final data packaging instructions.
-
Execution Details:
-
Setup: Creates necessary directories (
TEMP_DIR
,ANC_DIR
,FINAL_DIR
,OUTPUT_ALL
,OUTPUT_FINAL
,SCRIPT_DIR
,LOG_DIR
). Defines constants and paths, including locating the output directories from Part 1 and Part 2 usingls -t
andgrep
. Creates R scripts (pca_plot.R
,summary_report.Rmd
) dynamically in$SCRIPT_DIR
. Sets up atrap cleanup EXIT INT TERM
to ensure results are saved even if the script is interrupted. Initializes log files. -
Copy Inputs: Uses
find
andtransfer_files
(anrsync
wrapper) to copy required files from Part 1 (*_QC1.fam
,*_QC1.bim
,sex_mismatches.txt
renamed tosexcheck_PROBLEM.txt
,*_QC_summary.txt
renamed tosnp_count_X.txt
) and Part 2 (mdsplot*.pdf
,*duplicates_count.txt
,*relatedness_count.txt
,*QC2_filtered.fam/log
,*pop_strat_mds.outlier.txt
,*QC3.bed/bim/fam
,local.snplist.txt
) into the temporaryANC_DIR
. This centralizes inputs for this stage. -
Step 8: Generate PCA Covariates:
- Runs
plink1.9 --pca
on the final QC'd dataset (*_QC3
) using the common SNP list (local.snplist.txt
) generated in Part 2. This calculates the top 20 principal components by default, outputting.eigenval
and.eigenvec
files toANC_DIR
. - Executes the
pca_plot.R
script usingRscript
. This script reads the.eigenval
file, calculates the variance explained by each PC, and usesggplot2
to generate a scree plot (screeplot_*.pdf
), saving it toANC_DIR
.
- Runs
-
Step 9: Generate Summary Statistics:
- Calls
generate_stats
function for pre-QC stats: Reads the*_QC1.fam
file (copied from Part 1), usesawk
to count cases/controls and males/females within each group, calculates proportions usingbc
, and writes the results to*_basic_stats_preQC.txt
inANC_DIR
. - Calls
generate_stats
function for post-QC stats: Reads the*_QC3.fam
file (copied from Part 2), performs the same counts and calculations as above, and writes results to*_basic_stats_postQC.txt
inANC_DIR
. - Calls
generate_snp_summary
: Creates*_qc_summary.txt
inANC_DIR
. It populates this file by:- Counting lines (
wc -l
) in*_QC1.bim
and*_QC1.fam
for pre-QC SNP/sample counts. - Grepping specific lines from the
*_QC2_filtered.log
(copied from Part 2) to extract counts of SNPs/samples removed due to missingness (--geno
,--mind
), MAF (--maf
), and HWE (--hwe
). Usesawk
to get the numeric count. Handles cases where counts might be zero or missing (0
is output). - Calculating the number of samples removed as MDS outliers by subtracting the line count of
*_QC3.fam
from*_QC2_filtered.fam
. - Counting lines in
*_QC3.bim
and*_QC3.fam
for post-QC SNP/sample counts.
- Counting lines (
- Calls
generate_summary_report_txt
: Creates*_QC3_summary.txt
inANC_DIR
. This provides a human-readable summary including initial/final sample counts, counts of removed duplicates and outliers, final case/control numbers, final SNP count, and the specific MDS outlier thresholds used (hardcoded in the script based on Part 2 R script).
- Calls
-
Generate PDF Summary Report:
- Calls
generate_summary_report_pdf
. This function first copies the text summary (*_QC3_summary.txt
) into the$SCRIPT_DIR
assummary_report.txt
. - It then executes an
Rscript
command that uses thermarkdown
package to render thesummary_report.Rmd
file (created during setup). The Rmd file simply includes the content ofsummary_report.txt
. - The R script includes logic to install
tinytex
(a LaTeX distribution) if needed and handles potential rendering errors. - The resulting
summary_report.pdf
is saved in$SCRIPT_DIR
and then copied to$OUTPUT_ALL
.
- Calls
-
Packaging and Cleanup:
- The
cleanup
function (triggered by thetrap
at the end of the script or on interruption) performs the final packaging. - It copies all generated files from
ANC_DIR
andSCRIPT_DIR
into$OUTPUT_ALL
. It also copies logs from Part 1 and Part 2 directories into$OUTPUT_ALL
for a complete archive. - It then identifies the specific files required for ENIGMA submission (based on the protocol's list: specific logs, stats files, MDS plots) by searching
ANC_DIR
andOUTPUT_ALL
. - It copies these required files into
$OUTPUT_FINAL
, renaming the MDS plots to include the$ANC_DATA
prefix for consistency. - It checks if the number of files in
$OUTPUT_FINAL
matches the expected count (27, or 26 if*QC2.log
wasn't generated) and logs a warning if there's a mismatch. - It logs any missing required files to
missing_submission_files.txt
. - Finally, it creates the
*_ENIGMA-DTI_FilesToSend.zip
archive containing the contents of$OUTPUT_FINAL
usingzip -r -j
(the-j
flag junks the paths, putting all files in the root of the zip).
- The
-
Key Tools:
plink1.9
,R
(data.table
,ggplot2
,rmarkdown
,tinytex
,knitr
,xfun
),bc
,awk
,find
,grep
,wc
,rsync
,zip
.
-
Setup: Creates necessary directories (
- PLINK v1.9:
/u/project/cbearden/hughesdy/software/plinkv1.9/plink
(Used for most QC steps) - PLINK v2.0:
/u/project/cbearden/hughesdy/software/plink2
(Used in Stage 1 for--update-name
and--rm-dup
) - rsid_tools:
$HOME/apps/rsid_tools/bin/rsid_tools
(Used in Stage 1 for SNP mapping; requires separate installation) - R: v4.2.2+ (Loaded via module
R/4.2.2-BIO
)- Required R Packages:
data.table
,ggplot2
,calibrate
(for MDS plot),rmarkdown
,tinytex
,knitr
,xfun
(for PDF report). Scripts attempt installation viainstall.packages
if missing.
- Required R Packages:
- Hoffman2 Modules:
parallel
,bcftools
,htslib
(for Stage 0),aria2
(for HapMap download in Stage 3). - Standard Unix Utilities:
bash
,awk
,grep
,sort
,join
,curl
,zip
,rsync
,timeout
,bc
,find
,tee
,cat
,mv
,cp
,ln
,mkdir
,rm
,stat
,tr
,wc
,pigz
(orgunzip
),tail
,head
.
- Raw NAPLS3 Genotypes:
/u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/NAPLS3_n710.{bed,bim,fam}
(Build hg19/GRCh37). - Phenotype/Sample Information: Located in
processed_genotype/enigma/DTIgenetics/info/
:NAPLS3_Terra_samplestab_phenofile.txt
: Subject IDs (column 3), sex (column 7), case/control status (column 8). Used in Part 1 for updating.fam
file and ID mapping.napls3_MS_diffusion.csv
: List of subjects with DTI data (column 1, formatSITE-S####
). Used in Part 1 for filtering individuals (--keep
).
- dbSNP VCF (for Stage 0): dbSNP build 156 for GRCh37 (
GCF_000001405.25.gz
and.tbi
). Downloaded automatically by01_create_rsid_binaries.sh
if not found locally or specified viaEXISTING_VCF_DIR
. - HapMap3 Reference Data (for Stage 3):
HM3_b37.{bed,bim,fam}
. Downloaded automatically by02_enigma_dti_qc_napls3_part2.sh
from ENIGMA website (aria2c
) if not present in temp directory ($HAPMAP_DIR
).
-
Hoffman2 Cluster: Scripts rely on SGE job scheduler syntax (
#$
) and module system (module load
). -
Directory Access: Read access to input data directories (
/u/project/cbearden/hughesdy/NAPLS/raw_genotype/NAPLS3/
) and write access to working directories (/u/home/c/cobeaman/project-cbearden/napls/gprep/processed_genotype/
,$SCRATCH
). -
Environment Variables: The master script
run_napls_qc.sh
defines key paths (e.g.,WORK_DIR
,SCRATCH_DIR
,LOG_DIR
, tool paths) which are exported and inherited by the subscripts. Subscripts also define their own specific paths relative to these.
The pipeline is executed by submitting the master script run_napls_qc.sh
to the Hoffman2 scheduler:
qsub run_napls_qc.sh
- Configuration: Review and adjust environment variables within run_napls_qc.sh if necessary (e.g., paths to software, base directories).
-
Monitoring:
- Overall pipeline progress and logs:
${LOG_DIR}/napls3_qc_run.log
(where$LOG_DIR
is defined in run_napls_qc.sh). - Individual script logs: Stored within subdirectories corresponding to each stage (e.g.,
${FINAL_DIR}/logs/
for Part 3). Check both the main SGE log ($JOB_ID_*.log
) and the script-specific run log (*_run.log
) within these directories. - SGE job output:
$HOME/project-cbearden/napls/gprep/processed_genotype/logs/$JOB_ID_napls_qc_master.log
.
- Overall pipeline progress and logs:
-
Resumption: The pipeline uses a checkpoint file (
${LOG_DIR}/napls3_qc_checkpoint.txt
) to track completed stages. If the job is interrupted, resubmitting run_napls_qc.sh will detect completed steps viastep_completed
function and skip them, resuming from the first incomplete stage.
- SNP Renaming (Stage 1):
- Renamed PLINK files:
processed_genotype/NAPLS3_n710_renamed*.{bed,bim,fam}
(Timestamped) - Renaming map:
processed_genotype/final_snp_rename*.txt
(Timestamped) - Missing variants list:
processed_genotype/missing_variants*.txt
(Timestamped) - Logs:
processed_genotype/logs/<jobid>_rename_snps_direct/
- Renamed PLINK files:
- ENIGMA QC (Stages 2-4): Each part creates a job-ID-based directory within DTIgenetics.
- Part 1 (
..._part1/
):*_QC1.{bed,bim,fam}
: Dataset after initial filtering (DTI subjects), sex/phenotype update, X-split, and sex check removal.sex_mismatches.txt
: List of individuals removed due to sex mismatch (FID, IID).*_QC_summary.txt
: Basic counts (Total N, Cases, Controls, X SNPs) for the QC1 dataset.- Logs: logs subdirectory (includes
run.log
, PLINK logs).
- Part 2 (
..._part2/
):*_QC3.{bed,bim,fam}
: Final QC'd dataset after duplicate removal and MDS outlier removal. This is the primary input for Part 3's PCA.mdsplot_*.pdf
: MDS plots (before and after outlier removal).*_pop_strat_mds.outlier.txt
: List of removed ancestry outliers (FID, IID).*_pop_strat_mds.eur.txt
: List of individuals kept (European ancestry cluster) (FID, IID).*_QC1pruned_duplicates_count.txt
,*_QC1pruned_relatedness_count.txt
: Counts from duplicate/relatedness checks.*_QC2_summary.txt
: Text summary of removals in Part 2.local.snplist.txt
: List of non-ambiguous SNPs common between dataset and HapMap3, used for MDS and PCA.- Logs: logs subdirectory (includes
run.log
, PLINK logs, R script output).
- Part 3 (
..._part3/
):*_PCACovariates.{eigenval,eigenvec}
: PCA results (eigenvalues and eigenvectors).screeplot_*.pdf
: PCA scree plot.*_basic_stats_preQC.txt
,*_basic_stats_postQC.txt
: Case/control/sex counts before/after QC.*_qc_summary.txt
: Detailed SNP/sample removal counts across steps (missingness, MAF, HWE, MDS).*_QC3_summary.txt
: Text summary report of final counts and thresholds.summary_report.pdf
: PDF version of the summary report.output_all/
: Archive of most intermediate and final files from all QC steps (useful for debugging).output_final/
: Curated set of files required for ENIGMA submission (specific logs, stats, plots).*_ENIGMA-DTI_FilesToSend.zip
: Final zip archive containingoutput_final
contents.missing_submission_files.txt
: List of expected submission files that were not found (if any).- Logs: logs subdirectory (includes
run.log
, PLINK logs, R script output).
- Part 1 (
- Master Script Logs:
processed_genotype/logs/<jobid>_napls_qc_master/
(includes overall run lognapls3_qc_run.log
and checkpoint filenapls3_qc_checkpoint.txt
). - Overall Summary:
processed_genotype/napls3_qc_pipeline_summary.txt
(Generated at the end of the master script, summarizing runtime, outputs, and completion status).
-
Job Failure: Check the SGE output log (
$HOME/project-cbearden/napls/gprep/processed_genotype/logs/$JOB_ID_*.log
) and the pipeline run log (${LOG_DIR}/napls3_qc_run.log
). Also check logs within the specific stage's output directory (e.g.,${FINAL_DIR}/logs/
). Look forERROR:
messages in the logs. -
Prerequisite Errors: Ensure all software (PLINK, R, rsid_tools, etc.) is correctly installed/loaded and paths in run_napls_qc.sh are accurate. Verify input files (
NAPLS3_n710.*
, phenotype files) exist and are accessible. The master script'sverify_prerequisites
function checks most common issues. -
File Not Found: Check if the previous step completed successfully and generated the expected output files. The
cached_find
function in the master script relies on specific naming patterns and timestamps; ensure outputs exist in the expected locations (WORK_DIR
,ENIGMA_DIR/<jobid>_partX/
). Errors intransfer_files
(rsync) might also cause issues. -
PLINK Errors: Consult the PLINK
.log
files generated within the relevant stage's output directory (often inoutput_all
or logs subdirectories). Common errors involve memory limits, file format issues, or conflicting parameters. -
R Script Errors: Look for errors within the main run log (
*_run.log
) or specific R output files/logs. Common issues include missing packages (though scripts attempt installation), incorrect input file formats/paths, or problems with plotting libraries. PDF generation errors might relate to LaTeX (tinytex
) setup; checktinytex
installation and logs. -
rsid_tools
Errors: Ensurersid_tools
is correctly installed and the path is set. Verify the dbSNP binary files (Stage 0) were created successfully and are accessible in$RS_BIN_DIR
(or linked correctly in Stage 1's scratch space). -
Permissions: Ensure write permissions in
$WORK_DIR
,$SCRATCH_DIR
, and their subdirectories. Check read permissions for input data. -
Timeout Errors: If a step times out (reported by the master script), it may require more resources (time
h_rt
, memoryh_data
, or corespe shared
) requested in the SGE header (#$
) of the failing script or the master script.
- ENIGMA-DTI QC Protocol: GitHub Link, Summary HTML
- PLINK 1.9: Documentation
- PLINK 2.0: Documentation
- rsid_tools: GitHub Repository
- R Project: Homepage
- GNU Parallel: Documentation
- Hoffman2 Cluster: User Guide