Skip to content
lowestprime  /   BP-DNAm  /  
Open in github.dev Open in a new github.dev tab Open in codespace

Latest commit

af16c5c · Dec 4, 2024

History

History
executable file
·
204 lines (181 loc) · 16.7 KB

README.md

File metadata and controls

executable file
·
204 lines (181 loc) · 16.7 KB

Accelerated Biological Aging in Bipolar Disorder

Motivation and Background

This project investigates accelerated biological aging in the largest bipolar disorder DNA methylation cohort to date, aiming to identify epigenetic age acceleration differences, drivers, and modifiers between individuals with bipolar disorder and controls. Preprocessing and quality control of DNA methylation data from Illumina EPIC arrays is performed, specifically addressing missing probes and data normalization. GrimAge2 and other epigenetic aging algorithms from the pyaging Python package are applied. Statistical analyses, including t-tests, ANCOVA, and correlation analysis, are conducted in R and Python to assess differences in GrimAge2 age acceleration between diagnostic groups while covarying for age and sex. Data visualization is employed using Python libraries including seaborn and matplotlib to generate informative plots for data exploration and presentation. The R packages minfi, BioAge, dnaMethyAge, and methylclock are applied to prepare for epigenetic clock analysis. Finally, data wrangling and manipulation is performed using R's data.table and Python's pandas to prepare, clean, and transform the raw data for analysis. Future research will compare across multiple methylation aging clocks, characterize the individual contributions of GrimAge2 subcomponents, and explore the effects of lithium treatment and other environmental modifiers on epigenetic age acceleration in bipolar disorder.

  1. Relevant Literature+
  2. Meta-analysis of epigenetic aging in schizophrenia reveals multifaceted relationships with age, sex, illness duration, and polygenic risk

Computational Overview

  • Programming Languages: R and Python.
  • R Packages: minfi, BioAge, dnaMethyAge, methylclock, dplyr, tidyr, data.table, purrr, ggplot2, plotly, RColorBrewer, reshape2, GenomicRanges, SummarizedExperiment, qs, bigmemory, doParallel, parallel, arrow.
  • Python Packages: pyaging, pandas, numpy, scipy, seaborn, matplotlib, sklearn (specifically KMeans, StandardScaler), statsmodels, pygam, pyarrow.
  • High-Performance Computing (HPC): Conducted in the Hoffman2 HPC environment utilizing SGE job scheduling and parallel processing in R for computationally intensive tasks.
  • Data Management: Data cleaning, transformation, merging, and subsetting across both R and Python is performed. Efficiently procssed large datasets using packages including bigmemory and pyarrow. Generated reproducible analysis workflows by logging key data characteristics (e.g. data dimensions, timestamps) to filenames.
  • Statistical Analysis: Conducted various statistical analyses, including descriptive statistics, correlation analysis, t-tests, ANCOVA, and planned for GAMs.
  • Data Visualization: Created a wide range of static visualizations for exploratory data analysis and presentation of results.
  • Version Control: Utilized GitHub for code sharing and version control.
  • Workflow Design: Designed and implemented a multi-stage analysis pipeline involving data preprocessing, clock calculation, statistical analysis, visualization, and reporting, including integration of R and Python components.

Results

Stage 1 - Data Acquisition and Preparation

  • Data Acquisition: Acquired raw DNA methylation data (likely IDAT files) from Illumina EPIC arrays along with accompanying sample sheets containing demographic and diagnostic information. Potentially integrated data from multiple sources (e.g., "Bipolar 2023 Sample Sheet", "2000_sample_covariates", "highcov_technical_covariates", "Complete BIG Data").
  • Data Import and Formatting: Imported data into R and converted to appropriate formats (e.g., GenomicRatioSet) for downstream analysis using minfi. Used R's read.csv, read_excel, and read.table for sample sheet information. Employed Python's pyarrow.feather for efficient loading of preprocessed and saved data subsets.
  • Data Cleaning and Quality Control (QC): Performed quality control procedures, including:
    • Checking for missing data in both methylation and sample annotation data.
    • Addressing missing probe information using external resources like the mepylome package and manifest files.
    • Removal of duplicate probe data.  
    • Compared predicted and reported sex.
  • Data Wrangling and Transformation: Manipulated and transformed data using dplyr, tidyr, data.table in R and pandas in Python. This included renaming columns, recoding variables (e.g., Gender), handling "_REP" sample duplicates, merging datasets, calculating age in months/years from date data, and summarizing missing data patterns.
  • Data Subsetting: Created subsets of data for specific analyses (e.g., selecting samples with complete data, extracting specific CpG sites related to GrimAge2).
Characteristic Bipolar Other
Count 1530 912
Male 655 (42.8%) 382 (41.9%)
Female 875 (57.2%) 530 (58.1%)

Density Plot of Normalized Beta Values

Stage 2 - Epigenetic Clock Analysis

  • GrimAge2 Calculation: Calculated GrimAge2 and AgeAccelGrim2 using custom R functions leveraging bigmemory for efficient handling of large matrices and doParallel for parallel processing of subcomponents. This included loading pre-trained GrimAge2 model weights and reference values.
  • Other Clock Calculations: Calculated various epigenetic clocks using R packages (DNAmAge, DunedinPoAm, DunedinPACE) and Python package (pyaging). This required handling missing CpG sites for each clock and managing compatibility between R and Python data structures.
  • Probe Analysis and Verification: Compared the CpG sites required by GrimAge2 with the available CpG sites in the methylation data and reference array annotations (IlluminaHumanMethylationEPICv2anno.20a1.hg38). Identified and documented missing probes.
  • Descriptive Statistics: Computed descriptive statistics (e.g., mean, standard deviation, median, quartiles) for age, GrimAge2, and AgeAccelGrim2, stratified by diagnosis, using data.table and pandas.
  • Correlation Analysis: Calculated Pearson, Spearman, and Kendall correlations between chronological age and GrimAge2 using R's stats package.
  • Comparative Analysis: Performed t-tests and ANCOVA to compare AgeAccelGrim2 between bipolar and control groups, considering age as a covariate, using R's stats and statsmodels packages in Python.
  • Data Visualization: Generated various plots, including density plots, box plots, violin plots, scatter plots, bar plots, and pie charts, to visualize data distributions, correlations, and group differences using ggplot2, plotly in R and seaborn, matplotlib in Python. This involved customizing plot aesthetics, adding statistical annotations (p-values, effect sizes), and creating multi-panel figures.
  • Data Export and Reporting: Exported results and summary tables to CSV and Excel files using R's fwrite and Python's pandas.to_csv for reporting and sharing.

BPDNAm GrimAge2 Source Code Variables

  1. DNAmGrimAge2 and AgeAccelGrim2
  2. Seven DNAm-based plasma protein estimates
  3. DNAm-based pack years (DNAmPACKYRS)
Name Variable Unit
DNAm GrimAge2 DNAmGrimAge2 year
GrimAge2 age acceleration AgeAccelGrim2 year
DNAm Growth differentiation factor 15 DNAmGDF15 pg/mL
DNAm Beta-2-microglobulin DNAmB2M pg/mL
DNAm Cystatin-C DNAmCystatinC pg/mL
DNAm Tissue Inhibitor Metalloproteinases 1 DNAmTIMP1 pg/mL
DNAm Adrenomedullin DNAmADM pg/mL
DNAm Plasminogen activator inhibitor 1 DNAmPAI1 pg/mL
DNAm Leptin DNAmLeptin pg/mL
DNAm log C-reactive protein DNAmlogCRP mg/L (in CRP)
DNAm log hemoglobin A1C DNAmlogA1C % (in A1C)
DNAm smoking pack years DNAmPACKYRS
Chronological Age DNAmGrimAge2 AgeAccelGrim2
Metric Bipolar Other Bipolar Other Bipolar Other
Mean 50.19 53.45 59.43 59.98 0.66 -1.10
SD 12.39 15.53 9.66 11.88 4.09 3.85
Min 19.00 18.00 31.61 29.69 -9.18 -9.49
Max 85.00 91.30 90.03 95.51 3.18 1.35
Q1 42.00 44.35 53.00 52.85 14.75 15.00
Q2 51.00 56.00 59.79 61.55 -2.21 -3.91
Q3 59.00 65.00 66.37 68.38 0.27 -1.61

Correlation Between Age and GrimAge2 with Missing Probes Summary

DNAm GrimAge2 vs Chronological Age by Diagnosis

AgeAccelGrim2 by Diagnosis

Distribution of AgeAccelGrim2 by Diagnosis

Density Distribution of AgeAccelGrim2 by Diagnosis

Mean AgeAccelGrim2 by Diagnosis

SampleID Female Age Diagnosis DNAmGrimAge2 AgeAccelGrim2 DNAmADM DNAmCystatinC DNAmGDF15 DNAmLeptin DNAmPAI1 DNAmTIMP1 DNAmlogCRP DNAmlogA1C DNAmPACKYRS DNAmB2M
431-BG00001 1.0 51.0 BipolarI 56.85389522787445 -2.4955535070147974 819.7894 1273303.1 1267.2058 115190.06 121522.39 33414.67 -4.123713 1.3817582 170.05283 4716012.5
431-BG00002 1.0 33.0 BipolarI 44.49051949954246 -2.1223872026143624 840.1452 1285489.1 1326.883 124625.3 121982.63 32898.43 -4.2515163 1.4281118 164.60991 4838856.0
431-BG00003 0.0 49.0 BipolarI 57.901584527482825 -0.03269287043615776 827.7234 1353035.5 1505.8076 119267.16 115462.27 33098.508 -4.18434 1.3656142 166.24052 4772260.0
431-BG00004 0.0 41.0 BipolarI 58.85398316038897 6.580391110351066 835.9474 1316691.6 1615.7949 121670.67 112583.19 33452.492 -4.7012987 1.3756903 168.39299 4817889.5
431-BG00006 0.0 64.0 BipolarI 67.78206206165513 -0.7660003635408827 842.70087 1318320.0 1619.623 122580.05 122025.65 32792.11 -4.18104 1.4210433 173.75922 4698416.0

Stage 3 - Additional Analysis

Tasks

  • Double check QC steps below and search for missing ~180 probes
  • Perform other analyses on the dataset
  • Compare _REP vs non _REP sample pair methylation data and acquire missing information for these samples if it differs
  • Split "Other" non-bipolar samples with higher granularity
    • Ensure age and sex matching between groups and covariate/controlling for other vars; see publications in Overview Presentation for qc guidance
  • Compute biological age acceleration using other methylation clocks in pyaging: a Python-based compendium of GPU-optimized aging clocks and compare with grimage2
  • Establish plasma protein estimate and other clock output's for outsized influence on ageaccelgrim2 and other measures of accelerated biological aging
    • Compare lithium effects on aging in bipolar; see Methylcheck