Accelerated Biological Aging in Bipolar Disorder

Motivation and Background

This project investigates accelerated biological aging in the largest bipolar disorder DNA methylation cohort to date, aiming to identify epigenetic age acceleration differences, drivers, and modifiers between individuals with bipolar disorder and controls. Preprocessing and quality control of DNA methylation data from Illumina EPIC arrays is performed, specifically addressing missing probes and data normalization. GrimAge2 and other epigenetic aging algorithms from the pyaging Python package are applied. Statistical analyses, including t-tests, ANCOVA, and correlation analysis, are conducted in R and Python to assess differences in GrimAge2 age acceleration between diagnostic groups while covarying for age and sex. Data visualization is employed using Python libraries including seaborn and matplotlib to generate informative plots for data exploration and presentation. The R packages minfi, BioAge, dnaMethyAge, and methylclock are applied to prepare for epigenetic clock analysis. Finally, data wrangling and manipulation is performed using R's data.table and Python's pandas to prepare, clean, and transform the raw data for analysis. Future research will compare across multiple methylation aging clocks, characterize the individual contributions of GrimAge2 subcomponents, and explore the effects of lithium treatment and other environmental modifiers on epigenetic age acceleration in bipolar disorder.

Relevant Literature+
Meta-analysis of epigenetic aging in schizophrenia reveals multifaceted relationships with age, sex, illness duration, and polygenic risk

Computational Overview

Programming Languages: R and Python.
R Packages: minfi, BioAge, dnaMethyAge, methylclock, dplyr, tidyr, data.table, purrr, ggplot2, plotly, RColorBrewer, reshape2, GenomicRanges, SummarizedExperiment, qs, bigmemory, doParallel, parallel, arrow.
Python Packages: pyaging, pandas, numpy, scipy, seaborn, matplotlib, sklearn (specifically KMeans, StandardScaler), statsmodels, pygam, pyarrow.
High-Performance Computing (HPC): Conducted in the Hoffman2 HPC environment utilizing SGE job scheduling and parallel processing in R for computationally intensive tasks.
Data Management: Data cleaning, transformation, merging, and subsetting across both R and Python is performed. Efficiently procssed large datasets using packages including bigmemory and pyarrow. Generated reproducible analysis workflows by logging key data characteristics (e.g. data dimensions, timestamps) to filenames.
Statistical Analysis: Conducted various statistical analyses, including descriptive statistics, correlation analysis, t-tests, ANCOVA, and planned for GAMs.
Data Visualization: Created a wide range of static visualizations for exploratory data analysis and presentation of results.
Version Control: Utilized GitHub for code sharing and version control.
Workflow Design: Designed and implemented a multi-stage analysis pipeline involving data preprocessing, clock calculation, statistical analysis, visualization, and reporting, including integration of R and Python components.

Results

Overview Presentation

Stage 1 - Data Acquisition and Preparation

Data Acquisition: Acquired raw DNA methylation data (likely IDAT files) from Illumina EPIC arrays along with accompanying sample sheets containing demographic and diagnostic information. Potentially integrated data from multiple sources (e.g., "Bipolar 2023 Sample Sheet", "2000_sample_covariates", "highcov_technical_covariates", "Complete BIG Data").
Data Import and Formatting: Imported data into R and converted to appropriate formats (e.g., GenomicRatioSet) for downstream analysis using minfi. Used R's read.csv, read_excel, and read.table for sample sheet information. Employed Python's pyarrow.feather for efficient loading of preprocessed and saved data subsets.
Data Cleaning and Quality Control (QC): Performed quality control procedures, including:
- Checking for missing data in both methylation and sample annotation data.
- Addressing missing probe information using external resources like the mepylome package and manifest files.
- Removal of duplicate probe data.
- Compared predicted and reported sex.
Data Wrangling and Transformation: Manipulated and transformed data using dplyr, tidyr, data.table in R and pandas in Python. This included renaming columns, recoding variables (e.g., Gender), handling "_REP" sample duplicates, merging datasets, calculating age in months/years from date data, and summarizing missing data patterns.
Data Subsetting: Created subsets of data for specific analyses (e.g., selecting samples with complete data, extracting specific CpG sites related to GrimAge2).

Cohort Demographics

Characteristic	Bipolar	Other
Count	1530	912
Male	655 (42.8%)	382 (41.9%)
Female	875 (57.2%)	530 (58.1%)

Density Plot of Normalized Beta Values

Stage 2 - Epigenetic Clock Analysis

GrimAge2 Calculation: Calculated GrimAge2 and AgeAccelGrim2 using custom R functions leveraging bigmemory for efficient handling of large matrices and doParallel for parallel processing of subcomponents. This included loading pre-trained GrimAge2 model weights and reference values.
Other Clock Calculations: Calculated various epigenetic clocks using R packages (DNAmAge, DunedinPoAm, DunedinPACE) and Python package (pyaging). This required handling missing CpG sites for each clock and managing compatibility between R and Python data structures.
Probe Analysis and Verification: Compared the CpG sites required by GrimAge2 with the available CpG sites in the methylation data and reference array annotations (IlluminaHumanMethylationEPICv2anno.20a1.hg38). Identified and documented missing probes.
Descriptive Statistics: Computed descriptive statistics (e.g., mean, standard deviation, median, quartiles) for age, GrimAge2, and AgeAccelGrim2, stratified by diagnosis, using data.table and pandas.
Correlation Analysis: Calculated Pearson, Spearman, and Kendall correlations between chronological age and GrimAge2 using R's stats package.
Comparative Analysis: Performed t-tests and ANCOVA to compare AgeAccelGrim2 between bipolar and control groups, considering age as a covariate, using R's stats and statsmodels packages in Python.
Data Visualization: Generated various plots, including density plots, box plots, violin plots, scatter plots, bar plots, and pie charts, to visualize data distributions, correlations, and group differences using ggplot2, plotly in R and seaborn, matplotlib in Python. This involved customizing plot aesthetics, adding statistical annotations (p-values, effect sizes), and creating multi-panel figures.
Data Export and Reporting: Exported results and summary tables to CSV and Excel files using R's fwrite and Python's pandas.to_csv for reporting and sharing.

BPDNAm GrimAge2 Source Code Variables

DNAmGrimAge2 and AgeAccelGrim2
Seven DNAm-based plasma protein estimates
DNAm-based pack years (DNAmPACKYRS)

Name	Variable	Unit
DNAm GrimAge2	DNAmGrimAge2	year
GrimAge2 age acceleration	AgeAccelGrim2	year
DNAm Growth differentiation factor 15	DNAmGDF15	pg/mL
DNAm Beta-2-microglobulin	DNAmB2M	pg/mL
DNAm Cystatin-C	DNAmCystatinC	pg/mL
DNAm Tissue Inhibitor Metalloproteinases 1	DNAmTIMP1	pg/mL
DNAm Adrenomedullin	DNAmADM	pg/mL
DNAm Plasminogen activator inhibitor 1	DNAmPAI1	pg/mL
DNAm Leptin	DNAmLeptin	pg/mL
DNAm log C-reactive protein	DNAmlogCRP	mg/L (in CRP)
DNAm log hemoglobin A1C	DNAmlogA1C	% (in A1C)
DNAm smoking pack years	DNAmPACKYRS

Summary Statistsics

	Chronological Age		DNAmGrimAge2		AgeAccelGrim2
Metric	Bipolar	Other	Bipolar	Other	Bipolar	Other
Mean	50.19	53.45	59.43	59.98	0.66	-1.10
SD	12.39	15.53	9.66	11.88	4.09	3.85
Min	19.00	18.00	31.61	29.69	-9.18	-9.49
Max	85.00	91.30	90.03	95.51	3.18	1.35
Q₁	42.00	44.35	53.00	52.85	14.75	15.00
Q₂	51.00	56.00	59.79	61.55	-2.21	-3.91
Q₃	59.00	65.00	66.37	68.38	0.27	-1.61

SampleID	Female	Age	Diagnosis	DNAmGrimAge2	AgeAccelGrim2	DNAmADM	DNAmCystatinC	DNAmGDF15	DNAmLeptin	DNAmPAI1	DNAmTIMP1	DNAmlogCRP	DNAmlogA1C	DNAmPACKYRS	DNAmB2M
431-BG00001	1.0	51.0	BipolarI	56.85389522787445	-2.4955535070147974	819.7894	1273303.1	1267.2058	115190.06	121522.39	33414.67	-4.123713	1.3817582	170.05283	4716012.5
431-BG00002	1.0	33.0	BipolarI	44.49051949954246	-2.1223872026143624	840.1452	1285489.1	1326.883	124625.3	121982.63	32898.43	-4.2515163	1.4281118	164.60991	4838856.0
431-BG00003	0.0	49.0	BipolarI	57.901584527482825	-0.03269287043615776	827.7234	1353035.5	1505.8076	119267.16	115462.27	33098.508	-4.18434	1.3656142	166.24052	4772260.0
431-BG00004	0.0	41.0	BipolarI	58.85398316038897	6.580391110351066	835.9474	1316691.6	1615.7949	121670.67	112583.19	33452.492	-4.7012987	1.3756903	168.39299	4817889.5
431-BG00006	0.0	64.0	BipolarI	67.78206206165513	-0.7660003635408827	842.70087	1318320.0	1619.623	122580.05	122025.65	32792.11	-4.18104	1.4210433	173.75922	4698416.0

Stage 3 - Additional Analysis

Tasks

Double check QC steps below and search for missing ~180 probes
- Complete A cross-package Bioconductor workflow for analysing methylation array data vignette for cohort DNAm data
- compare predicted vs reported sex
Perform other analyses on the dataset
- DMR and standard analyses with this cohort to replicate prior work
- See Methylcheck and Methylize
Compare _REP vs non _REP sample pair methylation data and acquire missing information for these samples if it differs
Split "Other" non-bipolar samples with higher granularity
- Ensure age and sex matching between groups and covariate/controlling for other vars; see publications in Overview Presentation for qc guidance
Compute biological age acceleration using other methylation clocks in pyaging: a Python-based compendium of GPU-optimized aging clocks and compare with grimage2
Establish plasma protein estimate and other clock output's for outsized influence on ageaccelgrim2 and other measures of accelerated biological aging
- Compare lithium effects on aging in bipolar; see Methylcheck

Uh oh!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

README.md

README.md

Accelerated Biological Aging in Bipolar Disorder

Motivation and Background

Computational Overview

Results

Stage 1 - Data Acquisition and Preparation

Cohort Demographics

Density Plot of Normalized Beta Values

Stage 2 - Epigenetic Clock Analysis

BPDNAm GrimAge2 Source Code Variables

Summary Statistsics

Correlation Between Age and GrimAge2 with Missing Probes Summary

DNAm GrimAge2 vs Chronological Age by Diagnosis

AgeAccelGrim2 by Diagnosis

Distribution of AgeAccelGrim2 by Diagnosis

Density Distribution of AgeAccelGrim2 by Diagnosis

Mean AgeAccelGrim2 by Diagnosis

BPDNAm All Calculated GrimAge2 Variables

Stage 3 - Additional Analysis

Tasks

Uh oh!

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

FilesExpand file tree

README.md

File metadata and controls

Accelerated Biological Aging in Bipolar Disorder

Motivation and Background

Computational Overview

Results

Stage 1 - Data Acquisition and Preparation

Stage 2 - Epigenetic Clock Analysis

BPDNAm GrimAge2 Source Code Variables

Stage 3 - Additional Analysis

Tasks