Skip to content

Generate synthetic longitudinal correlated data using distributions and correlations from real-world observational data

License

Notifications You must be signed in to change notification settings

ronhandels/synthetic-correlated-data

Repository files navigation

Generate synthetic data

  • This code generates synthetic data based on an existing dataset and/or user input.
  • Please see a detailed application in Alzheimer's Disease "Generate Synthetic Data in R for a Hypothetical Alzheimer's Disease Trial" on MedRxiv (go to preview/download PDF).
  • The R code is located in the file named "R code". Please manually adjust the working directory in this code, which is marked by the comment "USER-INPUT REQUIRED" under the heading "PREPARE".

Code steps explained

These steps can also be found on the detailed application in Alzheimer's Disease "Generate Synthetic Data in R for a Hypothetical Alzheimer's Disease Trial" on MedRxiv (go to preview/download PDF).

Part A: Original Data

  1. Original real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study* on
    • i) demographic (age, sex, education),
    • ii) clinical (cognition: Mini-Mental State Examination (MMSE) and Alzheimer's Disease Assessment Scale (ADAS); function: Functional Activities Questionnaire (FAQ); composite cognition/function: Clinical Dementia Rating (CDR), Alzheimer’s Disease Composite Score (ADCOMS)) and
    • iii) biological (genetics: APOE4; cerebrospinal fluid: ABeta, Tau; imaging: PET-SUVR-centiloid) outcomes at baseline, 6, 12 and/or 18-month follow-up (35 variables), with missing data multiple-imputed to obtain 10 sets of 537 individuals.
  2. Estimate (theoretical) minimum and maximum (all continuous variables) and proportions (all categorical variables).
  3. Rescale to 0-1 range (continuous).
  4. Estimate beta distribution shape parameters (method of moments; continuous).
  5. Transform to cumulative probability distribution function (CDF) (using shape parameters; continuous) and to cumulative probability (categorical).
  6. Transform to a normal distribution (using quantile function, i.e., inverse cumulative distribution function).
  7. Estimate variance-covariance matrix.

Part B: Synthetic data

  1. Generate random correlated normal data (mean=0, SD=1) using Cholesky decomposition of variance-covariance matrix from step 7.
  2. Transform to cumulative probability distribution function (CDF).
  3. Transform to beta distribution (using quantile function, i.e., inverse cumulative distribution function) (using beta distribution shape parameters from step 4; continuous).
  4. Rescale to original range (using minimum and maximum and proportions from step 2).

image

[*] We note the example data available on GitHub was generated by taking the original ADNI data and adding a fixed change as well as large random variation to each variable. Therefore, this example data does not contain any original data from ADNI data, it does not represent ADNI data, and we think it does not represent clinical correctness. We therefore recommend using the example data only for the purpose of understanding our method in our tutorial in which these example data are used. We note part C of the "Application in Alzheimer's Disease" is not part of the code on GitHub

Limitations

The results of this study are subject to several limitations.

  • First, correlations among the data were estimated on normalized scale and assumed representative for correlations on the original scale. This assumption is likely incorrect in case data is not normally distributed. However, for much data the impact seems relatively small as reflected by a relatively small deviation on correlation in the observed and synthetically recreated data.
  • Second, missing data or drop-out were not simulated, limiting the representativeness of the data to a real-world setting in which missing data or drop-out in trials is common.
  • Third, generally we believe synthetic data are as good as the underlying models parameterizing them. Our method is based on the assumption that the data can be correctly described by the parameters of the beta distribution and the correlation coefficient. Likely, data from multimodal distribution or with non-linear associations are incorrectly synthetically recreated by our method.

Application in Alzheimer's Disease

INTRODUCTION: Representative data of recent Alzheimer's Disease (AD) trials are difficult to obtain. We aimed to generate a synthetic version of an original real-world observational dataset, subsequently apply a plausible AD treatment effect, and make our method open-source available.

METHODS: Synthetic data was generated in the following steps: (1) Obtain real-world data from the ADNI study on demographic (age, sex, education), clinical (cognition: MMSE and ADAS; function: FAQ; composite cognition/function: CDR, ADCOMS) and biological (genetics: APOE4; cerebrospinal fluid: ABeta, Tau; imaging: PET-SUVR-centiloid) outcomes at baseline, 6, 12 and/or 18-month follow-up (35 variables), with missing data multiple-imputed to obtain 10 sets of 537 individuals. (2) Estimate (theoretical) minimum and maximum (all continuous variables) and proportions (all categorical variables). (3) Rescale to 0-1 range (continuous). (4) Estimate beta distribution shape parameters (method of moments; continuous). (5) Transform to cumulative probability distribution function (using shape parameters; continuous) and to cumulative probability (categorical). (6) Transform to a normal distribution. (7) Estimate variance-covariance matrix. (8) Generate random correlated normal data using Cholesky decomposition of variance-covariance. (9) Transform to cumulative probability distribution function. (10) Transform to beta distribution (using shape parameters; continuous). (11) Rescale to original range. (12) Keep half as control arm, and half as intervention arm, and estimate change from baseline. (13) Multiply intervention change from baseline with self-defined hypothetical relative treatment effect. We assumed correlations on normalized scale were similar to correlations on original scale. R code is available on github: https://github.com/ronhandels/synthetic-correlated-data.

RESULTS: The synthetic distribution and mean over time showed large similarity to the original data (visually assessed). The absolute difference in pairwise correlations between original and synthetic data median was 0.02 (95th percentile=0.11, max=0.18).

CONCLUSION: We judged our method sufficiently valid to generate synthetic correlated plausible hypothetical trial results.

Other work

See file poster syntehtic data ISPOR.pdf for details and supporting figures on an application, accepted at www.ispor.org conference in Copenhagen 2023.

Acknowledgment

Developers:

  • Ron Handels (Maastricht University, Netherlands)
  • Linus Jonsson (Karolinska Institutet, Sweden)
  • Lars Lau Raket (Lund University, Sweden)

Data

Data used in the "Application in Alzheimer's Disease" were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

About

Generate synthetic longitudinal correlated data using distributions and correlations from real-world observational data

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages