- This code generates synthetic data based on an existing dataset and/or user input.
- Please see a detailed application in Alzheimer's Disease "Generate Synthetic Data in R for a Hypothetical Alzheimer's Disease Trial" on MedRxiv (go to preview/download PDF).
- The R code is located in the file named "R code". Please manually adjust the working directory in this code, which is marked by the comment "USER-INPUT REQUIRED" under the heading "PREPARE".
These steps can also be found on the detailed application in Alzheimer's Disease "Generate Synthetic Data in R for a Hypothetical Alzheimer's Disease Trial" on MedRxiv (go to preview/download PDF).
- Original real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study* on
- i) demographic (age, sex, education),
- ii) clinical (cognition: Mini-Mental State Examination (MMSE) and Alzheimer's Disease Assessment Scale (ADAS); function: Functional Activities Questionnaire (FAQ); composite cognition/function: Clinical Dementia Rating (CDR), Alzheimer’s Disease Composite Score (ADCOMS)) and
- iii) biological (genetics: APOE4; cerebrospinal fluid: ABeta, Tau; imaging: PET-SUVR-centiloid) outcomes at baseline, 6, 12 and/or 18-month follow-up (35 variables), with missing data multiple-imputed to obtain 10 sets of 537 individuals.
- Estimate (theoretical) minimum and maximum (all continuous variables) and proportions (all categorical variables).
- Rescale to 0-1 range (continuous).
- Estimate beta distribution shape parameters (method of moments; continuous).
- Transform to cumulative probability distribution function (CDF) (using shape parameters; continuous) and to cumulative probability (categorical).
- Transform to a normal distribution (using quantile function, i.e., inverse cumulative distribution function).
- Estimate variance-covariance matrix.
- Generate random correlated normal data (mean=0, SD=1) using Cholesky decomposition of variance-covariance matrix from step 7.
- Transform to cumulative probability distribution function (CDF).
- Transform to beta distribution (using quantile function, i.e., inverse cumulative distribution function) (using beta distribution shape parameters from step 4; continuous).
- Rescale to original range (using minimum and maximum and proportions from step 2).
[*] We note the example data available on GitHub was generated by taking the original ADNI data and adding a fixed change as well as large random variation to each variable. Therefore, this example data does not contain any original data from ADNI data, it does not represent ADNI data, and we think it does not represent clinical correctness. We therefore recommend using the example data only for the purpose of understanding our method in our tutorial in which these example data are used. We note part C of the "Application in Alzheimer's Disease" is not part of the code on GitHub
The results of this study are subject to several limitations.
- First, correlations among the data were estimated on normalized scale and assumed representative for correlations on the original scale. This assumption is likely incorrect in case data is not normally distributed. However, for much data the impact seems relatively small as reflected by a relatively small deviation on correlation in the observed and synthetically recreated data.
- Second, missing data or drop-out were not simulated, limiting the representativeness of the data to a real-world setting in which missing data or drop-out in trials is common.
- Third, generally we believe synthetic data are as good as the underlying models parameterizing them. Our method is based on the assumption that the data can be correctly described by the parameters of the beta distribution and the correlation coefficient. Likely, data from multimodal distribution or with non-linear associations are incorrectly synthetically recreated by our method.
INTRODUCTION: Representative data of recent Alzheimer's Disease (AD) trials are difficult to obtain. We aimed to generate a synthetic version of an original real-world observational dataset, subsequently apply a plausible AD treatment effect, and make our method open-source available.
METHODS: Synthetic data was generated in the following steps: (1) Obtain real-world data from the ADNI study on demographic (age, sex, education), clinical (cognition: MMSE and ADAS; function: FAQ; composite cognition/function: CDR, ADCOMS) and biological (genetics: APOE4; cerebrospinal fluid: ABeta, Tau; imaging: PET-SUVR-centiloid) outcomes at baseline, 6, 12 and/or 18-month follow-up (35 variables), with missing data multiple-imputed to obtain 10 sets of 537 individuals. (2) Estimate (theoretical) minimum and maximum (all continuous variables) and proportions (all categorical variables). (3) Rescale to 0-1 range (continuous). (4) Estimate beta distribution shape parameters (method of moments; continuous). (5) Transform to cumulative probability distribution function (using shape parameters; continuous) and to cumulative probability (categorical). (6) Transform to a normal distribution. (7) Estimate variance-covariance matrix. (8) Generate random correlated normal data using Cholesky decomposition of variance-covariance. (9) Transform to cumulative probability distribution function. (10) Transform to beta distribution (using shape parameters; continuous). (11) Rescale to original range. (12) Keep half as control arm, and half as intervention arm, and estimate change from baseline. (13) Multiply intervention change from baseline with self-defined hypothetical relative treatment effect. We assumed correlations on normalized scale were similar to correlations on original scale. R code is available on github: https://github.com/ronhandels/synthetic-correlated-data.
RESULTS: The synthetic distribution and mean over time showed large similarity to the original data (visually assessed). The absolute difference in pairwise correlations between original and synthetic data median was 0.02 (95th percentile=0.11, max=0.18).
CONCLUSION: We judged our method sufficiently valid to generate synthetic correlated plausible hypothetical trial results.
See file poster syntehtic data ISPOR.pdf for details and supporting figures on an application, accepted at www.ispor.org conference in Copenhagen 2023.
- Ron Handels (Maastricht University, Netherlands)
- Linus Jonsson (Karolinska Institutet, Sweden)
- Lars Lau Raket (Lund University, Sweden)
Data used in the "Application in Alzheimer's Disease" were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf