Warning: mkdir(): Permission denied in /home/virtual/lib/view_data.php on line 93 Warning: chmod() expects exactly 2 parameters, 3 given in /home/virtual/lib/view_data.php on line 94 Warning: fopen(/home/virtual/pfmjournal/journal/upload/ip_log/ip_log_2024-10.txt): failed to open stream: No such file or directory in /home/virtual/lib/view_data.php on line 100 Warning: fwrite() expects parameter 1 to be resource, boolean given in /home/virtual/lib/view_data.php on line 101 Computationally efficient and stable real-world synthetic emergency room electronic health record data generation: high similarity and privacy preserving diffusion model approach: A retrospective cohort study
Precis Future Med Search

CLOSE


Precis Future Med > Volume 8(3); 2024 > Article
Aguirre, Yu, Jung, Yoon, and Cha: Computationally efficient and stable real-world synthetic emergency room electronic health record data generation: high similarity and privacy preserving diffusion model approach: A retrospective cohort study

Abstract

Purpose

This study aimed to develop real-world synthetic electronic health record (EHR) for emergency departments using computationally efficient and stable diffusion probabilistic models.

Methods

In this study, we compared the performance of diffusion models and state-ofthe-art generative adversarial networks (GANs) in terms of statistical similarity, privacy, medical usefulness, and the feasibility of using synthetic data for machine learning purposes.

Results

Our results demonstrate that diffusion models are significantly more computationally efficient than GANs and perform comparably or slightly better in terms of similarity, privacy, and utility. We also found that the data quality of the diffusion model is statistically very similar for both categorical and continuous values and can address class imbalance precisely. Moreover, the usefulness of synthetic data is almost identical to that of real EHR data. Our privacy analysis showed that the synthetic data generated by the diffusion models were private.

Conclusion

These findings have significant implications for improving the efficiency of emergency settings and enabling real-time emergency room data modeling. This demonstrates the potential of diffusion models for generating computationally efficient high-quality synthetic data. The study concluded that diffusion models can generate real-world synthetic EHRs that are computationally efficient, private, and high-quality, and can be used for machine learning purposes in emergency settings.

INTRODUCTION

Electronic health records (EHRs) have rapidly increased in recent decades and many EHR applications have been developed. However, clinical data is highly sensitive and cannot be transferred because of privacy and ethical concerns. This makes it difficult to use clinical data in critical moments such as disasters and emergency room (ER) settings. To the best of our knowledge, this is the first real-world synthetic EHR data generation using diffusion models for ER.
Previous research on synthetic data generation has shown promising results; for example, the Wasserstein generative adversarial network (WGAN) [1], WGAN with gradient penalty (WGAN-GP) [2], conditional tabular generative adversarial network (CTGAN) [3], and specific medical generative adversarial networks (GANs) for EHR data generation have been developed with apparent success such as medGAN [4], electronic medical record WGAN (EMR-WGAN) [5], anonymization through data synthesis GAN (ADS-GAN) [6], and the most recent conditional tabular generative adversarial network (CTABGAN) [7] and CTABGAN+ [8]. The CTABGAN+ results [8] show that the synthetic data of CTABGAN+ remarkably resemble the real data for all three types of variables in five different datasets and address class imbalance. The data similarity and analysis utility also showed that CTABGAN+ outperformed state-of-the-art (SOTA) models achieving up to 17% higher accuracy for the five machine learning algorithms.
Diffusion probabilistic [9,10] models are currently becoming the leading paradigm of generative modeling for many important data modalities. Being the most prevalent models in the computer vision community [11], diffusion models have shown significant improvements over GANs. Examples of diffusion models that outperform GANs in specific tasks include Guided Language to Image Diffusion for Generation and Editing (GLIDE) [12] for text-to-image generation, and Diffwave [13] for audio synthesis. However, few studies have been conducted using diffusion models for synthetic tabular data generation, such as electronic health record diffusion model (EHRDiff) [14], medical diffusion model (MedDiff) [15], and the Tabular Denoising Diffusion Probabilistic Model (TabDDPM) [16], which are the currently available options. Although the implementation of EHRDiff is not available and privacy-level analysis is not addressed for MedDiff, the recently proposed TabDDPM has shown remarkable performance in synthetic data generation for various non-medical tabular datasets. Thus, the EHRDiff code is still not available and MedDiff has been recently released without addressing the privacy level. Therefore, TabDDPM has been the selected model.
In this study, we aimed to compare diffusion models (TabDDPM) against SOTA GAN models CTABGAN and CTABGAN+ by investigating their benefits in ER settings. First, we compared the qualitative and quatitative qualities of TabDDPM with those of the GAN-based models. Second, we validated that diffusion models offer the benefit of using a single neural network for more stable and easier training while also preventing common GAN problems, such as mode collapse [17]. Finally, especially for the ER setting, we show that diffusion models are more computationally efficient which opens new possibilities such as generating dynamic ER patient models in various conditions such as disasters, where modeling synthetic data quickly can have dramatic importance. Furthermore, real-time simulations with clinical outcomes can be performed owing to computational efficiency such as building a model to predict the type of patients that will come in the next hour. GANs have emerged as a powerful technique for generating synthetic data that closely resembles real-world data. However, GANs face several challenges that must be addressed for their effective and efficient use in various applications.

GANs limitations

Mode collapse

Mode collapse is a well-known problem in GANs where the generator produces a limited set of outputs, often failing to capture the entire distribution of the training data. Consequently, the generated samples exhibit a lack of diversity. In This poses a significant concern in the medical domain as it leads to the creation of completely biased synthetic data that do not accurately represent the original population.

Instability

GANs are notoriously difficult to train because of their instability, which often results in divergence or oscillations during training. This instability is often attributed to the non-convex nature of the GANs objective function. Given that GANs comprise two neural networks, proper training of each network relies heavily on the other. Therefore, addressing instability is a challenging problem that is closely related to architecture.

Computational efficiency

GANs are computationally expensive owing to their architecture, two neural networks, and the large number of training iterations required for convergence.

Large amounts of data

GANs require a substantial amount of training data to accurately learn the underlying data distribution. However, in certain scenarios, the objective is to generate synthetic data from a small dataset. This can be particularly valuable in ER settings, where modeling a limited number of patients to synthesize data on a small timescale (e.g., a day or week of patient data in the ER) can prove useful.

GAN improvements

Regarding the mentioned limitations, extensive research has been conducted to alleviate the constraints associated with GANs. For instance, the introduction of WGAN [1] has demonstrated its effectiveness in mitigating mode collapse by incorporating the Wasserstein distance into the training process. Furthermore, the stability and model convergence were enhanced by WGAN-GP [2]. Building on these advancements, novel GAN models have been developed, which are the focus of this study because they represent robust and SOTA approaches to synthetic data generation.
Despite their notable performance improvements, GANbased models still face various challenges that are difficult to overcome. One such challenge is the computational inefficiency arising from the use of two neural networks, which renders the synthetic data generation process computationally expensive. Although notable strides have been made in enhancing stability through the integration of the Wasserstein distance and architectural refinements, achieving complete stability remains a remarkable task because of the nonconvex nature of the objective functions of GANs.

Diffusion models

To comprehensively examine the potential advantages of diffusion models over GANs in EHR data generation, it is crucial to address the following key distinctions. Unlike GANs, diffusion models generate synthetic data directly, thus eliminating the need for separate generator and discriminator networks. Consequently, diffusion models circumvent the instability issues commonly associated with GANs, such as mode collapse and oscillations. Furthermore, diffusion models exhibit reduced sensitivity to hyperparameters, simplifying the training process and mitigating the risk of overfitting.
Another notable advantage of diffusion models is their ability to accurately model complex distributions. These models are specifically designed to capture the evolution of a probability distribution over time, enabling them to leverage the full expressive power of neural networks to learn intricate and multimodal distributions. Consequently, diffusion models capture subtle patterns and correlations within the data, that may not be noticed by GANs.
Theoretically, these characteristics imply that diffusion models not only offer enhanced stability compared to GANs but also exhibit superior time efficiency, as they rely on a single network for training purposes.

METHODS

We confirmed that all methods were performed in accordance with the relevant guidelines and regulations. All experimental protocols were approved by Institutional Review Board (IRB) of Samsung Medical Center licensing committee (IRB file number 2023-11-166). The informed consent was waived by the IRB for the 111,228 ER participants and/or their legal guardians.

TabDDPM

For this study we created tabular synthetic data using TabDDPM [16]. As mentioned in [16] diffusion models [9,10] are a type of generative model that aims to approximate the target distribution by starting with a given parametric distribution, usually a standard Gaussian distribution. At each Markov step, a deep neural network is utilized to learn how to invert the diffusion process using a known Gaussian kernel. The equivalence of diffusion models and score matching has been demonstrated [10], both represent different perspectives for gradually converting a simple known distribution into a target distribution via an iterative denoising process [18,19]. Recent studies have contributed to the advancement of diffusion models [11,20], which introduced different advanced learning protocols and improved model architectures. These advancements have led to the superior performance of DDPM over GANs in the field of computer vision, demonstrating better generative quality and diversity. TabDDPM research also demonstrates the successful application of diffusion models to tabular data generation problems.
We performed exhaustive hyperparameter tuning of the model using the parameter-tuning algorithm listed in TabDDPM. We trained a multilayer perceptron with 512× 1,024× 1,024 ×1,024 ×1,024 ×256 layer size for 50,000 steps, a learning rate of 0.00015 and a batch size of 256. NVIDIA Ge-Force RTX 3,060 graphics processing units (GPU) was used to accelerate the training.

Study setting and population

This study used ER data from a tertiary hospital emergency department (ED) in a metropolitan area of South Korea, which serves approximately 80,000 annual ED visits and has a capacity of approximately 2,000 beds. The study population consisted of 111,228 ED patients admitted between 2016 and 2020, excluding patients under 18 years of age, those who died upon arrival, those with cardiac arrest.

Experiment

Selection of predictors

As it is described in Table 1, the predictors are age, gender, heart rate, diastolic blood pressure, respiratory rate (RR), systolic blood pressure, oxygen saturation, ER visit count, intensive care unit (ICU) visit count, myocardial infarction, congestive heart failure, peripheral vascular disease, stroke, dementia, chronic pulmonary disease, rheumatoid disease, peptic ulcer disease, diabetes without chronic complications, diabetes with complications, hemiplegia/paraplegia, kidney disease, local tumor/leukemia/lymphoma, metastatic solid tumor, mild liver disease, and severe liver disease.

Outcomes

Our primary outcome was the prediction of death, considering all previously described predictors.

Fair comparison

To ensure a fair and unbiased comparison among the models, it is important to note that they were not trained for an equal number of steps or epochs. This approach introduces a potential source of bias, because different model architectures may require varying amounts of time to converge to an optimal fit. However, to mitigate this concern, extensive fine-tuning was performed, leading to the identification of optimal models for our dataset. The selected models and their respective training durations were as follows: TabDDPM was trained for 50,000 steps, CTABGAN+ underwent 300 epochs (equivalent to 66,600 steps), and CTGAN was trained for 100 epochs (equivalent to 11,100 steps).

Similarity

To examine the degree of similarity between synthetic and real data, a comprehensive analysis of their similarities was conducted. It is important to note that the methodology outlined in [7] was employed to calculate the Wasserstein distance, Jensen-Shannon divergence (JSD), and discrepancy in the pairwise correlation.

Variable correlation

To assess the statistical similarity of each variable, the mean± standard deviation of continuous variables were computed, and the count and standard deviation of the binary variables were analyzed. The dataset was segregated into two separate columns based on the outcome variable, specifically “death in hospital.” Subsequently, the similarity between the real and synthetic data generated by TabDDPM was quantified.

Kolmogorov-Smirnov test

The relationship between each pair of samples or datasets was compared using the Kolmogorov-Smirnov (KS) test [21,22], specifically the two-sample KS test. This test examined the likelihood that two sets of samples were derived from the same probability distribution. A significance level of 0.05 was chosen as the threshold for the KS test, where any value exceeding this threshold suggests that the data do not originate from the same distribution. To perform this test, both underfitting and overfitting scenarios of the models were considered, instead of determining the optimal CTABGAN+ and TabDDPM training epochs or steps. This approach was adopted to ensure the stability of TabDDPM’s training.

Wasserstein distance

The Wasserstein distance [23] was utilized to quantify the degree to which synthetic datasets accurately replicate the distributions of individual continuous/mixed variables in comparison with real datasets.

Jensen-Shannon divergence

The JSD metric was employed to quantitatively assess the dissimilarity between the probability mass distributions of individual categorical variables in both real and synthetic datasets. However, it is important to note that the JSD metric may not provide accurate evaluations of the quality of continuous variables, particularly in scenarios where there is no overlap between the synthetic and original datasets. Consequently, the aforementioned limitation is effectively addressed by utilizing the Wasserstein distance, as previously discussed.

Differences in pairwise correlation

Feature interaction preservation in synthetic datasets was evaluated by computing separate pairwise correlation matrices for both real and synthetic datasets. The Pearson correlation coefficient is used to quantify the correlation between two continuous variables, whereas the Theil uncertainty coefficient [24] is utilized to measure the correlation between two categorical features. Additionally, the correlation ratio is used to assess the correlation between the categorical and continuous variables.

Distribution similarity

To evaluate the similarity of the distributions, visual comparisons were performed by plotting the distributions of both real and synthetic data. The analysis involved a comparison of the synthetic data generated by CTGAN, CTABGAN+, and TabDDPM with the real data, encompassing both continuous and binary variables.

Pearson similarity

To calculate the feature correlation within our dataset, the Pearson correlation coefficient [25] was employed. The Pearson correlation coefficient is a statistical metric used to assess the linear correlation between two sets of data, thereby enabling evaluation of the extent to which the correlation between features is preserved. This coefficient was calculated for both the real and synthetic datasets generated by CTGAN, CTABGAN+, and TabDDPM.

Inter-variable relationship

To ensure that inter-variable relationships also reflect medical reality in synthetic data correlation analysis between different variables was performed.

Utility

The training data used is detailed in Table 1. All the variables listed in the table were used as input variables. The target variable for prediction was hospital mortality, described in Table 1 as either alive or dead.

Training on synthetic datasets and testing on real datasets test

In addition to assessing data similarity, a common task for evaluating the usability of synthetic data in machine learning tasks is training on synthetic datasets and testing on real datasets (referred to as TSTR) [26,27].
To establish a baseline, a random selection of 70% of the dataset was used for training, and the logistic regression [28], random forest [29], and Extreme Gradient Boosting (XGBoost) [30] models were trained on these data. The remaining 30% was reserved for testing the real data. This approach, known as trained on real and tested on real TRTR, provides a benchmark for synthetic data aims to achieve. The performance of each model was evaluated using a five-fold cross-validation.
Subsequently, the CTABGAN+, TabDDPM, and CTGAN models were trained using 70% of the training set to generate synthetic data. These synthetic datasets were then used to train the same machine learning models employed in the TRTR setting, and testing was conducted on the remaining 30% of real test data (TSTR). Five-fold cross-validation was performed to evaluate the performance of the models. The area under the receiver operating characteristic curve (AUROC) was calculated for each model on the test datasets.

Privacy

Evaluating the privacy of diffusion model-generated synthetic data is crucial, as the main goal is to produce secure and protected data. It is essential to thoroughly assess privacy to guarantee that the original data cannot be traced back from the synthetic data. Overfitted diffusion models pose a significant risk by potentially replicating the original data, making privacy evaluation even more important.
The Euclidean distance was chosen as the metric for calculating distance to closest record (DCR) because of its suitability for measuring proximity in a multidimensional numerical feature space. It provides a straightforward and intuitive way to assess how closely synthetic records approximate the original data distribution. This metric was selected for its simplicity, computational efficiency, and widespread acceptance in data analysis applications.

DCR

The DCR [31,32] metric quantifies the Euclidean distance between a synthetic record and its closest neighboring real record. To calculate the DCR, we used the Euclidean distance metric to measure the proximity between each synthetic record and the closest real record. We utilized all features listed in Table 1 for this analysis. For each synthetic record, we computed the distance to every real record. The distance between a synthetic record s and a real record r with features (s1, s2,…, sn) and (r1, r2,…, rn) respectively, is calculated as:
ds,r=i=1nsi-ri2.
For each synthetic record, we identified the real record with the smallest Euclidean distance. This smallest distance is recorded as the DCR for that synthetic record. To provide a robust estimate of the privacy risk, the 5th percentile of the DCR values across all synthetic records was calculated. Higher DCR values indicate a lower risk of privacy breach, as it means that synthetic records are far from real records.

NNDR

The nearest neighbor distance ratio (NNDR) measures the ratio between the Euclidean distance of the closest real neighbor and the second closest real neighbor to each synthetic record. As with DCR, we used all features listed in Table 1. For each synthetic record, we calculated the Euclidean distance to every real record. For each synthetic record, we identified the real record with the smallest distance (closest neighbor) and the real record with the second smallest distance (second closest neighbor). The NNDR is then calculated as the ratio of the distance to the closest neighbor to the distance to the second closest neighbor:
NNDR=ds,rclosestds,rsecond closest.
This ratio lies within the range 0 to 1, where higher values indicate better privacy, as they suggest that synthetic records are not closely mimicking any specific real record. Similarly to DCR, we computed the 5th percentile of the NNDR values to provide a robust estimate of privacy risk. Lower NNDR values indicate a higher risk of revealing sensitive information from the closest real record.

Membership inference attack

Membership inference attack (MIA) [33] examines the likelihood of data being a member of the training data used to generate synthetic data. In this study, MIA serves as the primary metric for privacy assessment, as it provides a means to evaluate the degree of privacy relative to different models. The implemented procedure follows the description provided in [27]. While passing this test does not guarantee absolute safety against attacks, MIA serves as evidence of privacy against a common adversarial attack. The threshold for good privacy is set at 50%, as a random classifier would guess correctly half the time, and any value surpassing this threshold indicates some degree of data leakage.

Time efficiency

The time efficiencies of the models were evaluated to assess their computational performances. Specifically, CTABGAN+ and TabDDPM were compared, as they demonstrated similar and favorable performances across most metrics. However, it is important to note that a limitation of this comparison was the inability to utilize a GPU for CTGAN, which may have resulted in an unfair comparison in terms of time efficiency.

RESULTS

Similarity

Variable correlation (analytical)

Table 1 presents a description of the similarities between the real and synthetic datasets generated by the diffusion model, TabDDPM. Notably, the continuous variables exhibit a high degree of fidelity to real data, as evidenced by their realistic mean and standard deviation values. Furthermore, the distribution of binary variables, while exhibiting some differences, remained highly comparable across both mortality classes in hospitalized patients.
Notably, CTGAN did not demonstrate adequate proficiency in addressing the issue of imbalanced binary variables. In contrast, both TabDDPM and CTABGAN+ exhibited exceptional performance, even in the face of imbalanced classes.

Distribution similarity

Fig. 1 presents a comparative analysis of continuous variables using CTGAN, CTABGAN+, and TabDDPM. The results demonstrate that both CTABGAN+ and TabDDPM exhibit similar and superior performances, respectively, compared with CTGAN. Notably, both models demonstrate exceptional proficiency in addressing complex distributions, with TabDDPM exhibiting a slight edge in addressing the sharpness of the distributions. Further examples are presented in Supplementary Fig. 1.
Fig. 2 illustrates a comparison of the binary categorical values for the minority class, revealing that while CTGAN displays inadequate performance, both CTABGAN+ and TabDDPM yield comparable results to the original data. Nevertheless, TabDDPM exhibited slightly more precise results in addressing the imbalance than CTABGAN+.

Pearson similarity

Supplementary Fig. 2 shows the pairwise Pearson correlations among the numerical features of the dataset. Pearson correlation values were obtained for each variable in both the original dataset and corresponding synthetic data, and the plots represent the differences between these values. Thus, a softer plot indicates higher similarity in the correlations among the variables.
The results showed that CTGAN performed the worst, whereas CTABGAN+ displayed the best performance in most cases, with TabDDPM performing well in the majority of cases, except for two variables representing count values, number of ER visits (count ER) and number of ICU stays (count ICU). Therefore, TabDDPM demonstrates exceptional performance in generating mixed data, except for count values. Nevertheless, CTABGAN+ exhibits a slight advantage in this respect.

Kolmogorov-Smirnov test (analytical)

In reference to the KS similarity test, Supplementary Fig. 3 illustrates that both CTABGAN+ and TabDDPM display scores above 0.95, thereby satisfying the KS test criteria. Conversely, insufficient for acceptance. In addition, the instability of CTABGAN+ is evident in the plot. The optimal models for TabDDPM and CTABGAN+ were found to be 50,000 steps and 300 epochs, respectively. However, when underfitting occurred in CTABGAN+ (150 epochs), the KS scores spiked, leading to rejection of the KS test. The same trend was observed when overfitting occurred (450 epochs). In contrast, TabDDPM exhibited lower variability, with the KS test being accepted even when overfitting (75,000 steps) occurred, and underfitting (25,000 steps) was within the threshold of acceptance. These findings highlight the significance of proper training fitting, because inadequate tuning may lead to instability in the results.

Wasserstein distance, Jensen-Shannon divergence, and difference in pairwise correlation (analytical)

Table 2 presents the results of the three metrics used to evaluate the similarities between TabDDPM (50,000 steps), CTABGAN+ (300 epochs), and CTGAN. For all metrics, a smaller distance indicates greater similarity to the real data. The results indicated that both TabDDPM and CTABGAN+ significantly outperformed CTGAN, with TabDDPM displaying superior performance in terms of Wasserstein distance and JSD metrics.
Regarding the difference in pairwise correlation, both TabDDPM and CTABGAN+ again surpassed CTGAN; however, this time, CTABGAN+ exhibited superior performance compared to TabDDPM. This phenomenon may be attributed to the differences in count values, as demonstrated in the subsequent Pearson correlation analysis.

Inter-variable relationship

Supplementary Fig. 4 presents the results of comparison between real and synthetic data and the inter-variable correlation of systolic blood pressure and diastolic blood pressure by heart rate and between RR and oxygen saturation (SpO2) and age below. The results reflect that the synthetic data maintains the relationship between relevant clinical variables since it resembles the correlation of the real data in each of the ranges with minimal deviation.

Utility

TSTR test

The utility of synthetic data generation is key because it allows training on synthetic data for various purposes. Table 3 presents the results of the five-fold TSTR using logistic regression, random forest, and XGBoost. TabDDPM and CTABGAN+ outperform CTGAN and TabDDPM performs better than CTABGAN+ for random forest and XGBoost. However, it exhibited a worse performance for logistic regression. Again, the impact of overfitting (450 epochs) on CTABGAN+ can be seen in the area under the curve (AUC) score, whereas TabDDPM (75,000 steps) appears to be more stable. These results suggest the instability of GAN models. The poor performance of logistic regression may be attributed to its assumption of a linear relationship between variables, which is not always true, and its tendency to underperform on imbalanced datasets because the algorithm is optimized to minimize the overall error rate, which can lead to poor performance in the minority class.

Privacy

DCR and NNDR

In both cases, higher DCR and NNDR values between the real and synthetic data indicate greater privacy. Therefore, Table 4 suggests that CTGAN is the most private model. We observed a common trade-off between privacy and similarity. As discussed earlier, CTGAN does not perform well in terms of similarity or utility, leading to better privacy. Comparing CTABGAN+ and TabDDPM in both tests, TabDDPM was found to be slightly more private than CTABGAN+, although this difference was not significant.

Membership inference attack

Table 4 displays the results of the five-fold MIA test. For the data to be secure under this type of attack, the value should aim for an MIA of 0.5. We can see that in all cases, the test passes and all the models are secure.

Time efficiency

Table 3 addresses time efficiency comparison between CTABGAN+ and TabDDPM. CTABGAN+ required an average of 2 hours and 41 minutes to train for 300 epochs, while TabDDPM required 4 minutes. Considering that both algorithms have similar performance in most metrics, the difference in time efficiency is remarkable. In this case, CTABGAN+ required 36 times more time to perform the same task.

DISCUSSION

The results of our study demonstrate that diffusion models exhibit performances comparable to those of SOTA GANs in terms of similarity, privacy, and utility. Analytical and graphical evidence supports the notion that diffusion models perform similarly, and in some cases slightly better, than GANs. Privacy analysis, measured by DCR and NNDR values, indicated a similar level of privacy between SOTA models, with all models passing the MIA test. In terms of utility, the diffusion model outperformed GANs in all aspects except logistic regression.
Although our study confirmed the competitiveness of diffusion models with respect to GANs, their clear advantages lie in their computational efficiency and stability. Specifically, our findings reveal that the TabDDPM diffusion model offers significant benefits in terms of training efficiency, being 36 times faster than the SOTA CTABGAN+ model. It is noteworthy that the efficiency gains of the diffusion models become even more relevant as the dataset size increases. By contrast, GANs are less practical for large datasets because of the substantial difference in processing time. Additionally, the requirement for repeated data generation iterations further exacerbates the discrepancy between diffusion models and GANs. The computational power required to execute these iterations places GANs at a disadvantage. Moreover, when considering multicenter synthetic data generation, these challenges are amplified because data generation iterations affect other centers, potentially necessitating updates to their datasets and further extending the overall time required. Thus, the discrepancy in the processing time can be significantly magnified, depending on the objectives of the study.
Our findings are consistent with those of previous studies conducted in other domains, which demonstrated that diffusion models outperform GANs in terms of stability, efficiency, and data quality. In the context of synthetic data generation in ER settings, this study reinforces the superiority of diffusion models in terms of efficiency and stability. Although previous SOTA GANs can generate high-quality synthetic data, diffusion models exhibit remarkable advantages in terms of efficiency and stability. The stability observed in diffusion models suggests their potential for superior performance on small datasets, eliminating issues such as mode collapse, which are commonly associated with GANs.
However, the limitations should be acknowledged. Although we conducted an in-depth benchmark, including a test with a variational autoencoder, we excluded it from our study because of its limited ability to generate minority class samples. In addition, the exclusion of the training time of CTGAN, which was not trained using a GPU, was necessary to ensure a fair comparison. To further strengthen the evidence on privacy, future research could explore additional privacy attacks such as attribute inference attack or re-identification attacks.
The implications of our findings are highly relevant in ER settings, where the efficiency and stability of diffusion models can facilitate the generation of dynamic patient models under various conditions, including disaster scenarios, in which fast and accurate data modeling is critical. For instance, in critical situations such as the recent coronavirus disease 2019 (COVID-19) pandemic, timely sharing of initial data plays a crucial role in assessing the situation and responding effectively. Time was essential, as sharing data rapidly allowed for a comprehensive understanding of the disease outbreak and facilitated prompt decision-making. In this context, the efficiency and stability of diffusion models have become even more important, as they enable the generation of dynamic patient models in real-time. This capability becomes particularly valuable when hospital computers are offline and lack powerful GPU capabilities, because diffusion models can assist in synthetic data generation under such resource constraints. Furthermore, the stability exhibited by diffusion models suggests their utility in small dataset generation, enabling the synthesis of small batches of patient data for real-time analyses. Overall, the ability of diffusion models to support fast and accurate data modeling is vital for addressing critical events, such as disease outbreaks, in the ER setting.
In conclusion, this study compared the performance of diffusion models with SOTA GANs in synthetic data generation, with a focus on ER settings. The results showed that diffusion models are comparable to GANs in terms of similarity, privacy, and utility, with the advantage of being more computationally efficient and stable. This study also highlights the potential relevance of diffusion models in generating dynamic ER patient models and real-time simulations of clinical outcomes. Future research should explore differential private diffusion models and adapt them to generate more relevant synthetic data for hospitals with time constraints.

Supplementary materials

Supplementary Fig. 1.
Comparison of correlation values between real and synthetic data with tabular denoising diffusion probabilistic model (TabDDPM), contidional tabular generative adversarial network + (CTABGAN+), and contidional tabular generative adversarial network (CTGAN). The described variables are (A, B, C) age, (D, E, F) heart rate (HR), (G, H, I) diastolic blood pressure (DBP), and (J, K, L) systolic blood pressure (SBP).
pfm-2024-00030-Supplementary-Fig-1.pdf
Supplementary Fig. 2.
Comparison of correlation values between real and synthetic data with (A) contidional tabular generative adversarial network (CTGAN), (B) tabular denoising diffusion probabilistic model (TabDDPM), and (C) contidional tabular generative adversarial network + (CTABGAN+).
pfm-2024-00030-Supplementary-Fig-2.pdf
Supplementary Fig. 3.
Komolgorov-Smirnov (KS) statistics overfitting and underfitting both models. The underfitted models are tabular denoising diffusion probabilistic model (TabDDPM) with 25,000 steps and contidional tabular generative adversarial network + (CTABGAN+) with 150 epochs. The overfitted TabDDPM 75,000 steps and CTABGAN+ 450 epoch. Results show that when CTABGAN+ is overfitted or underfitted the KS statistic hypothesis is not accepted and has the lowest value with a remarkable difference over TabDDPM. The table shows the inverse KS statistic where 0.95 is the threshold. CTGAN, contidional tabular generative adversarial network.
pfm-2024-00030-Supplementary-Fig-3.pdf
Supplementary Fig. 4.
(A, B, C, D) Correlation between systolic blood pressure (SBP) and diastolic blood pressure (DBP) by heart rate on the top and between respiratory rate (RR) and oxygen saturation (SpO2) and age below. The results reflect that the synthetic data maintains the relationship between relevant clinical variables since it resembles the correlation of the real data in each of the ranges with minimal deviation.
pfm-2024-00030-Supplementary-Fig-4.pdf

CONFLICTS OF INTEREST

Won Chul Cha has been editorial board of Precision and Future Medicine since January 2023. He was not involved in the review process of this original article. Jinsung Yoon has no conflict of interest with Google.

Notes

AUTHOR CONTRIBUTIONS

Conception or design: JA, JYY, KHJ, JY, WCC.

Acquisition, analysis, or interpretation of data: JA, JYY, KHJ, JY, WCC.

Drafting the work or revising: JA, JYY, KHJ, JY, WCC.

Final approval of the manuscript: JA, JYY, KHJ, JY, WCC.

ACKNOWLEDGEMENTS

This study was supported by a Samsung Medical Center grant #OTA2101731.

Fig. 1.
Comparison of continuous variables among (A) tabular denoising diffusion probabilistic model (TabDDPM), (B) conditional tabular generative adversarial network + (CTABGAN+), and (C) conditional tabular generative adversarial network (CTGAN). The blue distribution refers to real data and the red to the new or synthetic data. Heart rate (HR) continuous variable comparison.
pfm-2024-00030f1.jpg
Fig. 2.
Comparison of categorical (binary minority class) variables between tabular denoising diffusion probabilistic model (TabDDPM), contidional tabular generative adversarial network + (CTABGAN+), and contidional tabular generative adversarial network (CTGAN). The blue distribution refers to original data and the red one to the new or synthetic data. MI, myocardial infarction; CHF, congestive heart failure; PVD, peripheral vascular disease; STR, stroke; DEM, dementia; CPD, chronic pulmonary disease; RD, rheumatoid disease; PUD, peptic ulcer disease; DM WOC, diabetes without chronic complication; DM C, diabetes with complication; HEMI, hemiplegia/paraplegia; RENAL, kidney disease; TU LE, local tumor/leukemia/lymphoma; MST, metastatic solid tumor; MLD, mild liver disease; SLD, severe liver disease; DTH IN, death in hospital.
pfm-2024-00030f2.jpg
Table 1.
Comparison between real data and synthetic data values generated by TabDDPM and Komolgorov-Smirnov statistics for each variable
Variable Real
Synthetic TabDDPM
Overall Alive Dead Overall Alive Dead KS
No. of patients 111,228 105,086 6,142 111,228 104,988 6,240
Age (yr) 57.4±16.5 57.0±16.5 62.9±14.0 57.1±16.3 56.8±16.4 62.5±13.9 0.0101
Male 57,535 (51.7) 53,722 (51.1) 3,813 (62.1) 57,573 (51.8) 53,644 (51.1) 3,929 (63.0) 0.0003
Female 53,693 (48.3) 51,364 (48.9) 2,329 (37.9) 53,655 (48.2) 51,344 (48.9) 2,311 (37.0) 0.0003
HR (bpm) 91.4±21.7 90.8±21.2 102.7±26.8 91.2±21.9 90.5±21.4 103.0±27.3 0.0063
DBP (mm Hg) 76.9±16.5 77.3±16.1 70.9±20.4 77.1±16.4 77.5±16.0 70.3±20.8 0.0064
RR (bpm) 20.0±3.40 19.9±3.20 21.6±5.90 20.0±3.30 19.9±3.00 21.5±5.8 0.0117
SBP (mm Hg) 126.6±27.5 127.3±26.9 115.5±34.2 126.6±27.3 127.4±26.6 114.2±34.8 0.0064
SpO2 (%) 96.9±4.20 97.0±3.80 93.9±8.00 96.9±4.30 97.1±3.90 94.0±8.20 0.0131
Count ER 0.8±1.80 0.7±1.70 1.6±2.40 0.7±3.20 0.7±2.90 1.9±5.50 0.0185
Count ICU 0.2±0.80 0.2±0.70 0.4±1.00 0.2±0.90 0.2±0.80 0.4±1.40 0.0155
Count surgery 0.2±0.60 0.2±0.60 0.3±0.70 0.2±0.50 0.2±0.50 0.3±0.60 0.0106
MI 1,837 (1.7) 1,729 (1.6) 108 (1.8) 1,449 (1.3) 1,377 (1.3) 72 (1.2) 0.0034
CHF 3,929 (3.5) 3,652 (3.5) 277 (4.5) 3,228 (2.9) 3,010 (2.9) 218 (3.5) 0.0063
PVD 2,378 (2.1) 2,240 (2.1) 138 (2.2) 1,817 (1.6) 1,711 (1.6) 106 (1.7) 0.0050
STR 8,668 (7.8) 8,197 (7.8) 471 (7.7) 7,699 (6.9) 7,311 (7.0) 388 (6.2) 0.0087
DEM 2,804 (2.5) 2,598 (2.5) 206 (3.4) 2,365 (2.1) 2,242 (2.1) 123 (2.0) 0.0039
CPD 6,397 (5.8) 5,932 (5.6) 465 (7.6) 5,753 (5.2) 5,318 (5.1) 435 (7.0) 0.0057
RD 1,378 (1.2) 1,296 (1.2) 82 (1.3) 1,080 (1.0) 1,027 (1.0) 53 (0.8) 0.0026
PUD 4,666 (4.2) 4,380 (4.2) 286 (4.7) 3,890 (3.5) 3,628 (3.5) 262 (4.2) 0.0069
DM WOC 11,826 (10.6) 11,056 (10.5) 770 (12.5) 10,642 (9.6) 9,895 (9.4) 747 (12.0) 0.0106
DM C 4,088 (3.7) 3,897 (3.7) 191 (3.1) 3,286 (3.0) 3,130 (3.0) 156 (2.5) 0.0072
HEMI 606 (0.5) 567 (0.5) 39 (0.6) 464 (0.4) 440 (0.4) 24 (0.4) 0.0012
RENAL 6,949 (6.2) 6,626 (6.3) 323 (5.3) 6,058 (5.4) 5,811 (5.5) 247 (4.0) 0.0080
TU LE 38,971 (35.0) 34,901 (33.2) 4,070 (66.3) 38,454 (34.6) 33,918 (32.3) 4,536 (72.7) 0.0046
MST 5,521 (5.0) 4,812 (4.6) 709 (11.5) 5,016 (4.5) 4,211 (4.0) 805 (12.9) 0.0045
MLD 10,103 (9.1) 9,228 (8.8) 875 (14.2) 9,223 (8.3) 8,297 (7.9) 926 (14.8) 0.0079
SLD 792 (0.7) 717 (0.7) 75 (1.2) 580 (0.5) 515 (0.5) 65 (1.0) 0.0019

Values are presented as mean±standard deviation or number (%).

TabDDPM, tabular denoising diffusion probabilistic model; KS, Komolgorov-Smirnov; HR, heart rate; DBP, diastolic blood pressure; RR, respiratory rate; SBP, systolic blood pressure; SpO2, oxygen saturation; count ER, count of emergency room visit; count ICU, count of intensive care visit; count surgery, count of surgery visit; MI, myocardial infarction; CHF, congestive heart failure; PVD, peripheral vascular disease; STR, stroke; DEM, dementia; CPD, chronic pulmonary disease; RD, rheumatoid disease; PUD, peptic ulcer disease; DM WOC, diabetes without chronic complication; DM C, diabetes with complication; HEMI, hemiplegia/paraplegia; RENAL, kidney disease; TU LE, local tumor/leukemia/lymphoma; MST, metastatic solid tumor; MLD, mild liver disease; SLD, severe liver disease.

Table 2.
Comparison of average WD, JSD, and correlation distance for three methods
Average WD (Cont.) Average JSD (Cat.) Diff. Pair-Wise distance
TabDDPM 0.001472 0.011608 0.803866
CTABGAN+ 0.002614 0.015443 0.452392
CTGAN 0.025507 0.08734 1.880659

In WD and JSD metrics TabDDPM outperforms the rest while CTABGAN+ outperforms the rest in pair-wise correlation distance.

WD, Wasserstein distance; JSD, Jensen-Shannon divergence; Cont., continuous; Cat., categorical; Diff., difference; TabDDPM, tabular denoising diffusion probabilistic model; CTABGAN+, contidional tabular generative adversarial network +; CTGAN, contidional tabular generative adversarial network.

Table 3.
AUC and computational efficiency analysis
AUC value Logistic regression Random forest XGBoost Average time Range of time
Real 0.790±0.004 0.801±0.007 0.814±0.004 NA NA
TabDDPM 50,000 steps 0.748±0.009 0.788±0.004 0.800±0.000 0:04:36 0:04:29–0:04:49
CTABGAN+ 300 epoch 0.786±0.004 0.778±0.007 0.790±0.006 2:41:46 2:41:07–2:43:21
CTGAN 100 epoch 0.634±0.078 0.650±0.031 0.532±0.060 NA NA

Values are presented as mean±standard deviation. The table shows the AUC of logistic regression, random forest, and XGBoost as well as the standard deviation of running the test five-fold. TabDDPM had a better utility performance in most cases except from logistic regression which may have been more sensitive to subtle changes in data. On the right side of the table time comparison shows that TabDDPM is on average 36 times faster than CTABGAN+. CTGAN results were excluded since we could not train the model with graphics processing units (GPU) and results would have been biased.

AUC, area under the curve; XGBoost, Extreme Gradient Boosting; NA, not applicable; TabDDPM, tabular denoising diffusion probabilistic model; CTABGAN+, contidional tabular generative adversarial network +; CTGAN, contidional tabular generative adversarial network.

Table 4.
Comparison of DCR and NNDR between real and synthetic data using three methods
Variable DCR (5th percentage)
NNDR (5th percentage)
MIA
Real vs. Synthetic Real Synthetic Real vs. Synthetic Real Synthetic
TabDDPM 0.300490 0.219638 0.229497 0.629353 0.553303 0.570862 0.467±0.045
CTABGAN+ 0.279647 0.219638 0.256688 0.624098 0.553303 0.559522 0.483±0.039
CTGAN 0.966723 0.219638 0.54074 0.812043 0.553303 0.586514 0.479±0.025

Values are presented as mean±standard deviation. Excluding CTGAN, TabDDPM has the best performance. The higher Real vs. Synthetic the higher privacy value. Also MIA values are also included, and all models pass the test by not being vulnerable to the attack since their values are below the 0.5 threshold.

DCR, distance to closest record; NNDR, nearest neighbor distance ratio; MIA, membership inference attack; TabDDPM, tabular denoising diffusion probabilistic model; CTABGAN+, contidional tabular generative adversarial network +; CTGAN, contidional tabular generative adversarial network.

REFERENCES

1. Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. Proc Mach Learn Res 2017;70:214–23.

2. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of wasserstein GANs. Adv Neural Inf Process Syst 2017;30:1. –11. https://papers.nips.cc/paper_files/paper/2017/file/892c3b1c6dccd52936e27cbd0ff683d6-Paper.pdf.

3. Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional GAN. Adv Neural Inf Process Syst 2019;32:1. –11. https://papers.nips.cc/paper_files/paper/2019/file/254ed7d2de3b23ab10936522dd547b78-Paper.pdf.

4. Choi E, Biswal S, Malin B, Duke J, Stewart WF, Sun J. Generating multi-label discrete patient records using generative adversarial networks. Proc Mach Learn Res 2017;68:286–305.

5. Zhang Z, Yan C, Mesa DA, Sun J, Malin BA. Ensuring electronic medical record simulation through better training, modeling, and evaluation. J Am Med Inform Assoc 2020;27:99–108.
crossref pmid pmc pdf
6. Yoon J, Drumright LN, van der Schaar M. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J Biomed Health Inform 2020;24:2378–88.
crossref pmid pdf
7. Zhao Z, Kunar A, Birke R, Chen LY. CTAB-GAN: effective table data synthesizing. Proc Mach Learn Res 2021;157:97–112.

8. Zhao Z, Kunar A, Birke R, Chen LY. CTAB-GAN+: enhancing tabular data synthesis [Preprint]. Posted 2022 Apr 1. arXiv 2204.00401. https://doi.org/10.48550/arXiv.2204.00401.

9. Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. Proc Mach Learn Res 2015;37:2256–65.

10. Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 2020;33:1. –12. https://papers.nips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.

11. Dhariwal P, Nichol A. Diffusion models beat GANs on image synthesis. Adv Neural Inf Process Syst 2021;34:1. –15. https://papers.nips.cc/paper_files/paper/2021/file/49ad-23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf.

12. Nichol A, Dhariwal P, Ramesh A, Shyam P, Mishkin P, Mc-Grew B, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models [Preprint]. Posted 2021 Dec 20. arXiv 2112.10741. https://doi.org/10.48550/arXiv.2112.10741.

13. Kong Z, Ping W, Huang J, Zhao K, Catanzaro B. Diffwave: a versatile diffusion model for audio synthesis [Preprint]. Posted 2020 Sep 21. arXiv 2009.09761. https://doi.org/10.48550/arXiv.2009.09761.

14. Yuan H, Zhou S, Yu S. EHRDiff: exploring realistic EHR synthesis with diffusion models [Preprint]. Posted 2023 Mar 10. arXiv 2303.05656. https://doi.org/10.48550/arXiv.2303.05656.

15. He H, Zhao S, Xi Y, Ho JC. MedDiff: generating electronic health records using accelerated denoising diffusion model [Preprint]. Posted 2023 Feb 8. arXiv 2302.04355. https://doi.org/10.48550/arXiv.2302.04355.

16. Kotelnikov A, Baranchuk D, Rubachev I, Babenko A. TabDDPM: modelling tabular data with diffusion models [Preprint]. Posted 2022 Sep 30. arXiv 2209.15421. https://doi.org/10.48550/arXiv.2209.15421.

17. Thanh-Tung H, Tran T. Catastrophic forgetting and mode collapse in GANs. 2020 International Joint Conference on Neural Networks (IJCNN) 2020 Jul 19-24; Glasgow, UK. IEEE; 2020. p. 1-10. https://ieeexplore.ieee.org/document/9207181.
crossref
18. Song Y, Ermon S. Generative modeling by estimating gradients of the data distribution. Adv Neural Inf Process Syst 2019;32:1. –13. https://papers.nips.cc/paper_files/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf.

19. Song Y, Ermon S. Improved techniques for training scorebased generative models. Adv Neural Inf Process Syst 2020;33:1. –11. https://papers.nips.cc/paper_files/paper/2020/file/92c3b916311a5517d9290576e3ea37ad-Paper.pdf.

20. Nichol A, Dhariwal P. Improved denoising diffusion probabilistic models. Proc Mach Learn Res 2021;139:8162–71.

21. Kolmogorov AN. Sulla determinazione empirica di una legge di distribuzione. Giorn Dell’inst Ital Degli Att 1933;4:83–91.

22. Smirnov N. Table for estimating the goodness of fit of empirical distributions. Ann Math Stat 1948;19:279–81.
crossref
23. Vershik AM. Kantorovich metric: initial history and littleknown applications. J Math Sci 2006;133:1410–7.
crossref pdf
24. Theil H. A rank-invariant method of linear and polynomial regression analysis. In: Raj B, Koerts J, editors. Henri Theil’s contributions to economics and econometrics: econometric theory and methodology>. Springer; 1992. p. 345–81.

25. Pearson K. Contributions to the mathematical theory of evolution. Philos Trans R Soc Lond A 1894;185:71–110.

26. Yan C, Yan Y, Wan Z, Zhang Z, Omberg L, Guinney J, et al. A Multifaceted benchmarking of synthetic electronic health record generation models. Nat Commun 2022;13:7609.
crossref pmid pmc pdf
27. Yoon J, Mizrahi M, Ghalaty NF, Jarvinen T, Ravi AS, Brune P, et al. EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records. NPJ Digit Med 2023;6:141.
crossref pmid pmc pdf
28. Berkson J. Application of the logistic function to bio-assay. J Am Stat Assoc 1944;39:357–65.
crossref
29. Cutler A, Cutler DR, Stevens JR. Random forests. In: Zhang C, Ma Y, editors. Ensemble machine learning: methods and applications. Springer; 2012. p. 157–75.

30. Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al. Xgboost: Extreme Gradient Boosting. R package version 0.4-2. R package version 0.4-2. R Foundation for Statistical Computing; 2015. p. 1–4.

31. Lu PH, Wang PC, Yu CM. Empirical evaluation on synthetic data generation with generative adversarial network. In: Akerkar R, Jung JJ, editors. WIMS2019: Proceedings of the 9th International Conference on Web Intelligence, Mining and Semantics; 2019 Jun 26-28; Seoul, Korea. Association for Computing Machinery; 2019. p. 1-6.

32. Park N, Mohammadi M, Gorde K, Jajodia S, Park H, Kim Y. Data synthesis based on generative adversarial networks. arXiv [Preprint] 2018;Jun 9 https://doi.org/10.48550/arXiv.1806.03384.
crossref
33. Liu C, Wang C, Peng K, Huang H, Li Y, Cheng W. Socinf: membership inference attacks on social media health data with machine learning. IEEE Trans Comput Soc Syst 2019;6:907–21.
crossref


ABOUT
ARTICLES

Browse all articles >

ISSUES
TOPICS

Browse all articles >

EDITORIAL
POLICY
AUTHOR
INFORMATION
Editorial Office
Sungkyunkwan University School of Medicine
2066 Seobu-ro, Jangan-gu, Suwon, Gyeonggi-do 16419, Korea
Tel: +82-31-299-6038    Fax: +82-31-299-6029    E-mail: pfmjournal@skku.edu                

Copyright © 2024 by Sungkyunkwan University School of Medicine.

Developed in M2PI

Close layer
prev next