Add synthetic-EHR generative evaluation metrics#1148
Open
chufangao wants to merge 2 commits into
Open
Conversation
Adds pyhealth/metrics/generative/, a subpackage for evaluating synthetic EHR data along privacy, utility, and statistical-fidelity axes: - privacy.py: NNAAR, membership inference attack, discriminator privacy - utility.py: machine learning efficacy (TRTR vs TSTR), code-prevalence similarity (R2, Pearson, RMSE) - utils.py: shared data prep, an LSTM classifier, and a random-forest baseline - evaluate_synthetic_ehr(): convenience orchestrator for the full suite These functions are ported from a standalone evaluation script. The MIMIC-specific data-loading/CLI glue is dropped; the metrics work on any flat EHR dataframe. Public functions are re-exported from pyhealth.metrics. Adds unit tests in tests/core/test_generative_metrics.py and Sphinx docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds pyhealth/metrics/generative/, a subpackage for evaluating synthetic EHR data along privacy, utility, and statistical-fidelity axes: - privacy.py: NNAAR, membership inference attack, discriminator privacy - utility.py: machine learning efficacy (TRTR vs TSTR), code-prevalence similarity (R2, Pearson, RMSE) - utils.py: shared data prep, an LSTM classifier, and a random-forest baseline - evaluate_synthetic_ehr(): convenience orchestrator for the full suite These functions are ported from a standalone evaluation script. The MIMIC-specific data-loading/CLI glue is dropped; the metrics work on any flat EHR dataframe. Public functions are re-exported from pyhealth.metrics. Adds unit tests in tests/core/test_generative_metrics.py and Sphinx docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
pyhealth/metrics/generative/, a subpackage for evaluating syntheticEHR data along three axes — privacy, utility, and statistical
fidelity.
privacy.py—calc_nnaar(Nearest Neighbor Adversarial AccuracyRisk),
calc_membership_inference(membership inference attack), andcompute_discriminator_privacy(real-vs-synthetic discriminator score).utility.py—compute_mle(machine learning efficacy, TRTR vs TSTR)and
compute_prevalence_metrics(code-prevalence similarity: R², Pearson,RMSE).
utils.py— shared data prep, a self-contained LSTM classifier, and arandom-forest baseline.
evaluate_synthetic_ehr()— convenience orchestrator that runs thefull suite and returns one merged
{metric: (mean, std)}dict.The metrics are ported from a standalone evaluation script. The
MIMIC-specific data-loading/CLI glue is dropped so the functions work on any
flat EHR dataframe (one row per patient/visit/code event). Public functions
are re-exported from
pyhealth.metrics.Cleanups applied during the port
logginginstead of bareprintcalls..cpu().numpy()).scipy.stats.pearsonrwithnumpy.corrcoefto avoid anundeclared
scipydependency.Tests
tests/core/test_generative_metrics.py— 18unittestcases, all passing:13 functional tests covering each metric and the orchestrator
(
lstm+rfmodes, argument validation).5 behavioral tests (
TestMetricsBehavior) that verify each metricresponds sensibly across three synthetic datasets — an exact copy of the
training data, a similar set (~15% of codes perturbed), and a different
set (disjoint code vocabulary):
0 → 0.03 → 0.26; exact copy → RMSE 0, R²/Pearson = 11.0 → 0.1 → 0.01.0 → 0.94 → 0.46(chance for unrelated data)1.0 → 0.98 → 0.81Docs
Added
docs/api/metrics/pyhealth.metrics.generative.rstand a toctree entryin
docs/api/metrics.rst.Notes
The discriminator-privacy score is degenerate for exact copies (the model
predicts a constant on identical features, so the score reflects test-split
balance rather than 0.5). The behavioral test asserts the robust direction —
disjoint synthetic data is cleanly flagged while real-derived data is not.