Skip to content

Add synthetic-EHR generative evaluation metrics#1148

Open
chufangao wants to merge 2 commits into
sunlabuiuc:masterfrom
chufangao:chil26_evals2
Open

Add synthetic-EHR generative evaluation metrics#1148
chufangao wants to merge 2 commits into
sunlabuiuc:masterfrom
chufangao:chil26_evals2

Conversation

@chufangao
Copy link
Copy Markdown
Collaborator

Summary

Adds pyhealth/metrics/generative/, a subpackage for evaluating synthetic
EHR data along three axes — privacy, utility, and statistical
fidelity
.

  • privacy.pycalc_nnaar (Nearest Neighbor Adversarial Accuracy
    Risk), calc_membership_inference (membership inference attack), and
    compute_discriminator_privacy (real-vs-synthetic discriminator score).
  • utility.pycompute_mle (machine learning efficacy, TRTR vs TSTR)
    and compute_prevalence_metrics (code-prevalence similarity: R², Pearson,
    RMSE).
  • utils.py — shared data prep, a self-contained LSTM classifier, and a
    random-forest baseline.
  • evaluate_synthetic_ehr() — convenience orchestrator that runs the
    full suite and returns one merged {metric: (mean, std)} dict.

The metrics are ported from a standalone evaluation script. The
MIMIC-specific data-loading/CLI glue is dropped so the functions work on any
flat EHR dataframe (one row per patient/visit/code event). Public functions
are re-exported from pyhealth.metrics.

Cleanups applied during the port

  • logging instead of bare print calls.
  • Fixed a latent CUDA crash in the LSTM eval loop (.cpu().numpy()).
  • Replaced scipy.stats.pearsonr with numpy.corrcoef to avoid an
    undeclared scipy dependency.
  • Input dataframes are copied instead of mutated in place.
  • Google-style docstrings, type hints, PEP8 (≤88 chars).

Tests

tests/core/test_generative_metrics.py — 18 unittest cases, all passing:

  • 13 functional tests covering each metric and the orchestrator
    (lstm + rf modes, argument validation).

  • 5 behavioral tests (TestMetricsBehavior) that verify each metric
    responds sensibly across three synthetic datasets — an exact copy of the
    training data, a similar set (~15% of codes perturbed), and a different
    set (disjoint code vocabulary):

    Metric Verified behavior
    Prevalence RMSE 0 → 0.03 → 0.26; exact copy → RMSE 0, R²/Pearson = 1
    NNAAR Flags memorization: 1.0 → 0.1 → 0.0
    Membership inference Attack accuracy 1.0 → 0.94 → 0.46 (chance for unrelated data)
    Discriminator privacy Disjoint-vocabulary data trivially flagged; real-derived data is not
    MLE (utility) Exact copy reproduces real utility exactly; ratio degrades 1.0 → 0.98 → 0.81

Docs

Added docs/api/metrics/pyhealth.metrics.generative.rst and a toctree entry
in docs/api/metrics.rst.

Notes

The discriminator-privacy score is degenerate for exact copies (the model
predicts a constant on identical features, so the score reflects test-split
balance rather than 0.5). The behavioral test asserts the robust direction —
disjoint synthetic data is cleanly flagged while real-derived data is not.

chufangao and others added 2 commits May 17, 2026 23:46
Adds pyhealth/metrics/generative/, a subpackage for evaluating synthetic
EHR data along privacy, utility, and statistical-fidelity axes:

- privacy.py: NNAAR, membership inference attack, discriminator privacy
- utility.py: machine learning efficacy (TRTR vs TSTR), code-prevalence
  similarity (R2, Pearson, RMSE)
- utils.py: shared data prep, an LSTM classifier, and a random-forest
  baseline
- evaluate_synthetic_ehr(): convenience orchestrator for the full suite

These functions are ported from a standalone evaluation script. The
MIMIC-specific data-loading/CLI glue is dropped; the metrics work on any
flat EHR dataframe. Public functions are re-exported from
pyhealth.metrics. Adds unit tests in tests/core/test_generative_metrics.py
and Sphinx docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds pyhealth/metrics/generative/, a subpackage for evaluating synthetic
EHR data along privacy, utility, and statistical-fidelity axes:

- privacy.py: NNAAR, membership inference attack, discriminator privacy
- utility.py: machine learning efficacy (TRTR vs TSTR), code-prevalence
  similarity (R2, Pearson, RMSE)
- utils.py: shared data prep, an LSTM classifier, and a random-forest
  baseline
- evaluate_synthetic_ehr(): convenience orchestrator for the full suite

These functions are ported from a standalone evaluation script. The
MIMIC-specific data-loading/CLI glue is dropped; the metrics work on any
flat EHR dataframe. Public functions are re-exported from
pyhealth.metrics. Adds unit tests in tests/core/test_generative_metrics.py
and Sphinx docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant