Add R validation for 4 survey-enabled estimators by igerber · Pull Request #276 · igerber/diff-diff

igerber · 2026-04-06T01:24:41Z

Summary

Add R benchmark script (benchmark_survey_estimators.R) generating golden JSON for 4 survey-enabled estimators: ImputationDiD, StackedDiD, SunAbraham, TripleDifference
Add Python tests consuming golden values (5 tests across 4 scenarios)
Document validation results in docs/benchmarks.rst with observed gaps and reproduction instructions

Methodology references (required if estimator / math changes)

N/A — no methodology changes. Tests validate existing survey weight propagation against R's survey::svyglm().

Validation

Tests added: tests/test_survey_estimator_validation.py (5 tests)
- S1: ImputationDiD control-only WLS regression (coef gap <1e-10, SE 0.00%)
- S1: ImputationDiD end-to-end smoke test (finite ATT/SE)
- S2: StackedDiD with Q-weight x survey weight composition (coef gap <1e-10, SE 0.77%)
- S3: SunAbraham IW-aggregated ATT (coef gap <1e-11, SE 0.00%)
- S4: TripleDifference three-way interaction DDD (coef gap <1e-10, SE 0.36%)
R golden values generated by Rscript benchmarks/R/benchmark_survey_estimators.R

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…kedDiD, SunAbraham, TripleDifference) Expand survey R-validation coverage from 4 to 8 estimators. Each target estimator reduces to a WLS regression under survey weights, validated against R's survey::svyglm(). Point estimates match to <1e-10; SEs within 0.77% (tightest observed gap). Deliverables: - R benchmark script generating golden JSON (4 scenarios, 2 DGPs) - Python tests consuming golden values (5 tests, 1.5% SE tolerance) - Documentation in docs/benchmarks.rst Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-06T01:31:24Z

Overall Assessment

⚠️ Needs changes

Executive Summary

Cross-checked the PR against docs/methodology/REGISTRY.md and the affected estimator implementations. I did not find an undocumented estimator-method mismatch in ImputationDiD, StackedDiD, SunAbraham, or TripleDifference.
The S2 StackedDiD SE gap is a disclosed benchmark-harness asymmetry: the R side drops FPC on stacked data, while the Python estimator keeps the re-resolved stacked survey design.
P1: the new R benchmark helper computes inference inline without the repo’s NaN-all-or-nothing guard, so a future zero or non-finite SE would serialize misleading benchmark inference values.
P3: several golden values generated by the benchmark are never asserted in Python, so the exercised coverage is narrower than the fixture/docs suggest.
I could not execute the new test module here because the review environment does not have pytest or pandas.

Methodology

Cross-check against the registry and estimator docstrings did not surface an undocumented deviation in the library estimators touched by this PR.

Severity: P1. Impact: benchmarks/R/benchmark_survey_estimators.R:L195-L211 introduces a new inline inference helper (t_stat <- att / se, manual CI construction). If a benchmark case ever yields se <= 0 or non-finite SE, this will emit misleading Inf or point-estimate inference instead of the project’s required all-or-nothing NaN behavior. Because tests/test_survey_estimator_validation.py:L304-L325 only checks S4 coefficient/SE, those bad inference fields would land silently in the committed golden file. Concrete fix: replace the helper with guarded inference generation that sets all derived inference fields to NaN whenever se is non-finite or <= 0, and then assert those fields in Python.
Severity: P3. Impact: docs/benchmarks.rst:L802-L806 correctly discloses that S2 omits FPC on the R side while the Python estimator keeps it on the stacked design. I do not count this as a defect because it is documented and confined to the benchmark harness. Concrete fix: none required.

Code Quality

No additional findings beyond the P1 inference helper above.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No blocking tech-debt issue introduced in the library code. The coverage gap below is currently untracked in TODO.md.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the benchmark writes more golden inference data than the test module actually exercises: S2 event-study SEs at benchmarks/R/benchmark_survey_estimators.R:L385-L388 are not asserted by tests/test_survey_estimator_validation.py:L242-L254; S3 cohort SEs at benchmarks/R/benchmark_survey_estimators.R:L480-L483 are not asserted by tests/test_survey_estimator_validation.py:L286-L298; and S4 t_stat/CI fields at benchmarks/R/benchmark_survey_estimators.R:L591-L600 are not asserted by tests/test_survey_estimator_validation.py:L304-L325. That means the fixture/docs overstate what CI actually protects. Concrete fix: add assertions for the generated SE/inference fields, or remove the unused fields from the fixture/docs if they are intentionally out of scope.

Path to Approval

Replace the inline S4 inference math in benchmarks/R/benchmark_survey_estimators.R:L195-L211 with guarded NaN propagation for zero or non-finite SEs.
Extend tests/test_survey_estimator_validation.py:L304-L325 to assert the S4 inference fields generated by that helper (t_stat, CI, and df), so the benchmark cannot regress silently again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add R validation for 4 survey-enabled estimators#276

Add R validation for 4 survey-enabled estimators#276
igerber wants to merge 1 commit intomainfrom
survey-r-validation

igerber commented Apr 6, 2026

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 6, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant