Skip to content

Add R validation for 4 survey-enabled estimators#276

Open
igerber wants to merge 1 commit intomainfrom
survey-r-validation
Open

Add R validation for 4 survey-enabled estimators#276
igerber wants to merge 1 commit intomainfrom
survey-r-validation

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 6, 2026

Summary

  • Add R benchmark script (benchmark_survey_estimators.R) generating golden JSON for 4 survey-enabled estimators: ImputationDiD, StackedDiD, SunAbraham, TripleDifference
  • Add Python tests consuming golden values (5 tests across 4 scenarios)
  • Document validation results in docs/benchmarks.rst with observed gaps and reproduction instructions

Methodology references (required if estimator / math changes)

  • N/A — no methodology changes. Tests validate existing survey weight propagation against R's survey::svyglm().

Validation

  • Tests added: tests/test_survey_estimator_validation.py (5 tests)
    • S1: ImputationDiD control-only WLS regression (coef gap <1e-10, SE 0.00%)
    • S1: ImputationDiD end-to-end smoke test (finite ATT/SE)
    • S2: StackedDiD with Q-weight x survey weight composition (coef gap <1e-10, SE 0.77%)
    • S3: SunAbraham IW-aggregated ATT (coef gap <1e-11, SE 0.00%)
    • S4: TripleDifference three-way interaction DDD (coef gap <1e-10, SE 0.36%)
  • R golden values generated by Rscript benchmarks/R/benchmark_survey_estimators.R

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…kedDiD, SunAbraham, TripleDifference)

Expand survey R-validation coverage from 4 to 8 estimators. Each target
estimator reduces to a WLS regression under survey weights, validated
against R's survey::svyglm(). Point estimates match to <1e-10; SEs
within 0.77% (tightest observed gap).

Deliverables:
- R benchmark script generating golden JSON (4 scenarios, 2 DGPs)
- Python tests consuming golden values (5 tests, 1.5% SE tolerance)
- Documentation in docs/benchmarks.rst

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

Overall Assessment

⚠️ Needs changes

Executive Summary

  • Cross-checked the PR against docs/methodology/REGISTRY.md and the affected estimator implementations. I did not find an undocumented estimator-method mismatch in ImputationDiD, StackedDiD, SunAbraham, or TripleDifference.
  • The S2 StackedDiD SE gap is a disclosed benchmark-harness asymmetry: the R side drops FPC on stacked data, while the Python estimator keeps the re-resolved stacked survey design.
  • P1: the new R benchmark helper computes inference inline without the repo’s NaN-all-or-nothing guard, so a future zero or non-finite SE would serialize misleading benchmark inference values.
  • P3: several golden values generated by the benchmark are never asserted in Python, so the exercised coverage is narrower than the fixture/docs suggest.
  • I could not execute the new test module here because the review environment does not have pytest or pandas.

Methodology

Cross-check against the registry and estimator docstrings did not surface an undocumented deviation in the library estimators touched by this PR.

  • Severity: P1. Impact: benchmarks/R/benchmark_survey_estimators.R:L195-L211 introduces a new inline inference helper (t_stat <- att / se, manual CI construction). If a benchmark case ever yields se <= 0 or non-finite SE, this will emit misleading Inf or point-estimate inference instead of the project’s required all-or-nothing NaN behavior. Because tests/test_survey_estimator_validation.py:L304-L325 only checks S4 coefficient/SE, those bad inference fields would land silently in the committed golden file. Concrete fix: replace the helper with guarded inference generation that sets all derived inference fields to NaN whenever se is non-finite or <= 0, and then assert those fields in Python.
  • Severity: P3. Impact: docs/benchmarks.rst:L802-L806 correctly discloses that S2 omits FPC on the R side while the Python estimator keeps it on the stacked design. I do not count this as a defect because it is documented and confined to the benchmark harness. Concrete fix: none required.

Code Quality

  • No additional findings beyond the P1 inference helper above.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No blocking tech-debt issue introduced in the library code. The coverage gap below is currently untracked in TODO.md.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Replace the inline S4 inference math in benchmarks/R/benchmark_survey_estimators.R:L195-L211 with guarded NaN propagation for zero or non-finite SEs.
  2. Extend tests/test_survey_estimator_validation.py:L304-L325 to assert the S4 inference fields generated by that helper (t_stat, CI, and df), so the benchmark cannot regress silently again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant