Skip to content

Feature/wooldridge did#216

Merged
igerber merged 16 commits intoigerber:mainfrom
wenddymacro:feature/wooldridge-did
Apr 4, 2026
Merged

Feature/wooldridge did#216
igerber merged 16 commits intoigerber:mainfrom
wenddymacro:feature/wooldridge-did

Conversation

@wenddymacro
Copy link
Copy Markdown
Contributor

@wenddymacro wenddymacro commented Mar 20, 2026

Summary

  • Rebase and forge third-party contribution implementing Wooldridge (2025, 2023) Extended Two-Way Fixed Effects estimator
  • OLS, logit, and Poisson QMLE paths with ASF-based ATT and delta-method SEs
  • Four aggregation types (simple, group, calendar, event) following Stata jwdid_estat
  • Fixed QMLE sandwich to use weight_type="aweight" for correct unweighted scores
  • Fixed control_group="never_treated" to correctly exclude pre-treatment observations from treated units
  • Fixed nonlinear rank-deficiency handling, bootstrap_weights validation, n_control_units metadata
  • Added solve_poisson IRLS solver to linalg.py
  • 63 tests covering all paths, edge cases, control group distinction, and methodology checks

Methodology references (required if estimator / math changes)

  • Method name(s): WooldridgeDiD / Extended Two-Way Fixed Effects (ETWFE)
  • Paper / source link(s):
  • Any intentional deviations from the source (and why):
    • Cell-level aggregation weights (n_{g,t}) instead of W2025 Eqs. 7.2-7.4 cohort-share weights — matches Stata jwdid_estat (documented in REGISTRY.md)
    • QMLE cluster-robust small-sample adjustment uses (G/(G-1))*(n-1)/(n-k) vs Stata's G/(G-1) only — conservative, inflates SEs slightly (documented in REGISTRY.md and TODO.md)
    • R's etwfe uses fixest for nonlinear paths; this uses direct QMLE via compute_robust_vcov (documented in REGISTRY.md)

Validation

  • Tests added/updated: tests/test_wooldridge.py (63 tests), tests/test_linalg.py (4 solve_poisson tests)
  • Backtest / simulation / notebook evidence (if applicable): docs/tutorials/16_wooldridge_etwfe.ipynb
  • Paper review: docs/methodology/papers/wooldridge-etwfe-review.md

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

@igerber
Copy link
Copy Markdown
Owner

igerber commented Mar 21, 2026

Thanks for this contribution — the Wooldridge ETWFE estimator fills a real gap in the library. The nonlinear outcome support (logit/Poisson via the ASF approach) is something we've had on our roadmap, and this is the right paper to implement it from.

I've done an independent review of Wooldridge (2023) against your implementation. The REGISTRY.md section is solid overall, but there are items that need to be addressed before we can move forward.

Files to remove
These should not be part of the PR:

  • uv.lock — we don't use uv
  • docs/superpowers/CHECKPOINT.md — implementation scaffold
  • docs/superpowers/plans/2026-03-18-wooldridge-did.md — same
  • docs/superpowers/specs/2026-03-18-wooldridge-did-design.md — same

PR template

Please fill in the methodology references, validation, and security fields in the PR description.

Methodology notes from paper review

I did an independent review of Wooldridge (2023) and have a few items to flag:

  1. Delta-method gradient verification. The paper does not write out the explicit gradient for the nonlinear ASF-based ATTs (it cites Wooldridge 2010, problem 12.17). Your implementation hand-codes this gradient. Please add a numerical gradient check (finite differences) as a test to verify correctness — this is the most error-prone part of the nonlinear path and the whole value proposition of this estimator depends on it.
  2. Aggregation weights. The paper describes aggregation conceptually (Section 3.1, p. C50) but doesn't give explicit weight formulas with equation numbers. Please confirm your weight definitions match Stata jwdid_estat and document the exact formulas in the REGISTRY.md section.
  3. Covariate centring. Wooldridge centres covariates at cohort means (X̂_{ig} = X_i - X̄_g, p. C48) in the treatment interaction terms. This affects interpretation of δ_{gs} on the index scale. Please verify this is implemented correctly.
  4. Proposition 3.1 equivalence. The paper proves that imputation and pooled estimation produce identical ATTs when using the canonical link function. This is a testable property - please add a test that verifies this equivalence numerically (e.g., compare imputation-based ATTs to pooled ATTs for logit/Poisson on simulated data).

Validation
We need at least one numerical benchmark against Stata jwdid or R etwfe — verifying that ATT estimates match to a reasonable tolerance. See benchmarks/R/ for examples of how we validate other estimators.

Additionally, a test comparing OLS ETWFE ATT(g,t) to CallawaySantAnna ATT(g,t) on the same dataset would be valuable, since they should be approximately equivalent under linear PT with never-treated controls.

Code conventions

  • The nonlinear paths (logit/Poisson) build their own sandwich vcov manually. Our convention is to route all inference through compute_robust_vcov in linalg.py - see CLAUDE.md under "Unified linalg.py backend". Please refactor to use that where possible.
  • All inference fields (t_stat, p_value, conf_int) must be computed together using safe_inference() from diff_diff.utils. Please verify this is the case throughout.
  • The Poisson FE path (Section 3.3 of the paper) does not suffer from incidental parameters - this is a unique property of the exponential mean. If your implementation uses unit dummies for Poisson, this is fine, but please document the distinction from the logit path where unit dummies would cause incidental parameters problems.

Happy to answer questions on any of this. This is a meaningful contribution and we want to make sure it lands well.

@wenddymacro
Copy link
Copy Markdown
Contributor Author

Methodology

Primary reference:
Wooldridge, J. M. (2025). Two-way fixed effects, the two-way Mundlak regression, and difference-in-differences estimators. Empirical Economics, 69(5),
2545–2587.
(Published version of NBER Working Paper 29154.)

Secondary reference:
Wooldridge, J. M. (2023). Simple approaches to nonlinear difference-in-differences with panel data. The Econometrics Journal, 26(3), C31–C66.

Application reference:
Nagengast, A. J., Rios-Avila, F., & Yotov, Y. V. (2026). The European single market and intra-EU trade: an assessment with heterogeneity-robust difference-in-differences methods. Economica, 93(369), 298–331.
(Empirical application of ETWFE; co-authored by the jwdid package author.)

Callaway-Sant'Anna equivalence (Proposition 3.1):
OLS ETWFE ATT(g,t) matches CallawaySantAnna ATT(g,t) on mpdta to within 5e-3,
verified in test_ols_etwfe_att_matches_callaway_santanna.

Validation

  • tests/test_wooldridge.py::TestMethodologyCorrectness::test_ols_etwfe_att_ma tches_callaway_santanna
    confirms OLS ETWFE ATT(g,t) ≈ CallawaySantAnna ATT(g,t) on mpdta (7 matched
    (g,t) cells, max absolute difference 1.7e-3).
  • benchmarks/python/benchmark_wooldridge.py added for timing and accuracy
    benchmarking.

Security

  • No network calls, no subprocess execution, no user-controlled eval paths
    in any
    new code. All computation is local NumPy/pandas/scipy only.

wenddymacro added a commit to wenddymacro/diff-diff that referenced this pull request Mar 25, 2026
…vcov backend, add tests

- git rm uv.lock + 3 docs/superpowers/ files; extend .gitignore
- Extend compute_robust_vcov with optional weights for QMLE sandwich vcov
  (logit: w_i = p_i(1-p_i), Poisson: w_i = mu_i); Rust path bypassed
- Replace manual vcov loops in logit/Poisson paths with compute_robust_vcov
- Expose _gradient in group_time_effects for delta-method transparency
- Add 3 tests to TestComputeRobustVcov (weighted sandwich, bypass, null-weight parity)
- Add 3 tests to TestMethodologyCorrectness (logit gradient, Poisson gradient,
  OLS ETWFE ≡ CallawaySantAnna Proposition 3.1 equivalence)
- Document aggregation weights formula + deviation-from-R note in REGISTRY.md
- Add benchmark_wooldridge.py and register in run_benchmarks.py
igerber and others added 2 commits April 4, 2026 10:05
Rebase and forge third-party contribution (wenddymacro) implementing
Wooldridge (2025, 2023) Extended Two-Way Fixed Effects estimator.

- OLS, logit, and Poisson QMLE paths with ASF-based ATT
- Delta-method SEs for nonlinear models, cluster-robust sandwich
- Four aggregation types (simple, group, calendar, event)
- Fixed QMLE sandwich to use weight_type="aweight" for correct
  unweighted scores in bread computation
- Added solve_poisson IRLS solver to linalg.py
- Complete documentation: REGISTRY, README, CHANGELOG, tutorial
- 60 tests covering all paths, edge cases, and methodology checks

Co-Authored-By: wenddymacro <50739376+wenddymacro@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… P2s

P0: Fix _filter_sample so control_group="never_treated" excludes
pre-treatment observations from treated units. Previously both
settings produced identical samples (treated_mask=cohort>0 kept
all treated-unit obs regardless). Now never_treated restricts to
post-treatment obs from treated units + all never-treated obs.

P1: Add NaN checks in logit/Poisson ASF loops to skip cells whose
interaction coefficients were dropped due to rank deficiency.
Zero out NaN entries in beta before vcov computation.

P2a: Validate bootstrap_weights in __init__ — invalid values now
raise ValueError instead of silently falling through to Mammen.

P2b: Fix n_control_units metadata to count not-yet-treated units
when control_group="not_yet_treated" (was counting only cohort==0).

P3a: Change bootstrap loop to return_vcov=False (only coefs used).

P3b: Soften "matches Stata exactly" claims in tutorial to
"follows the jwdid specification" with registry deviation note.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: cfe0e5c1aac241714f914f296fa3920ba6cdbf94


Overall Assessment
⚠️ Needs changes

Executive Summary

Methodology

  • Severity P1. Impact: the never_treated path is an undocumented estimator change. _filter_sample() deletes treated pre-treatment observations, and _build_interaction_matrix() only creates post/anticipation cells, so aggregate("event") under never_treated cannot recover the placebo leads that the registry and official etwfe docs expect from that control choice. The test suite currently locks in the opposite behavior by asserting all(k >= 0) for never_treated. Concrete fix: keep treated histories in-sample for the reference ETWFE path, implement the never reference/control logic instead of sample deletion, and replace the all(k >= 0) test with a parity test that expects negative event times when the design supports them. diff_diff/wooldridge.py:L70, diff_diff/wooldridge.py:L105, docs/methodology/REGISTRY.md:L1148, tests/test_wooldridge.py:L486. citeturn1search0turn1search1turn2search0
  • Severity P1. Impact: there is no assumption check that the requested control design is identified. fit() proceeds even if there are no treated cohorts, no never-treated units under never_treated, or no valid comparison cells. That is a missing identification guard on a new estimator, and other estimators in this library already fail fast here. Concrete fix: after sample construction, raise on len(groups) == 0; under never_treated, require at least one never-treated unit; under not_yet_treated, require at least one valid untreated comparison cohort/period. Add tests for all-never-treated, all-treated single-cohort, and never_treated with zero never-treated units. diff_diff/wooldridge.py:L290, diff_diff/wooldridge.py:L391, diff_diff/staggered.py:L1556. citeturn1search1turn2search0

Code Quality

  • Severity P1. Impact: the Poisson path does not honor rank_deficient_action. _fit_poisson() calls solve_poisson() without any rank handling, and solve_poisson() regularizes the Hessian with 1e-12 * I / silently breaks on LinAlgError rather than using the library’s established drop/warn/error contract. On collinear nonlinear designs, that can mean late vcov failures or unstable unidentified ATT/SEs. Concrete fix: add the same QR-based rank detection / dropped-column expansion used by solve_ols() and solve_logit(), thread rank_deficient_action through _fit_poisson(), and add regression tests for all-eventually-treated and fully-treated-calendar-time Poisson panels. diff_diff/wooldridge.py:L724, diff_diff/linalg.py:L2347, tests/test_wooldridge.py:L652
  • Severity P1. Impact: n_bootstrap, bootstrap_weights, and seed are part of the public estimator API and get_params(), but only _fit_ols() uses them. For logit/Poisson, nonzero bootstrap requests are silently ignored and users still get analytic SEs. That is incomplete parameter propagation on an inference parameter. Concrete fix: either reject bootstrap settings when method != "ols" with a clear ValueError, or implement nonlinear bootstrap support and test the supported behavior explicitly. diff_diff/wooldridge.py:L210, diff_diff/wooldridge.py:L267, diff_diff/wooldridge.py:L347, diff_diff/wooldridge.py:L494, tests/test_wooldridge.py:L455

Performance
No material findings.

Maintainability
No standalone findings beyond the method-coupling issue already called out in diff_diff/wooldridge.py:L70.

Tech Debt

  • Severity P3. Impact: the QMLE aweight small-sample adjustment, cell-count aggregation weights, canonical-link warning gap, and missing Stata golden-value tests are already documented in the Methodology Registry / TODO.md, so they are mitigated and should not block this PR. Concrete fix: none required for approval; keep them tracked as written. docs/methodology/REGISTRY.md:L1119, docs/methodology/REGISTRY.md:L1128, TODO.md:L75

Security
No findings.

Documentation/Tests

  • Severity P2. Impact: the docs describe a different estimator than the code. The tutorial says default OLS uses additive cohort/time FE and that t < g cells are available as placebo checks, while the implementation uses unit+time FE for OLS and only constructs t >= g - anticipation cells. The API page also says the implementation matches Stata exactly even though the registry/TODO already record deviations. Concrete fix: rewrite the OLS tutorial section to match the actual design, explain when negative event times exist, and replace “exactly” with “with documented deviations in REGISTRY.md”; update the tests so never_treated behavior is checked against the intended reference behavior instead of hard-coding k >= 0. docs/tutorials/16_wooldridge_etwfe.ipynb:L69, docs/tutorials/16_wooldridge_etwfe.ipynb:L100, docs/api/wooldridge_etwfe.rst:L4, docs/api/wooldridge_etwfe.rst:L13, tests/test_wooldridge.py:L486. citeturn1search0
  • Execution note: I could not run the test suite in this sandbox because numpy and pandas are not installed.

Path to Approval

  1. Rework control_group="never_treated" so it matches the intended ETWFE/reference behavior, and add a regression test that negative event times appear when the design supports placebo leads.
  2. Add prefit identification checks for missing treated cohorts, missing never-treated controls, and missing valid untreated comparison cells.
  3. Implement rank-deficiency handling for Poisson that honors rank_deficient_action and add collinearity tests for nonlinear paths.
  4. Either reject bootstrap parameters for nonlinear methods or implement nonlinear bootstrap inference and test it.

…son rank handling

Finding 1 (P1): Rewrite never_treated implementation. Instead of deleting
pre-treatment observations (wrong), keep all data in-sample and control
the comparison pool via the interaction matrix. For never_treated,
_build_interaction_matrix creates ALL (g,t) pairs including pre-treatment
cells as placebo indicators, so only never-treated units remain in the
baseline. Pre-treatment coefficients serve as placebo/pre-trend tests
and aggregate("event") now produces negative event times.

Finding 2 (P1): Add identification checks in fit() — raise ValueError
for no treated cohorts, no never-treated units with never_treated control,
and empty interaction matrix.

Finding 3 (P1): Add rank_deficient_action to solve_poisson matching
solve_logit/solve_ols pattern. Uses QR-based rank detection, drops
collinear columns, expands beta with NaN. Warns on singular Hessian
instead of silent break.

Finding 4 (P1): Reject bootstrap params for nonlinear methods —
n_bootstrap > 0 with method != "ols" raises ValueError.

Finding 5 (P2): Fix tutorial to accurately describe OLS FE structure
(unit+time via within-transformation, not additive cohort+time).

Also: skip NaN coefficients in OLS/logit/Poisson ATT extraction, fix
vcov submatrix to only include identified cells.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 6446bacd379ed694ada44044c08fb05da121b6ad


Overall Assessment
⚠️ Needs changes

Executive Summary

  • The prior never_treated methodology issue appears addressed: treated histories are now kept in-sample and placebo leads are supported for that control choice.
  • The prior nonlinear-bootstrap propagation issue is also addressed: n_bootstrap > 0 is now explicitly rejected for method="logit" and method="poisson".
  • One P1 remains on identification: control_group="not_yet_treated" still has no fail-fast check for designs with zero untreated comparison support, so invalid ETWFE designs can fall through to estimation instead of raising a clear identification error.
  • One P1 remains on nonlinear inference: rank-deficient logit/Poisson designs drop columns in the solver, but the QMLE vcov is still computed on the original singular matrix, so rank_deficient_action="warn"/"silent" is not actually honored.
  • The documented QMLE small-sample factor, cell-count aggregation weights, canonical-link warning gap, and missing Stata goldens are properly tracked in REGISTRY.md / TODO.md; I did not count those against approval.

Methodology

  • Severity P1. Impact: WooldridgeDiD still lacks an explicit identification check for control_group="not_yet_treated" when no untreated comparison cells exist. fit() now rejects “no treated cohorts” and never_treated without never-treated units, but it does not verify that any untreated donor observations remain for the requested ATT cells. In single-cohort/all-treated designs, the OLS path can therefore fall through to all-dropped treatment columns and NaN overall output instead of a clear identification error. Concrete fix: after sample construction, explicitly verify that each estimable post/anticipation cell has at least one untreated comparison observation under the selected control-group rule, and raise ValueError otherwise; add regression tests for all-treated single-cohort and zero-comparison designs. diff_diff/wooldridge.py:L332-L361, diff_diff/wooldridge.py:L70-L140, diff_diff/linalg.py:L835-L881, tests/test_wooldridge.py:L787-L804, docs/methodology/REGISTRY.md:L1140-L1148

Code Quality

  • Severity P1. Impact: the prior rank-deficiency fix is only partial for nonlinear ETWFE. solve_logit() and solve_poisson() can drop collinear columns, but _fit_logit() / _fit_poisson() still pass the original singular design into compute_robust_vcov(). Because compute_robust_vcov() explicitly solves the full bread matrix, any rank-deficient logit/Poisson design will still error, so rank_deficient_action="warn"/"silent" is not honored on the inference path. Concrete fix: carry kept_cols out of the nonlinear solvers, compute the QMLE sandwich on the reduced design, expand the vcov back with NaNs, and form cell/aggregate gradients on that reduced parameter space; add logit/Poisson regression tests with dropped treatment cells. diff_diff/wooldridge.py:L621-L644, diff_diff/wooldridge.py:L767-L783, diff_diff/linalg.py:L1035-L1115, diff_diff/linalg.py:L1240-L1420, diff_diff/linalg.py:L2368-L2465

Performance

Maintainability

  • No standalone findings beyond the items above.

Tech Debt

  • Severity P3. Impact: the remaining QMLE small-sample-factor deviation, cell-count aggregation weights, missing canonical-link warning, and missing Stata golden tests are already documented/tracked, so they should not block this PR. Concrete fix: none required for approval; keep them tracked as written. docs/methodology/REGISTRY.md:L1114-L1128, TODO.md:L75-L78

Security

  • No findings.

Documentation/Tests

Execution note: I could not run pytest in this sandbox because the Python environment here does not have numpy / pandas installed.

Path to Approval

  1. Add a fail-fast identification guard for control_group="not_yet_treated" when no untreated comparison cells exist, and cover it with tests for all-treated single-cohort / zero-comparison designs.
  2. Rework nonlinear rank-deficiency handling so the logit/Poisson vcov and delta-method gradient path uses the reduced design rather than the original singular matrix, and add regression tests for dropped treatment cells in both nonlinear methods.

…ication checks, docs

P1: Compute nonlinear QMLE vcov on reduced design when columns are
dropped. Previously vcov was computed on the full singular matrix,
causing errors. Now logit and Poisson paths extract kept columns,
compute vcov on the reduced design, and expand back with NaN.

P1: Add not_yet_treated identification check — raises ValueError when
no untreated comparison observations exist (all units treated at all
observed times).

P2: Fix tutorial notebook — call aggregate('event') before using
event_study_effects in comparison plot cell.

P2: Replace last "exactly" claim in API docs with "follows the jwdid
specification" + registry deviation reference.

Fix flaky TestAllEventuallyTreated assertion — relax ATT > 0 to
isfinite (simulated data with rank deficiency can produce negative ATT
for individual cells despite positive true effect).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 168c2d35f93dc0252c31679259d901bb8dcda27c


Overall Assessment
⚠️ Needs changes

Executive Summary

  • The prior never_treated control-group defect and nonlinear-bootstrap propagation issue appear fixed.
  • The prior nonlinear rank-deficiency finding is only partially resolved: the vcov moved to the reduced design, but the delta-method gradients did not, so dropped nuisance columns can still poison nonlinear ATT SEs.
  • Severity P1 [Newly identified]: the OLS multiplier bootstrap ignores cluster= and always perturbs at the unit level, so bootstrap inference is wrong whenever clustering is set above the unit.
  • I do not keep the earlier not_yet_treated zero-support OLS concern open; the PR now explicitly drops unsupported cells and tests that behavior.
  • Public docs still overstate Stata parity and misdescribe the OLS fixed-effects structure; this is non-blocking.
  • The new benchmark still underreports total_seconds because aggregation happens outside the timer.

Methodology

  • Severity P1. Impact: the previous nonlinear rank-deficiency issue is only partially fixed. solve_logit() / solve_poisson() still expand dropped coefficients back to NaN (diff_diff/linalg.py:L1402-L1418, diff_diff/linalg.py:L2462-L2466), and _fit_logit() / _fit_poisson() now compute the sandwich on the reduced design but still build ATT gradients in the full parameter space (diff_diff/wooldridge.py:L638-L699, diff_diff/wooldridge.py:L795-L856). If the dropped column is a nuisance dummy/covariate rather than the target interaction, grad @ vcov_full @ grad propagates NaN into otherwise estimable ATT and aggregate SEs, which does not satisfy the Registry’s delta-method SE requirement for nonlinear ATT estimation (docs/methodology/REGISTRY.md:L1114-L1118). Concrete fix: keep kept_beta through the delta-method path, compute grad_reduced against vcov_reduced, and only expand for presentation; add logit/Poisson regressions with duplicated nuisance columns.
  • Severity P1 [Newly identified]. Impact: the OLS bootstrap path uses the requested cluster variable for analytic vcov (diff_diff/wooldridge.py:L472-L483) but ignores it for multiplier bootstrap, where weights are always drawn over unit (diff_diff/wooldridge.py:L546-L575). When cluster != unit, the reported bootstrap overall SE, CI, and p-value are computed under the wrong cluster structure. Concrete fix: generate one multiplier per unique cluster_ids, expand it back to observations, and add a regression test with a coarser cluster column plus n_bootstrap > 0.

Code Quality

  • No standalone findings beyond the inference bugs above.

Performance

  • Severity P3. Impact: benchmark_wooldridge.py records total_seconds after results.aggregate("event"), but outside the timer, so estimation time and total time are identical and aggregation cost is omitted (benchmarks/python/benchmark_wooldridge.py:L71-L84). Concrete fix: include aggregation inside the timed block, or time estimation and aggregation separately.

Maintainability

  • No standalone findings beyond the methodology items.

Tech Debt

  • Severity P3. Impact: the remaining QMLE small-sample-factor deviation, cell-count aggregation weights, canonical-link warning gap, and missing Stata goldens are already documented/tracked, so I did not count them against approval (TODO.md:L79-L82, docs/methodology/REGISTRY.md:L1119-L1128). Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Make the OLS multiplier bootstrap generate weights at the effective analytic cluster level (cluster if set, else unit) and add regression coverage for cluster != unit.
  2. Move nonlinear delta-method SE calculations to the reduced parameter space used by the solver/vcov, and add rank-deficient logit/Poisson tests with duplicated nuisance columns.

Execution note: I did not run pytest in this sandbox because the available Python environment here does not have numpy / pandas installed.

…laims

P1-A: Compute nonlinear delta-method gradients in reduced parameter space
when rank-deficient. Previously gradients were in full space but vcov had
NaN for dropped columns, causing NaN propagation into estimable ATT SEs.
Now grad_reduced @ vcov_reduced @ grad_reduced for both logit and Poisson.

P1-B: OLS bootstrap now draws multiplier weights at the analytic cluster
level (cluster_ids), not always at the unit level. Previously ignored the
cluster parameter, producing wrong bootstrap SEs for coarser clustering.

P2: Fix remaining "equivalent to jwdid" claim in API docs and "additive
cohort + time FEs" in tutorial intro/summary. Now accurately describes
unit + time FE via within-transformation with documented deviations.

Also: update gradient tests to check SE finiteness directly (internal
_gradient key removed from public effects dict).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 2f9667ccc61b604348057df8bfa613f9a90d52f2


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The two prior implementation defects appear fixed in code: the OLS multiplier bootstrap now draws weights at the effective analytic cluster in diff_diff/wooldridge.py:546, and the nonlinear delta-method now projects gradients onto the kept-parameter space before SE propagation in diff_diff/wooldridge.py:638 and diff_diff/wooldridge.py:802.
  • Cross-checking against Wooldridge’s ETWFE/nonlinear DiD papers and the jwdid write-up, I did not find an undocumented methodology mismatch in the changed estimator logic; the remaining QMLE small-sample-factor and cell-weight differences are explicitly documented in the Registry/TODO and are non-blocking. (link.springer.com)
  • The PR still does not add regression coverage for either previously blocking edge case. Bootstrap tests only exercise the default unit-cluster path in tests/test_wooldridge.py:455, and nonlinear tests only cover full-rank happy paths in tests/test_wooldridge.py:537 and tests/test_wooldridge.py:554.
  • Under this repo’s review checklist, those untested new parameter/inference interactions remain P1.
  • Non-blockers remain in the benchmark timer and a few lingering “matches Stata” doc phrasings.

Methodology

  • Severity P3 informational. Impact: The estimator now matches the documented ETWFE structure I would expect from the sources: saturated cohort×time OLS with unit/time FE, ASF-based nonlinear ATT for logit/Poisson, never_treated placebo leads, and post-treatment-only simple/group aggregation. The QMLE aweight small-sample factor and cell-count aggregation weights are explicitly documented deviations in docs/methodology/REGISTRY.md:1116, docs/methodology/REGISTRY.md:1122, and TODO.md:79. Concrete fix: none for approval. (link.springer.com)

Code Quality

No findings beyond the test/doc items below.

Performance

  • Severity P3. Impact: total_seconds still excludes results.aggregate("event"), because timing stops at the end of the with Timer() block and the script then re-reads the same elapsed value; the benchmark therefore reports estimation time twice in benchmarks/python/benchmark_wooldridge.py:71 and benchmarks/python/benchmark_wooldridge.py:84. Concrete fix: either time estimation and aggregation separately or move aggregate("event") inside the timed block.

Maintainability

No findings beyond the missing regression coverage.

Tech Debt

  • Severity P3 informational. Impact: the remaining known limitations are properly tracked: QMLE small-sample-factor parity, cell-count aggregation weights, missing canonical-link warning, and missing Stata goldens in TODO.md:79, TODO.md:80, TODO.md:81, and TODO.md:82. Concrete fix: none for this PR.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Add a cluster != unit multiplier-bootstrap regression test for WooldridgeDiD.
  2. Add rank-deficient logit and Poisson regression tests with duplicated nuisance columns to lock in the reduced-space delta-method fix.

Execution note: I could not run pytest in this sandbox because numpy and pytest are not installed, so this is a static review.

…ciency

P1 test coverage for previously blocking fixes:
- TestBootstrapClusterLevel: bootstrap with cluster != unit exercises
  the clustered multiplier weight path, asserts different SE from unit-level
- TestNonlinearRankDeficiency: logit and Poisson with duplicated nuisance
  covariate (rank_deficient_action="silent") assert finite SEs for
  estimable ATT cells after dropped columns

P3: Replace remaining "matches Stata" with "follows Stata" in API docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 940739d134a554e8fbee1c0acc8053b78118f9c5


Overall Assessment

⚠️ Needs changes

Executive Summary

Methodology

The linear ETWFE structure and the pooled nonlinear logit/Poisson framing are consistent with the cited Wooldridge sources. (link.springer.com)

  • Severity P1. Impact: anticipation is a public estimator parameter, but the changed code only uses it when building treatment cells (diff_diff/wooldridge.py#L136). The not_yet_treated availability check remains hard-coded to cohort > time (diff_diff/wooldridge.py#L350), and control-unit metadata also ignores the anticipation window (diff_diff/wooldridge.py#L441). Because the Registry defines anticipation-adjusted treatment cells but does not document any “cell-only” restriction for this parameter (docs/methodology/REGISTRY.md#L1147), this is an undocumented partial propagation / missing assumption-check gap. Concrete fix: make the not_yet_treated guard and control-count logic anticipation-aware, carry anticipation through the results object if downstream summaries depend on it, and add a regression where all remaining pseudo-controls lie inside the anticipation window and fit() must fail clearly.

Code Quality

No additional findings beyond the anticipation propagation issue above.

Performance

  • Severity P3. Impact: total_seconds is still just the elapsed time from the estimation-only timer; results.aggregate("event") happens after the with Timer() block, but benchmarks/python/benchmark_wooldridge.py#L84 reuses the old value. The benchmark therefore reports estimation time twice instead of estimation-plus-aggregation time. Concrete fix: move results.aggregate("event") inside the timed block, or time aggregation separately and report both numbers.

Maintainability

No additional findings.

Tech Debt

  • Severity P3 informational. Impact: the remaining ETWFE limitations are properly tracked already: QMLE small-sample-factor parity, cell-count aggregation weights, canonical-link warning, and missing Stata goldens at TODO.md#L80, TODO.md#L81, TODO.md#L82, and TODO.md#L83. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: anticipation coverage only asserts that extra (g,t) cells exist (tests/test_wooldridge.py#L614); it does not lock down anticipation-aware identification or downstream summaries, which is why the P1 path above remains unprotected. Concrete fix: add one regression that expects a clear error when only anticipation-window pseudo-controls remain, and one regression that fixes the intended overall_att / aggregate("group") / aggregate("calendar") behavior under anticipation > 0.

Path to Approval

  1. Make the not_yet_treated control-availability check anticipation-aware in diff_diff/wooldridge.py#L348, and align control-unit metadata in diff_diff/wooldridge.py#L436 with the same rule.
  2. Add anticipation regressions covering both failure and success paths: one dataset with no valid controls after applying the anticipation window that must raise, and one dataset that locks in the intended aggregate semantics for overall_att, group, and calendar.

Execution note: this was a static review only; pytest, numpy, pandas, and scipy are not installed in this sandbox.

…adata

Make not_yet_treated identification guard anticipation-aware: check
(cohort - anticipation) > time instead of cohort > time. Also fix
_count_control_units to use anticipation-adjusted threshold.

Add regression tests:
- test_anticipation_aware_identification_rejects_pseudo_controls:
  single-cohort design where anticipation=1 consumes all controls
- test_anticipation_aggregate_semantics: verify aggregations produce
  finite results with anticipation > 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 01dc4eac906e64c52608d209c8c978253ba14d78


Overall Assessment

⛔ Blocker

The core ETWFE regression structure and the nonlinear ASF/QMLE framing are broadly consistent with Wooldridge’s published linear and nonlinear DiD treatments, and the PR’s stated QMLE small-sample and cell-weight differences are already documented locally, so the remaining blockers are implementation defects rather than source-material disputes. (academic.oup.com)

Executive Summary

Methodology

Affected methods in this diff: WooldridgeDiD ETWFE OLS/logit/Poisson estimation, ASF-based ATT, delta-method SEs, and simple/group/calendar/event aggregation.

Code Quality

No separate findings beyond the methodology defects above.

Performance

Maintainability

No separate findings.

Tech Debt

  • Severity P3 informational. Impact: the remaining ETWFE limitations are already tracked in TODO.md: QMLE small-sample-factor parity at TODO.md#L80, cell-count aggregation weights at TODO.md#L81, canonical-link warnings at TODO.md#L82, and missing Stata golden tests at TODO.md#L83. Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new anticipation regression at tests/test_wooldridge.py#L647 only checks that outputs are finite, so it would not catch the aggregation-semantics mismatch above; the nonlinear rank-deficiency coverage at tests/test_wooldridge.py#L889 and tests/test_wooldridge.py#L914 only duplicates nuisance covariates and never forces a dropped treatment-interaction column, so it misses the Poisson P0 path. Concrete fix: add one anticipation test with known expected overall_att / group / calendar values under anticipation > 0, and one Poisson test that forces an interaction-column drop and asserts the affected cell is omitted/NaN, not att=0.0.

Path to Approval

  1. Fix Poisson dropped-cell handling in diff_diff/wooldridge.py#L804 by preserving the raw coefficient vector, skipping dropped interaction cells, and adding a regression test that forces treatment-interaction rank deficiency.
  2. Resolve the anticipation aggregation contract. Either:
    implement anticipation-aware post-treatment filters in overall_att, group, and calendar and carry anticipation into WooldridgeDiDResults; or
    document the current t >= g aggregation rule with a **Note:** in docs/methodology/REGISTRY.md and add tests that explicitly assert that behavior.

Execution note: static review only; numpy and pytest are unavailable in this sandbox.

…tics

P0: Poisson path zeroed NaN coefficients BEFORE checking for dropped
interaction cells, causing dropped cells to silently become att=0 and
get averaged into overall_att. Fix: preserve raw interaction coefficients
before NaN->0 zeroing and check against the raw values in the ASF loop.

P1: Document that simple/group/calendar aggregation uses t >= g as the
post-treatment threshold regardless of anticipation setting. Anticipation
cells are estimated but treated as pre-treatment placebos in aggregation.
Add REGISTRY **Note:** and test asserting overall_att equals manually
computed post-treatment-only weighted average.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 8a70087929ea099be607364a5225900d263de88d


Overall Assessment

⚠️ Needs changes

The prior Poisson dropped-cell blocker looks addressed in diff_diff/wooldridge.py:L806, and the earlier anticipation aggregation-threshold concern is now explicitly documented in docs/methodology/REGISTRY.md:L1148. But two unmitigated P1 issues remain in the new ETWFE API/assumption handling.

Executive Summary

  • The previous Poisson rank-deficiency blocker appears fixed: the Poisson path now preserves raw interaction coefficients before NaN -> 0 cleanup and skips dropped treatment cells during ASF construction in diff_diff/wooldridge.py:L806.
  • The previous anticipation aggregation mismatch is now mitigated by documentation: the Registry explicitly says simple/group/calendar aggregation uses t >= g even when anticipation > 0 in docs/methodology/REGISTRY.md:L1148.
  • Severity P1: anticipation still is not propagated into WooldridgeDiDResults, so event-study outputs cannot distinguish true placebo leads from anticipation effects, and summary('event') labels every negative k as [pre] in diff_diff/wooldridge_results.py:L238.
  • Severity P1 [Newly identified]: fit() never validates that cohort is time-invariant within unit, even though the estimator is defined on unit-level first-treatment cohorts in diff_diff/wooldridge.py:L315.
  • Severity P3: the new benchmark still underreports total_seconds because aggregation runs after the timed block in benchmarks/python/benchmark_wooldridge.py:L71 and benchmarks/python/benchmark_wooldridge.py:L83.

Methodology

Affected methods: WooldridgeDiD OLS/logit/Poisson ETWFE estimation, ASF ATT(g,t), and WooldridgeDiDResults event/simple/group/calendar aggregation.

  • Severity P1. Impact: the Registry now defines anticipation-window cells as estimated but distinct from post-treatment ATT in docs/methodology/REGISTRY.md:L1147 and docs/methodology/REGISTRY.md:L1148, but none of the three fit paths propagate anticipation into the results object in diff_diff/wooldridge.py:L530, diff_diff/wooldridge.py:L750, and diff_diff/wooldridge.py:L919. As a result, diff_diff/wooldridge_results.py:L238 marks all negative event times as [pre], which is semantically wrong for k in [-anticipation, -1]. Concrete fix: add an anticipation field to WooldridgeDiDResults, pass it from all fit paths, and make event-study rendering/export distinguish true pre-trend placebos (k < -anticipation) from anticipation effects.
  • Severity P1 [Newly identified]. Impact: cohort is documented as a first-treatment-period variable in diff_diff/wooldridge.py:L315, but fit() immediately normalizes NaNs and proceeds to sample construction in diff_diff/wooldridge.py:L321 without checking that cohort is constant within each unit. If callers pass a row-varying treatment-timing column instead of a true G_i, the estimator silently builds an invalid cohort design and can return wrong ATT(g,t) values. Concrete fix: before _filter_sample(), enforce df.groupby(unit)[cohort].nunique() == 1 with a clear error message and example offending unit.

Code Quality

  • No separate findings beyond the methodology issues above.

Performance

Maintainability

  • No separate findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new nonlinear rank-deficiency coverage in tests/test_wooldridge.py:L899 only duplicates nuisance covariates in tests/test_wooldridge.py:L903 and tests/test_wooldridge.py:L928; it still does not force a dropped treatment-interaction column, so the exact prior Poisson failure mode is not locked by a regression test. Concrete fix: add a Poisson case that makes one (g,t) interaction column collinear and assert that the affected cell is omitted rather than recorded as att=0.

Path to Approval

  1. Propagate anticipation into WooldridgeDiDResults from all three fit paths and make summary('event')/exports distinguish anticipation effects from true pre-treatment placebo leads.
  2. Add a unit-level validation that cohort is time-invariant within unit, and fail fast on violations with a clear error.
  3. Add regression tests for both fixes: one anticipation=1 event-study output test and one non-constant-cohort validation test.

Execution note: static review only. pytest and even numpy are unavailable in this sandbox, so I could not run the new test suite.

P1: Validate cohort is time-invariant within unit before fitting.
Raises ValueError with the offending unit if cohort varies across
time periods for any unit.

P1: Propagate anticipation into WooldridgeDiDResults. Event study
summary now labels anticipation-window cells as [antic] (not [pre]).
True pre-treatment placebo leads (k < -anticipation) still labeled
[pre]. Document t >= g aggregation rule in REGISTRY **Note:**.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 7fa18fef4691f421157c1ffb2e23743b95982338


Overall Assessment

⚠️ Needs changes

The prior P1s from the last AI review appear resolved, but one untracked P1 remains in the new nonlinear ETWFE API: the logit/Poisson paths still accept outcomes outside the model’s support and can therefore return ATT/SE output for invalid specifications.

Executive Summary

Methodology

Code Quality

  • Severity P3. Impact: rank_deficient_action is validated for logit in diff_diff/linalg.py:L1209-L1215, but not for the new Poisson path. Combined with the lack of constructor-level validation in diff_diff/wooldridge.py:L238-L251, a typo like "warnn" silently behaves like "silent" on rank-deficient Poisson fits in diff_diff/linalg.py:L2403-L2417. Concrete fix: validate rank_deficient_action in WooldridgeDiD.__init__ and/or mirror the solve_logit() valid_actions check inside solve_poisson().

Performance

  • Severity P3. Impact: benchmark_wooldridge.py reports total_seconds as the same timer used for fit(), even though event aggregation happens afterward in benchmarks/python/benchmark_wooldridge.py:L71-L84. The benchmark therefore understates end-to-end ETWFE runtime. Concrete fix: either move results.aggregate("event") inside the timed block or report separate estimation_seconds and aggregation_seconds.

Maintainability

No findings.

Tech Debt

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the added nonlinear rank-deficiency coverage in tests/test_wooldridge.py:L903-L947 only duplicates nuisance covariates. It still does not force a dropped (g,t) treatment-interaction column, so the exact prior Poisson failure mode fixed in diff_diff/wooldridge.py:L819-L823 and diff_diff/wooldridge.py:L868-L871 is not pinned by a regression test. Concrete fix: add a logit/Poisson case where one treatment interaction is collinear and assert that the corresponding cell is omitted from group_time_effects rather than returned with att=0.

Path to Approval

  1. Add explicit nonlinear outcome-support validation before fitting: logit should reject non-finite or out-of-range outcomes (y < 0 or y > 1), and poisson should reject non-finite or negative outcomes.
  2. Add regression tests for both invalid-input failures, plus one dropped-treatment-interaction test that forces a collinear (g,t) column and asserts the cell is omitted rather than estimated as zero.

Static review only. This sandbox lacks pytest and even numpy, so I could not execute the added test suite or run numeric spot checks.

…ation

P1: Validate outcome support before nonlinear fitting. Logit rejects
y outside [0, 1], Poisson rejects y < 0, both reject non-finite values.
Previously out-of-support outcomes were silently accepted.

P3: Validate rank_deficient_action in solve_poisson (matching
solve_logit/solve_ols pattern). Typos like "warnn" now raise ValueError
instead of silently behaving like "silent".

Tests: TestOutcomeValidation with logit out-of-range and Poisson
negative outcome rejection tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 56aad1f7d644457b96157d49ae4351b6389abca0


Overall Assessment

⚠️ Needs changes

The prior nonlinear-outcome-support blocker from the last AI review appears resolved, but two unmitigated P1s remain in the new ETWFE implementation: the public covariate-adjusted design does not match the documented/source ETWFE specification, and the new sklearn-style set_params() path can silently switch estimator semantics on invalid values.

Executive Summary

Methodology

  • Severity P1 [Newly identified]. Impact: Methods affected: OLS, logit, and Poisson whenever any of exovar, xtvar, or xgvar is supplied. The registry says ETWFE covariates should be part of the cohort/time-specific treatment design in docs/methodology/REGISTRY.md:L1096-L1100 and docs/methodology/REGISTRY.md:L1135-L1138, but _prepare_covariates() only produces raw covariates and cohort interactions in diff_diff/wooldridge.py:L147-L185, and fit() just concatenates those columns once to X_int in diff_diff/wooldridge.py:L385-L403. That means demean_covariates=True is only demeaning a main-effect block, not implementing the documented ATT-parameterized covariate design, so covariate-adjusted ATT(g,t) can come from a different model than the one advertised. citeturn2view0turn0view0
  • Concrete fix: either implement the full documented covariate design end-to-end, or reject exovar/xtvar/xgvar at fit time until that methodology is actually supported. At minimum, add parity tests for one covariate-adjusted OLS case and one nonlinear case against a manually constructed interacted design or a Stata/R reference.

Code Quality

  • Severity P1 [Newly identified]. Impact: set_params() is advertised as sklearn-compatible in diff_diff/wooldridge.py:L288-L294 but does not re-run the constructor checks from diff_diff/wooldridge.py:L239-L250. Because downstream dispatch uses catch-all else branches, invalid values silently change estimator behavior: method="probit" falls into _fit_poisson() in diff_diff/wooldridge.py:L405-L443, invalid control_group falls into not-yet-treated logic in diff_diff/wooldridge.py:L91-L95, and invalid bootstrap_weights falls into Mammen weights in diff_diff/wooldridge.py:L571-L585. That is silent semantic drift on a public API.
  • Concrete fix: factor the existing constructor checks into a shared _validate_params() and call it from __init__, set_params(), and the start of fit(). Add regression tests that invalid method, control_group, anticipation, and bootstrap_weights all raise.

Performance

  • Severity P3. Impact: benchmark_wooldridge.py stores estimation_seconds after the timer closes, then calls results.aggregate("event"), but still reports total_seconds = estimation_timer.elapsed in benchmarks/python/benchmark_wooldridge.py:L71-L84. The benchmark therefore understates end-to-end runtime and duplicates the estimation timing.
  • Concrete fix: either move aggregation inside the timed block or report separate aggregation_seconds and total_seconds.

Maintainability

No findings.

Tech Debt

  • Severity P3 informational. Impact: the PR’s declared ETWFE deviations and deferred items are properly tracked, so they should not count against approval: conservative QMLE cluster adjustment, cell-level n_{g,t} aggregation weights, canonical-link warning gap, and missing Stata goldens are all documented in docs/methodology/REGISTRY.md:L1114-L1128 and TODO.md:L80-L83.
  • Concrete fix: none required for approval.

Security

No findings.

Documentation/Tests

  • Severity P3. Impact: the new nonlinear rank-deficiency tests in tests/test_wooldridge.py:L903-L951 only duplicate nuisance covariates. They still do not force a dropped (g,t) treatment-interaction column, so the exact omission logic added in diff_diff/wooldridge.py:L833-L885 is not regression-locked.
  • Concrete fix: add a logit/Poisson test where one treatment-interaction column is made collinear and assert that the affected cell is omitted from group_time_effects rather than returned with att=0.

Path to Approval

  1. Implement the documented covariate-adjusted ETWFE design for exovar/xtvar/xgvar across OLS, logit, and Poisson, or explicitly reject those arguments until the correct interacted specification is supported.
  2. Add covariate-adjusted parity tests that compare ATT(g,t) against a manually constructed interacted design or an external reference implementation, so the covariate specification is locked.
  3. Add shared parameter validation for set_params() and fit() so invalid method, control_group, anticipation, and bootstrap_weights raise instead of silently selecting Poisson/not-yet-treated/Mammen.

Static review only. The sandbox here does not have numpy, pandas, or pytest, so I could not execute the added test suite.

P1: Implement treatment × demeaned-covariate interactions (W2025 Eq. 5.3).
Covariates are now included as both main effects and moderating interactions
with each (g,t) treatment cell indicator. Demeaning uses within-cohort means.
This allows ATT to vary with covariates within cells, matching the documented
ETWFE specification.

P1: Re-run parameter validation in set_params() after setattr. Previously
invalid values for method/control_group/bootstrap_weights/anticipation
silently took effect via catch-all else branches.

Updated REGISTRY to document interacted covariate specification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: f25ea09027a7e5ffa0640ba42dee13e6d6dab70a


Overall Assessment

⛔ Blocker

Executive Summary

  • Re-review status: the prior set_params() silent-rerouting P1 appears resolved. set_params() now revalidates method, control_group, anticipation, and bootstrap_weights in diff_diff/wooldridge.py:288.
  • The previously flagged covariate-design gap is only partially addressed. The PR now adds treatment × demeaned-covariate interactions to the ETWFE design, consistent with the updated registry note in docs/methodology/REGISTRY.md:1134.
  • Severity P0: covariate-adjusted nonlinear ETWFE is still incorrect. In logit and Poisson, the ASF counterfactual turns off only the scalar cell effect δ_{g,t} even though the fitted model now also includes cell-specific treatment × covariate terms. That yields wrong ATT(g,t), wrong aggregate ATT, and wrong delta-method SEs whenever exovar, xtvar, or xgvar is supplied.
  • The PR’s declared SE and aggregation deviations are properly documented/tracked, so they remain informational rather than blockers: QMLE small-sample adjustment, n_{g,t} aggregation weights, canonical-link warning gap, and missing Stata goldens are covered in docs/methodology/REGISTRY.md:1119, docs/methodology/REGISTRY.md:1128, and TODO.md:80.
  • The benchmark still underreports end-to-end runtime because total_seconds excludes the post-fit aggregation work in benchmarks/python/benchmark_wooldridge.py:71 and benchmarks/python/benchmark_wooldridge.py:83.

Methodology

  • Severity P0. Impact: Methods affected: WooldridgeDiD(method="logit") and WooldridgeDiD(method="poisson") whenever exovar, xtvar, or xgvar is passed. fit() now builds treatment × demeaned-covariate interactions for all methods in diff_diff/wooldridge.py:412 and diff_diff/wooldridge.py:437, and the registry documents that interacted design in docs/methodology/REGISTRY.md:1138. But the nonlinear ASF implementation still constructs the counterfactual as eta_0 = eta_base - delta and computes gradients as if only δ_{g,t} were treatment-specific in diff_diff/wooldridge.py:752, diff_diff/wooldridge.py:756, diff_diff/wooldridge.py:762, diff_diff/wooldridge.py:932, diff_diff/wooldridge.py:933, and diff_diff/wooldridge.py:941. That no longer matches the registry’s nonlinear ATT definition in docs/methodology/REGISTRY.md:1104, because once the fitted treatment effect includes cell-specific covariate terms, switching treatment off must zero the full active treatment block, not just the scalar cell intercept. The current code silently leaves the treatment × covariate contribution in the “untreated” predictor, so both point estimates and SEs are wrong with no warning. Concrete fix: either reject exovar/xtvar/xgvar for nonlinear methods until the ASF math is implemented correctly, or compute eta_0 and the delta-method gradients using the full set of active (g,t) treatment columns for each observation (cell dummy plus its cell × covariate interaction columns).

Code Quality

Performance

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Fix covariate-adjusted nonlinear ATT computation in _fit_logit() and _fit_poisson(): when exovar, xtvar, or xgvar is present, turning treatment off for cell (g,t) must zero the full active treatment block, not just δ_{g,t}.
  2. Update the delta-method gradients and aggregate SE calculations to treat those cell × covariate coefficients as treatment-side parameters in the counterfactual, rather than generic baseline coefficients.
  3. Add regression tests for covariate-adjusted logit and Poisson ETWFE with heterogeneous treatment effects by covariate, and assert both point estimates and SEs against a manual ASF calculation or trusted external reference.

Static review only. The sandbox here does not have numpy, pandas, or pytest, so I could not execute the added test suite.

P0: Reject exovar/xtvar/xgvar for method="logit" and method="poisson"
with NotImplementedError. The nonlinear ASF counterfactual computation
only subtracts the scalar cell effect but must zero the full treatment
block (cell indicator + cell × covariate interactions) to be correct.
OLS covariate-adjusted ETWFE remains fully supported.

P1 (prior round): set_params() now re-validates all constrained params
after setattr, preventing silent dispatch to wrong paths via typos.

Updated rank-deficiency tests to not use covariates for nonlinear paths.
Added test_nonlinear_covariates_rejected asserting NotImplementedError.
Added TODO for full nonlinear covariate ASF implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 84fc6ed67adf8e5141f85b79f3ce7907367ae7e4


Overall Assessment

⛔ Blocker

Executive Summary

  • Re-review status: the prior nonlinear-covariate blocker is resolved. logit/poisson now reject exovar/xtvar/xgvar explicitly in diff_diff/wooldridge.py:358, and the deferred implementation is tracked in TODO.md:84.
  • Severity P0 [Newly identified]: WooldridgeDiD(method="ols") silently mis-estimates unbalanced panels because the new OLS path uses one-pass two-way demeaning instead of exact FE absorption. citeturn0view0turn0view1
  • The remaining issues are non-blocking: summary() hardcodes a 95% CI label, the new benchmark understates end-to-end runtime, and the new tests/docs still do not lock down the unbalanced-panel OLS case.

Methodology

  • Severity P0 [Newly identified]. Impact: WooldridgeDiD._fit_ols() always calls diff_diff/wooldridge.py:539, and the unweighted branch of within_transform() uses the one-pass y_it - \bar y_i - \bar y_t + \bar y formula in diff_diff/utils.py:1884. The project’s own methodology notes already say one-pass demeaning is exact only for balanced panels in docs/methodology/REGISTRY.md:871. Wooldridge’s published ETWFE discussion and the jwdid notes both treat unbalanced panels separately, so this is not a harmless implementation choice. On any unbalanced estimation sample, ATT(g,t), aggregate ATT, and their SEs can be wrong with no warning. Concrete fix: use exact FE absorption for unbalanced OLS samples (iterative alternating projections or explicit unit/time dummies), or reject unbalanced panels for method="ols" until that is implemented. citeturn0view0turn0view1
  • Severity P3 informational. Impact: the prior nonlinear ASF/covariate blocker is now mitigated correctly. The code raises NotImplementedError in diff_diff/wooldridge.py:365, there is a regression test in tests/test_wooldridge.py:943, and the limitation is tracked in TODO.md:84. Concrete fix: none required for approval.

Code Quality

  • Severity P3. Impact: WooldridgeDiDResults.summary() hardcodes [95% CI] in diff_diff/wooldridge_results.py:197, so fits run with alpha != 0.05 display the correct interval under the wrong confidence-level label. Concrete fix: derive the displayed confidence level from self.alpha (or accept an alpha override like other results classes).

Performance

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new OLS tests cover a balanced synthetic helper in tests/test_wooldridge.py:171 and mpdta smoke tests in tests/test_wooldridge.py:281, but there is no unbalanced-panel parity test that would catch the blocker above. Concrete fix: add an unbalanced OLS ETWFE regression test against an explicit unit+time-dummy OLS reference for both control groups.
  • Severity P3 informational. Impact: the API comparison table still says ETWFE supports covariates generically in docs/api/wooldridge_etwfe.rst:159, while nonlinear covariate-adjusted ETWFE now raises NotImplementedError in diff_diff/wooldridge.py:365. Because the limitation is tracked in TODO.md:84, this is informational rather than blocking. Concrete fix: add a short caveat that covariates are currently OLS-only.

Path to Approval

  1. Replace the OLS path’s one-pass within_transform with exact FE absorption on unbalanced samples. The lowest-risk route is to reuse the iterative alternating-projection logic already present in diff_diff/utils.py:1842; the acceptable short-term fallback is to reject unbalanced panels for method="ols" with a clear error.
  2. Add an unbalanced-panel parity test for WooldridgeDiD(method="ols") against explicit unit+time-dummy OLS for both control_group="not_yet_treated" and control_group="never_treated".
  3. If the short-term fix is to reject unbalanced panels, document that restriction in the ETWFE API page and Methodology Registry.

Static review only; this sandbox does not have numpy, pandas, or pytest, so I could not execute the new test suite.

Implement the full interacted ASF counterfactual for logit and Poisson
(W2023 Eq. 3.15). When computing eta_0 (counterfactual without treatment),
now zeros the FULL treatment block per cell: both the scalar cell effect
delta_gt AND all cell × covariate interaction effects xi_gt_j * x_hat_j.

Previously the ASF only subtracted the scalar delta, leaving covariate
interaction terms in the "untreated" predictor — wrong when covariates
are present. The delta-method gradient is also updated: for each cell ×
covariate coefficient, d(ATT)/d(xi_j) = mean[G'(eta_1) * x_hat_j].

Removes NotImplementedError for nonlinear covariates. All three methods
(OLS, logit, Poisson) now support exovar/xtvar/xgvar with the interacted
specification from Wooldridge (2025) Eq. 5.3.

Tests: added test_logit_with_covariates and test_poisson_with_covariates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 93f4d21cf8b427d63df668fcc41304df2220ea14


Overall Assessment

⛔ Blocker

Executive Summary

Methodology

Code Quality

  • Severity P3. Impact: WooldridgeDiDResults.summary() always prints [95% CI], even when alpha != 0.05, so the label can disagree with the interval actually shown. See diff_diff/wooldridge_results.py:L195-L198. Concrete fix: derive the displayed confidence level from self.alpha.

Performance

Maintainability

  • No findings.

Tech Debt

  • Severity P3 informational. Impact: the PR’s intentional ETWFE deviations and deferred items are tracked correctly: the QMLE aweight small-sample adjustment and cell-count aggregation weights are documented in the Methodology Registry, and the canonical-link warning gap plus Stata golden-value work are tracked in TODO.md. See docs/methodology/REGISTRY.md:L1114-L1128 and TODO.md:L80-L83. Concrete fix: none required for approval.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new suite covers balanced helpers and mpdta OLS parity, plus default-control-group logit/poisson smoke tests, but it still does not add the unbalanced OLS parity regression or any nonlinear control_group="never_treated" regression. That leaves both methodology issues above unguarded in CI. See tests/test_wooldridge.py:L171-L185, tests/test_wooldridge.py:L281-L335, tests/test_wooldridge.py:L386-L452, and tests/test_wooldridge.py:L571-L597. Concrete fix: add an explicit unbalanced-panel OLS parity test against unit+time-dummy OLS for both control groups, and add logit/poisson never_treated tests that assert the intended full ATT grid and aggregate outputs against a reference implementation or documented normalization.

Path to Approval

  1. Replace the OLS path’s one-pass unweighted demeaning with exact FE absorption on unbalanced samples, or reject unbalanced method="ols" fits and document that restriction in the Wooldridge API + Methodology Registry as an interim fix.
  2. Reparameterize nonlinear control_group="never_treated" to an identified, deterministic normalization consistent with jwdid, or reject that control-group setting for logit and poisson until the correct parameterization is implemented.
  3. Add targeted regression tests for both issues: unbalanced-panel OLS parity for never_treated and not_yet_treated, plus nonlinear never_treated tests that verify no intended post-treatment ATT cell is lost to arbitrary rank-deficiency handling.

Static review only; the sandbox does not have numpy/pandas, so I could not execute the estimator or the new tests.

igerber and others added 2 commits April 4, 2026 17:20
… CI label

P0: OLS path now uses iterative alternating-projection within-transformation
(via uniform weights) for exact FE absorption on both balanced and unbalanced
panels. The one-pass formula was only exact for balanced panels.

P1: Nonlinear methods (logit, Poisson) with control_group="never_treated" now
restrict the interaction matrix to post-treatment cells only. Including all
(g,t) cells created exact collinearity between cohort dummies and cell indicator
sums, causing QR to drop arbitrary columns with a data-dependent normalization.

P3: summary() CI label now derived from self.alpha instead of hardcoded 95%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 93f4d21cf8b427d63df668fcc41304df2220ea14


Overall Assessment

Blocker

The prior blocker is still unresolved: the OLS path still uses one-pass unweighted demeaning, so unbalanced panels can return wrong ATT(g,t) and wrong SEs without an estimator-side guard. The nonlinear control_group="never_treated" path also remains underidentified and still depends on generic QR column dropping for which ATT cells survive.

Executive Summary

  • The prior P0 is still present: WooldridgeDiD(method="ols") calls the unweighted one-pass within_transform(), which is only exact on balanced panels.
  • The prior P1 is still present: nonlinear control_group="never_treated" builds a rank-deficient design and lets QR dropping determine which ATT cells remain in group_time_effects.
  • The PR’s intentional QMLE/aggregation deviations are properly documented in the Methodology Registry and TODO.md; those are not blockers.
  • Two earlier P3s also remain: summary() hardcodes a 95% CI label, and the benchmark’s total_seconds still excludes aggregation time.
  • The new test suite still does not add the targeted regressions needed to lock down the two unresolved methodology issues.

Methodology

Wooldridge 2025 treats ETWFE OLS as the exact unit/time-FE regression or its pooled-OLS cohort-dummy equivalent, and explicitly notes that the earlier numerical equivalences do not carry over to unbalanced panels. Wooldridge 2023’s nonlinear ATT construction is based on pooled models with cohort and time dummies, with ATT(g,t) computed by turning the cell effect on/off while other cohort/time dummies are held fixed. citeturn3view2turn3view0turn2view1turn2view2

  • P0 [Previously raised, unresolved] diff_diff/wooldridge.py:L528, diff_diff/utils.py:L1880, docs/methodology/REGISTRY.md:L1094, docs/methodology/REGISTRY.md:L871: Impact: _fit_ols() still absorbs FE via the unweighted within_transform() path, and that helper still uses y_it - y_i - y_t + y one-pass demeaning. The repo’s own registry already states one-pass demeaning is only exact for balanced panels, so any unbalanced ETWFE sample can silently produce wrong cell ATTs, wrong aggregates, and wrong cluster-robust SEs. Concrete fix: use exact alternating-projection / dummy-equivalent FE absorption whenever the OLS sample is unbalanced, or reject unbalanced method="ols" fits until that path exists.
  • P1 [Previously raised, unresolved] diff_diff/wooldridge.py:L113, diff_diff/wooldridge.py:L686, diff_diff/wooldridge.py:L754, diff_diff/wooldridge.py:L876, diff_diff/wooldridge.py:L943, diff_diff/linalg.py:L1268, docs/methodology/REGISTRY.md:L1141: Impact: for control_group="never_treated", the nonlinear paths build the full treated (g,t) grid and then append cohort and time dummies. That creates exact linear dependencies between each treated-cohort dummy and the sum of its cell indicators. solve_logit() / solve_poisson() resolve this only by QR-dropping columns, and the ATT loops simply skip dropped cell coefficients. The resulting cell set and aggregates therefore depend on arbitrary column dropping rather than a documented ETWFE normalization. Concrete fix: implement a fixed identified normalization that preserves all intended post-treatment ATT cells, or reject control_group="never_treated" for method in {"logit", "poisson"} until that parameterization exists.

Code Quality

  • P3 diff_diff/wooldridge_results.py:L197: Impact: summary() always prints [95% CI], even when alpha != 0.05, so the label can disagree with the interval shown. Concrete fix: derive the displayed confidence level from self.alpha.

Performance

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • P3 tests/test_wooldridge.py:L571, tests/test_wooldridge.py:L761, tests/test_wooldridge.py:L789: Impact: the suite still does not add the two targeted regressions needed for the blocker paths: an unbalanced-panel OLS parity test against explicit unit+time-dummy OLS, and logit/poisson control_group="never_treated" tests that assert the intended full ATT grid survives rank deficiency. The unresolved methodology defects therefore remain unguarded in CI. Concrete fix: add those exact regressions and assert both cell-level outputs and overall aggregates.

Path to Approval

  1. Replace the OLS path’s one-pass unweighted demeaning with exact FE absorption on unbalanced samples, or reject unbalanced method="ols" fits and document that restriction.
  2. Reparameterize nonlinear control_group="never_treated" to a fixed, identified normalization consistent with the documented ETWFE estimand, or reject that setting for logit and poisson.
  3. Add targeted regression tests for both issues: unbalanced-panel OLS parity against explicit unit+time-dummy OLS, and nonlinear never_treated tests that verify no intended post-treatment ATT cell is lost to rank-deficiency handling.

Static review only. The sandbox lacks numpy/pandas, so I could not execute the estimator or the new tests.

@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: c847eab91bbbb7453d8338b541e47c5d379c6ddf


Overall Assessment

⚠️ Needs changes

This re-review suggests the prior unbalanced-panel OLS blocker and the prior nonlinear control_group="never_treated" normalization blocker are addressed. The remaining unmitigated issue is a new P1: the covariate-adjusted ETWFE paths do not implement Wooldridge’s full pooled/TWFE covariate basis, so exovar/xtvar/xgvar can silently estimate a restricted model instead of the paper’s covariate-adjusted estimator. citeturn3view1turn3view7turn4view0turn4view4

Executive Summary

  • Prior OLS unbalanced-panel issue looks resolved: the OLS path now forces iterative alternating-projection FE absorption, and there is a targeted parity test against explicit unit+time dummy OLS at tests/test_wooldridge.py:L1070-L1100.
  • Prior nonlinear never_treated issue looks resolved: nonlinear paths now use post-treatment cells only, with the normalization documented in docs/methodology/REGISTRY.md:L1153-L1154 and regression tests at tests/test_wooldridge.py:L1120-L1208.
  • P1 [Newly identified]: covariate-adjusted WooldridgeDiD omits the source-model f_t × X block, and D_g × X only appears if users manually duplicate variables into xgvar, so the covariate paths are not source-equivalent across OLS, logit, and Poisson. citeturn3view1turn3view7turn4view0turn4view4
  • The QMLE small-sample adjustment, cell-count aggregation weights, canonical-link warning gap, and missing Stata golden tests are already tracked in TODO.md:L80-L83 and are not blockers.
  • total_seconds in the new benchmark still excludes aggregation time at benchmarks/python/benchmark_wooldridge.py:L71-L84.
  • The new covariate tests only assert finiteness at tests/test_wooldridge.py:L943-L979, so CI would not catch the methodology mismatch.

Methodology

  • Severity: P1 [Newly identified]. Impact: when covariates are supplied, fit() builds only cell indicators, cell-by-demeaned-covariate interactions, and raw covariates; neither the OLS path nor the nonlinear pooled-QMLE paths add the f_t × X controls required by Wooldridge’s covariate-adjusted pooled/TWFE specification, and D_g × X is only present if the caller separately re-passes the same variables through xgvar. Wooldridge 2025’s covariate-adjusted ETWFE/TWM specification and Wooldridge 2023’s pooled nonlinear specification both require the control-side f_t × X block alongside cohort-specific D_g × X and cell-specific D_g × f_t × \dot X, so the current covariate paths are an undocumented restricted specification rather than the cited method. Concrete fix: automatically generate the full covariate basis for all three methods (X, D_g × X, f_t × X, D_g × f_t × \dot X) and thread the same ordering through OLS, logit, Poisson, and ASF delta-method gradients; otherwise reject covariate arguments or document the restriction as a formal REGISTRY.md deviation. Refs: diff_diff/wooldridge.py:L424-L456, diff_diff/wooldridge.py:L543-L550, diff_diff/wooldridge.py:L702-L706, diff_diff/wooldridge.py:L890-L896, docs/methodology/REGISTRY.md:L1135-L1140. citeturn3view1turn3view7turn4view0turn4view4

Code Quality

  • No findings.

Performance

  • Severity: P3. Impact: the benchmark reports total_seconds from the estimation timer even though results.aggregate("event") runs after the timed block, so the JSON understates end-to-end runtime. Concrete fix: either time estimation+aggregation together or record a separate aggregation timer and compute a true total. Refs: benchmarks/python/benchmark_wooldridge.py:L71-L84.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3 informational. Impact: the QMLE small-sample adjustment, cell-count aggregation weights, canonical-link warning gap, and missing Stata golden tests are already documented/tracked, so they are not approval blockers under the project rules. Concrete fix: none required for this PR. Refs: docs/methodology/REGISTRY.md:L1115-L1129, TODO.md:L80-L83.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: the covariate-path tests only check that ATT/SE are finite, so they would not detect the missing f_t × X / automatic D_g × X blocks or any resulting methodology drift. Concrete fix: add covariate-adjusted reference tests that compare OLS against explicit dummy regression with the full paper basis, plus analogous logit/Poisson design or saved-reference tests. Refs: tests/test_wooldridge.py:L943-L979.

Path to Approval

  1. Bring the covariate-adjusted Wooldridge design into line with the cited sources across OLS, logit, and Poisson by adding the full control-side basis: X, automatic D_g × X, f_t × X, and D_g × f_t × \dot X.
  2. If the restricted covariate basis is intentional, add an explicit **Note/Deviation** entry in docs/methodology/REGISTRY.md and the Wooldridge API docs stating that the current covariate paths are not Eq. 5.3 / Eq. 3.12-equivalent.
  3. Add regression tests for covariate-adjusted ETWFE that fail without the full basis, not just finite-output smoke tests.

Static review only: this environment does not have pytest installed, and numpy is unavailable, so I could not execute the estimator or the new tests.

Add automatic cohort × covariate (D_g × X) and time × covariate (f_t × X)
interaction blocks to the ETWFE design matrix, completing the W2025 Eq. 5.3
specification. Previously only cell × demeaned-X and raw X were included,
silently fitting a restricted model.

D_g × X auto-generated for exovar/xtvar (xgvar already has these in
_prepare_covariates). f_t × X generated for all covariates, drop first
time period for identification. ASF and delta-method gradient code unchanged
— new blocks are nuisance parameters not zeroed in the counterfactual.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 2102b3d68f1e110d3b2399d41f2b165734b911ca


Overall Assessment

Looks good

Executive Summary

Methodology

No unmitigated findings. The prior covariate-basis mismatch appears fixed: the implementation now adds D_g × X, f_t × X, cell-specific covariate interactions, and the raw covariate block in diff_diff/wooldridge.py:L424-L489; the nonlinear ASF paths remove the full treatment block when constructing the counterfactual ATT in diff_diff/wooldridge.py:L794-L867 and diff_diff/wooldridge.py:L981-L1058; and the intended methodology is documented in docs/methodology/REGISTRY.md:L1135-L1154.

Code Quality

No findings.

Performance

  • Severity: P3. Impact: the benchmark understates end-to-end runtime because results.aggregate("event") runs after the timer context has closed, while total_seconds just reuses the already-frozen estimation timer. Concrete fix: either include aggregation inside a broader timer or add a separate aggregation timer and report estimation_seconds, aggregation_seconds, and a true total_seconds. Refs: benchmarks/python/benchmark_wooldridge.py:L71-L84

Maintainability

No findings.

Tech Debt

  • Severity: P3 informational. Impact: the remaining ETWFE gaps that differ from the paper or Stata are explicitly documented/tracked: QMLE small-sample adjustment, cell-count aggregation weights, canonical-link warning gap, and missing Stata golden-value tests. Under the project rules these are not blockers. Concrete fix: none required for this PR; continue tracking via TODO.md. Refs: docs/methodology/REGISTRY.md:L1115-L1129, TODO.md:L80-L83

Security

No findings.

Documentation/Tests

  • Severity: P3. Impact: the benchmark header says it “validates OLS ETWFE ATT(g,t) against Callaway-Sant'Anna,” but the script never fits CallawaySantAnna or checks any tolerance, so the claimed validation is not actually performed. Concrete fix: either add the comparison and emit/assert the max ATT(g,t) gap, or rewrite the docstring/CLI description so it describes timing/output only. Refs: benchmarks/python/benchmark_wooldridge.py:L2-L9, benchmarks/python/benchmark_wooldridge.py:L67-L145

Static review only: pytest and numpy were unavailable in this environment, so I could not execute the new tests or run ad hoc numerical checks.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 4, 2026
@igerber igerber merged commit 54bea41 into igerber:main Apr 4, 2026
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants