Commit a875ffd
fix(utils): Conference matching and merging (#196)
* Add diagnostic script for conference data sync pipeline
This script traces data flow through the sync pipeline to identify
where matching breaks occur. Key findings from running diagnostics:
- Title normalization (tidy_df_names) works correctly for most cases
- Mapping system successfully converts PyCon DE -> PyCon Germany etc.
- Identified false positive risk: PyCon Austria vs PyCon Australia (93% match)
- EuroPython in CSV has no YAML equivalent (new conference, not a match issue)
The diagnostic script provides step-by-step analysis of:
1. Raw data loading from YAML and CSV sources
2. Column mapping transformation
3. Title normalization with mappings
4. Fuzzy matching scores and thresholds
5. Problem case identification
* Add exclusion rules to prevent Austria/Australia false match
The fuzzy matching system would incorrectly suggest matching
PyCon Austria with PyCon Australia at 93% similarity. This adds:
1. New `exclusions` section in titles.yml for known false-positive pairs:
- PyCon Austria <-> PyCon Australia (and all their abbreviations)
2. New `load_exclusions()` function in yaml.py to parse exclusion pairs
3. Updated `fuzzy_match()` in interactive_merge.py to:
- Load exclusions at startup
- Check exclusions before accepting any fuzzy match
- Log when an exclusion prevents a match
4. Updated diagnostic script to show exclusion status in output
This prevents the dangerous false-positive where two different
country conferences could be incorrectly merged.
* Harmonize permanent exclusions with session-based rejections
The system now has two complementary exclusion mechanisms:
1. Permanent exclusions (titles.yml -> exclusions:)
- Version-controlled in the repo
- For known false-positives like Austria/Australia
- Always applied, never prompted
2. Session rejections (.tmp/rejections.yml)
- User-generated during interactive sessions
- When user says "no" to a fuzzy match, it's saved here
- Automatically applied in future runs (no re-prompting)
Both are loaded and combined into a unified exclusion set at runtime.
The is_excluded() check now covers both types.
Also fixed: rejections now store conference names instead of row indices,
making them portable across different dataframe orderings.
* Track rejections.yml in version control
Move rejections.yml from .tmp/ (ignored) to data/ (tracked).
This means:
- User rejections are now permanent and version-controlled
- Team can review and curate rejected matches
- No more temporary session-based rejections that get lost
The file uses the same format as titles.yml for consistency.
* Consolidate exclusions into rejections.yml
Remove the separate `exclusions` section from titles.yml and move
everything to rejections.yml. This simplifies the system:
- One file (rejections.yml) for all "never match" pairs
- Includes both known false-positives (Austria/Australia) and user rejections
- Remove unused load_exclusions() function
- Update diagnostic script to match
Now there's just one place to look for rejected matches.
* fix: improve conference name matching and normalization
Phase 2 improvements to the data sync pipeline:
- Fix critical bug in load_title_mappings where set() was creating character
sets instead of string sets for variations
- Add country code expansion (ISO 3166) - "PyCon PL" now normalizes to
"PyCon Poland", enabling proper matching
- Add custom conference_scorer() using multiple fuzzy matching strategies
(token_sort_ratio, token_set_ratio, ratio, partial_ratio)
- Fix path resolution in yaml.py to use module-relative paths, preventing
creation of empty files in wrong directories
- Fix Python 2/3 compatibility issue (iteritems -> items) in utils.py
- Add additional conference name mappings (PyData, EuroPython, etc.)
- Update diagnostic script to use improved functions
Results: 14/15 CSV conferences now match exactly (up from ~12), with
Austria/Australia exclusion working correctly.
* feat: add input validation, merge tracking, and clear merge strategy
Phase 3 improvements to fix the merge logic:
1. Input Validation (validation.py):
- validate_dataframe() checks required columns and data types
- validate_merge_inputs() validates both dataframes before merge
- ensure_conference_strings() fixes non-string conference names
- Clear error messages when data is malformed
2. Merge Strategy:
- YAML is now explicitly the source of truth
- Remote data enriches YAML with new information
- Placeholder values (TBA, TBD) replaced by actual values
- resolve_conflict() helper with logging
3. Data Preservation:
- MergeReport class tracks all merge operations
- MergeRecord captures each match attempt with before/after state
- validate_no_data_loss() detects if conferences were dropped
- Comprehensive summary report available
4. Import Script Updates:
- fuzzy_match() now returns 3-tuple: (merged_df, remote_df, report)
- Backwards compatible - checks tuple length
- Logs merge statistics after each year's processing
* test: add comprehensive tests for data sync pipeline (Phase 4)
- Add test_validation.py (39 tests) for validation module:
- ValidationError exception class
- MergeRecord/MergeReport dataclasses
- DataFrame validation functions
- Data preservation checks
- Add test_pipeline_integration.py (29 tests) for full pipeline:
- Merge strategy configuration
- Placeholder detection (TBA, TBD, None)
- Conference scorer functions
- Conflict resolution logic
- End-to-end pipeline tests with mock data
- Data loss prevention tests
- Update test_interactive_merge.py (10 tests):
- Support 3-tuple return from fuzzy_match
- Remove xfail markers for fixed bugs
- Improve assertion clarity
All 78 tests pass with 2026 CSV data validation:
- 14/15 CSV conferences match exactly (93%)
- PyCon Austria/Australia correctly excluded
- EuroPython now correctly merged
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* style: fix linting errors from pre-commit hooks
- Fix black formatting issues (line length, spacing)
- Fix isort import ordering (force single line imports)
- Fix ruff PT001/PT023 (remove unnecessary parentheses from pytest decorators)
- Fix ruff RUF059 (prefix unused variables with underscore)
- Fix ruff F401 (remove unused ValidationError import)
- Remove unused DataFrame definitions in tests
All 78 tests pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* style: apply additional ruff auto-fixes
- Add missing trailing comma (COM812)
- Replace for loops with list.extend for performance (PERF401)
- Replace repeated append with extend (FURB113)
- Use ternary operators where appropriate (SIM108)
* style: fix remaining pre-commit linting errors
- Add noqa E402 comments for imports after sys.path.insert in diagnostic_pipeline.py
- Remove unnecessary list() call in sorted() (C414)
- Rename unused loop variable i to _i (B007)
- Add full docstring parameters and return annotations for docsig compliance
- Apply trailing comma fixes from ruff
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* chore: remove diagnostic script
Development artifact no longer needed after pipeline fixes.
* style: suppress S603 subprocess security warning in git_parser.py
The subprocess call uses a hardcoded 'git' command with controlled arguments,
not untrusted user input.
* refactor: remove backwards compatibility for fuzzy_match return value
fuzzy_match now always returns a 3-tuple (merged, remote, report), so the
conditional unpacking code is no longer needed.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>1 parent 0ed6caa commit a875ffd
13 files changed
Lines changed: 2137 additions & 128 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
31 | 33 | | |
32 | 34 | | |
33 | 35 | | |
| |||
64 | 66 | | |
65 | 67 | | |
66 | 68 | | |
67 | | - | |
| 69 | + | |
68 | 70 | | |
69 | 71 | | |
70 | 72 | | |
| |||
97 | 99 | | |
98 | 100 | | |
99 | 101 | | |
100 | | - | |
101 | | - | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
102 | 107 | | |
103 | 108 | | |
104 | 109 | | |
105 | 110 | | |
106 | | - | |
| 111 | + | |
107 | 112 | | |
108 | | - | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
109 | 116 | | |
110 | 117 | | |
111 | | - | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
| 118 | + | |
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
| |||
143 | 143 | | |
144 | 144 | | |
145 | 145 | | |
146 | | - | |
| 146 | + | |
147 | 147 | | |
148 | 148 | | |
149 | 149 | | |
| |||
171 | 171 | | |
172 | 172 | | |
173 | 173 | | |
174 | | - | |
175 | 174 | | |
176 | 175 | | |
177 | 176 | | |
178 | 177 | | |
179 | | - | |
180 | 178 | | |
181 | 179 | | |
182 | 180 | | |
| |||
204 | 202 | | |
205 | 203 | | |
206 | 204 | | |
207 | | - | |
| 205 | + | |
208 | 206 | | |
209 | 207 | | |
210 | 208 | | |
| |||
220 | 218 | | |
221 | 219 | | |
222 | 220 | | |
223 | | - | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
224 | 224 | | |
225 | 225 | | |
226 | 226 | | |
| |||
255 | 255 | | |
256 | 256 | | |
257 | 257 | | |
258 | | - | |
| 258 | + | |
259 | 259 | | |
260 | | - | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
261 | 263 | | |
262 | | - | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
263 | 276 | | |
264 | 277 | | |
265 | 278 | | |
| |||
270 | 283 | | |
271 | 284 | | |
272 | 285 | | |
273 | | - | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
274 | 298 | | |
275 | 299 | | |
276 | 300 | | |
| |||
286 | 310 | | |
287 | 311 | | |
288 | 312 | | |
289 | | - | |
| 313 | + | |
290 | 314 | | |
291 | | - | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
292 | 318 | | |
293 | | - | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
294 | 331 | | |
295 | 332 | | |
296 | 333 | | |
| |||
329 | 366 | | |
330 | 367 | | |
331 | 368 | | |
332 | | - | |
| 369 | + | |
333 | 370 | | |
334 | 371 | | |
335 | 372 | | |
| |||
362 | 399 | | |
363 | 400 | | |
364 | 401 | | |
365 | | - | |
| 402 | + | |
366 | 403 | | |
367 | 404 | | |
368 | 405 | | |
| |||
372 | 409 | | |
373 | 410 | | |
374 | 411 | | |
375 | | - | |
376 | 412 | | |
377 | 413 | | |
378 | 414 | | |
| |||
413 | 449 | | |
414 | 450 | | |
415 | 451 | | |
416 | | - | |
| 452 | + | |
417 | 453 | | |
418 | | - | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
419 | 457 | | |
420 | | - | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
421 | 470 | | |
422 | 471 | | |
423 | 472 | | |
| |||
432 | 481 | | |
433 | 482 | | |
434 | 483 | | |
435 | | - | |
436 | | - | |
437 | | - | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
438 | 489 | | |
439 | 490 | | |
440 | 491 | | |
441 | 492 | | |
442 | 493 | | |
443 | 494 | | |
444 | | - | |
445 | 495 | | |
446 | 496 | | |
447 | 497 | | |
| |||
457 | 507 | | |
458 | 508 | | |
459 | 509 | | |
460 | | - | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
461 | 520 | | |
462 | 521 | | |
463 | 522 | | |
464 | 523 | | |
465 | | - | |
| 524 | + | |
466 | 525 | | |
467 | | - | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
468 | 529 | | |
469 | | - | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
470 | 542 | | |
471 | 543 | | |
472 | 544 | | |
| |||
0 commit comments