[FLINK-39749][mysql-cdc] Support configurable string chunk key comparison mode to align with MySQL collation#4413
Conversation
1c15142 to
9e4bea6
Compare
…pare-mode option This commit introduces a new configuration option `scan.incremental.snapshot.string-key.compare-mode` to fix chunk splitting and binlog event routing issues when MySQL collation differs from Java's natural String ordering. Problem: - Java String.compareTo() is case-sensitive (Unicode code point order). - MySQL collations like utf8mb4_general_ci are case-insensitive. - This mismatch causes chunk boundaries computed by Java to diverge from actual MySQL row ordering, leading to premature unbounded chunks, overlapping splits, or lost binlog events. Solution: - Introduce ChunkKeyCompareMode enum: DEFAULT, CASE_INSENSITIVE, BINARY. - DEFAULT: preserves existing behavior (String.compareTo()). - CASE_INSENSITIVE: uses String.compareToIgnoreCase() for Java-side comparisons. - BINARY: injects BINARY keyword in SQL predicates and uses byte-level comparison in Java. Changes cover all three API layers: - DataStream API (MySqlSourceBuilder) - Flink SQL (MySqlTableSourceFactory) - Pipeline YAML (MySqlDataSourceFactory) Also updates documentation (EN + ZH) and adds test coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9e4bea6 to
9a43277
Compare
| </td> | ||
| </tr> | ||
| <tr> | ||
| <td>scan.incremental.snapshot.string-key.compare-mode</td> |
There was a problem hiding this comment.
We can use SHOW VARIABLES LIKE "collation_server" to query the character set in use from MySQL, avoiding the introduction of a new configuration parameter.
There was a problem hiding this comment.
Good point, but collation_server is just the server default — the actual table or column can override it. To auto-detect properly we'd have to query information_schema.COLUMNS per table at startup, and it gets messy with composite PKs where each column might have a different collation.
So I went with explicit config for now — default keeps backward compat and won't break anyone. We can add an auto mode later: detect each table's chunk-key collation during snapshot, store it per-table in the split state, and handle them individually
What is the purpose of the change
Fix chunk splitting and binlog event routing issues when MySQL collation differs from Java's natural String ordering.
Java `String.compareTo()` is case-sensitive (Unicode code point order), while MySQL collations like `utf8mb4_general_ci` are case-insensitive. This mismatch causes chunk boundaries computed by Java to diverge from actual MySQL row ordering, leading to premature unbounded chunks, overlapping splits, or lost binlog events.
See FLINK-39749 for details.
Brief change log
Verifying this change
This change is already covered by existing tests (`MySqlTableSourceFactoryTest`) and has been verified in a production Pipeline job (MySQL -> Paimon) with `CHAR(36)` UUID primary key and `utf8mb4_general_ci` collation.
Does this pull request potentially affect one of the following parts
Documentation