Skip to content

[FLINK-39749][mysql-cdc] Support configurable string chunk key comparison mode to align with MySQL collation#4413

Open
ziyanTOP wants to merge 1 commit into
apache:masterfrom
ziyanTOP:flink-cdc-string-key-compare-mode
Open

[FLINK-39749][mysql-cdc] Support configurable string chunk key comparison mode to align with MySQL collation#4413
ziyanTOP wants to merge 1 commit into
apache:masterfrom
ziyanTOP:flink-cdc-string-key-compare-mode

Conversation

@ziyanTOP
Copy link
Copy Markdown

What is the purpose of the change

Fix chunk splitting and binlog event routing issues when MySQL collation differs from Java's natural String ordering.

Java `String.compareTo()` is case-sensitive (Unicode code point order), while MySQL collations like `utf8mb4_general_ci` are case-insensitive. This mismatch causes chunk boundaries computed by Java to diverge from actual MySQL row ordering, leading to premature unbounded chunks, overlapping splits, or lost binlog events.

See FLINK-39749 for details.

Brief change log

  • Introduce `ChunkKeyCompareMode` enum: `DEFAULT`, `CASE_INSENSITIVE`, `BINARY`
  • `DEFAULT`: preserves existing behavior (`String.compareTo()`)
  • `CASE_INSENSITIVE`: uses `String.compareToIgnoreCase()` for Java-side comparisons
  • `BINARY`: injects `BINARY` keyword in SQL predicates and uses byte-level comparison in Java
  • Cover all three API layers: DataStream API, Flink SQL, Pipeline YAML
  • Update documentation (EN + ZH) and add test coverage

Verifying this change

This change is already covered by existing tests (`MySqlTableSourceFactoryTest`) and has been verified in a production Pipeline job (MySQL -> Paimon) with `CHAR(36)` UUID primary key and `utf8mb4_general_ci` collation.

Does this pull request potentially affect one of the following parts

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): yes (affects chunk key comparison in snapshot/binlog phases)
  • Anything that affects deployment or recovery: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? docs

@github-actions github-actions Bot added docs Improvements or additions to documentation mysql-cdc-connector mysql-pipeline-connector labels May 25, 2026
@ziyanTOP ziyanTOP changed the title [FLINK-39749][mysql-cdc] Add scan.incremental.snapshot.string-key.compare-mode option [FLINK-39749][mysql-cdc] Support configurable string chunk key comparison mode to align with MySQL collation May 25, 2026
@ziyanTOP ziyanTOP force-pushed the flink-cdc-string-key-compare-mode branch from 1c15142 to 9e4bea6 Compare May 25, 2026 10:25
…pare-mode option

This commit introduces a new configuration option `scan.incremental.snapshot.string-key.compare-mode`
to fix chunk splitting and binlog event routing issues when MySQL collation differs from Java's
natural String ordering.

Problem:
- Java String.compareTo() is case-sensitive (Unicode code point order).
- MySQL collations like utf8mb4_general_ci are case-insensitive.
- This mismatch causes chunk boundaries computed by Java to diverge from actual MySQL row ordering,
  leading to premature unbounded chunks, overlapping splits, or lost binlog events.

Solution:
- Introduce ChunkKeyCompareMode enum: DEFAULT, CASE_INSENSITIVE, BINARY.
- DEFAULT: preserves existing behavior (String.compareTo()).
- CASE_INSENSITIVE: uses String.compareToIgnoreCase() for Java-side comparisons.
- BINARY: injects BINARY keyword in SQL predicates and uses byte-level comparison in Java.

Changes cover all three API layers:
- DataStream API (MySqlSourceBuilder)
- Flink SQL (MySqlTableSourceFactory)
- Pipeline YAML (MySqlDataSourceFactory)

Also updates documentation (EN + ZH) and adds test coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ziyanTOP ziyanTOP force-pushed the flink-cdc-string-key-compare-mode branch from 9e4bea6 to 9a43277 Compare May 25, 2026 10:29
lvyanquan
lvyanquan previously approved these changes May 29, 2026
</td>
</tr>
<tr>
<td>scan.incremental.snapshot.string-key.compare-mode</td>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use SHOW VARIABLES LIKE "collation_server" to query the character set in use from MySQL, avoiding the introduction of a new configuration parameter.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but collation_server is just the server default — the actual table or column can override it. To auto-detect properly we'd have to query information_schema.COLUMNS per table at startup, and it gets messy with composite PKs where each column might have a different collation.
So I went with explicit config for now — default keeps backward compat and won't break anyone. We can add an auto mode later: detect each table's chunk-key collation during snapshot, store it per-table in the split state, and handle them individually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation mysql-cdc-connector mysql-pipeline-connector

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants