[FLINK-39622] [postgres] Fix O(N²) JDBC metadata lookups in CustomPostgresSchema d…#4403
Open
ThorneANN wants to merge 1 commit into
Open
[FLINK-39622] [postgres] Fix O(N²) JDBC metadata lookups in CustomPostgresSchema d…#4403ThorneANN wants to merge 1 commit into
ThorneANN wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
、 CustomPostgresSchema#readTableSchema invokes jdbcConnection.readSchema with
the full captured-table filter, so a single call already loads metadata for
every captured table. However the cache-population loop only iterates the
requested subset, discarding the rest. As a result, snapshot startup performs
one full pg_catalog scan per split, scaling as O(N²) with the number of
captured tables and causing severe latency on multi-tenant Postgres deployments
that capture hundreds of tables across schemas.
This change caches every table discovered by readSchema into schemasByTableId,
while the returned tableChanges still contains only the originally-requested
subset. Subsequent splits are served entirely from the cache.
Also fixes a related issue where getTableSchema(List) re-fetched
already-cached tables by passing the full tableIds list to readTableSchema
instead of the unmatched subset.