Skip to content

Conversation

@martyngigg
Copy link
Contributor

@martyngigg martyngigg commented Jan 13, 2026

Summary

Vastly simplifies the running of the test suite for the elt-common package. We use a local sqlite catalog through pyiceberg to avoid the need for running an external REST catalog.

The RestCatalog configuration is kept for running tests against the ingestion scripts.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for SQL-based Iceberg catalog configuration as an alternative to REST-based catalogs.
    • Made Trino authentication credentials (username and password) optional for improved flexibility.
  • Tests

    • Updated test infrastructure to support multiple catalog types and streamlined test fixture setup.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 13, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

The pull request consolidates elt-common testing workflows, adds dual Iceberg catalog type support (REST and SQL), refactors credential handling to accommodate both modes, and restructures test fixtures and helpers to support the new SQL catalog pathway alongside existing REST-based infrastructure.

Changes

Cohort / File(s) Summary
GitHub Actions Workflows
.github/actions/run-pytest-with-uv/action.yml, .github/workflows/elt-common_e2e_tests.yml, .github/workflows/elt-common_unit_tests.yml
Removed custom pytest action and existing separate unit/e2e workflows
.github/workflows/elt-common_tests.yml
.github/workflows/warehouses_e2e_tests.yml
PyIceberg Catalog Configuration
elt-common/src/elt_common/dlt_destinations/pyiceberg/configuration.py
Added SQL catalog credentials class, refactored REST credentials with validation, introduced dual-mode catalog type resolution
elt-common/src/elt_common/dlt_destinations/pyiceberg/helpers.py
elt-common/src/elt_common/dlt_destinations/pyiceberg/pyiceberg.py
Credential & Engine Management
elt-common/src/elt_common/iceberg/trino.py
Made user and password fields optional in TrinoCredentials; updated engine creation to handle None values
Test Fixture & Helper Infrastructure
elt-common/src/elt_common/testing/__init__.py
Updated environment prefix and added catalog_type and warehouse_name configuration fields
elt-common/src/elt_common/testing/dlt.py
elt-common/src/elt_common/testing/fixtures.py
elt-common/src/elt_common/testing/lakekeeper.py
elt-common/src/elt_common/testing/sqlcatalog.py
Test Configuration & Updates
elt-common/pytest.ini
Added requires_trino pytest marker for test categorisation
elt-common/tests/e2e_tests/elt_common/iceberg/conftest.py
elt-common/tests/unit_tests/dlt_destinations/pyiceberg/test_helpers.py
elt-common/tests/unit_tests/iceberg/maintenance/test_table_maintenance.py
elt-common/tests/unit_tests/iceberg/test_trino.py

Poem

🐰 A rabbit hops through workflows clean,
Where REST and SQL catalogs convene,
With credentials flexible, fixtures refined,
And test helpers perfectly aligned,
The dual pathways now intertwined! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main objective of the changeset: removing external services requirements from elt-common e2e tests by switching to SQLite-based catalog.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@martyngigg martyngigg marked this pull request as ready for review January 13, 2026 15:38
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @elt-common/src/elt_common/iceberg/trino.py:
- Around line 23-24: The dataclass fields user and password were made optional
(str | None) but from_env still expects the env vars and will raise KeyError;
update the dataclass to give these fields explicit defaults (e.g., user: str |
None = None and password: str | None = None) or modify the from_env factory in
TrinoConnection (or similar) to use os.getenv/ dict.get with default=None for
the corresponding environment variables so absent vars do not raise.

In @elt-common/src/elt_common/testing/lakekeeper.py:
- Around line 20-22: The constructor (__init__) of the class calls
access_token(settings) which performs network I/O and lacks retry handling,
causing unhandled exceptions during Server instantiation; modify initialization
to avoid network calls: either add tenacity-based retry logic inside
access_token() or change the class to lazily obtain the token (store settings in
__init__ and call access_token(settings) on first use), and ensure behavior
mirrors existing retry-protected methods (purge_warehouse, delete_warehouse) so
transient network errors are retried rather than bubbling out of __init__.
- Around line 171-188: token_endpoint() and access_token() call
requests.get/post without timeouts causing potential indefinite hangs; update
the requests calls to include a timeout (use the same default as
_request_with_auth, e.g. timeout=10.0) so both token_endpoint(settings)
(requests.get) and access_token(settings) (requests.post) pass timeout=10.0,
keep response.raise_for_status()/response.json() unchanged, or alternatively
refactor to reuse the existing _request_with_auth helper that already defaults
timeout=10.0.
🧹 Nitpick comments (12)
elt-common/src/elt_common/iceberg/trino.py (1)

107-111: Fallback credentials use a hardcoded username.

The fallback BasicAuthentication("trino", "") assumes a default username of "trino" when credentials are missing. This is reasonable for local/test scenarios but worth documenting if production use is intended, as silent fallback to a generic user could mask configuration errors.

elt-common/tests/unit_tests/iceberg/test_trino.py (1)

8-21: Test name is misleading and return value is not verified.

The test name suggests it verifies that only Iceberg tables are returned, but it only asserts the SQL command executed. Consider either:

  1. Renaming to test_trino_query_engine_list_tables_executes_correct_sql, or
  2. Mocking execute to return sample data and asserting the formatted output.
Option 2: Verify the return value
 def test_trino_query_engine_list_tables_returns_only_iceberg_tables(
     mocker: MockerFixture,
 ):
     mock_creds = mocker.MagicMock()
     mocker.patch.object(TrinoQueryEngine, "_create_engine", mocker.MagicMock())
     trino = TrinoQueryEngine(mock_creds)
-    trino_execute_spy = mocker.spy(trino, "execute")
+    mock_rows = [("schema1", "table1"), ("schema2", "table2")]
+    mocker.patch.object(trino, "execute", return_value=mock_rows)
 
-    trino.list_iceberg_tables()
+    result = trino.list_iceberg_tables()
 
-    assert trino_execute_spy.call_count == 1
-    expected_cmd_re = re.compile(r"^select \* from system.iceberg_tables$")
-    command_match = expected_cmd_re.match(trino_execute_spy.call_args[0][0])
-    assert command_match is not None
+    trino.execute.assert_called_once()
+    assert "system.iceberg_tables" in trino.execute.call_args[0][0]
+    assert result == ["schema1.table1", "schema2.table2"]
elt-common/tests/unit_tests/iceberg/maintenance/test_table_maintenance.py (1)

26-39: Consider simplifying the mock verification.

Using mocker.spy() on a MagicMock is redundant since MagicMock already tracks all calls. You can verify directly via trino_engine.execute.call_args without the spy wrapper.

♻️ Suggested simplification
 def test_iceberg_maintenance_commands_run_expected_trino_alter_table_command(
     mocker: MockerFixture,
     command: str,
     command_args: Dict[str, str],
 ):
     trino_engine = mocker.MagicMock()
     iceberg_maint = IcebergTableMaintenaceSql(trino_engine)
     table_id = "ns1.table1"
-    trino_execute_spy = mocker.spy(trino_engine, "execute")
     getattr(iceberg_maint, command)(table_id, **command_args)
 
-    assert trino_execute_spy.call_count == 1
+    assert trino_engine.execute.call_count == 1
     expected_cmd_re = re.compile(rf"^alter table {table_id} execute ({command})(.+)?$")
-    command_match = expected_cmd_re.match(trino_execute_spy.call_args[0][0])
+    command_match = expected_cmd_re.match(trino_engine.execute.call_args[0][0])
     assert command_match is not None
     assert command_match.group(1) == command
     if command_args:
         for key in command_args.keys():
             assert key in command_match.group(2)
elt-common/src/elt_common/dlt_destinations/pyiceberg/configuration.py (2)

53-65: Validation logic is sound; minor readability improvement possible.

The authentication property validation correctly ensures all-or-none semantics. The counting expression could be simplified for readability.

♻️ Optional simplification
-        non_null_count = sum(map(lambda x: 1 if x is not None else 0, auth_props.values()))
+        non_null_count = sum(1 for v in auth_props.values() if v is not None)

68-78: Consider adding validation for required SQL credentials.

The on_resolved method is a no-op, but uri and warehouse are essential for SQL catalog connectivity. Consider validating that these fields are non-empty to fail fast with a clear error message rather than during catalog connection.

♻️ Suggested validation
     def on_resolved(self) -> None:
-        pass
+        if not self.uri:
+            raise DestinationTerminalException("Missing required configuration value: 'uri' for SQL catalog.")
+        if not self.warehouse:
+            raise DestinationTerminalException("Missing required configuration value: 'warehouse' for SQL catalog.")
elt-common/src/elt_common/testing/sqlcatalog.py (1)

8-20: Potential resource management issue with TemporaryDirectory.

The TemporaryDirectory is stored as an instance attribute but not used as a context manager, which means its automatic cleanup won't trigger. The current design relies on external cleanup via shutil.rmtree in fixtures.py. This works but creates tight coupling and potential for resource leaks if the fixture teardown doesn't execute.

Consider either:

  1. Adding an explicit cleanup method to SqlCatalogWarehouse
  2. Documenting that callers must handle cleanup
♻️ Option 1: Add explicit cleanup method
 class SqlCatalogWarehouse:
     def __init__(self, warehouse_name: str):
         self.name = warehouse_name
         self.workdir = TemporaryDirectory()
         self.uri = f"sqlite:///{self.workdir.name}/{warehouse_name}.db"
         self.warehouse_path = f"file://{self.workdir.name}/{warehouse_name}"
 
+    def cleanup(self) -> None:
+        """Clean up the temporary directory"""
+        self.workdir.cleanup()
+
     def connect(self) -> PyIcebergCatalog:
         """Connect to the warehouse in the catalog"""
         creds = PyIcebergSqlCatalogCredentials()
         creds.uri = self.uri
         creds.warehouse = self.warehouse_path
         return load_catalog(name="default", **creds.as_dict())
elt-common/src/elt_common/testing/fixtures.py (3)

35-39: Prefer using TemporaryDirectory.cleanup() for consistency.

Using shutil.rmtree(warehouse.workdir.name) works, but since SqlCatalogWarehouse stores a TemporaryDirectory object, calling its cleanup() method would be more consistent and handles edge cases (e.g., directory already removed).

♻️ Suggested change
         def cleanup_func():
-            shutil.rmtree(warehouse.workdir.name)
+            warehouse.workdir.cleanup()

64-75: Exception handling may be too narrow.

The cleanup logic only catches RuntimeError, but other exception types could occur during warehouse purge/delete operations (e.g., requests.RequestException, tenacity.RetryError). Consider catching Exception to ensure cleanup warnings are shown for any failure type.

♻️ Suggested change
-            except RuntimeError as exc:
+            except Exception as exc:
                 warnings.warn(
                     f"Error deleting test warehouse '{str(warehouse.project_id)}'. It may need to be removed manually."
                 )
                 warnings.warn(f"Error:\n{str(exc)}")

83-89: Consider adding a type hint for clarity.

The type annotation was removed because warehouse can be either SqlCatalogWarehouse or RestCatalogWarehouse. Adding a Union type hint would improve IDE support and documentation.

♻️ Suggested type hint
+from .lakekeeper import RestCatalogWarehouse
+
 @pytest.fixture
-def destination_config(warehouse):
+def destination_config(warehouse: SqlCatalogWarehouse | RestCatalogWarehouse):
     destination_config = PyIcebergDestinationTestConfiguration(warehouse)
elt-common/src/elt_common/testing/dlt.py (2)

52-59: Inconsistent use of local variables.

Line 53 creates server and server_settings locals, but lines 54-58 continue to use self.warehouse.server directly. Use the locals consistently for readability.

♻️ Suggested fix
         else:
             server, server_settings = self.warehouse.server, self.warehouse.server.settings
-            environ["DESTINATION__PYICEBERG__CREDENTIALS__URI"] = str(
-                self.warehouse.server.catalog_endpoint()
-            )
-            environ["DESTINATION__PYICEBERG__CREDENTIALS__PROJECT_ID"] = str(
-                self.warehouse.server.settings.project_id
-            )
+            environ["DESTINATION__PYICEBERG__CREDENTIALS__URI"] = str(server.catalog_endpoint())
+            environ["DESTINATION__PYICEBERG__CREDENTIALS__PROJECT_ID"] = str(
+                server_settings.project_id
+            )

46-51: Consider setting CATALOG_TYPE explicitly for REST path.

The SQL branch sets CATALOG_TYPE="sql", but the REST branch relies on the default. For explicitness and symmetry, consider adding:

environ["DESTINATION__PYICEBERG__CATALOG_TYPE"] = "rest"

Also applies to: 52-78

elt-common/src/elt_common/testing/lakekeeper.py (1)

24-26: Naming conflict and behavioural inconsistency.

Server.token_endpoint (property, line 24) constructs the token URL directly, whilst token_endpoint() (function, line 171) fetches it via OpenID Connect discovery. Both have the same name but different behaviours:

  • Property: settings.openid_provider_uri + "/protocol/openid-connect/token" (assumes Keycloak path)
  • Function: fetches from .well-known/openid-configuration (spec-compliant discovery)

Consider renaming the module-level function (e.g., discover_token_endpoint) or updating the property to use discovery for consistency.

Also applies to: 171-174

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 8f6c4e5 and 97921a2.

📒 Files selected for processing (21)
  • .github/actions/run-pytest-with-uv/action.yml
  • .github/workflows/elt-common_e2e_tests.yml
  • .github/workflows/elt-common_tests.yml
  • .github/workflows/elt-common_unit_tests.yml
  • .github/workflows/warehouses_e2e_tests.yml
  • elt-common/pytest.ini
  • elt-common/src/elt_common/dlt_destinations/pyiceberg/configuration.py
  • elt-common/src/elt_common/dlt_destinations/pyiceberg/helpers.py
  • elt-common/src/elt_common/dlt_destinations/pyiceberg/pyiceberg.py
  • elt-common/src/elt_common/iceberg/trino.py
  • elt-common/src/elt_common/testing/__init__.py
  • elt-common/src/elt_common/testing/dlt.py
  • elt-common/src/elt_common/testing/fixtures.py
  • elt-common/src/elt_common/testing/lakekeeper.py
  • elt-common/src/elt_common/testing/sqlcatalog.py
  • elt-common/tests/e2e_tests/elt_common/iceberg/conftest.py
  • elt-common/tests/unit_tests/dlt_destinations/pyiceberg/test_helpers.py
  • elt-common/tests/unit_tests/iceberg/__init__.py
  • elt-common/tests/unit_tests/iceberg/maintenance/__init__.py
  • elt-common/tests/unit_tests/iceberg/maintenance/test_table_maintenance.py
  • elt-common/tests/unit_tests/iceberg/test_trino.py
💤 Files with no reviewable changes (5)
  • .github/workflows/elt-common_e2e_tests.yml
  • .github/workflows/elt-common_unit_tests.yml
  • elt-common/tests/e2e_tests/elt_common/iceberg/conftest.py
  • .github/actions/run-pytest-with-uv/action.yml
  • elt-common/src/elt_common/dlt_destinations/pyiceberg/helpers.py
🧰 Additional context used
🧬 Code graph analysis (5)
elt-common/tests/unit_tests/iceberg/test_trino.py (1)
elt-common/src/elt_common/iceberg/trino.py (2)
  • TrinoQueryEngine (45-119)
  • list_iceberg_tables (83-91)
elt-common/src/elt_common/testing/fixtures.py (3)
elt-common/src/elt_common/testing/__init__.py (2)
  • Settings (35-101)
  • storage_config (80-101)
elt-common/src/elt_common/testing/lakekeeper.py (4)
  • Server (17-94)
  • create_warehouse (44-74)
  • purge_warehouse (77-79)
  • delete_warehouse (82-87)
elt-common/src/elt_common/testing/sqlcatalog.py (1)
  • SqlCatalogWarehouse (8-20)
elt-common/src/elt_common/testing/dlt.py (2)
elt-common/src/elt_common/testing/sqlcatalog.py (1)
  • SqlCatalogWarehouse (8-20)
elt-common/src/elt_common/testing/lakekeeper.py (4)
  • RestCatalogWarehouse (98-168)
  • catalog_endpoint (28-33)
  • token_endpoint (25-26)
  • token_endpoint (171-174)
elt-common/tests/unit_tests/iceberg/maintenance/test_table_maintenance.py (1)
elt-common/src/elt_common/iceberg/maintenance/__init__.py (1)
  • IcebergTableMaintenaceSql (14-48)
elt-common/src/elt_common/dlt_destinations/pyiceberg/pyiceberg.py (2)
elt-common/src/elt_common/dlt_destinations/pyiceberg/factory.py (1)
  • pyiceberg (15-58)
elt-common/src/elt_common/dlt_destinations/pyiceberg/configuration.py (1)
  • connection_properties (104-106)
🔇 Additional comments (12)
.github/workflows/warehouses_e2e_tests.yml (1)

1-1: LGTM!

The workflow and job name changes are sensible simplifications that improve readability without affecting functionality.

Also applies to: 27-27

elt-common/pytest.ini (1)

3-4: LGTM!

The requires_trino marker is well-documented and follows pytest conventions for conditionally deselecting tests that depend on external infrastructure.

.github/workflows/elt-common_tests.yml (2)

44-47: LGTM!

The test execution step correctly uses uv run pytest since the environment is not activated, ensuring the virtual environment context is properly managed.


26-26: No action required. ubuntu-slim is a valid GitHub-hosted runner label designed for lightweight jobs and is appropriate for test workflows.

Likely an incorrect or invalid review comment.

elt-common/tests/unit_tests/dlt_destinations/pyiceberg/test_helpers.py (1)

14-14: LGTM!

The import change correctly reflects the API refactor where RestCatalog is now sourced directly from pyiceberg.catalog.rest rather than being re-exported through the helpers module. The mock-based tests continue to function correctly since they reference the same underlying class.

elt-common/src/elt_common/dlt_destinations/pyiceberg/configuration.py (1)

88-109: LGTM!

The resolve_type decorator pattern cleanly supports dynamic credential type resolution based on catalog_type. The delegation of on_resolved to the credentials ensures validation runs for both REST and SQL configurations.

elt-common/src/elt_common/testing/__init__.py (1)

36-44: This is a new file, not a modification—there is no breaking change.

The Settings class with env_prefix="elt_common_testing_" is being introduced for the first time. There was no previous "tests_" prefix to migrate from. The git history shows this file was created in commit 97921a2, not modified from an earlier state.

No existing code in the repository currently references these environment variables, so there are no compatibility concerns for internal usage.

Likely an incorrect or invalid review comment.

elt-common/src/elt_common/dlt_destinations/pyiceberg/pyiceberg.py (1)

27-27: LGTM – The switch from create_catalog to load_catalog correctly uses PyIceberg's recommended pattern. The credentials' as_dict() method properly includes the "type" key (either "rest" or "sql") required by load_catalog to determine the correct catalog implementation. The approach of passing name="default" alongside **config.connection_properties allows the configuration to drive catalog selection whilst maintaining the expected defaults.

elt-common/src/elt_common/testing/dlt.py (2)

17-18: LGTM!

The imports correctly bring in both warehouse types to support the new dual catalog configuration.


38-38: LGTM!

The union type annotation clearly indicates support for both catalog types.

elt-common/src/elt_common/testing/lakekeeper.py (2)

97-102: LGTM!

The rename from Warehouse to RestCatalogWarehouse clearly distinguishes it from SqlCatalogWarehouse and improves code clarity.


104-114: LGTM!

The connect() method correctly uses PyIcebergRestCatalogCredentials and the standard load_catalog API.

@martyngigg martyngigg merged commit 1e3f3c5 into main Jan 13, 2026
4 checks passed
@martyngigg martyngigg deleted the elt-common-e2e_tests-use-sqlite-catalog branch January 13, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants