security: implement SQL parameterization and table whitelisting in da… by MOHITKOURAV01 · Pull Request #282 · openml/server-api

MOHITKOURAV01 · 2026-03-20T12:10:15Z

Changes

This Pull Request addresses critical SQL Injection (SQLi) vulnerabilities identified in the datasets.py and tasks.py routers. The changes replace insecure raw string concatenation with secure SQL parameterization and implement strict table whitelisting.

Key Changes:

SQL Parameterization: Refactored list_datasets in src/routers/openml/datasets.py to use SQLAlchemy text() and bindparam. This includes correctly handling list parameters with expanding=True for IN clauses to prevent injection via tags or IDs.
Table Whitelisting: Implemented a strict whitelist for dynamic table lookups in src/routers/openml/tasks.py to prevent unauthorized table access through manipulated input.
Code Quality & Type Safety: Resolved mypy type-hint mismatches in datasets.py and suppressed false-positive security warnings (S608) after verifying structural safety with the new parameterized approach.
Static Analysis: Verified that all pre-commit hooks (including mypy and ruff) are passing.

How to Test

Security Verification: Attempt to send a malicious SQL payload (e.g., 'study_14'); DROP TABLE dataset; --) to the /datasets/list endpoint.
Expected Result: The application should now treat the input as a literal string value rather than executing it as a command, returning no results or an error safely.
Local Tests: - Run pre-commit run --all-files to verify linting and security checks.
- Run pytest to ensure that standard dataset and task lookups still function correctly.

Checklist

I have performed a self-review of my own pull request.
Tests pass locally (pre-commit passed; module loading verified in a fresh venv).
I have commented my code in hard-to-understand areas.
Changes are already covered under existing tests.
This PR and the commits have been created autonomously by a bot/agent.

Related Issues

Closes #281

coderabbitai · 2026-03-20T12:10:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Replaced ad-hoc SQL in the tasks router with calls to a new async helper database.tasks.get_lookup_data(...). The router now verifies referenced lookup keys exist in task_inputs and are non-empty, parses lookup IDs to integers returning HTTP 400 on failure, treats missing fetched fields as empty strings, and coerces fetched values to strings. Added ALLOWED_LOOKUP_TABLES and PK_MAPPING and implemented get_lookup_data(table, id_, expdb) to enforce an allowlist, select by the configured primary key, execute a parameterized SELECT * FROM {table} WHERE {pk} = :id_, and return a single row mapping or None; disallowed tables raise ValueError.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check	❓ Inconclusive	Per reviewer feedback, the PR includes overlapping dataset parameterization work already merged in PR `#286`; however, the lookup-table whitelist changes remain in-scope and necessary for issue `#281`.	Remove redundant dataset parameterization changes already covered by PR `#286` and keep only the lookup-table whitelisting logic for tasks.py, as suggested by the PGijsbers review comment.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the security-focused changes (SQL parameterization and table whitelisting) that form the core of this pull request.
Description check	✅ Passed	The description comprehensively explains the SQL injection vulnerabilities being addressed and documents the key changes, testing approach, and checklist items.
Linked Issues check	✅ Passed	The PR addresses issue `#281` by implementing SQL parameterization in datasets.py and table whitelisting in tasks.py, meeting the core security objectives of treating user input as data and validating dynamic table names.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The allowed_tables set in _fill_json_template is recreated on every call; consider moving it to a module-level constant or using an Enum so the whitelist is centralized and easier to maintain.
Although table names are now whitelisted in _fill_json_template, the SQL is still constructed via f-strings; consider mapping the user-facing keys to hardcoded SQL table identifiers instead of interpolating the string directly to further reduce surface for mistakes.
list_datasets is now complex enough to require multiple noqa flags (C901, PLR0913, PLR0915); consider extracting some of the filter/parameter-building logic into smaller helper functions to simplify control flow and make it easier to reason about the query.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `allowed_tables` set in `_fill_json_template` is recreated on every call; consider moving it to a module-level constant or using an Enum so the whitelist is centralized and easier to maintain.
- Although table names are now whitelisted in `_fill_json_template`, the SQL is still constructed via f-strings; consider mapping the user-facing keys to hardcoded SQL table identifiers instead of interpolating the string directly to further reduce surface for mistakes.
- `list_datasets` is now complex enough to require multiple noqa flags (C901, PLR0913, PLR0915); consider extracting some of the filter/parameter-building logic into smaller helper functions to simplify control flow and make it easier to reason about the query.

## Individual Comments

### Comment 1
<location path="src/routers/openml/tasks.py" line_range="133" />
<code_context>
             table, _ = field.split(".")
+            # List of tables allowed for [LOOKUP:table.column] directive.
+            # This is a security measure to prevent SQL injection via table names.
+            allowed_tables = {"estimation_procedure", "evaluation_measure", "task_type", "dataset"}
+            if table not in allowed_tables:
+                msg = f"Table {table} is not allowed for lookup."
</code_context>
<issue_to_address>
**suggestion:** Hoist `allowed_tables` to a module-level constant and consider a more specific exception type.

Right now the set is rebuilt on every call; defining `ALLOWED_LOOKUP_TABLES = {...}` at module level avoids that overhead and makes the allowlist easier to manage. For the error, consider raising something the router can reliably translate to an appropriate HTTP status (e.g., a FastAPI `HTTPException` or a custom domain error) instead of a bare `ValueError` that may bubble up as a 500 if unhandled.

Suggested implementation:

```python
        (field,) = match.groups()
        if field not in fetched_data:
            table, _ = field.split(".")
            if table not in ALLOWED_LOOKUP_TABLES:
                msg = f"Table {table} is not allowed for lookup."
                raise HTTPException(status_code=400, detail=msg)

```

To fully implement the suggestion, you should also:

1. Define the module-level constant (near the top of `src/routers/openml/tasks.py`, after imports or alongside other constants):

```python
ALLOWED_LOOKUP_TABLES = {
    "estimation_procedure",
    "evaluation_measure",
    "task_type",
    "dataset",
}
```

2. Ensure `HTTPException` is imported from FastAPI. If you currently have something like:

```python
from fastapi import APIRouter
```

update it to:

```python
from fastapi import APIRouter, HTTPException
```

(or add a separate `from fastapi import HTTPException` line, depending on your style).

If you prefer a custom domain error instead of `HTTPException`, define that exception class in an appropriate module and replace the `HTTPException(...)` call with raising your domain-specific exception that the router already translates into an HTTP 4xx status.
</issue_to_address>

### Comment 2
<location path="src/routers/openml/datasets.py" line_range="156" />
<code_context>
         else ""
     )

-    def quality_clause(quality: str, range_: str | None) -> str:
+    def quality_clause(range_: str | None, param_name: str) -> str:
         if not range_:
</code_context>
<issue_to_address>
**issue (complexity):** Consider making `quality_clause` a pure function that returns its SQL and params and simplifying inline `bindparams` usage to avoid hidden coupling and extra mutable state.

You can keep the parameterization benefits while reducing indirection and hidden coupling by making `quality_clause` pure and letting the caller manage params, and by simplifying `bindparams` handling.

### 1. Make `quality_clause` explicit and side‑effect free

Instead of mutating `params` and relying on magic names like `quality_{param_name}`, have `quality_clause` return both the SQL fragment and its own parameter dict:

```python
def quality_clause(
    quality_name: str,
    range_: str | None,
    *,
    prefix: str,
) -> tuple[str, dict[str, Any]]:
    if not range_:
        return "", {}

    if not (match := re.match(integer_range_regex, range_)):
        msg = f"`range_` not a valid range: {range_}"
        raise ValueError(msg)

    start, end = match.groups()
    clause_params: dict[str, Any] = {"quality_name": quality_name}

    if end:
        clause_params[f"{prefix}_start"] = int(start)
        clause_params[f"{prefix}_end"] = int(end[2:])
        value_clause = f"`value` BETWEEN :{prefix}_start AND :{prefix}_end"
    else:
        clause_params[f"{prefix}_val"] = int(start)
        value_clause = f"`value` = :{prefix}_val"

    sql = f""" AND
        d.`did` IN (
            SELECT `data`
            FROM data_quality
            WHERE `quality`=:quality_name AND {value_clause}
        )
    """
    return sql, clause_params
```

And at the call site:

```python
number_instances_filter, p = quality_clause(
    "NumberOfInstances", number_instances, prefix="instances"
)
params.update(p)

number_classes_filter, p = quality_clause(
    "NumberOfClasses", number_classes, prefix="classes"
)
params.update(p)

number_features_filter, p = quality_clause(
    "NumberOfFeatures", number_features, prefix="features"
)
params.update(p)

number_missing_values_filter, p = quality_clause(
    "NumberOfMissingValues", number_missing_values, prefix="missing"
)
params.update(p)
```

This removes:

- `params["quality_instances"] = "NumberOfInstances"` etc.
- The implicit dependency on `params` naming conventions inside `quality_clause`.

All parameter names are now explicit and local to the call site.

### 2. Simplify `bindparams` usage

You can also avoid a separate `bind_params` list and keep everything inline:

```python
matching_filter = (
    text(
        f"""
        SELECT d.`did`, d.`name`, d.`version`, d.`format`, d.`file_id`,
               IFNULL(cs.`status`, 'in_preparation')
        FROM dataset AS d
        LEFT JOIN ({current_status}) AS cs ON d.`did` = cs.`did`
        WHERE {visible_to_user} {where_name} {where_version} {where_uploader}
              {where_data_id} {matching_tag} {number_instances_filter} {number_features_filter}
              {number_classes_filter} {number_missing_values_filter}
          AND IFNULL(cs.`status`, 'in_preparation') IN :statuses
        LIMIT :limit OFFSET :offset
        """
    )
    .bindparams(
        bindparam("statuses", expanding=True),
        bindparam("data_ids", expanding=True) if "data_ids" in params else bindparam("data_ids"),
    )
)
```

Or, if you prefer, only bind `data_ids` when needed:

```python
matching_filter = text(...)

if "data_ids" in params:
    matching_filter = matching_filter.bindparams(
        bindparam("statuses", expanding=True),
        bindparam("data_ids", expanding=True),
    )
else:
    matching_filter = matching_filter.bindparams(
        bindparam("statuses", expanding=True),
    )
```

This keeps the “what is bound” logic close to the query construction and removes the extra mutable `bind_params` list to track.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/routers/openml/tasks.py

src/routers/openml/datasets.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/routers/openml/datasets.py`:
- Around line 236-239: The bindparams call in the query building code currently
always adds bindparam("data_ids") even when "data_ids" is not in params, causing
SQLAlchemy ArgumentError; change the logic around the bindparams call (the
bindparams invocation shown) to construct a list of bind parameters (e.g.,
always include bindparam("statuses", expanding=True) and only append
bindparam("data_ids", expanding=True) when "data_ids" exists in params) and then
call .bindparams(*params_list) so "data_ids" is not bound on the no-filter path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 000a2fb3-103f-40fb-98af-53276317bd1f

📥 Commits

Reviewing files that changed from the base of the PR and between 8dc0b56 and 193a0dd.

📒 Files selected for processing (2)

src/routers/openml/datasets.py
src/routers/openml/tasks.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/routers/openml/tasks.py

src/routers/openml/datasets.py

PGijsbers · 2026-03-23T12:32:52Z

I had already started work on input sanitation/structuring for dataset lists, it's now merged with #286. Those changes can be removed from this PR, but having a check on the lookup tables is still reasonable. If you could update the PR to only add that change, I would be happy to merge it.

for more information, see https://pre-commit.ci

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/routers/openml/tasks.py (1)

155-159: ⚠️ Potential issue | 🟡 Minor

Potential TypeError when lookup value is non-string.

Values from row_data.items() (line 156) can be integers, floats, or other non-string types depending on the database column. When template.replace() is called at line 159, it expects a string as the second argument.

Proposed fix

             for column, value in row_data.items():
-                fetched_data[f"{table}.{column}"] = value
+                fetched_data[f"{table}.{column}"] = str(value) if value is not None else ""
         if match.string == template:
             return fetched_data[field]
         template = template.replace(match.group(), fetched_data[field])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/tasks.py` around lines 155 - 159, The replacement can
raise a TypeError if fetched_data[field] is not a string; in the template
substitution loop (where row_data items are added to fetched_data and template =
template.replace(match.group(), fetched_data[field]) is called) convert the
replacement value to a string (e.g., str(fetched_data[field]) or use an explicit
safe formatting that handles None) before calling template.replace, and
optionally guard against missing keys by using fetched_data.get(field, "") so
the code in this block (variables: fetched_data, template, match, field,
row_data) always passes a string to template.replace.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/routers/openml/tasks.py`:
- Line 20: Remove the duplicate, unused constant ALLOWED_LOOKUP_TABLES from this
module: it's defined but never referenced here and duplicates validation already
implemented in database.tasks.get_lookup_data(). Locate the symbol
ALLOWED_LOOKUP_TABLES in this file and delete the definition; ensure no other
references in this module rely on it, and leave the single source of truth in
database.tasks.get_lookup_data() intact.

---

Outside diff comments:
In `@src/routers/openml/tasks.py`:
- Around line 155-159: The replacement can raise a TypeError if
fetched_data[field] is not a string; in the template substitution loop (where
row_data items are added to fetched_data and template =
template.replace(match.group(), fetched_data[field]) is called) convert the
replacement value to a string (e.g., str(fetched_data[field]) or use an explicit
safe formatting that handles None) before calling template.replace, and
optionally guard against missing keys by using fetched_data.get(field, "") so
the code in this block (variables: fetched_data, template, match, field,
row_data) always passes a string to template.replace.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 45be442c-676a-4e51-9aa1-7f0e7b1f87a7

📥 Commits

Reviewing files that changed from the base of the PR and between 567f15c and 7d2e510.

📒 Files selected for processing (2)

src/database/tasks.py
src/routers/openml/tasks.py

src/routers/openml/tasks.py

for more information, see https://pre-commit.ci

MOHITKOURAV01 · 2026-03-23T20:02:20Z

Hi @PGijsbers Thanks for the feedback! I've updated the PR to remove the redundant dataset sanitation (already in #286) and kept only the lookup table security checks. Also refactored the code to satisfy CI complexity requirements. Ready for review!

PGijsbers · 2026-03-24T14:55:30Z

Also refactored the code to satisfy CI complexity requirements.

Please undo this. Keep changes limited only to those requested.

sourcery-ai bot reviewed Mar 20, 2026

View reviewed changes

src/routers/openml/tasks.py Outdated Show resolved Hide resolved

src/routers/openml/datasets.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

src/routers/openml/datasets.py Outdated Show resolved Hide resolved

MOHITKOURAV01 closed this Mar 23, 2026

MOHITKOURAV01 force-pushed the feature/sql-parameterization branch from 567f15c to f6ce940 Compare March 23, 2026 19:46

Restore lookup table checks (removing redundant dataset sanitation)

67f7e03

MOHITKOURAV01 reopened this Mar 23, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

7d2e510

for more information, see https://pre-commit.ci

coderabbitai bot reviewed Mar 23, 2026

View reviewed changes

src/routers/openml/tasks.py Outdated Show resolved Hide resolved

MOHITKOURAV01 and others added 5 commits March 24, 2026 01:26

Fix type mismatch and constant consistency for CI

b66ed27

Refactor: remove redundant ALLOWED_LOOKUP_TABLES from router

72fde8a

Refactor: use safe .get() for fetched_data in template replacement

05180c9

Refactor: extract _perform_lookup to reduce complexity

868f4f4

[pre-commit.ci] auto fixes from pre-commit.com hooks

4aff467

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

security: implement SQL parameterization and table whitelisting in da…#282

security: implement SQL parameterization and table whitelisting in da…#282
MOHITKOURAV01 wants to merge 7 commits intoopenml:mainfrom
MOHITKOURAV01:feature/sql-parameterization

MOHITKOURAV01 commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

PGijsbers commented Mar 23, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

MOHITKOURAV01 commented Mar 23, 2026

Uh oh!

PGijsbers commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

MOHITKOURAV01 commented Mar 20, 2026

Changes

How to Test

Checklist

Related Issues

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PGijsbers commented Mar 23, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MOHITKOURAV01 commented Mar 23, 2026

Uh oh!

PGijsbers commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 20, 2026 •

edited

Loading