Skip to content

Scrubber eats too much memory#4310

Merged
stuartc merged 7 commits intomainfrom
4307-scrubber-eats-too-much-memory
Jan 15, 2026
Merged

Scrubber eats too much memory#4310
stuartc merged 7 commits intomainfrom
4307-scrubber-eats-too-much-memory

Conversation

@midigofrank
Copy link
Copy Markdown
Collaborator

@midigofrank midigofrank commented Jan 13, 2026

Description

This PR does 2 things:

  • Limits the number of sensitive the credential body can have. This is configurable using MAX_CREDENTIAL_SENSITIVE_VALUES. By default, it's 50
  • Removes duplicate sensitive values to be scrubbed. This way, we don't unnecessary iterations

Closes #4307

Validation steps

  1. Start your server with MAX_CREDENTIAL_SENSITIVE_VALUES=2 mix phx.server
  2. Try creating a raw json credential with multiple sensitive keys (more than 2). Please look at https://github.com/OpenFn/lightning/blob/main/lib/lightning/credentials/sensitive_values.ex#L11 to see safe keys. Safe keys should not count up

Please disclose how you've used AI in this work (it's cool, we just want to know!):

  • Code generation (copilot but not intellisense)
  • Learning or fact checking
  • Strategy / design
  • Optimisation / refactoring
  • Translation / spellchecking / doc gen
  • Other
  • I have not used AI

You can read more details in our Responsible AI Policy

Pre-submission checklist

  • I have performed a self-review of my code.
  • I have implemented and tested all related authorization policies. (e.g., :owner, :admin, :editor, :viewer)
  • I have updated the changelog.
  • I have ticked a box in "AI usage" in this PR

@midigofrank midigofrank self-assigned this Jan 13, 2026
@github-project-automation github-project-automation Bot moved this to New Issues in Core Jan 13, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.21%. Comparing base (2fcd819) to head (83062c4).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...ing_web/live/credential_live/raw_body_component.ex 72.72% 3 Missing ⚠️
lib/lightning/credentials/credential_body.ex 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4310      +/-   ##
==========================================
- Coverage   89.29%   89.21%   -0.09%     
==========================================
  Files         425      425              
  Lines       19915    19940      +25     
==========================================
+ Hits        17783    17789       +6     
- Misses       2132     2151      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@midigofrank midigofrank force-pushed the 4307-scrubber-eats-too-much-memory branch from a150e2a to 42fb158 Compare January 14, 2026 09:05
@midigofrank midigofrank marked this pull request as ready for review January 14, 2026 13:14
Copy link
Copy Markdown
Contributor

@elias-ba elias-ba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @midigofrank this is really a good solution for a hard problem. I have few observations and questions. Could you check them and let me know what you think ?

Comment thread lib/lightning/config.ex Outdated

config :lightning, Lightning.Scrubber,
max_credential_sensitive_values:
env!("MAX_CREDENTIAL_SENSITIVE_VALUES", :integer, 50)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this to .env.example ? And also maybe to DEPLOYMENT.md ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think it was necessary given the default was already a higher a value. But yeah, let me include it in the DEPLOYMENT.md docs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeing with Frank here (disclaimer I rarely look at or use .env.example), while we usually throw in an exhaustive list of options in that file (essentially listing the defaults) it's such a niche option probably fine to not include it.

{%{}, types}
|> Ecto.Changeset.cast(params, [:value])
|> maybe_add_error(valid?, error, touched)
|> validate_sensitive_values_count(body, touched)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parses the body string again via parse_for_validation/1, but it was already parsed in parse_body/1 above. Could we consider passing the already-parsed body_json to avoid double parsing ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't. We need the body as a map and body_json is a binary. It uses the body and it only parses the body if it is a binary

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with Elias here, it does get parse twice when it's a string. we enter via validate_body, calling parse_body with a binary and validate_sensitive_value_count which called parse_for_validation. Also the the parse_body taking a string and parsing it as a map and then in another match it takes a non-empty map and encodes it as a string - thats hurting my brain!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking some more, while I'm sure it does get parsed twice when it's a string; the previous (i.e. current) implementation in this component smells a bit now around the validation logic.

We don't use CredentialBody.changeset here, I'm not sure why though. That would require a bit of work to change.

{:error, "contains too many sensitive keys (250). Max allowed is 200"}
"""
def validate_sensitive_values_count(body) do
max_values = Lightning.Config.max_credential_sensitive_values()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to existing credentials that already exceed this limit? If a user tries to update an unrelated field on a credential with 100+ sensitive values, will the update fail? Could we consider whether a migration or audit is needed to identify any existing credentials over the limit ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing happens to them if they don't touch the body. In the prod db, the highest has 19. It's only the irregular one that has 500 and it will be dealt with at a user level.

String.downcase(k) in @safe_keys || is_nil(v)
end)
|> Enum.map(fn {_k, v} -> v end)
|> Enum.uniq()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good optimization. Just noting this subtly changes the return value. Previously duplicate values were included multiple times. Should be fine but worth a sanity check that no code path relied on the duplicate count.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we're good. The scrubber looks for the values as it scrubs. Duplicate values would just mean that the scrubber would make extra iterations

@github-project-automation github-project-automation Bot moved this from New Issues to In review in Core Jan 14, 2026
@midigofrank midigofrank requested a review from elias-ba January 15, 2026 08:50
@stuartc stuartc merged commit b2b2596 into main Jan 15, 2026
6 of 8 checks passed
@stuartc stuartc deleted the 4307-scrubber-eats-too-much-memory branch January 15, 2026 14:43
@github-project-automation github-project-automation Bot moved this from In review to Done in Core Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Scrubber Consumes 45+ MB Memory per Run with Large Credentials

3 participants