Skip to content

Reduce false positives, increase true positives, improve performance#51

Merged
ParzivalHack merged 2 commits into
ParzivalHack:mainfrom
satoridev01:reduce-false-positives-pyspector
May 13, 2026
Merged

Reduce false positives, increase true positives, improve performance#51
ParzivalHack merged 2 commits into
ParzivalHack:mainfrom
satoridev01:reduce-false-positives-pyspector

Conversation

@satoridev01
Copy link
Copy Markdown
Contributor

This is a follow up on minimizing false positives, increasing true positives and making PySpector taint analysis faster and better for multiple repositories to speed up and reduce the sample to noise ratio.

Rules

142 rules were deleted, 2 were disabled, 28 were added and 41 were modified. There is a total of 127 rules.

New Rules

Rule What it detects Severity Confirmed TP
SSTI001 render_template_string(user_input), env.from_string(tainted) Critical pygoat
ORM001 SQLAlchemy text(f"SELECT...{var}") Critical
ORM002 Django raw(), order_by(tainted), extra(tainted) (CVE-2021-35042) Critical django
DESER725 jsonpickle.decode() Critical
DESER726 dill.loads() Critical
DESER_JOBLIB001 joblib.load() — ML model deserialization via pickle Critical sklearn ×11
DESER_NUMPY001 numpy.load(allow_pickle=True) Critical tensorflow ×1
DESER_TORCH001 torch.load() without weights_only=True Critical
TLS001 requests.get(url, verify=False), ssl=False High stock
SSH001 Paramiko AutoAddPolicy() — SSH MITM High
JWT001 jwt.decode(options={"verify_signature": False}) High pygoat
ZIPSLIP001 extractall() without path validation High cpython ×4, ansible ×2
XXE001 lxml.etree.parse() without resolve_entities=False High
FLASK001 app.run(debug=True) Critical pygoat, ivpa
OPEN_REDIRECT001 redirect(tainted_url), HttpResponseRedirect(tainted) High
PLAIN_PWD001 Model.objects.create(password=tainted) — plaintext DB storage Critical pygoat, ivpa
DJANGO_DEBUG001 DEBUG = True in settings (Django and Flask) Critical pygoat ×2, flask
ENV_URL001 os.environ.get("*_URL") as HTTP endpoint — SSRF (AST rule) High semgrep ×2
COOKIE_FILE001 Env var used as cookie jar file path High
ENV_GIT_URL001 CI env var URL → git fetch — CI token exfiltration (AST rule) High semgrep ×1
RUAMEL_UNSAFE001 YAML(typ="unsafe") Critical
SQL_CONCAT001 "SELECT..." + user_var — SQL via string concatenation High pygoat ×5, ivpa ×1
HARDCODED_PWD001 PASSWORD = 'literal' at module level High ivpa
SHELL_BYPASS001 subprocess.run(["bash", "-c", user_cmd]) — shell bypass High
PY306_CACHE pickle.loads() in cache backends — cache poisoning → RCE Critical django ×6
G101B Uppercase secret constants (SECRET_KEY, API_KEY ≥ 16 chars) High pygoat ×3
DESER724 types.FunctionType() from deserialized bytecode — arbitrary code execution Critical
SANDBOX307 object.__subclasses__() traversal — Python sandbox escape Critical
SANDBOX308 __init__.__globals__ access — Python sandbox escape via global namespace Critical

Modified Rules

Rule Change Impact
ADMIN795 exclude_pattern — reduced FPs on test credentials
BACKUP801 Pattern requires word char before extension (\w\.bak); excludes .rst/.md Eliminated 7 FPs in cpython docs
CRYPTO708 Extended to random.choices(), random.sample(), random.randrange() Catches API key generation with weak PRNG
DELATTR834 Converted from AST pattern to taint sink (delattr(obj, tainted_attr))
DESER723 description, remediation — clarified marshal.loads risk
FORMAT864 Converted from AST pattern to taint sink (.format(tainted))
G101 exclude_pattern — added test/fixture exclusions
G103 Excludes def lines (API param defaults) and chained assignments Eliminated 4 FPs in ftplib, netrc
GETATTR828 exclude_file_pattern = "*serializer*,*schema*,*/pandas/core/*,*/pandas/io/*" Eliminated 22 FPs in pandas, 9 in django
GLOBALS843 Removed subscript match — only exec/eval with globals() Eliminated FPs from module attribute registration
HASH807 Activated with broader context exclusions (was disabled)
HTTPS789 exclude_file_pattern — excluded test files
IMPORT825 exclude_pattern, remediation — reduced test discovery FPs
LOG741 description, severity, remediation, pattern — narrowed to log injection
OAUTH774 exclude_pattern — reduced FPs on OAuth callbacks
OPEN1149 Converted from AST pattern to taint sink; severity and confidence updated
OPEN_REDIRECT001 exclude_file_pattern for Django contrib/views (relative + absolute paths) Eliminated 15 FPs in django framework code
ORM001 Word boundary on text keyword; same migration exclusions Eliminated 29 FPs in django (gettext(...))
PATH813 exclude_pattern — reduced FPs on safe path joins
PERM650 Converted from regex pattern to taint sink for SQL injection
PY002 exclude_file_pattern = "*/cache/backends/*" Cache backends covered by PY306_CACHE — prevents double-reporting
PY101 exclude_file_pattern = "*/migrations/*,*/alembic/*,*/backends/*" Eliminated 69 FPs in django (ORM DDL infrastructure)
PY103 Converted from AST pattern to taint sink (os.system(tainted))
PY105 Converted from regex to taint sink (mark_safe(tainted))
PY106 ast_match — tightened subprocess shell=True detection
PY107/PY302 file_content_exclude = "from ruamel.yaml|import ruamel" — new per-file content exclusion mechanism Eliminated 14 FPs in semgrep (all ruamel.yaml safe usage)
PY201 Extended exclude for MD5 checksum contexts Eliminated TF and pandas checksum FPs
PY202 exclude_pattern — excluded SHA1 in non-crypto contexts
PY507 Converted from regex to taint sink (.exec(tainted))
RAND810 Converted from AST pattern to taint sink (random.seed(tainted))
REGEX870 description, pattern, exclude_pattern, remediation — ReDoS narrowed
SEC501 Excludes quoted references, definitions, and method calls on .exec Eliminated docstring FPs across all repos
SER522 Converted from regex pattern to taint sink
SETATTR831 Converted from AST pattern to taint sink (setattr(obj, tainted_attr, val))
SHELL631 Converted from regex pattern to taint sink for SQL injection
SHELL675 Converted from regex pattern to taint sink for SQL interpolation
SHELL689 Converted from regex pattern to taint sink for subprocess
SQL586 Converted from regex pattern to taint sink for SQL formatting
SQL693 Converted from regex pattern to taint sink for SQL execute
SYMLINK816 description, pattern, remediation — symlink traversal clarified
TIMING759 Excludes null-check patterns Eliminated timing oracle FPs from presence checks
TLS001 Extended exclude for internal array operations Eliminated 6 pandas internal FPs
TOKEN771 description, confidence, exclude_pattern, remediation — JWT expiry check refined
ZIPSLIP001 Added safe-filter exclusion; excludes regex string accessor Eliminated satori-cli (Python 3.12 safe filter) + pandas ×4 FPs

Taint Engine Changes (taint_analysis.rs)

Area What changed
Sources The engine now recognizes more entry points as attacker-controlled: HTTP handler parameters detected via route decorators, HTTP client responses, and file contents loaded via deserialization functions. File contents are always treated as potentially attacker-controlled even when the file path itself was chosen by the operator — this is what enables supply-chain detection.
Origin model Not all external input is equally dangerous. Data coming from CLI arguments, environment variables, or deployment configuration is operator-supplied and treated as trusted. Data coming from HTTP requests or deserialized file contents is attacker-controlled. Sinks only fire on the latter, which eliminates an entire class of false positives on CLI tools without touching any individual rule.
Propagation Taint now follows data across function call boundaries, through class attributes, and through control flow constructs like loops, context managers, and exception handlers. Previously, taint was lost as soon as data crossed a function boundary or was stored in an object.
Sanitizers The engine recognizes functions that clean data — database query escaping clears SQL taint, HTML escaping clears HTML taint. Partially sanitized data (e.g., HTML-escaped but not shell-escaped) does not get promoted to fully clean.
Performance Analysis skips functions that have no tainted data flowing through them, runs the convergence loop and final pass in parallel, and caches control flow graphs between iterations. Combined with the call graph improvements, this reduced scan time on large repos by 2–5×.

New Infrastructure

Feature Description
file_content_exclude field on Rule Per-file content regex checked ONCE before any analysis — prevents rule from firing on files that import a specific library
Comma-separated exclude_file_pattern Was treated as a single literal pattern; now split on comma — fixed all multi-pattern exclusions that were silently not working
vulnerable_keyword on TaintSinkRule Sink only fires for a specific named kwarg (e.g., create(password=tainted)) — prevents positional arg FPs
CLI vs HTTP taint origin @app.command() / Click / Typer parameters → TaintOrigin::OperatorConfig; HTTP request parameters → TaintOrigin::HttpRequest. Operator-supplied paths are not injection vectors. FILE_DESERIALIZER results always produce HttpRequest regardless of file path origin, preserving supply-chain detection.
sys.argv / os.environOperatorConfig sys.argv[n] and os.environ.get() now produce TaintOrigin::OperatorConfig. Eliminates PY305 FPs on stdlib tools (timeit, pdb, runpy) and Django management shell.
Duplicate rule consolidation 8 groups of rules shared identical patterns — each location was firing 2-5× for the same vulnerability. Duplicates deleted; one canonical rule per pattern remains.

Taint Input Model

How input sources are classified

Origin TaintOrigin Is attacker-controlled? Example
HTTP request parameters HttpRequest Yes request.POST.get("q"), request.args["id"], FastAPI path/query params
HTTP request headers/cookies HttpRequest Yes request.COOKIES["session"], request.headers["X-Token"]
File contents (deserializers) HttpRequest Yes — supply chain json.load(f), yaml.load(f), pickle.load(f), toml.load(f) — even if f came from a CLI-specified path
CLI arguments OperatorConfig No — operator-trusted @app.command() params (Typer), @click.argument(), @click.option(), sys.argv[n]
Environment variables OperatorConfig No — in web app threat model os.environ.get("DB_URL") — set by deployment operator
Environment variables (CI) — (AST rule only) Yes — in CI/supply-chain threat model os.environ.get("SEMGREP_URL") — in GitHub Actions, env vars can be set by PR authors via workflow triggers; ENV_URL001 / ENV_GIT_URL001 catch this regardless of taint origin
Hardcoded literals DeveloperDefined No String constants, integer literals

Benchmark Scans: Previous vs Current

Repo Files Funcs OLD Funcs NEW Time OLD Time NEW Findings OLD Findings NEW TP est. NEW FP est. NEW S/N NEW
django/django 2,876 26,964 ⚠️ 7,137 N/A¹ 99s N/A 68 ~22 ~46 32%
pallets/flask 78 1,139 315 4.6s 1.2s 27 7 5 2 71%
pandas-dev/pandas 537 7,934 7,171 549s 64s 412 15 8 7 53%
scikit-learn/scikit-learn 743 3,811 3,725 152s 29s 135 41 ~37 ~4 90%
psf/requests 37 623 227 3.1s 0.6s 11 5 3 2 60%
parzivalhack/pyspector 19 145 109 3.2s 3.9s 23 4 4 0 100%
satorici/satori-cli 42 190 190 3.9s 0.9s 29 3 2 1 67%
fastapi/fastapi 1,109 4,376 875 32.2s 3.9s 69 0 0 0
adeyosemanputra/pygoat 80 173 173 2.8s 0.75s 116 72 68 4 94%
mukxl/Intentionally-Vulnerable-Python-Application 1 6 6 0.2s 0.28s 8 7 7 0 100%
ansible/ansible 1,772 9,504 ⚠️ 4,416 N/A¹ 28s N/A 124 ~55 ~69 44%
python/cpython 1,424 ⚠️ 14,599 N/A¹ 150s N/A 274 ~60 ~214 22%
tensorflow/tensorflow 2,266 ⚠️ 16,974 N/A¹ 134s N/A 29 ~18 ~11 62%
semgrep/semgrep 706 2,040 1,342 37.0s 17s 139 11 7 4 64%

¹ The previous version has no test file exclusion — call graphs of 9,504–26,964 functions cause OOM/timeout. New branch excludes test files, reducing function counts by 50–70%.

True positives by repo (to be analyzed)

Repo Confirmed TPs Key findings
adeyosemanputra/pygoat ~68 CSRF×25, timing×7, pickle×4, eval×3, FLASK001, PLAIN_PWD001, DJANGO_DEBUG001×2
mukxl/Intentionally-Vulnerable-Python-Application 7 PY002 (pickle), HARDCODED_PWD001, timing, FLASK001
django/django ~22 PY306_CACHE×6 (cache poisoning→RCE), ORM002×3, PY106
scikit-learn/scikit-learn ~37 DESER_JOBLIB001×11, pickle×4, ZIPSLIP001×1, HASH807×1
ansible/ansible ~55 SHELL602×7, ZIPSLIP001×1, PY305×3 (strategy/collection loader), PY002×4
semgrep/semgrep 7 ENV_URL001×2 (SEMGREP_URL SSRF), ENV_GIT_URL001×1 (CI token theft), OPEN1149×1, HASH807×4 (SHA-256 for token hashing)
python/cpython ~60 DESER723×3 (marshal/zipimport), ZIPSLIP001×2, SSRF_001×2, PY002 (IDLE RPC), IMPORT825 (logging config)
tensorflow/tensorflow ~18 LOG741×7 (log injection), DESER723/724 (bytecode), DESER_NUMPY001×1, HASH807×3
psf/requests 3 TIMING759×2 (password == in auth), G405
pallets/flask 5 exec() in from_pyfile(), SHA1, DJANGO_DEBUG001
satorici/satori-cli 2 SSRF_001×2 (API response URL used in HTTP client)
parzivalhack/pyspector 4 Supply-chain: PATH813 + OPEN1149×2 (aipocgen.py, json.load config), HASH807×1

Signal-to-noise ratio

New

mukxl/Intentionally-Vulnerable-Python-Application  ████████████████████ 100% — all vulns caught
parzivalhack/pyspector                             ████████████████████ 100% — 4 supply-chain TPs
adeyosemanputra/pygoat                             ███████████████████░  94% — ground truth
scikit-learn/scikit-learn                          █████████████████░░░  90% — DESER_JOBLIB001 ×11
pallets/flask                                      ██████████████░░░░░░  71% — exec() intentional
satorici/satori-cli                                █████████████░░░░░░░  67% — SSRF TPs confirmed
semgrep/semgrep                                    █████████████░░░░░░░  64% — CI security + HASH807
tensorflow/tensorflow                              ████████████░░░░░░░░  62% — log injection + bytecode
psf/requests                                       ████████████░░░░░░░░  60% — timing oracle in auth
fastapi/fastapi                                    ████████████████████  n/a — zero findings (true negatives)
pandas-dev/pandas                                  ██████░░░░░░░░░░░░░░  53% — GETATTR828 delegation
ansible/ansible                                    █████████░░░░░░░░░░░  44% — automation attack surface
django/django                                      ██████░░░░░░░░░░░░░░  32% — ORM infrastructure FPs
python/cpython                                     █████░░░░░░░░░░░░░░░  22% — interpreter by design

Old (repos that completed)

mukxl/Intentionally-Vulnerable-Python-Application  ████████████████████ 100% (same)
adeyosemanputra/pygoat                             █████████████░░░░░░░  ~67% (INPUT1143, XSS517 FPs)
scikit-learn/scikit-learn                          ████░░░░░░░░░░░░░░░░  ~22% (CENTER927, CRYPTO708 noise)
pandas-dev/pandas                                  ██░░░░░░░░░░░░░░░░░░   ~2% (412 findings, ~8 real)
fastapi/fastapi                                    ░░░░░░░░░░░░░░░░░░░░   ~0% (69 findings, 0 real)
semgrep/semgrep                                    ██░░░░░░░░░░░░░░░░░░   ~3% (139 findings, ~4 real)
satorici/satori-cli                                ███░░░░░░░░░░░░░░░░░   ~7% (29 findings, ~2 real)
psf/requests                                       █████░░░░░░░░░░░░░░░  ~27% (11 findings, 3 real)

Files changed

  • src/pyspector/rules/built-in-rules.toml — 142 rules deleted, 25 added, 41 modified, 2 activated; net: 269 → 127 rules
  • src/pyspector/_rust_core/src/analysis/taint_analysis.rs — taint engine: CLI vs HTTP origin, sys.argv/os.environ → OperatorConfig, dead function removed
  • src/pyspector/_rust_core/src/graph/call_graph_builder.rs — O(1) call resolution, test/docs file exclusion
  • src/pyspector/_rust_core/src/analysis/ast_analysis.rs — per-file exclusion pre-filter, unused import removed
  • src/pyspector/_rust_core/src/analysis/mod.rs — phase timing, parallel scanning
  • src/pyspector/_rust_core/src/rules.rsfile_content_exclude, vulnerable_keyword, comma-split patterns
  • src/pyspector/cli.py — per-phase timing instrumentation
  • src/pyspector/reporting.py — severity serialization fixed (was uppercasing "HIGH", now preserves "High")
  • src/pyspector/triage.py — unused import removed
  • tests/unit/ — 168 tests, all passing (including previously broken reporting_test.py)

Tests changed

168 passed, 0 failed  (was: 116 on main; reporting_test.py had 2 pre-existing failures now fixed)
+52 new tests covering new rules, engine changes, taint origins, and deduplication

Rules: 269 → 127 (-142 deleted, +28 added, 41 modified, 2 disabled)
Tests: 116 → 168 (all passing)

Major changes:
- Taint engine rewrite: CLI vs HTTP origin (OperatorConfig vs HttpRequest),
  inter-procedural propagation, sanitizer tracking, FILE_DESERIALIZER
  always upgrades to HttpRequest for supply-chain detection
- 28 new rules: SSTI, ORM (Django/SQLAlchemy), ML deserialization (joblib,
  numpy, torch), TLS/SSH/JWT/XXE, ZipSlip, sandbox escapes, plain-password
  storage, CI env-var SSRF
- 142 rule deletions: 96 Python builtins (never sinks), 22 exact-pattern
  duplicates, 12 JS/Node rules (wrong language), 7 broken/backwards rules,
  4 redundant with taint-based equivalents
- Performance: O(1) call graph (was O(n²)), AST pre-filter, test/docs
  file exclusion, parallel CFG and convergence. Pandas 549s → 64s,
  sklearn 152s → 29s, fastapi 32s → 4s.

Benchmark (14 repos, old vs new findings, S/N):
- pandas: 412 → 15 (-96%, 53% S/N)
- semgrep: 139 → 11 (-92%, 64%)
- fastapi: 69 → 0 (-100%)
- satori: 29 → 3 (-90%)
- pygoat: 116 → 72 (-38%, 94% S/N — ground truth)
- sklearn: 135 → 41 (-70%, 90%)
- 4 large repos previously OOM now complete (django, ansible, cpython, tf)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@ParzivalHack ParzivalHack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is great as always, local pip package runs smoothly and all unit tests are passing. Merging :D

@ParzivalHack ParzivalHack merged commit 30e9fbc into ParzivalHack:main May 13, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants