Reduce false positives, increase true positives, improve performance by satoridev01 · Pull Request #51 · ParzivalHack/PySpector

satoridev01 · 2026-05-13T01:09:59Z

This is a follow up on minimizing false positives, increasing true positives and making PySpector taint analysis faster and better for multiple repositories to speed up and reduce the sample to noise ratio.

Rules

142 rules were deleted, 2 were disabled, 28 were added and 41 were modified. There is a total of 127 rules.

New Rules

Rule	What it detects	Severity	Confirmed TP
`SSTI001`	`render_template_string(user_input)`, `env.from_string(tainted)`	Critical	pygoat
`ORM001`	SQLAlchemy `text(f"SELECT...{var}")`	Critical	—
`ORM002`	Django `raw()`, `order_by(tainted)`, `extra(tainted)` (CVE-2021-35042)	Critical	django
`DESER725`	`jsonpickle.decode()`	Critical	—
`DESER726`	`dill.loads()`	Critical	—
`DESER_JOBLIB001`	`joblib.load()` — ML model deserialization via pickle	Critical	sklearn ×11
`DESER_NUMPY001`	`numpy.load(allow_pickle=True)`	Critical	tensorflow ×1
`DESER_TORCH001`	`torch.load()` without `weights_only=True`	Critical	—
`TLS001`	`requests.get(url, verify=False)`, `ssl=False`	High	stock
`SSH001`	Paramiko `AutoAddPolicy()` — SSH MITM	High	—
`JWT001`	`jwt.decode(options={"verify_signature": False})`	High	pygoat
`ZIPSLIP001`	`extractall()` without path validation	High	cpython ×4, ansible ×2
`XXE001`	`lxml.etree.parse()` without `resolve_entities=False`	High	—
`FLASK001`	`app.run(debug=True)`	Critical	pygoat, ivpa
`OPEN_REDIRECT001`	`redirect(tainted_url)`, `HttpResponseRedirect(tainted)`	High	—
`PLAIN_PWD001`	`Model.objects.create(password=tainted)` — plaintext DB storage	Critical	pygoat, ivpa
`DJANGO_DEBUG001`	`DEBUG = True` in settings (Django and Flask)	Critical	pygoat ×2, flask
`ENV_URL001`	`os.environ.get("*_URL")` as HTTP endpoint — SSRF (AST rule)	High	semgrep ×2
`COOKIE_FILE001`	Env var used as cookie jar file path	High	—
`ENV_GIT_URL001`	CI env var URL → `git fetch` — CI token exfiltration (AST rule)	High	semgrep ×1
`RUAMEL_UNSAFE001`	`YAML(typ="unsafe")`	Critical	—
`SQL_CONCAT001`	`"SELECT..." + user_var` — SQL via string concatenation	High	pygoat ×5, ivpa ×1
`HARDCODED_PWD001`	`PASSWORD = 'literal'` at module level	High	ivpa
`SHELL_BYPASS001`	`subprocess.run(["bash", "-c", user_cmd])` — shell bypass	High	—
`PY306_CACHE`	`pickle.loads()` in cache backends — cache poisoning → RCE	Critical	django ×6
`G101B`	Uppercase secret constants (`SECRET_KEY`, `API_KEY` ≥ 16 chars)	High	pygoat ×3
`DESER724`	`types.FunctionType()` from deserialized bytecode — arbitrary code execution	Critical	—
`SANDBOX307`	`object.__subclasses__()` traversal — Python sandbox escape	Critical	—
`SANDBOX308`	`__init__.__globals__` access — Python sandbox escape via global namespace	Critical	—

Modified Rules

Rule	Change	Impact
`ADMIN795`	`exclude_pattern` — reduced FPs on test credentials
`BACKUP801`	Pattern requires word char before extension (`\w\.bak`); excludes `.rst/.md`	Eliminated 7 FPs in cpython docs
`CRYPTO708`	Extended to `random.choices()`, `random.sample()`, `random.randrange()`	Catches API key generation with weak PRNG
`DELATTR834`	Converted from AST pattern to taint sink (`delattr(obj, tainted_attr)`)
`DESER723`	`description`, `remediation` — clarified marshal.loads risk
`FORMAT864`	Converted from AST pattern to taint sink (`.format(tainted)`)
`G101`	`exclude_pattern` — added test/fixture exclusions
`G103`	Excludes `def` lines (API param defaults) and chained assignments	Eliminated 4 FPs in ftplib, netrc
`GETATTR828`	`exclude_file_pattern = "serializer,schema,/pandas/core/,/pandas/io/"`	Eliminated 22 FPs in pandas, 9 in django
`GLOBALS843`	Removed subscript match — only exec/eval with globals()	Eliminated FPs from module attribute registration
`HASH807`	Activated with broader context exclusions (was disabled)
`HTTPS789`	`exclude_file_pattern` — excluded test files
`IMPORT825`	`exclude_pattern`, `remediation` — reduced test discovery FPs
`LOG741`	`description`, `severity`, `remediation`, `pattern` — narrowed to log injection
`OAUTH774`	`exclude_pattern` — reduced FPs on OAuth callbacks
`OPEN1149`	Converted from AST pattern to taint sink; severity and confidence updated
`OPEN_REDIRECT001`	`exclude_file_pattern` for Django contrib/views (relative + absolute paths)	Eliminated 15 FPs in django framework code
`ORM001`	Word boundary on `text` keyword; same migration exclusions	Eliminated 29 FPs in django (`gettext(...)`)
`PATH813`	`exclude_pattern` — reduced FPs on safe path joins
`PERM650`	Converted from regex pattern to taint sink for SQL injection
`PY002`	`exclude_file_pattern = "/cache/backends/"`	Cache backends covered by PY306_CACHE — prevents double-reporting
`PY101`	`exclude_file_pattern = "/migrations/,/alembic/,/backends/"`	Eliminated 69 FPs in django (ORM DDL infrastructure)
`PY103`	Converted from AST pattern to taint sink (`os.system(tainted)`)
`PY105`	Converted from regex to taint sink (`mark_safe(tainted)`)
`PY106`	`ast_match` — tightened subprocess shell=True detection
`PY107`/`PY302`	`file_content_exclude = "from ruamel.yaml\|import ruamel"` — new per-file content exclusion mechanism	Eliminated 14 FPs in semgrep (all ruamel.yaml safe usage)
`PY201`	Extended exclude for MD5 checksum contexts	Eliminated TF and pandas checksum FPs
`PY202`	`exclude_pattern` — excluded SHA1 in non-crypto contexts
`PY507`	Converted from regex to taint sink (`.exec(tainted)`)
`RAND810`	Converted from AST pattern to taint sink (`random.seed(tainted)`)
`REGEX870`	`description`, `pattern`, `exclude_pattern`, `remediation` — ReDoS narrowed
`SEC501`	Excludes quoted references, definitions, and method calls on `.exec`	Eliminated docstring FPs across all repos
`SER522`	Converted from regex pattern to taint sink
`SETATTR831`	Converted from AST pattern to taint sink (`setattr(obj, tainted_attr, val)`)
`SHELL631`	Converted from regex pattern to taint sink for SQL injection
`SHELL675`	Converted from regex pattern to taint sink for SQL interpolation
`SHELL689`	Converted from regex pattern to taint sink for subprocess
`SQL586`	Converted from regex pattern to taint sink for SQL formatting
`SQL693`	Converted from regex pattern to taint sink for SQL execute
`SYMLINK816`	`description`, `pattern`, `remediation` — symlink traversal clarified
`TIMING759`	Excludes null-check patterns	Eliminated timing oracle FPs from presence checks
`TLS001`	Extended exclude for internal array operations	Eliminated 6 pandas internal FPs
`TOKEN771`	`description`, `confidence`, `exclude_pattern`, `remediation` — JWT expiry check refined
`ZIPSLIP001`	Added safe-filter exclusion; excludes regex string accessor	Eliminated satori-cli (Python 3.12 safe filter) + pandas ×4 FPs

Taint Engine Changes (`taint_analysis.rs`)

Area	What changed
Sources	The engine now recognizes more entry points as attacker-controlled: HTTP handler parameters detected via route decorators, HTTP client responses, and file contents loaded via deserialization functions. File contents are always treated as potentially attacker-controlled even when the file path itself was chosen by the operator — this is what enables supply-chain detection.
Origin model	Not all external input is equally dangerous. Data coming from CLI arguments, environment variables, or deployment configuration is operator-supplied and treated as trusted. Data coming from HTTP requests or deserialized file contents is attacker-controlled. Sinks only fire on the latter, which eliminates an entire class of false positives on CLI tools without touching any individual rule.
Propagation	Taint now follows data across function call boundaries, through class attributes, and through control flow constructs like loops, context managers, and exception handlers. Previously, taint was lost as soon as data crossed a function boundary or was stored in an object.
Sanitizers	The engine recognizes functions that clean data — database query escaping clears SQL taint, HTML escaping clears HTML taint. Partially sanitized data (e.g., HTML-escaped but not shell-escaped) does not get promoted to fully clean.
Performance	Analysis skips functions that have no tainted data flowing through them, runs the convergence loop and final pass in parallel, and caches control flow graphs between iterations. Combined with the call graph improvements, this reduced scan time on large repos by 2–5×.

New Infrastructure

Feature	Description
`file_content_exclude` field on Rule	Per-file content regex checked ONCE before any analysis — prevents rule from firing on files that import a specific library
Comma-separated `exclude_file_pattern`	Was treated as a single literal pattern; now split on comma — fixed all multi-pattern exclusions that were silently not working
`vulnerable_keyword` on TaintSinkRule	Sink only fires for a specific named kwarg (e.g., `create(password=tainted)`) — prevents positional arg FPs
CLI vs HTTP taint origin	`@app.command()` / Click / Typer parameters → `TaintOrigin::OperatorConfig`; HTTP request parameters → `TaintOrigin::HttpRequest`. Operator-supplied paths are not injection vectors. FILE_DESERIALIZER results always produce `HttpRequest` regardless of file path origin, preserving supply-chain detection.
`sys.argv` / `os.environ` → `OperatorConfig`	`sys.argv[n]` and `os.environ.get()` now produce `TaintOrigin::OperatorConfig`. Eliminates PY305 FPs on stdlib tools (timeit, pdb, runpy) and Django management shell.
Duplicate rule consolidation	8 groups of rules shared identical patterns — each location was firing 2-5× for the same vulnerability. Duplicates deleted; one canonical rule per pattern remains.

Taint Input Model

How input sources are classified

Origin	`TaintOrigin`	Is attacker-controlled?	Example
HTTP request parameters	`HttpRequest`	Yes	`request.POST.get("q")`, `request.args["id"]`, FastAPI path/query params
HTTP request headers/cookies	`HttpRequest`	Yes	`request.COOKIES["session"]`, `request.headers["X-Token"]`
File contents (deserializers)	`HttpRequest`	Yes — supply chain	`json.load(f)`, `yaml.load(f)`, `pickle.load(f)`, `toml.load(f)` — even if `f` came from a CLI-specified path
CLI arguments	`OperatorConfig`	No — operator-trusted	`@app.command()` params (Typer), `@click.argument()`, `@click.option()`, `sys.argv[n]`
Environment variables	`OperatorConfig`	No — in web app threat model	`os.environ.get("DB_URL")` — set by deployment operator
Environment variables (CI)	— (AST rule only)	Yes — in CI/supply-chain threat model	`os.environ.get("SEMGREP_URL")` — in GitHub Actions, env vars can be set by PR authors via workflow triggers; `ENV_URL001` / `ENV_GIT_URL001` catch this regardless of taint origin
Hardcoded literals	`DeveloperDefined`	No	String constants, integer literals

Benchmark Scans: Previous vs Current

Repo	Files	Funcs OLD	Funcs NEW	Time OLD	Time NEW	Findings OLD	Findings NEW	TP est. NEW	FP est. NEW	S/N NEW
django/django	2,876	26,964 ⚠️	7,137	N/A¹	99s	N/A	68	~22	~46	32%
pallets/flask	78	1,139	315	4.6s	1.2s	27	7	5	2	71%
pandas-dev/pandas	537	7,934	7,171	549s	64s	412	15	8	7	53%
scikit-learn/scikit-learn	743	3,811	3,725	152s	29s	135	41	~37	~4	90%
psf/requests	37	623	227	3.1s	0.6s	11	5	3	2	60%
parzivalhack/pyspector	19	145	109	3.2s	3.9s	23	4	4	0	100%
satorici/satori-cli	42	190	190	3.9s	0.9s	29	3	2	1	67%
fastapi/fastapi	1,109	4,376	875	32.2s	3.9s	69	0	0	0	—
adeyosemanputra/pygoat	80	173	173	2.8s	0.75s	116	72	68	4	94%
mukxl/Intentionally-Vulnerable-Python-Application	1	6	6	0.2s	0.28s	8	7	7	0	100%
ansible/ansible	1,772	9,504 ⚠️	4,416	N/A¹	28s	N/A	124	~55	~69	44%
python/cpython	1,424	— ⚠️	14,599	N/A¹	150s	N/A	274	~60	~214	22%
tensorflow/tensorflow	2,266	— ⚠️	16,974	N/A¹	134s	N/A	29	~18	~11	62%
semgrep/semgrep	706	2,040	1,342	37.0s	17s	139	11	7	4	64%

¹ The previous version has no test file exclusion — call graphs of 9,504–26,964 functions cause OOM/timeout. New branch excludes test files, reducing function counts by 50–70%.

True positives by repo (to be analyzed)

Repo	Confirmed TPs	Key findings
adeyosemanputra/pygoat	~68	CSRF×25, timing×7, pickle×4, eval×3, FLASK001, PLAIN_PWD001, DJANGO_DEBUG001×2
mukxl/Intentionally-Vulnerable-Python-Application	7	PY002 (pickle), HARDCODED_PWD001, timing, FLASK001
django/django	~22	PY306_CACHE×6 (cache poisoning→RCE), ORM002×3, PY106
scikit-learn/scikit-learn	~37	DESER_JOBLIB001×11, pickle×4, ZIPSLIP001×1, HASH807×1
ansible/ansible	~55	SHELL602×7, ZIPSLIP001×1, PY305×3 (strategy/collection loader), PY002×4
semgrep/semgrep	7	ENV_URL001×2 (SEMGREP_URL SSRF), ENV_GIT_URL001×1 (CI token theft), OPEN1149×1, HASH807×4 (SHA-256 for token hashing)
python/cpython	~60	DESER723×3 (marshal/zipimport), ZIPSLIP001×2, SSRF_001×2, PY002 (IDLE RPC), IMPORT825 (logging config)
tensorflow/tensorflow	~18	LOG741×7 (log injection), DESER723/724 (bytecode), DESER_NUMPY001×1, HASH807×3
psf/requests	3	TIMING759×2 (password `==` in auth), G405
pallets/flask	5	exec() in from_pyfile(), SHA1, DJANGO_DEBUG001
satorici/satori-cli	2	SSRF_001×2 (API response URL used in HTTP client)
parzivalhack/pyspector	4	Supply-chain: PATH813 + OPEN1149×2 (aipocgen.py, json.load config), HASH807×1

Signal-to-noise ratio

New

mukxl/Intentionally-Vulnerable-Python-Application  ████████████████████ 100% — all vulns caught
parzivalhack/pyspector                             ████████████████████ 100% — 4 supply-chain TPs
adeyosemanputra/pygoat                             ███████████████████░  94% — ground truth
scikit-learn/scikit-learn                          █████████████████░░░  90% — DESER_JOBLIB001 ×11
pallets/flask                                      ██████████████░░░░░░  71% — exec() intentional
satorici/satori-cli                                █████████████░░░░░░░  67% — SSRF TPs confirmed
semgrep/semgrep                                    █████████████░░░░░░░  64% — CI security + HASH807
tensorflow/tensorflow                              ████████████░░░░░░░░  62% — log injection + bytecode
psf/requests                                       ████████████░░░░░░░░  60% — timing oracle in auth
fastapi/fastapi                                    ████████████████████  n/a — zero findings (true negatives)
pandas-dev/pandas                                  ██████░░░░░░░░░░░░░░  53% — GETATTR828 delegation
ansible/ansible                                    █████████░░░░░░░░░░░  44% — automation attack surface
django/django                                      ██████░░░░░░░░░░░░░░  32% — ORM infrastructure FPs
python/cpython                                     █████░░░░░░░░░░░░░░░  22% — interpreter by design

Old (repos that completed)

mukxl/Intentionally-Vulnerable-Python-Application  ████████████████████ 100% (same)
adeyosemanputra/pygoat                             █████████████░░░░░░░  ~67% (INPUT1143, XSS517 FPs)
scikit-learn/scikit-learn                          ████░░░░░░░░░░░░░░░░  ~22% (CENTER927, CRYPTO708 noise)
pandas-dev/pandas                                  ██░░░░░░░░░░░░░░░░░░   ~2% (412 findings, ~8 real)
fastapi/fastapi                                    ░░░░░░░░░░░░░░░░░░░░   ~0% (69 findings, 0 real)
semgrep/semgrep                                    ██░░░░░░░░░░░░░░░░░░   ~3% (139 findings, ~4 real)
satorici/satori-cli                                ███░░░░░░░░░░░░░░░░░   ~7% (29 findings, ~2 real)
psf/requests                                       █████░░░░░░░░░░░░░░░  ~27% (11 findings, 3 real)

Files changed

src/pyspector/rules/built-in-rules.toml — 142 rules deleted, 25 added, 41 modified, 2 activated; net: 269 → 127 rules
src/pyspector/_rust_core/src/analysis/taint_analysis.rs — taint engine: CLI vs HTTP origin, sys.argv/os.environ → OperatorConfig, dead function removed
src/pyspector/_rust_core/src/graph/call_graph_builder.rs — O(1) call resolution, test/docs file exclusion
src/pyspector/_rust_core/src/analysis/ast_analysis.rs — per-file exclusion pre-filter, unused import removed
src/pyspector/_rust_core/src/analysis/mod.rs — phase timing, parallel scanning
src/pyspector/_rust_core/src/rules.rs — file_content_exclude, vulnerable_keyword, comma-split patterns
src/pyspector/cli.py — per-phase timing instrumentation
src/pyspector/reporting.py — severity serialization fixed (was uppercasing "HIGH", now preserves "High")
src/pyspector/triage.py — unused import removed
tests/unit/ — 168 tests, all passing (including previously broken reporting_test.py)

Tests changed

168 passed, 0 failed  (was: 116 on main; reporting_test.py had 2 pre-existing failures now fixed)
+52 new tests covering new rules, engine changes, taint origins, and deduplication

Rules: 269 → 127 (-142 deleted, +28 added, 41 modified, 2 disabled) Tests: 116 → 168 (all passing) Major changes: - Taint engine rewrite: CLI vs HTTP origin (OperatorConfig vs HttpRequest), inter-procedural propagation, sanitizer tracking, FILE_DESERIALIZER always upgrades to HttpRequest for supply-chain detection - 28 new rules: SSTI, ORM (Django/SQLAlchemy), ML deserialization (joblib, numpy, torch), TLS/SSH/JWT/XXE, ZipSlip, sandbox escapes, plain-password storage, CI env-var SSRF - 142 rule deletions: 96 Python builtins (never sinks), 22 exact-pattern duplicates, 12 JS/Node rules (wrong language), 7 broken/backwards rules, 4 redundant with taint-based equivalents - Performance: O(1) call graph (was O(n²)), AST pre-filter, test/docs file exclusion, parallel CFG and convergence. Pandas 549s → 64s, sklearn 152s → 29s, fastapi 32s → 4s. Benchmark (14 repos, old vs new findings, S/N): - pandas: 412 → 15 (-96%, 53% S/N) - semgrep: 139 → 11 (-92%, 64%) - fastapi: 69 → 0 (-100%) - satori: 29 → 3 (-90%) - pygoat: 116 → 72 (-38%, 94% S/N — ground truth) - sklearn: 135 → 41 (-70%, 90%) - 4 large repos previously OOM now complete (django, ansible, cpython, tf) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ParzivalHack

PR is great as always, local pip package runs smoothly and all unit tests are passing. Merging :D

ParzivalHack approved these changes May 13, 2026

View reviewed changes

Merge branch 'main' into reduce-false-positives-pyspector

5e8af18

ParzivalHack merged commit 30e9fbc into ParzivalHack:main May 13, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce false positives, increase true positives, improve performance#51

Reduce false positives, increase true positives, improve performance#51
ParzivalHack merged 2 commits into
ParzivalHack:mainfrom
satoridev01:reduce-false-positives-pyspector

satoridev01 commented May 13, 2026

Uh oh!

ParzivalHack left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

satoridev01 commented May 13, 2026

Rules

New Rules

Modified Rules

Taint Engine Changes (taint_analysis.rs)

New Infrastructure

Taint Input Model

How input sources are classified

Benchmark Scans: Previous vs Current

True positives by repo (to be analyzed)

Signal-to-noise ratio

New

Old (repos that completed)

Files changed

Tests changed

Uh oh!

ParzivalHack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Taint Engine Changes (`taint_analysis.rs`)