Reduce false positives, increase true positives, improve performance#51
Merged
ParzivalHack merged 2 commits intoMay 13, 2026
Merged
Conversation
Rules: 269 → 127 (-142 deleted, +28 added, 41 modified, 2 disabled) Tests: 116 → 168 (all passing) Major changes: - Taint engine rewrite: CLI vs HTTP origin (OperatorConfig vs HttpRequest), inter-procedural propagation, sanitizer tracking, FILE_DESERIALIZER always upgrades to HttpRequest for supply-chain detection - 28 new rules: SSTI, ORM (Django/SQLAlchemy), ML deserialization (joblib, numpy, torch), TLS/SSH/JWT/XXE, ZipSlip, sandbox escapes, plain-password storage, CI env-var SSRF - 142 rule deletions: 96 Python builtins (never sinks), 22 exact-pattern duplicates, 12 JS/Node rules (wrong language), 7 broken/backwards rules, 4 redundant with taint-based equivalents - Performance: O(1) call graph (was O(n²)), AST pre-filter, test/docs file exclusion, parallel CFG and convergence. Pandas 549s → 64s, sklearn 152s → 29s, fastapi 32s → 4s. Benchmark (14 repos, old vs new findings, S/N): - pandas: 412 → 15 (-96%, 53% S/N) - semgrep: 139 → 11 (-92%, 64%) - fastapi: 69 → 0 (-100%) - satori: 29 → 3 (-90%) - pygoat: 116 → 72 (-38%, 94% S/N — ground truth) - sklearn: 135 → 41 (-70%, 90%) - 4 large repos previously OOM now complete (django, ansible, cpython, tf) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
ParzivalHack
approved these changes
May 13, 2026
Owner
ParzivalHack
left a comment
There was a problem hiding this comment.
PR is great as always, local pip package runs smoothly and all unit tests are passing. Merging :D
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a follow up on minimizing false positives, increasing true positives and making PySpector taint analysis faster and better for multiple repositories to speed up and reduce the sample to noise ratio.
Rules
142 rules were deleted, 2 were disabled, 28 were added and 41 were modified. There is a total of 127 rules.
New Rules
SSTI001render_template_string(user_input),env.from_string(tainted)ORM001text(f"SELECT...{var}")ORM002raw(),order_by(tainted),extra(tainted)(CVE-2021-35042)DESER725jsonpickle.decode()DESER726dill.loads()DESER_JOBLIB001joblib.load()— ML model deserialization via pickleDESER_NUMPY001numpy.load(allow_pickle=True)DESER_TORCH001torch.load()withoutweights_only=TrueTLS001requests.get(url, verify=False),ssl=FalseSSH001AutoAddPolicy()— SSH MITMJWT001jwt.decode(options={"verify_signature": False})ZIPSLIP001extractall()without path validationXXE001lxml.etree.parse()withoutresolve_entities=FalseFLASK001app.run(debug=True)OPEN_REDIRECT001redirect(tainted_url),HttpResponseRedirect(tainted)PLAIN_PWD001Model.objects.create(password=tainted)— plaintext DB storageDJANGO_DEBUG001DEBUG = Truein settings (Django and Flask)ENV_URL001os.environ.get("*_URL")as HTTP endpoint — SSRF (AST rule)COOKIE_FILE001ENV_GIT_URL001git fetch— CI token exfiltration (AST rule)RUAMEL_UNSAFE001YAML(typ="unsafe")SQL_CONCAT001"SELECT..." + user_var— SQL via string concatenationHARDCODED_PWD001PASSWORD = 'literal'at module levelSHELL_BYPASS001subprocess.run(["bash", "-c", user_cmd])— shell bypassPY306_CACHEpickle.loads()in cache backends — cache poisoning → RCEG101BSECRET_KEY,API_KEY≥ 16 chars)DESER724types.FunctionType()from deserialized bytecode — arbitrary code executionSANDBOX307object.__subclasses__()traversal — Python sandbox escapeSANDBOX308__init__.__globals__access — Python sandbox escape via global namespaceModified Rules
ADMIN795exclude_pattern— reduced FPs on test credentialsBACKUP801\w\.bak); excludes.rst/.mdCRYPTO708random.choices(),random.sample(),random.randrange()DELATTR834delattr(obj, tainted_attr))DESER723description,remediation— clarified marshal.loads riskFORMAT864.format(tainted))G101exclude_pattern— added test/fixture exclusionsG103deflines (API param defaults) and chained assignmentsGETATTR828exclude_file_pattern = "*serializer*,*schema*,*/pandas/core/*,*/pandas/io/*"GLOBALS843HASH807HTTPS789exclude_file_pattern— excluded test filesIMPORT825exclude_pattern,remediation— reduced test discovery FPsLOG741description,severity,remediation,pattern— narrowed to log injectionOAUTH774exclude_pattern— reduced FPs on OAuth callbacksOPEN1149OPEN_REDIRECT001exclude_file_patternfor Django contrib/views (relative + absolute paths)ORM001textkeyword; same migration exclusionsgettext(...))PATH813exclude_pattern— reduced FPs on safe path joinsPERM650PY002exclude_file_pattern = "*/cache/backends/*"PY101exclude_file_pattern = "*/migrations/*,*/alembic/*,*/backends/*"PY103os.system(tainted))PY105mark_safe(tainted))PY106ast_match— tightened subprocess shell=True detectionPY107/PY302file_content_exclude = "from ruamel.yaml|import ruamel"— new per-file content exclusion mechanismPY201PY202exclude_pattern— excluded SHA1 in non-crypto contextsPY507.exec(tainted))RAND810random.seed(tainted))REGEX870description,pattern,exclude_pattern,remediation— ReDoS narrowedSEC501.execSER522SETATTR831setattr(obj, tainted_attr, val))SHELL631SHELL675SHELL689SQL586SQL693SYMLINK816description,pattern,remediation— symlink traversal clarifiedTIMING759TLS001TOKEN771description,confidence,exclude_pattern,remediation— JWT expiry check refinedZIPSLIP001Taint Engine Changes (
taint_analysis.rs)New Infrastructure
file_content_excludefield on Ruleexclude_file_patternvulnerable_keywordon TaintSinkRulecreate(password=tainted)) — prevents positional arg FPs@app.command()/ Click / Typer parameters →TaintOrigin::OperatorConfig; HTTP request parameters →TaintOrigin::HttpRequest. Operator-supplied paths are not injection vectors. FILE_DESERIALIZER results always produceHttpRequestregardless of file path origin, preserving supply-chain detection.sys.argv/os.environ→OperatorConfigsys.argv[n]andos.environ.get()now produceTaintOrigin::OperatorConfig. Eliminates PY305 FPs on stdlib tools (timeit, pdb, runpy) and Django management shell.Taint Input Model
How input sources are classified
TaintOriginHttpRequestrequest.POST.get("q"),request.args["id"], FastAPI path/query paramsHttpRequestrequest.COOKIES["session"],request.headers["X-Token"]HttpRequestjson.load(f),yaml.load(f),pickle.load(f),toml.load(f)— even iffcame from a CLI-specified pathOperatorConfig@app.command()params (Typer),@click.argument(),@click.option(),sys.argv[n]OperatorConfigos.environ.get("DB_URL")— set by deployment operatoros.environ.get("SEMGREP_URL")— in GitHub Actions, env vars can be set by PR authors via workflow triggers;ENV_URL001/ENV_GIT_URL001catch this regardless of taint originDeveloperDefinedBenchmark Scans: Previous vs Current
True positives by repo (to be analyzed)
==in auth), G405Signal-to-noise ratio
New
Old (repos that completed)
Files changed
src/pyspector/rules/built-in-rules.toml— 142 rules deleted, 25 added, 41 modified, 2 activated; net: 269 → 127 rulessrc/pyspector/_rust_core/src/analysis/taint_analysis.rs— taint engine: CLI vs HTTP origin, sys.argv/os.environ → OperatorConfig, dead function removedsrc/pyspector/_rust_core/src/graph/call_graph_builder.rs— O(1) call resolution, test/docs file exclusionsrc/pyspector/_rust_core/src/analysis/ast_analysis.rs— per-file exclusion pre-filter, unused import removedsrc/pyspector/_rust_core/src/analysis/mod.rs— phase timing, parallel scanningsrc/pyspector/_rust_core/src/rules.rs—file_content_exclude,vulnerable_keyword, comma-split patternssrc/pyspector/cli.py— per-phase timing instrumentationsrc/pyspector/reporting.py— severity serialization fixed (was uppercasing "HIGH", now preserves "High")src/pyspector/triage.py— unused import removedtests/unit/— 168 tests, all passing (including previously broken reporting_test.py)Tests changed