Skip to content

feat: use tree-sitter to parse requirements.txt more correctly#391

Open
Strum355 wants to merge 1 commit intoguacsec:mainfrom
Strum355:nsc/requirements-txt-treesitter
Open

feat: use tree-sitter to parse requirements.txt more correctly#391
Strum355 wants to merge 1 commit intoguacsec:mainfrom
Strum355:nsc/requirements-txt-treesitter

Conversation

@Strum355
Copy link
Member

@Strum355 Strum355 commented Feb 26, 2026

Description

Replaced the original manual "parsing" of requirements.txt with a proper tree-sitter based approach. For more info on tree-sitter see https://tree-sitter.github.io/tree-sitter/

Related issues (if any): #242

Checklist

  • I have followed this repository's contributing guidelines.
  • I will adhere to the project's code of conduct.

Additional information

Tested this with https://github.com/fabric8-analytics/fabric8-analytics-vscode-extension, some changes required there will be included in the PR that bumps the version of this library too

@Strum355 Strum355 force-pushed the nsc/requirements-txt-treesitter branch 4 times, most recently from cd46b6f to d906a56 Compare February 27, 2026 16:53
@Strum355 Strum355 changed the title wip feat: use tree-sitter to parse requirements.txt more correctly feat: use tree-sitter to parse requirements.txt more correctly Feb 27, 2026
@Strum355 Strum355 force-pushed the nsc/requirements-txt-treesitter branch 3 times, most recently from 0c78fce to 62d1183 Compare March 4, 2026 12:55
@Strum355 Strum355 marked this pull request as ready for review March 4, 2026 13:04
@Strum355 Strum355 requested a review from ruromero March 4, 2026 13:04
@qodo-code-review
Copy link

Review Summary by Qodo

Replace requirements.txt parsing with tree-sitter and add async support

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Replace manual requirements.txt parsing with tree-sitter for better accuracy
• Add async/await support to provider methods for proper Promise handling
• Implement tree-sitter-based requirement parsing with version extraction
• Add web-tree-sitter and tree-sitter-requirements dependencies
Diagram
flowchart LR
  A["Manual String Parsing"] -->|"Replace with"| B["Tree-Sitter Parser"]
  B -->|"Extract"| C["Package Names & Versions"]
  D["Sync Methods"] -->|"Convert to"| E["Async Methods"]
  E -->|"Return"| F["Promise<Provided>"]
  G["Dependencies"] -->|"Add"| H["web-tree-sitter<br/>tree-sitter-requirements"]
Loading

Grey Divider

File Changes

1. src/providers/requirements_parser.js ✨ Enhancement +34/-0

New tree-sitter based requirements.txt parser module

src/providers/requirements_parser.js


2. src/providers/python_controller.js ✨ Enhancement +66/-68

Refactor to use tree-sitter parsing and async methods

src/providers/python_controller.js


3. src/providers/python_pip.js ✨ Enhancement +38/-55

Convert provider methods to async and use tree-sitter

src/providers/python_pip.js


View more (8)
4. src/analysis.js 🐞 Bug fix +2/-2

Add await keywords for async provider method calls

src/analysis.js


5. src/provider.js ✨ Enhancement +1/-1

Update Provider typedef to support Promise return types

src/provider.js


6. src/sbom.js Formatting +5/-5

Minor formatting improvements for consistency

src/sbom.js


7. package.json Dependencies +3/-0

Add tree-sitter and web-tree-sitter dependencies

package.json


8. test/providers/python_pip.test.js 🧪 Tests +54/-54

Update tests to handle async provider methods

test/providers/python_pip.test.js


9. test/providers/tst_manifests/pip/pip_requirements_txt_ignore/requirements.txt 🧪 Tests +5/-0

Add test cases for complex requirement specifications

test/providers/tst_manifests/pip/pip_requirements_txt_ignore/requirements.txt


10. test/providers/tst_manifests/pip/pip_requirements_txt_ignore/expected_component_sbom.json 🧪 Tests +24/-0

Update expected SBOM with new parsed dependencies

test/providers/tst_manifests/pip/pip_requirements_txt_ignore/expected_component_sbom.json


11. test/providers/tst_manifests/pip/pip_requirements_txt_ignore/expected_stack_sbom.json 🧪 Tests +24/-0

Update expected stack SBOM with new parsed dependencies

test/providers/tst_manifests/pip/pip_requirements_txt_ignore/expected_stack_sbom.json


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 4, 2026

Code Review by Qodo

🐞 Bugs (5) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Missing await on install 🐞 Bug ✓ Correctness
Description
getDependencies() calls the now-async #installingRequirementsOneByOne() without awaiting it, so
dependency discovery and cleanup can run before installs finish. This can yield incomplete/incorrect
dependency graphs and pip install/uninstall races.
Code

src/providers/python_controller.js[R153-156]

				this.#installingRequirementsOneByOne()
			}
		}
-		let dependencies = this.#getDependenciesImpl(includeTransitive)
+		let dependencies = await this.#getDependenciesImpl(includeTransitive)
Evidence
The best-effort installation method was changed to async (it awaits requirement parsing), but the
caller does not await it and immediately proceeds to compute dependencies and then uninstall,
creating a race condition.

src/providers/python_controller.js[124-176]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`Python_controller.getDependencies()` calls `#installingRequirementsOneByOne()` without `await`, but that method is `async` and awaits parsing before installing packages. This introduces a race where dependency tree calculation and uninstall can run before installation completes.

## Issue Context
This happens only in the `TRUSTIFY_DA_PYTHON_INSTALL_BEST_EFFORTS=true` path.

## Fix Focus Areas
- src/providers/python_controller.js[129-176]

## Suggested fix
- Change `this.#installingRequirementsOneByOne()` to `await this.#installingRequirementsOneByOne()`.
- (Optional hardening) Replace `requirements.forEach(...)` with `for (const req of requirements) { ... }` to ensure deterministic sequential installs and clearer error propagation.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. SBOM generated twice 🐞 Bug ➹ Performance
Description
python_pip.provideStack() calls createSbomStackAnalysis() twice, doubling runtime and repeating
heavy side effects like environment creation and pip operations. This can also produce inconsistent
results if installs/uninstalls happen twice.
Code

src/providers/python_pip.js[R45-52]

+async function provideStack(manifest, opts = {}) {
+	const sbom = await createSbomStackAnalysis(manifest, opts)
+	console.log(`WHAT ${JSON.stringify(sbom, null, '\t')}`)
	return {
		ecosystem,
-		content: createSbomStackAnalysis(manifest, opts),
+		content: await createSbomStackAnalysis(manifest, opts),
		contentType: 'application/vnd.cyclonedx+json'
	}
Evidence
provideStack computes the SBOM once into a variable (then logs it) but returns the result of a
second call. createSbomStackAnalysis invokes Python_controller.getDependencies which may
install/uninstall packages, so duplication is expensive and risky.

src/providers/python_pip.js[45-52]
src/providers/python_pip.js[186-201]
src/providers/python_controller.js[136-165]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`python_pip.provideStack()` calls `createSbomStackAnalysis()` twice. This doubles execution time and repeats expensive side effects (pip install/uninstall and/or virtualenv operations).

## Issue Context
`createSbomStackAnalysis()` constructs a `Python_controller` and calls `getDependencies()`, which may install dependencies and then uninstall them.

## Fix Focus Areas
- src/providers/python_pip.js[45-52]
- src/providers/python_pip.js[186-202]

## Suggested fix
- Replace `content: await createSbomStackAnalysis(...)` with `content: sbom`.
- Remove the `console.log(...)` line, or gate it behind `TRUSTIFY_DA_DEBUG === &quot;true&quot;` and log a bounded message (not full SBOM).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Git dep uses git+ssh 🐞 Bug ⛯ Reliability
Description
The added GitHub dependency resolves in package-lock.json to a git+ssh URL, which commonly fails in
CI/user environments without GitHub SSH keys configured. This can block installation entirely.
Code

package.json[R56-57]

+		"tree-sitter-requirements": "github:Strum355/tree-sitter-requirements",
+		"web-tree-sitter": "^0.26.6",
Evidence
package.json adds a GitHub shorthand dependency; the lockfile shows npm resolved it via
git+ssh://git@github.com/..., which requires SSH-based git access and will often fail for public CI
runners or users without SSH setup.

package.json[46-58]
package-lock.json[7507-7512]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The dependency `tree-sitter-requirements` is declared via GitHub shorthand. In `package-lock.json` it resolves to `git+ssh://git@github.com/...`, which frequently fails without SSH keys.

## Issue Context
This affects every `npm ci` / `npm install` consumer of this package.

## Fix Focus Areas
- package.json[56-57]
- package-lock.json[7507-7511]

## Suggested fix
- Update `package.json` to use an HTTPS git URL pinned to a commit/tag, e.g. `&quot;tree-sitter-requirements&quot;: &quot;git+https://github.com/Strum355/tree-sitter-requirements.git#&lt;sha&gt;&quot;` (or a published semver npm package).
- Re-run `npm install` (or `npm ci` with regeneration as appropriate) to update `package-lock.json` so `resolved` uses `git+https` not `git+ssh`.
- Verify CI can install without SSH configuration.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. WASM init not cached 🐞 Bug ➹ Performance
Description
requirements_parser helpers each call init(), repeatedly running Parser.init() and Language.load()
which reloads WASM and the language multiple times per request. This adds avoidable latency and
could cause duplicate initialization behavior.
Code

src/providers/requirements_parser.js[R7-34]

+async function init() {
+	await Parser.init({
+		locateFile() {
+			return require.resolve('web-tree-sitter/web-tree-sitter.wasm')
+		}
+	});
+	return await Language.load(require.resolve('tree-sitter-requirements/tree-sitter-requirements.wasm'));
+}
+
+export async function getParser() {
+	const language = await init();
+	return new Parser().setLanguage(language);
+}
+
+export async function getRequirementQuery() {
+	const language = await init();
+	return new Query(language, '(requirement (package) @name) @req');
+}
+
+export async function getIgnoreQuery() {
+	const language = await init();
+	return new Query(language, '((requirement (package) @name) @req . (comment) @comment (#match? @comment "^#[\\t ]*exhortignore"))');
+}
+
+export async function getPinnedVersionQuery() {
+	const language = await init();
+	return new Query(language, '(version_spec (version_cmp) @cmp (version) @version (#eq? @cmp "=="))');
+}
Evidence
The module defines init() that performs expensive WASM initialization and language loading; every
exported accessor calls init() independently. Callers often request multiple accessors (parser +
multiple queries), multiplying the work.

src/providers/requirements_parser.js[7-34]
src/providers/python_controller.js[46-56]
src/providers/python_pip.js[92-96]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`requirements_parser.js` re-runs `Parser.init()` and `Language.load()` every time `getParser()` / `get*Query()` is called. Callers often call multiple helpers per request, causing redundant WASM work.

## Issue Context
Both `Python_controller` and `python_pip` call multiple helpers (parser + queries).

## Fix Focus Areas
- src/providers/requirements_parser.js[7-34]
- src/providers/python_controller.js[46-56]
- src/providers/python_pip.js[92-96]

## Suggested fix
- Add a module-level memoized `languagePromise` (or `initPromise`) so initialization is performed once.
- Implement `getLanguage()` that returns the cached promise.
- Update `getParser()` and all `get*Query()` functions to call `await getLanguage()` instead of `await init()`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Debug logs; tests commented 🐞 Bug ✧ Quality
Description
Placeholder console logs (e.g., "BANANA"/"WHAT") are left in production code, and multiple tests are
commented out, reducing signal and coverage for key scenarios. This makes failures noisier and
increases regression risk for the new parsing logic.
Code

src/providers/python_controller.js[R102-114]

+	async #parseRequirements() {
+		const content = fs.readFileSync(this.pathToRequirements).toString();
+		const tree = (await this.parser).parse(content);
+		return Promise.all((await this.requirementsQuery).matches(tree.rootNode).map(async (match) => {
+			const reqNode = match.captures.find(c => c.name === 'req').node;
+			const name = match.captures.find(c => c.name === 'name').node.text;
+			const versionMatches = (await this.pinnedVersionQuery).matches(reqNode);
+			const version = versionMatches.length > 0
+				? versionMatches[0].captures.find(c => c.name === 'version').node.text
+				: null;
+			console.log(`BANANA ${name} ${version}`)
+			return { name, version };
+		}));
Evidence
The code prints requirement names/versions unconditionally during parsing and prints the full stack
SBOM in provideStack; meanwhile, python_pip tests disable several suites/cases instead of fixing
them or marking them skipped with rationale.

src/providers/python_controller.js[98-114]
src/providers/python_pip.js[45-51]
test/providers/python_pip.test.js[50-117]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Production code contains unconditional debug `console.log(...)` statements, and multiple tests are commented out, lowering test coverage for important scenarios.

## Issue Context
The new tree-sitter parsing behavior is complex and needs strong regression coverage.

## Fix Focus Areas
- src/providers/python_controller.js[98-114]
- src/providers/python_pip.js[45-51]
- test/providers/python_pip.test.js[50-117]

## Suggested fix
- Remove the `console.log(&#x27;BANANA ...&#x27;)` and `console.log(&#x27;WHAT ...&#x27;)` statements, or wrap them with `if (process.env.TRUSTIFY_DA_DEBUG === &#x27;true&#x27;)` and avoid logging full SBOM payloads.
- Re-enable the commented-out tests; if they are flaky/slow, convert them to `test.skip(...)` or `suite.skip(...)` with a comment explaining why and how to re-enable, rather than commenting blocks out.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@Strum355 Strum355 force-pushed the nsc/requirements-txt-treesitter branch from 62d1183 to d7f26ca Compare March 4, 2026 13:05
Comment on lines +56 to +57
"tree-sitter-requirements": "github:Strum355/tree-sitter-requirements",
"web-tree-sitter": "^0.26.6",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Git dep uses git+ssh 🐞 Bug ⛯ Reliability

The added GitHub dependency resolves in package-lock.json to a git+ssh URL, which commonly fails in
CI/user environments without GitHub SSH keys configured. This can block installation entirely.
Agent Prompt
## Issue description
The dependency `tree-sitter-requirements` is declared via GitHub shorthand. In `package-lock.json` it resolves to `git+ssh://git@github.com/...`, which frequently fails without SSH keys.

## Issue Context
This affects every `npm ci` / `npm install` consumer of this package.

## Fix Focus Areas
- package.json[56-57]
- package-lock.json[7507-7511]

## Suggested fix
- Update `package.json` to use an HTTPS git URL pinned to a commit/tag, e.g. `"tree-sitter-requirements": "git+https://github.com/Strum355/tree-sitter-requirements.git#<sha>"` (or a published semver npm package).
- Re-run `npm install` (or `npm ci` with regeneration as appropriate) to update `package-lock.json` so `resolved` uses `git+https` not `git+ssh`.
- Verify CI can install without SSH configuration.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@Strum355 Strum355 force-pushed the nsc/requirements-txt-treesitter branch from d7f26ca to 57aafed Compare March 4, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Naive parsing of python's requirements.txt

1 participant