Skip to content

Add AST-based chunking for Python files#2

Open
devangpratap wants to merge 6 commits intomainfrom
feature/ast-chunking
Open

Add AST-based chunking for Python files#2
devangpratap wants to merge 6 commits intomainfrom
feature/ast-chunking

Conversation

@devangpratap
Copy link
Copy Markdown
Owner

Summary

  • Added ast_parser.py that uses Python's built-in ast module to extract functions, classes, and module-level code as individual chunks
  • Integrated AST parsing into the indexer with automatic fallback to line-based chunking for unsupported files or syntax errors
  • Added AST_ENABLED and AST_EXTENSIONS config options for easy toggling
  • Improved search result display to show symbol type/name for AST-parsed chunks

Why

Line-based chunking often splits functions in half, producing embeddings that don't represent any coherent code unit. AST-aware chunking ensures each chunk is a complete function, class, or logical block — significantly improving search relevance.

Test plan

  • Run python main.py -d . on the Codeseek repo itself and verify chunks map to actual functions/classes
  • Test with a file containing syntax errors to confirm fallback works
  • Disable AST via config and verify line-based chunking still works
  • Search for a function by description and confirm it ranks higher than before

Introduces ast_parser.py which uses Python's built-in ast module
to extract functions, classes, and methods as individual chunks.
This provides semantically meaningful code units for embedding
instead of arbitrary line-based splits.
The indexer now attempts AST-based parsing for .py files before
falling back to line-based chunking. Added AST_ENABLED and
AST_EXTENSIONS settings to config for easy control.
Show the symbol type and name (function/class) in search results
when available from AST parsing. Also increased preview from 5
to 8 lines for better context visibility.
Document the new AST-based parsing step and add ast_parser.py
to the project structure section.
@devangpratap devangpratap force-pushed the feature/ast-chunking branch from 6671ed8 to 1d25325 Compare May 9, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant