-
Notifications
You must be signed in to change notification settings - Fork 7
How It Works
This page explains PCART's internal mechanisms: the pipeline, workspace layout, pkl lifecycle, matching strategy, and data flow.
Source Code ──► Preprocess ──► API Extract ──► API Map ──► Compatibility Analyze ──► Auto Repair ──► Report
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Code flattening AST-based Dynamic+Static Change type AST-level fix
+ instrumentation call discovery signature match detection + dynamic/static validate
-
Preprocess: Flattens control flow (list/dict comprehensions, conditional returns), converts tabs to spaces, merges multi-line calls into single lines, then instruments the code with
recordValue.pyto capture runtime API call data. -
API Extract: Walks the project AST to discover all calls to the target library, resolving aliases from
importstatements, assignment chains (a = X(); a.f()), andwith/async withcontext managers. Produces structuredCallsiteRecorddictionaries keyed by artifact id. - API Map: Matches each discovered API call to its library definition — first via dynamic matching (running the instrumented code), then falling back to static fuzzy matching if dynamic matching fails.
- Compatibility Analyze: Compares current vs target API signatures to detect parameter changes (addition, removal, renaming, reordering, type changes, positional↔keyword conversions).
- Auto Repair: Applies AST-level fixes for incompatible calls, then validates fixes by running the repaired code in the target environment if parameter values can be recovered from pkl files, otherwise by static signature validation.
-
Report: Generates an internal report inside the workspace, then exports it to
Report/runs/{run_id}/{command_id}/for user access.
Each PCART run creates an isolated workspace under PCARTRuns/runs/:
PCARTRuns/runs/
└── {project}__{timestamp}-{NNN}/ # run_id
└── cmd-001/ # command_id
├── Copy/ # Instrumented project copy for pkl generation
│ ├── {projName}/ # Project copy
│ └── pkl/ # Runtime pkl files and manifests
├── Dynamic/ # Stripped project copy for dynamic matching scripts
│ └── {projName}/
├── data/ # Intermediate JSON (call dicts, match snapshots)
├── temp/ # Temporary files
├── Report/ # Internal report (exported to Report/runs/...)
└── metadata.json # Run configuration metadata
User-visible reports are exported to:
Report/runs/
└── {run_id}/
└── {command_id}/
├── {project}.txt # Repair report
├── patches/ # Source patches
└── fixed_project/ # Repaired project copy
The Copy/bak_<projName>/ backup still exists inside the workspace for single-callsite re-instrumentation during target-environment pkl regeneration.
pkl (pickle) files are the core data artifact that carries runtime API call information between pipeline stages.
When the instrumented project runs in the currentEnv:
-
paraValueDictcollects each API call's receiver object and parameter values at runtime. -
apiCoveredSetrecords which callsites were actually executed. -
callsiteInfoDictrecords structured callsite metadata for each instrumented call. - At process exit (
atexit),savePkls()writes pkl files to the workspace'sCopy/pkl/.
pkl files are named by the structured artifact id produced by CallsiteIdentity.artifactId(), which uniquely identifies each callsite:
{rel_path_slug}__L{lineno}C{col_offset}__{call_slug}__{SHA256_hash}.pkl
Example: src_main_py__L42C5__torch_tensor__a1b2c3d4...64hex.pkl
The artifact id combines:
- File slug: project-relative path, normalized (64 char max)
-
Position:
L{lineno}C{col_offset}for precise source location -
Call slug: normalized API name before the first
((48 char max) - Hash: SHA256 of the full callsite payload for global uniqueness
For calls within with/async with blocks, two candidate pkl files are generated per callsite:
| Suffix | Content |
|---|---|
__object.pkl |
Receiver captured as a runtime object via dill
|
__expr.pkl |
Receiver captured as a string expression (fallback) |
The dual-candidate approach handles cases where the runtime object cannot be pickled (e.g., locked resources, C extension objects).
Each callsite also produces a .manifest.json file tracking candidate statuses:
{
"callsite": "src_main_py__L42C5__client_publish__a1b2c3d4...64hex",
"covered": true,
"candidates": [
{"kind": "object", "status": "saved", "pkl": "src_main_py__L42C5__client_publish__a1b2...__object.pkl"},
{"kind": "expr", "status": "saved", "pkl": "src_main_py__L42C5__client_publish__a1b2...__expr.pkl"}
]
}Candidates marked save_failed are re-generated in the target environment if needed.
If a pkl from currentEnv cannot be loaded in targetEnv (due to serialization format changes), PCART automatically:
- Re-instruments the source file for the failed callsite only.
- Re-runs the project in
targetEnv. - Saves the new pkl with a
new_prefix (e.g.,new_src_main_py__L42C5__client_publish__a1b2...__object.pkl).
-
Dynamic matching (
dynamicMatch.py): Loads the pkl, reconstructs the API callable, and usesinspect.signature()to extract the parameter signature. -
Value addition (
addValueForAPI.py): Loads the pkl to fill in concrete parameter values for repair validation. -
Repair validation (
verifySingle.py): Loads the pkl and attempts to call the repaired API with recovered values.
When multiple pkl files exist for a callsite, PCART tries them in priority order:
-
new_<artifact_id>__object.pkl(re-generated in target env, runtime object) -
new_<artifact_id>__expr.pkl(re-generated in target env, expression fallback) -
new_<artifact_id>.pkl(re-generated in target env, legacy format) -
<artifact_id>__object.pkl(original from current env, runtime object) -
<artifact_id>__expr.pkl(original from current env, expression fallback) -
<artifact_id>.pkl(original from current env, legacy format)
PCART uses a two-tier matching strategy:
- Load the pkl file for the callsite.
- Reconstruct the API callable using the runtime receiver object and parameter values.
- Call
inspect.signature()to get the actual runtime signature. - Record the internal file path (
inspect.getfile()) for disambiguation.
Falls back to static when:
- No pkl exists for the callsite (not covered by test execution).
-
inspect.signature()returnsnullptr(built-in or C extension API). - pkl fails to load (serialization error).
- Match the API call name against the pre-extracted library API definitions (from
LibAPIExtraction/). - Use fuzzy matching: match by the last segment of the API name, then filter by name overlap.
- Check for import aliases via the library's
__init__.pyassignments. - Distinguish between
.pyi-declared (built-in) APIs and.py-declared APIs.
PCART detects API calls through with and async with context managers:
async with aiofiles.open("file.txt") as f: # ← withitem: aiofiles.open("file.txt")
await f.read() # ← alias "f" → aiofiles.open("file.txt").read()How it works:
-
Extraction (
WithVisitor): Records eachwithitem'scontext_expr(the API call) andoptional_vars(the alias), along with the line number range of thewithblock. -
Alias resolution (
modifyWithName): When a call uses the alias (e.g.,f.read()), it is recursively resolved to the full API path (e.g.,aiofiles.open("file.txt").read()). For nestedwithblocks with same-named aliases, the innermost scope takes precedence. -
Receiver capture (
recordValue.py): For withitem callers, both the runtime object and the string expression are saved as separate pkl candidates, since the runtime object may not survive serialization.
To distinguish the same API called at different source locations, PCART uses a structured CallsiteIdentity to generate a stable, readable artifact id:
@dataclass(frozen=True)
class CallsiteIdentity:
rel_path: str # project-relative file path
lineno: int # line number
col_offset: int # column offset
end_lineno: int # end line number
end_col_offset: int # end column offset
call_text: str # original call text (for report display)
normalized_call: str # whitespace/quote-normalized call text (for identity)The artifactId() method produces a stable key in the form:
{rel_slug}__L{lineno}C{col_offset}__{call_slug}__{SHA256_hash}
The CallsiteRecord separates this identity into three concerns:
- artifact_id: Used for pkl/json/manifest/shared dictionary keys throughout the pipeline
- call_text: Original call text for report display
- format_api: Restored API path for static fuzzy matching
This key is used consistently throughout the pipeline:
-
Preprocess: As the dictionary key in
paraValueDictandcallsiteInfoDictinstrumentation. - pkl naming: As the base for pkl file names.
- Dynamic matching: As the lookup key to find the correct pkl.
- Shared dictionary: As the cache key to avoid redundant matching across files.
Without callsite-based identification, two calls to the same API at different lines would share one pkl and one match result, causing incorrect signature assignments.
- Quick-Start: Basic usage tutorial
- Configuration-Guide: Complete configuration reference
- Troubleshooting: Common issues and solutions