Skip to content

How It Works

PCART Bot edited this page May 11, 2026 · 5 revisions

How PCART Works

This page explains PCART's internal mechanisms: the pipeline, workspace layout, pkl lifecycle, matching strategy, and data flow.

Pipeline Overview

Source Code ──► Preprocess ──► API Extract ──► API Map ──► Compatibility Analyze ──► Auto Repair ──► Report
                  │                │              │              │                      │
                  ▼                ▼              ▼              ▼                      ▼
             Code flattening   AST-based     Dynamic+Static   Change type          AST-level fix
             + instrumentation call discovery  signature match  detection          + dynamic/static validate

Step-by-step

  1. Preprocess: Flattens control flow (list/dict comprehensions, conditional returns), converts tabs to spaces, merges multi-line calls into single lines, then instruments the code with recordValue.py to capture runtime API call data.
  2. API Extract: Walks the project AST to discover all calls to the target library, resolving aliases from import statements, assignment chains (a = X(); a.f()), and with/async with context managers. Produces structured CallsiteRecord dictionaries keyed by artifact id.
  3. API Map: Matches each discovered API call to its library definition — first via dynamic matching (running the instrumented code), then falling back to static fuzzy matching if dynamic matching fails.
  4. Compatibility Analyze: Compares current vs target API signatures to detect parameter changes (addition, removal, renaming, reordering, type changes, positional↔keyword conversions).
  5. Auto Repair: Applies AST-level fixes for incompatible calls, then validates fixes by running the repaired code in the target environment if parameter values can be recovered from pkl files, otherwise by static signature validation.
  6. Report: Generates an internal report inside the workspace, then exports it to Report/runs/{run_id}/{command_id}/ for user access.

Workspace Layout

Each PCART run creates an isolated workspace under PCARTRuns/runs/:

PCARTRuns/runs/
└── {project}__{timestamp}-{NNN}/       # run_id
    └── cmd-001/                         # command_id
        ├── Copy/                        # Instrumented project copy for pkl generation
        │   ├── {projName}/              #   Project copy
        │   └── pkl/                     #   Runtime pkl files and manifests
        ├── Dynamic/                     # Stripped project copy for dynamic matching scripts
        │   └── {projName}/
        ├── data/                        # Intermediate JSON (call dicts, match snapshots)
        ├── temp/                        # Temporary files
        ├── Report/                      # Internal report (exported to Report/runs/...)
        └── metadata.json                # Run configuration metadata

User-visible reports are exported to:

Report/runs/
└── {run_id}/
    └── {command_id}/
        ├── {project}.txt               # Repair report
        ├── patches/                     # Source patches
        └── fixed_project/               # Repaired project copy

The Copy/bak_<projName>/ backup still exists inside the workspace for single-callsite re-instrumentation during target-environment pkl regeneration.

pkl File Lifecycle

pkl (pickle) files are the core data artifact that carries runtime API call information between pipeline stages.

Generation (recordValue.py)

When the instrumented project runs in the currentEnv:

  1. paraValueDict collects each API call's receiver object and parameter values at runtime.
  2. apiCoveredSet records which callsites were actually executed.
  3. callsiteInfoDict records structured callsite metadata for each instrumented call.
  4. At process exit (atexit), savePkls() writes pkl files to the workspace's Copy/pkl/.

Naming

pkl files are named by the structured artifact id produced by CallsiteIdentity.artifactId(), which uniquely identifies each callsite:

{rel_path_slug}__L{lineno}C{col_offset}__{call_slug}__{SHA256_hash}.pkl

Example: src_main_py__L42C5__torch_tensor__a1b2c3d4...64hex.pkl

The artifact id combines:

  • File slug: project-relative path, normalized (64 char max)
  • Position: L{lineno}C{col_offset} for precise source location
  • Call slug: normalized API name before the first ( (48 char max)
  • Hash: SHA256 of the full callsite payload for global uniqueness

For calls within with/async with blocks, two candidate pkl files are generated per callsite:

Suffix Content
__object.pkl Receiver captured as a runtime object via dill
__expr.pkl Receiver captured as a string expression (fallback)

The dual-candidate approach handles cases where the runtime object cannot be pickled (e.g., locked resources, C extension objects).

Manifest

Each callsite also produces a .manifest.json file tracking candidate statuses:

{
  "callsite": "src_main_py__L42C5__client_publish__a1b2c3d4...64hex",
  "covered": true,
  "candidates": [
    {"kind": "object", "status": "saved", "pkl": "src_main_py__L42C5__client_publish__a1b2...__object.pkl"},
    {"kind": "expr",  "status": "saved", "pkl": "src_main_py__L42C5__client_publish__a1b2...__expr.pkl"}
  ]
}

Candidates marked save_failed are re-generated in the target environment if needed.

Regeneration in Target Environment

If a pkl from currentEnv cannot be loaded in targetEnv (due to serialization format changes), PCART automatically:

  1. Re-instruments the source file for the failed callsite only.
  2. Re-runs the project in targetEnv.
  3. Saves the new pkl with a new_ prefix (e.g., new_src_main_py__L42C5__client_publish__a1b2...__object.pkl).

Consumption

  • Dynamic matching (dynamicMatch.py): Loads the pkl, reconstructs the API callable, and uses inspect.signature() to extract the parameter signature.
  • Value addition (addValueForAPI.py): Loads the pkl to fill in concrete parameter values for repair validation.
  • Repair validation (verifySingle.py): Loads the pkl and attempts to call the repaired API with recovered values.

Candidate Order

When multiple pkl files exist for a callsite, PCART tries them in priority order:

  1. new_<artifact_id>__object.pkl (re-generated in target env, runtime object)
  2. new_<artifact_id>__expr.pkl (re-generated in target env, expression fallback)
  3. new_<artifact_id>.pkl (re-generated in target env, legacy format)
  4. <artifact_id>__object.pkl (original from current env, runtime object)
  5. <artifact_id>__expr.pkl (original from current env, expression fallback)
  6. <artifact_id>.pkl (original from current env, legacy format)

Dynamic vs Static Matching

PCART uses a two-tier matching strategy:

Dynamic Matching (preferred)

  1. Load the pkl file for the callsite.
  2. Reconstruct the API callable using the runtime receiver object and parameter values.
  3. Call inspect.signature() to get the actual runtime signature.
  4. Record the internal file path (inspect.getfile()) for disambiguation.

Falls back to static when:

  • No pkl exists for the callsite (not covered by test execution).
  • inspect.signature() returns nullptr (built-in or C extension API).
  • pkl fails to load (serialization error).

Static Matching (fallback)

  1. Match the API call name against the pre-extracted library API definitions (from LibAPIExtraction/).
  2. Use fuzzy matching: match by the last segment of the API name, then filter by name overlap.
  3. Check for import aliases via the library's __init__.py assignments.
  4. Distinguish between .pyi-declared (built-in) APIs and .py-declared APIs.

withitem / async with Support

PCART detects API calls through with and async with context managers:

async with aiofiles.open("file.txt") as f:   # ← withitem: aiofiles.open("file.txt")
    await f.read()                            # ← alias "f" → aiofiles.open("file.txt").read()

How it works:

  1. Extraction (WithVisitor): Records each withitem's context_expr (the API call) and optional_vars (the alias), along with the line number range of the with block.
  2. Alias resolution (modifyWithName): When a call uses the alias (e.g., f.read()), it is recursively resolved to the full API path (e.g., aiofiles.open("file.txt").read()). For nested with blocks with same-named aliases, the innermost scope takes precedence.
  3. Receiver capture (recordValue.py): For withitem callers, both the runtime object and the string expression are saved as separate pkl candidates, since the runtime object may not survive serialization.

callsite-based Identification

To distinguish the same API called at different source locations, PCART uses a structured CallsiteIdentity to generate a stable, readable artifact id:

@dataclass(frozen=True)
class CallsiteIdentity:
    rel_path: str         # project-relative file path
    lineno: int           # line number
    col_offset: int       # column offset
    end_lineno: int       # end line number
    end_col_offset: int   # end column offset
    call_text: str        # original call text (for report display)
    normalized_call: str  # whitespace/quote-normalized call text (for identity)

The artifactId() method produces a stable key in the form:

{rel_slug}__L{lineno}C{col_offset}__{call_slug}__{SHA256_hash}

The CallsiteRecord separates this identity into three concerns:

  • artifact_id: Used for pkl/json/manifest/shared dictionary keys throughout the pipeline
  • call_text: Original call text for report display
  • format_api: Restored API path for static fuzzy matching

This key is used consistently throughout the pipeline:

  • Preprocess: As the dictionary key in paraValueDict and callsiteInfoDict instrumentation.
  • pkl naming: As the base for pkl file names.
  • Dynamic matching: As the lookup key to find the correct pkl.
  • Shared dictionary: As the cache key to avoid redundant matching across files.

Without callsite-based identification, two calls to the same API at different lines would share one pkl and one match result, causing incorrect signature assignments.

Related Pages