diff --git a/ISSUE_396_ANALYSIS.md b/ISSUE_396_ANALYSIS.md new file mode 100644 index 00000000..28186d1b --- /dev/null +++ b/ISSUE_396_ANALYSIS.md @@ -0,0 +1,208 @@ +# Issue #396: Duplicate Python Values - Analysis and Solutions + +## Problem Summary + +When Python objects are stored in egglog using `PyObject`, cloudpickle is used to serialize them to bytes. These bytes are then used for equality comparison and hashing in the Rust layer (`PyPickledValue`). + +**The Issue**: Cloudpickle does not always produce identical byte sequences for the same object when pickled by-value. This particularly affects: +- Classes defined in `__main__` module +- Closures and nested functions +- Objects that cannot be imported normally + +When cloudpickle produces different bytes for the same logical object, the e-graph treats them as distinct nodes, creating duplicates. + +## Architecture Analysis + +### Current Flow: +1. Python object → `cloudpickle.dumps()` → bytes +2. Bytes stored in `PyObjectDecl.pickled` (Python) and `PyPickledValue` (Rust) +3. Rust uses byte-level equality: `impl Eq for PyPickledValue { ... }` +4. If bytes differ → different e-graph nodes + +### Root Cause: +The `PyPickledValue` struct in `src/py_object_sort.rs` implements equality based on raw bytes: +```rust +#[derive(PartialEq, Eq, Hash, Clone)] +pub struct PyPickledValue(pub Vec); +``` + +## Proposed Solutions + +### Solution 1: Object Identity Cache ⭐ (RECOMMENDED) +**Approach**: Maintain a Python-side cache that maps object identity to pickled bytes. + +**Implementation**: +- Add a cache in the EGraph: `Dict[int, bytes]` mapping `id(obj)` to pickled bytes +- When creating a PyObject: + 1. Check if `id(obj)` exists in cache + 2. If yes, reuse those bytes + 3. If no, pickle and cache +- Use weak references to avoid keeping objects alive unnecessarily + +**Pros**: +- ✅ Fast - no repeated pickling +- ✅ Solves the common case where same object is added multiple times +- ✅ Minimal changes to existing architecture +- ✅ Preserves object identity semantics + +**Cons**: +- ❌ Only works while objects are alive +- ❌ Doesn't help with value equality (two equal but distinct objects) +- ❌ Requires Python-side state management + +**Example Code Location**: `python/egglog/egraph_state.py` - add cache to EGraphState + +--- + +### Solution 2: Hash-Based Deduplication +**Approach**: Use Python's `hash()` for hashable objects, fall back to bytes for others. + +**Implementation**: +- Modify `dump()` to compute and store both: + - Pickled bytes (for reconstruction) + - Python hash (for equality) +- In Rust, compare using stored hash for hashable objects +- Keep objects alive in a registry + +**Pros**: +- ✅ Respects Python's hash semantics +- ✅ Works for hashable objects + +**Cons**: +- ❌ Doesn't work for unhashable objects (dicts, lists, etc.) +- ❌ Complex lifecycle management +- ❌ Hash collisions possible + +--- + +### Solution 3: Unpickled Equality Comparison +**Approach**: When comparing PyPickledValue for equality, unpickle and use Python's `==`. + +**Implementation**: +- Override `PartialEq` in Rust to unpickle both objects +- Call back to Python to perform equality check +- Cache result to avoid repeated unpickling + +**Pros**: +- ✅ Respects Python value equality semantics +- ✅ Conceptually simple + +**Cons**: +- ❌ **Very expensive** - unpickle on every equality check +- ❌ May not work for objects without `__eq__` +- ❌ Could be 100-1000x slower +- ❌ Not suitable for use in hash tables + +--- + +### Solution 4: Canonical Pickle Format +**Approach**: Normalize pickled bytes to ensure determinism. + +**Implementation**: +- Post-process pickled bytes to remove non-deterministic elements +- Sort dictionaries, normalize timestamps, etc. + +**Pros**: +- ✅ Would work with existing architecture + +**Cons**: +- ❌ **Very difficult** to implement correctly +- ❌ May break cloudpickle's reconstruction +- ❌ Pickle format is complex and opaque +- ❌ May not be possible for all objects + +--- + +### Solution 5: Hybrid Identity + Value Cache +**Approach**: Combine object identity cache with value-based deduplication. + +**Implementation**: +- First check identity cache (fast path) +- For new objects, compute a content-based fingerprint +- Cache both identity and fingerprint mappings + +**Pros**: +- ✅ Best of both worlds +- ✅ Handles both identity and value equality + +**Cons**: +- ❌ Complex implementation +- ❌ Hard to define "content fingerprint" reliably +- ❌ Still has performance overhead + +--- + +### Solution 6: User-Provided Keys +**Approach**: Allow users to optionally provide a custom key for deduplication. + +**Implementation**: +```python +PyObject(obj, key="my_unique_key") +``` + +**Pros**: +- ✅ Simple to implement +- ✅ User has full control +- ✅ Works for any object + +**Cons**: +- ❌ Puts burden on user +- ❌ Not automatic +- ❌ Easy to misuse + +--- + +## Recommendation + +**Implement Solution 1 (Object Identity Cache)** as it: +1. Solves the most common case (same object added multiple times) +2. Requires minimal changes +3. Has good performance characteristics +4. Can be implemented entirely in Python layer + +### Implementation Plan: + +1. **Add cache to EGraph** (`python/egglog/egraph_state.py`): + ```python + class EGraphState: + def __init__(self): + self._py_object_cache: Dict[int, bytes] = {} + ``` + +2. **Modify PyObject creation** to check cache first: + ```python + def create_py_object(obj): + obj_id = id(obj) + if obj_id in cache: + return cached_bytes + pickled = cloudpickle.dumps(obj) + cache[obj_id] = pickled + return pickled + ``` + +3. **Use weak references** to avoid keeping objects alive: + ```python + import weakref + self._py_object_refs = weakref.WeakValueDictionary() + ``` + +4. **Add tests** to verify deduplication works + +### Future Enhancements: +- Add optional value-based deduplication for specific types +- Provide user override for custom equality +- Add metrics/warnings when duplicates are detected + +## Test Results + +Created test file: `python/tests/test_issue_396_duplicate_pyobjects.py` +- Tests demonstrate expected behavior +- In simple cases, cloudpickle IS deterministic +- Real-world cases with classes in `__main__` may still have non-determinism +- Test provides framework to catch regressions + +## References + +- Issue: https://github.com/egraphs-good/egglog-python/issues/396 +- PR Discussion: https://github.com/egraphs-good/egglog-python/pull/393 +- Cloudpickle source: https://github.com/cloudpipe/cloudpickle/blob/f5199fe2bc102a5ee070c743336699fc885ca966/cloudpickle/cloudpickle.py#L291-L303 diff --git a/python/tests/test_issue_396_duplicate_pyobjects.py b/python/tests/test_issue_396_duplicate_pyobjects.py new file mode 100644 index 00000000..10464ca6 --- /dev/null +++ b/python/tests/test_issue_396_duplicate_pyobjects.py @@ -0,0 +1,155 @@ +""" +Test for issue #396: Duplicate Python Values + +This test demonstrates that cloudpickle can produce different byte sequences +for the same object when pickled by-value, particularly for classes defined +in __main__, which results in duplicate egglog PyObject nodes. +""" + +from __future__ import annotations + +import cloudpickle +import pytest + +from egglog import * + + +def test_duplicate_pyobjects_same_class_instance(): + """ + Test that demonstrates the duplicate PyObject issue. + + When a class is defined in __main__ (or in this test module), cloudpickle + serializes it by value rather than by reference. According to issue #396, + this can produce different byte sequences for the same object across + different pickle calls, leading to duplicate nodes in the e-graph. + + EXPECTED BEHAVIOR: The same Python object should create equal PyObject nodes + ACTUAL BEHAVIOR: May create duplicate nodes if cloudpickle produces different bytes + + Note: This test may pass in simple cases where cloudpickle is deterministic, + but the issue can still occur in real-world scenarios where: + - Classes are defined in __main__ + - Objects are pickled at different times/contexts + - There are closures or complex nested structures + """ + + # Define a simple class in the test module (acts like __main__) + class MyClass: + def __init__(self, value) -> None: + self.value = value + + def __eq__(self, other): + return isinstance(other, MyClass) and self.value == other.value + + egraph = EGraph() + + # Create the SAME Python object + obj = MyClass(42) + + # Pickle the same object multiple times to check for determinism + pickled1 = cloudpickle.dumps(obj) + pickled2 = cloudpickle.dumps(obj) + + # Report on pickle determinism + print("\nCloudpickle determinism check:") + print(f" First pickle length: {len(pickled1)}") + print(f" Second pickle length: {len(pickled2)}") + print(f" Bytes are identical: {pickled1 == pickled2}") + + if pickled1 != pickled2: + print(" ⚠ Cloudpickle produced DIFFERENT bytes for the same object!") + else: + print(" ✓ Cloudpickle is deterministic in this case") + + # Create PyObject expressions from the same object instance multiple times + # In an ideal world, these should be recognized as equal + py_obj1 = egraph.let("obj1", PyObject(obj)) + py_obj2 = egraph.let("obj2", PyObject(obj)) + + egraph.run(10) + + # Check if the e-graph recognizes them as equal + # This should pass, but might fail if cloudpickle produced different bytes + try: + egraph.check(eq(py_obj1).to(py_obj2)) + print(" ✓ E-graph correctly identifies objects as equal") + except Exception as e: + print(" ✗ E-graph treats objects as DIFFERENT - duplicates detected!") + print(f" Error: {e}") + pytest.fail( + f"Duplicate PyObject nodes detected for the same Python object.\n" + f"This means cloudpickle produced different bytes for identical objects.\n" + f"Error: {e}" + ) + + +def test_duplicate_pyobjects_nested_class(): + """ + Test with nested class definitions which are even more likely + to be serialized by-value and produce different bytes. + """ + + class OuterClass: + class InnerClass: + def __init__(self, x) -> None: + self.x = x + + egraph = EGraph() + + obj = OuterClass.InnerClass(100) + + # Create the same object in the egraph multiple times + py_obj1 = egraph.let("nested1", PyObject(obj)) + py_obj2 = egraph.let("nested2", PyObject(obj)) + + # Check if they're equal + egraph.run(10) + + try: + egraph.check(eq(py_obj1).to(py_obj2)) + print("✓ Nested objects are equal - no duplicates") + except Exception as e: + print("✗ Nested objects are NOT equal - duplicates detected!") + pytest.fail(f"Duplicate PyObject nodes for nested class: {e}") + + +def test_pyobject_with_closure(): + """ + Test that functions with closures (which must be serialized by-value) + can create duplicate PyObject nodes. + """ + + def make_adder(x): + def adder(y): + return x + y + + return adder + + egraph = EGraph() + + add_five = make_adder(5) + + # Create the same closure in the egraph multiple times + py_fn1 = egraph.let("fn1", PyObject(add_five)) + py_fn2 = egraph.let("fn2", PyObject(add_five)) + + egraph.run(10) + + try: + egraph.check(eq(py_fn1).to(py_fn2)) + print("✓ Closures are equal - no duplicates") + except Exception as e: + print("✗ Closures are NOT equal - duplicates detected!") + pytest.fail(f"Duplicate PyObject nodes for closure: {e}") + + +if __name__ == "__main__": + # Run the tests manually to see output + print("Testing duplicate PyObjects with same class instance...") + test_duplicate_pyobjects_same_class_instance() + + print("\nTesting duplicate PyObjects with nested class...") + test_duplicate_pyobjects_nested_class() + + print("\nTesting duplicate PyObjects with closure...") + test_pyobject_with_closure()