-
Notifications
You must be signed in to change notification settings - Fork 17
Add test and analysis for issue #396: Duplicate Python Values #399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,208 @@ | ||
| # Issue #396: Duplicate Python Values - Analysis and Solutions | ||
|
|
||
| ## Problem Summary | ||
|
|
||
| When Python objects are stored in egglog using `PyObject`, cloudpickle is used to serialize them to bytes. These bytes are then used for equality comparison and hashing in the Rust layer (`PyPickledValue`). | ||
|
|
||
| **The Issue**: Cloudpickle does not always produce identical byte sequences for the same object when pickled by-value. This particularly affects: | ||
| - Classes defined in `__main__` module | ||
| - Closures and nested functions | ||
| - Objects that cannot be imported normally | ||
|
|
||
| When cloudpickle produces different bytes for the same logical object, the e-graph treats them as distinct nodes, creating duplicates. | ||
|
|
||
| ## Architecture Analysis | ||
|
|
||
| ### Current Flow: | ||
| 1. Python object → `cloudpickle.dumps()` → bytes | ||
| 2. Bytes stored in `PyObjectDecl.pickled` (Python) and `PyPickledValue` (Rust) | ||
| 3. Rust uses byte-level equality: `impl Eq for PyPickledValue { ... }` | ||
| 4. If bytes differ → different e-graph nodes | ||
|
|
||
| ### Root Cause: | ||
| The `PyPickledValue` struct in `src/py_object_sort.rs` implements equality based on raw bytes: | ||
| ```rust | ||
| #[derive(PartialEq, Eq, Hash, Clone)] | ||
| pub struct PyPickledValue(pub Vec<u8>); | ||
| ``` | ||
|
|
||
| ## Proposed Solutions | ||
|
|
||
| ### Solution 1: Object Identity Cache ⭐ (RECOMMENDED) | ||
| **Approach**: Maintain a Python-side cache that maps object identity to pickled bytes. | ||
|
|
||
| **Implementation**: | ||
| - Add a cache in the EGraph: `Dict[int, bytes]` mapping `id(obj)` to pickled bytes | ||
| - When creating a PyObject: | ||
| 1. Check if `id(obj)` exists in cache | ||
| 2. If yes, reuse those bytes | ||
| 3. If no, pickle and cache | ||
| - Use weak references to avoid keeping objects alive unnecessarily | ||
|
|
||
| **Pros**: | ||
| - ✅ Fast - no repeated pickling | ||
| - ✅ Solves the common case where same object is added multiple times | ||
| - ✅ Minimal changes to existing architecture | ||
| - ✅ Preserves object identity semantics | ||
|
|
||
| **Cons**: | ||
| - ❌ Only works while objects are alive | ||
| - ❌ Doesn't help with value equality (two equal but distinct objects) | ||
| - ❌ Requires Python-side state management | ||
|
|
||
| **Example Code Location**: `python/egglog/egraph_state.py` - add cache to EGraphState | ||
|
|
||
| --- | ||
|
|
||
| ### Solution 2: Hash-Based Deduplication | ||
| **Approach**: Use Python's `hash()` for hashable objects, fall back to bytes for others. | ||
|
|
||
| **Implementation**: | ||
| - Modify `dump()` to compute and store both: | ||
| - Pickled bytes (for reconstruction) | ||
| - Python hash (for equality) | ||
| - In Rust, compare using stored hash for hashable objects | ||
| - Keep objects alive in a registry | ||
|
|
||
| **Pros**: | ||
| - ✅ Respects Python's hash semantics | ||
| - ✅ Works for hashable objects | ||
|
|
||
| **Cons**: | ||
| - ❌ Doesn't work for unhashable objects (dicts, lists, etc.) | ||
| - ❌ Complex lifecycle management | ||
| - ❌ Hash collisions possible | ||
|
|
||
| --- | ||
|
|
||
| ### Solution 3: Unpickled Equality Comparison | ||
| **Approach**: When comparing PyPickledValue for equality, unpickle and use Python's `==`. | ||
|
|
||
| **Implementation**: | ||
| - Override `PartialEq` in Rust to unpickle both objects | ||
| - Call back to Python to perform equality check | ||
| - Cache result to avoid repeated unpickling | ||
|
|
||
| **Pros**: | ||
| - ✅ Respects Python value equality semantics | ||
| - ✅ Conceptually simple | ||
|
|
||
| **Cons**: | ||
| - ❌ **Very expensive** - unpickle on every equality check | ||
| - ❌ May not work for objects without `__eq__` | ||
| - ❌ Could be 100-1000x slower | ||
| - ❌ Not suitable for use in hash tables | ||
|
|
||
| --- | ||
|
|
||
| ### Solution 4: Canonical Pickle Format | ||
| **Approach**: Normalize pickled bytes to ensure determinism. | ||
|
|
||
| **Implementation**: | ||
| - Post-process pickled bytes to remove non-deterministic elements | ||
| - Sort dictionaries, normalize timestamps, etc. | ||
|
|
||
| **Pros**: | ||
| - ✅ Would work with existing architecture | ||
|
|
||
| **Cons**: | ||
| - ❌ **Very difficult** to implement correctly | ||
| - ❌ May break cloudpickle's reconstruction | ||
| - ❌ Pickle format is complex and opaque | ||
| - ❌ May not be possible for all objects | ||
|
|
||
| --- | ||
|
|
||
| ### Solution 5: Hybrid Identity + Value Cache | ||
| **Approach**: Combine object identity cache with value-based deduplication. | ||
|
|
||
| **Implementation**: | ||
| - First check identity cache (fast path) | ||
| - For new objects, compute a content-based fingerprint | ||
| - Cache both identity and fingerprint mappings | ||
|
|
||
| **Pros**: | ||
| - ✅ Best of both worlds | ||
| - ✅ Handles both identity and value equality | ||
|
|
||
| **Cons**: | ||
| - ❌ Complex implementation | ||
| - ❌ Hard to define "content fingerprint" reliably | ||
| - ❌ Still has performance overhead | ||
|
|
||
| --- | ||
|
|
||
| ### Solution 6: User-Provided Keys | ||
| **Approach**: Allow users to optionally provide a custom key for deduplication. | ||
|
|
||
| **Implementation**: | ||
| ```python | ||
| PyObject(obj, key="my_unique_key") | ||
| ``` | ||
|
|
||
| **Pros**: | ||
| - ✅ Simple to implement | ||
| - ✅ User has full control | ||
| - ✅ Works for any object | ||
|
|
||
| **Cons**: | ||
| - ❌ Puts burden on user | ||
| - ❌ Not automatic | ||
| - ❌ Easy to misuse | ||
|
|
||
| --- | ||
|
|
||
| ## Recommendation | ||
|
|
||
| **Implement Solution 1 (Object Identity Cache)** as it: | ||
| 1. Solves the most common case (same object added multiple times) | ||
| 2. Requires minimal changes | ||
| 3. Has good performance characteristics | ||
| 4. Can be implemented entirely in Python layer | ||
|
|
||
| ### Implementation Plan: | ||
|
|
||
| 1. **Add cache to EGraph** (`python/egglog/egraph_state.py`): | ||
| ```python | ||
| class EGraphState: | ||
| def __init__(self): | ||
| self._py_object_cache: Dict[int, bytes] = {} | ||
| ``` | ||
|
|
||
| 2. **Modify PyObject creation** to check cache first: | ||
| ```python | ||
| def create_py_object(obj): | ||
| obj_id = id(obj) | ||
| if obj_id in cache: | ||
| return cached_bytes | ||
| pickled = cloudpickle.dumps(obj) | ||
| cache[obj_id] = pickled | ||
| return pickled | ||
| ``` | ||
|
|
||
| 3. **Use weak references** to avoid keeping objects alive: | ||
| ```python | ||
| import weakref | ||
| self._py_object_refs = weakref.WeakValueDictionary() | ||
| ``` | ||
|
|
||
| 4. **Add tests** to verify deduplication works | ||
|
|
||
| ### Future Enhancements: | ||
| - Add optional value-based deduplication for specific types | ||
| - Provide user override for custom equality | ||
| - Add metrics/warnings when duplicates are detected | ||
|
|
||
| ## Test Results | ||
|
|
||
| Created test file: `python/tests/test_issue_396_duplicate_pyobjects.py` | ||
| - Tests demonstrate expected behavior | ||
| - In simple cases, cloudpickle IS deterministic | ||
| - Real-world cases with classes in `__main__` may still have non-determinism | ||
| - Test provides framework to catch regressions | ||
|
|
||
| ## References | ||
|
|
||
| - Issue: https://github.com/egraphs-good/egglog-python/issues/396 | ||
| - PR Discussion: https://github.com/egraphs-good/egglog-python/pull/393 | ||
| - Cloudpickle source: https://github.com/cloudpipe/cloudpickle/blob/f5199fe2bc102a5ee070c743336699fc885ca966/cloudpickle/cloudpickle.py#L291-L303 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,155 @@ | ||
| """ | ||
| Test for issue #396: Duplicate Python Values | ||
|
|
||
| This test demonstrates that cloudpickle can produce different byte sequences | ||
| for the same object when pickled by-value, particularly for classes defined | ||
| in __main__, which results in duplicate egglog PyObject nodes. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import cloudpickle | ||
| import pytest | ||
|
|
||
| from egglog import * | ||
|
|
||
|
|
||
| def test_duplicate_pyobjects_same_class_instance(): | ||
| """ | ||
| Test that demonstrates the duplicate PyObject issue. | ||
|
|
||
| When a class is defined in __main__ (or in this test module), cloudpickle | ||
| serializes it by value rather than by reference. According to issue #396, | ||
| this can produce different byte sequences for the same object across | ||
| different pickle calls, leading to duplicate nodes in the e-graph. | ||
|
|
||
| EXPECTED BEHAVIOR: The same Python object should create equal PyObject nodes | ||
| ACTUAL BEHAVIOR: May create duplicate nodes if cloudpickle produces different bytes | ||
|
|
||
| Note: This test may pass in simple cases where cloudpickle is deterministic, | ||
| but the issue can still occur in real-world scenarios where: | ||
| - Classes are defined in __main__ | ||
| - Objects are pickled at different times/contexts | ||
| - There are closures or complex nested structures | ||
| """ | ||
|
|
||
| # Define a simple class in the test module (acts like __main__) | ||
| class MyClass: | ||
| def __init__(self, value) -> None: | ||
|
||
| self.value = value | ||
|
|
||
| def __eq__(self, other): | ||
| return isinstance(other, MyClass) and self.value == other.value | ||
|
|
||
| egraph = EGraph() | ||
|
|
||
| # Create the SAME Python object | ||
| obj = MyClass(42) | ||
|
|
||
| # Pickle the same object multiple times to check for determinism | ||
| pickled1 = cloudpickle.dumps(obj) | ||
| pickled2 = cloudpickle.dumps(obj) | ||
|
|
||
| # Report on pickle determinism | ||
| print("\nCloudpickle determinism check:") | ||
|
||
| print(f" First pickle length: {len(pickled1)}") | ||
| print(f" Second pickle length: {len(pickled2)}") | ||
| print(f" Bytes are identical: {pickled1 == pickled2}") | ||
|
|
||
| if pickled1 != pickled2: | ||
| print(" ⚠ Cloudpickle produced DIFFERENT bytes for the same object!") | ||
| else: | ||
| print(" ✓ Cloudpickle is deterministic in this case") | ||
|
Comment on lines
+60
to
+62
|
||
|
|
||
| # Create PyObject expressions from the same object instance multiple times | ||
| # In an ideal world, these should be recognized as equal | ||
| py_obj1 = egraph.let("obj1", PyObject(obj)) | ||
| py_obj2 = egraph.let("obj2", PyObject(obj)) | ||
|
|
||
| egraph.run(10) | ||
|
|
||
| # Check if the e-graph recognizes them as equal | ||
| # This should pass, but might fail if cloudpickle produced different bytes | ||
| try: | ||
| egraph.check(eq(py_obj1).to(py_obj2)) | ||
| print(" ✓ E-graph correctly identifies objects as equal") | ||
| except Exception as e: | ||
| print(" ✗ E-graph treats objects as DIFFERENT - duplicates detected!") | ||
| print(f" Error: {e}") | ||
|
Comment on lines
+77
to
+78
|
||
| pytest.fail( | ||
| f"Duplicate PyObject nodes detected for the same Python object.\n" | ||
| f"This means cloudpickle produced different bytes for identical objects.\n" | ||
| f"Error: {e}" | ||
| ) | ||
|
|
||
|
|
||
| def test_duplicate_pyobjects_nested_class(): | ||
| """ | ||
| Test with nested class definitions which are even more likely | ||
| to be serialized by-value and produce different bytes. | ||
| """ | ||
|
|
||
| class OuterClass: | ||
| class InnerClass: | ||
| def __init__(self, x) -> None: | ||
|
||
| self.x = x | ||
|
|
||
| egraph = EGraph() | ||
|
|
||
| obj = OuterClass.InnerClass(100) | ||
|
|
||
| # Create the same object in the egraph multiple times | ||
| py_obj1 = egraph.let("nested1", PyObject(obj)) | ||
| py_obj2 = egraph.let("nested2", PyObject(obj)) | ||
|
|
||
| # Check if they're equal | ||
| egraph.run(10) | ||
|
|
||
| try: | ||
| egraph.check(eq(py_obj1).to(py_obj2)) | ||
| print("✓ Nested objects are equal - no duplicates") | ||
| except Exception as e: | ||
| print("✗ Nested objects are NOT equal - duplicates detected!") | ||
| pytest.fail(f"Duplicate PyObject nodes for nested class: {e}") | ||
|
Comment on lines
+112
to
+113
|
||
|
|
||
|
|
||
| def test_pyobject_with_closure(): | ||
| """ | ||
| Test that functions with closures (which must be serialized by-value) | ||
| can create duplicate PyObject nodes. | ||
| """ | ||
|
|
||
| def make_adder(x): | ||
| def adder(y): | ||
| return x + y | ||
|
|
||
| return adder | ||
|
|
||
| egraph = EGraph() | ||
|
|
||
| add_five = make_adder(5) | ||
|
|
||
| # Create the same closure in the egraph multiple times | ||
| py_fn1 = egraph.let("fn1", PyObject(add_five)) | ||
| py_fn2 = egraph.let("fn2", PyObject(add_five)) | ||
|
|
||
| egraph.run(10) | ||
|
|
||
| try: | ||
| egraph.check(eq(py_fn1).to(py_fn2)) | ||
| print("✓ Closures are equal - no duplicates") | ||
| except Exception as e: | ||
| print("✗ Closures are NOT equal - duplicates detected!") | ||
|
Comment on lines
+141
to
+142
|
||
| pytest.fail(f"Duplicate PyObject nodes for closure: {e}") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| # Run the tests manually to see output | ||
| print("Testing duplicate PyObjects with same class instance...") | ||
| test_duplicate_pyobjects_same_class_instance() | ||
|
|
||
| print("\nTesting duplicate PyObjects with nested class...") | ||
| test_duplicate_pyobjects_nested_class() | ||
|
|
||
| print("\nTesting duplicate PyObjects with closure...") | ||
| test_pyobject_with_closure() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import pollutes the enclosing namespace, as the imported module egglog does not define 'all'.