Skip to content

Commit cab10f6

Browse files
committed
Add storage types redesign spec
Design document for reimplementing blob, attach, filepath, and object types as a coherent AttributeType system. Separates storage location (@store) from encoding behavior.
1 parent c34a5b8 commit cab10f6

File tree

1 file changed

+363
-0
lines changed

1 file changed

+363
-0
lines changed
Lines changed: 363 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,363 @@
1+
# Storage Types Redesign Spec
2+
3+
## Overview
4+
5+
This document proposes a redesign of DataJoint's storage types (`blob`, `attach`, `filepath`, `object`) as a coherent system built on the `AttributeType` base class.
6+
7+
## Current State Analysis
8+
9+
### Existing Types
10+
11+
| Type | DB Column | Storage | Semantics |
12+
|------|-----------|---------|-----------|
13+
| `longblob` | LONGBLOB | Internal | Raw bytes |
14+
| `blob@store` | binary(16) | External | Raw bytes via UUID |
15+
| `attach` | LONGBLOB | Internal | `filename\0contents` |
16+
| `attach@store` | binary(16) | External | File via UUID |
17+
| `filepath@store` | binary(16) | External | Path-addressed file reference |
18+
| `object` | JSON | External | Managed file/folder with ObjectRef |
19+
20+
### Problems with Current Design
21+
22+
1. **Scattered implementation**: Logic split across `declare.py`, `table.py`, `fetch.py`, `external.py`
23+
2. **Inconsistent patterns**: Some types use AttributeType, others are hardcoded
24+
3. **Implicit behaviors**: `longblob` previously auto-serialized, now raw
25+
4. **Overlapping semantics**: `blob@store` vs `attach@store` unclear
26+
5. **No internal object type**: `object` always requires external store
27+
28+
## Proposed Architecture
29+
30+
### Core Concepts
31+
32+
1. **Storage Location** (orthogonal to type):
33+
- **Internal**: Data stored directly in database column
34+
- **External**: Data stored in external storage, UUID reference in database
35+
36+
2. **Content Model** (what the type represents):
37+
- **Binary**: Raw bytes (no interpretation)
38+
- **Serialized**: Python objects encoded via DJ blob format
39+
- **File**: Single file with filename metadata
40+
- **Folder**: Directory structure
41+
- **Reference**: Pointer to externally-managed file (path-addressed)
42+
43+
3. **AttributeType** handles encoding/decoding between Python values and stored representation
44+
45+
### Type Hierarchy
46+
47+
```
48+
AttributeType (base)
49+
50+
┌─────────────────┼─────────────────┐
51+
│ │ │
52+
BinaryType SerializedType FileSystemType
53+
(passthrough) (pack/unpack) │
54+
│ │ ┌──────┴──────┐
55+
│ │ │ │
56+
longblob <djblob> <attach> <filepath>
57+
longblob@store <djblob@store> <attach@store> filepath@store
58+
```
59+
60+
### Proposed Types
61+
62+
#### 1. Raw Binary (`longblob`, `blob`, etc.)
63+
64+
**Not an AttributeType** - these are primitive MySQL types.
65+
66+
- Store/return raw bytes without transformation
67+
- `@store` variant stores externally with content-addressed UUID
68+
- No encoding/decoding needed
69+
70+
```python
71+
# Table definition
72+
class RawData(dj.Manual):
73+
definition = """
74+
id : int
75+
---
76+
data : longblob # raw bytes in DB
77+
large_data : blob@store # raw bytes externally
78+
"""
79+
80+
# Usage
81+
table.insert1({'id': 1, 'data': b'raw bytes', 'large_data': b'large raw bytes'})
82+
row = (table & 'id=1').fetch1()
83+
assert row['data'] == b'raw bytes' # bytes returned
84+
```
85+
86+
#### 2. Serialized Objects (`<djblob>`)
87+
88+
**AttributeType** with DJ blob serialization.
89+
90+
- Input: Any Python object (arrays, dicts, lists, etc.)
91+
- Output: Same Python object reconstructed
92+
- Storage: DJ blob format (mYm/dj0 protocol)
93+
94+
```python
95+
@dj.register_type
96+
class DJBlobType(AttributeType):
97+
type_name = "djblob"
98+
dtype = "longblob" # or "longblob@store" for external
99+
100+
def encode(self, value, *, key=None) -> bytes:
101+
return blob.pack(value, compress=True)
102+
103+
def decode(self, stored, *, key=None) -> Any:
104+
return blob.unpack(stored)
105+
```
106+
107+
```python
108+
# Table definition
109+
class ProcessedData(dj.Manual):
110+
definition = """
111+
id : int
112+
---
113+
result : <djblob> # serialized in DB
114+
large_result : <djblob@store> # serialized externally
115+
"""
116+
117+
# Usage
118+
table.insert1({'id': 1, 'result': {'array': np.array([1,2,3]), 'meta': 'info'}})
119+
row = (table & 'id=1').fetch1()
120+
assert row['result']['meta'] == 'info' # Python dict returned
121+
```
122+
123+
#### 3. File Attachments (`<attach>`)
124+
125+
**AttributeType** for file storage with filename preservation.
126+
127+
- Input: File path (string or Path)
128+
- Output: Local file path after download
129+
- Storage: File contents with filename metadata
130+
131+
```python
132+
@dj.register_type
133+
class AttachType(AttributeType):
134+
type_name = "attach"
135+
dtype = "longblob" # or "longblob@store" for external
136+
137+
# For internal storage
138+
def encode(self, filepath, *, key=None) -> bytes:
139+
path = Path(filepath)
140+
return path.name.encode() + b"\0" + path.read_bytes()
141+
142+
def decode(self, stored, *, key=None) -> str:
143+
filename, contents = stored.split(b"\0", 1)
144+
# Download to configured path, return local filepath
145+
...
146+
```
147+
148+
**Key difference from blob**: Preserves original filename, returns file path not bytes.
149+
150+
```python
151+
# Table definition
152+
class Attachments(dj.Manual):
153+
definition = """
154+
id : int
155+
---
156+
config_file : <attach> # small file in DB
157+
data_file : <attach@store> # large file externally
158+
"""
159+
160+
# Usage
161+
table.insert1({'id': 1, 'config_file': '/path/to/config.yaml'})
162+
row = (table & 'id=1').fetch1()
163+
# row['config_file'] == '/downloads/config.yaml' # local path
164+
```
165+
166+
#### 4. Filepath References (`<filepath>`)
167+
168+
**AttributeType** for tracking externally-managed files.
169+
170+
- Input: File path in staging area
171+
- Output: Local file path after sync
172+
- Storage: Path-addressed (UUID = hash of relative path, not contents)
173+
- Tracks `contents_hash` separately for verification
174+
175+
```python
176+
@dj.register_type
177+
class FilepathType(AttributeType):
178+
type_name = "filepath"
179+
dtype = "binary(16)" # Always external (UUID reference)
180+
requires_store = True # Must specify @store
181+
182+
def encode(self, filepath, *, key=None) -> bytes:
183+
# Compute UUID from relative path
184+
# Track contents_hash separately
185+
...
186+
187+
def decode(self, uuid_bytes, *, key=None) -> str:
188+
# Sync file from remote to local stage
189+
# Verify contents_hash
190+
# Return local path
191+
...
192+
```
193+
194+
**Key difference from attach**:
195+
- Path-addressed (same path = same UUID, even if contents change)
196+
- Designed for managed file workflows where files may be updated
197+
- Always external (no internal variant)
198+
199+
```python
200+
# Table definition
201+
class ManagedFiles(dj.Manual):
202+
definition = """
203+
id : int
204+
---
205+
data_path : <filepath@store>
206+
"""
207+
208+
# Usage - file must be in configured stage directory
209+
table.insert1({'id': 1, 'data_path': '/stage/experiment_001/data.h5'})
210+
row = (table & 'id=1').fetch1()
211+
# row['data_path'] == '/local_stage/experiment_001/data.h5'
212+
```
213+
214+
#### 5. Managed Objects (`<object>`)
215+
216+
**AttributeType** for managed file/folder storage with lazy access.
217+
218+
- Input: File path, folder path, or ObjectRef
219+
- Output: ObjectRef handle (lazy - no automatic download)
220+
- Storage: JSON metadata column
221+
- Supports direct writes (Zarr, HDF5) via fsspec
222+
223+
```python
224+
@dj.register_type
225+
class ObjectType(AttributeType):
226+
type_name = "object"
227+
dtype = "json"
228+
requires_store = True # Must specify @store
229+
230+
def encode(self, value, *, key=None) -> str:
231+
# Upload file/folder to object storage
232+
# Return JSON metadata
233+
...
234+
235+
def decode(self, json_str, *, key=None) -> ObjectRef:
236+
# Return ObjectRef handle (no download)
237+
...
238+
```
239+
240+
```python
241+
# Table definition
242+
class LargeData(dj.Manual):
243+
definition = """
244+
id : int
245+
---
246+
zarr_data : <object@store>
247+
"""
248+
249+
# Usage
250+
table.insert1({'id': 1, 'zarr_data': '/path/to/data.zarr'})
251+
row = (table & 'id=1').fetch1()
252+
ref = row['zarr_data'] # ObjectRef handle
253+
ref.download('/local/path') # Explicit download
254+
# Or direct access via fsspec
255+
```
256+
257+
### Storage Location Modifier (`@store`)
258+
259+
The `@store` suffix is orthogonal to the type and specifies external storage:
260+
261+
| Type | Without @store | With @store |
262+
|------|---------------|-------------|
263+
| `longblob` | Raw bytes in DB | Raw bytes in external store |
264+
| `<djblob>` | Serialized in DB | Serialized in external store |
265+
| `<attach>` | File in DB | File in external store |
266+
| `<filepath>` | N/A (error) | Path reference in external store |
267+
| `<object>` | N/A (error) | Object in external store |
268+
269+
Implementation:
270+
- `@store` changes the underlying `dtype` to `binary(16)` (UUID)
271+
- Creates FK relationship to `~external_{store}` tracking table
272+
- AttributeType's `encode()`/`decode()` work with the external table transparently
273+
274+
### Extended AttributeType Interface
275+
276+
For types that interact with the filesystem, we extend the base interface:
277+
278+
```python
279+
class FileSystemType(AttributeType):
280+
"""Base for types that work with file paths."""
281+
282+
# Standard interface
283+
def encode(self, value, *, key=None) -> bytes | str:
284+
"""Convert input (path or value) to stored representation."""
285+
...
286+
287+
def decode(self, stored, *, key=None) -> str:
288+
"""Convert stored representation to local file path."""
289+
...
290+
291+
# Extended interface for external storage
292+
def upload(self, filepath: Path, external: ExternalTable) -> uuid.UUID:
293+
"""Upload file to external storage, return UUID."""
294+
...
295+
296+
def download(self, uuid: uuid.UUID, external: ExternalTable,
297+
download_path: Path) -> Path:
298+
"""Download from external storage to local path."""
299+
...
300+
```
301+
302+
### Configuration
303+
304+
```python
305+
# datajoint config
306+
dj.config['stores'] = {
307+
'main': {
308+
'protocol': 's3',
309+
'endpoint': 's3.amazonaws.com',
310+
'bucket': 'my-bucket',
311+
'location': 'datajoint/',
312+
},
313+
'archive': {
314+
'protocol': 'file',
315+
'location': '/mnt/archive/',
316+
}
317+
}
318+
319+
dj.config['download_path'] = '/tmp/dj_downloads' # For attach
320+
dj.config['stage'] = '/data/stage' # For filepath
321+
```
322+
323+
## Migration Path
324+
325+
### Phase 1: Current State (Done)
326+
- `<djblob>` AttributeType implemented
327+
- `longblob` returns raw bytes
328+
- Legacy `AttributeAdapter` wrapped for backward compat
329+
330+
### Phase 2: Attach as AttributeType
331+
- Implement `<attach>` and `<attach@store>` as AttributeType
332+
- Deprecate bare `attach` type (still works, emits warning)
333+
- Move logic from table.py/fetch.py to AttachType class
334+
335+
### Phase 3: Filepath as AttributeType
336+
- Implement `<filepath@store>` as AttributeType
337+
- Deprecate `filepath@store` syntax (redirect to `<filepath@store>`)
338+
339+
### Phase 4: Object Type Refinement
340+
- Already implemented as separate system
341+
- Ensure consistent with AttributeType patterns
342+
- Consider `<object@store>` syntax
343+
344+
### Phase 5: Cleanup
345+
- Remove scattered type handling from table.py, fetch.py
346+
- Consolidate external storage logic
347+
- Update documentation
348+
349+
## Summary
350+
351+
| Type | Input | Output | Internal | External | Use Case |
352+
|------|-------|--------|----------|----------|----------|
353+
| `longblob` | bytes | bytes ||| Raw binary data |
354+
| `<djblob>` | any | any ||| Python objects, arrays |
355+
| `<attach>` | path | path ||| Files with filename |
356+
| `<filepath>` | path | path ||| Managed file workflows |
357+
| `<object>` | path/ref | ObjectRef ||| Large files, Zarr, HDF5 |
358+
359+
This design:
360+
1. Makes all custom types consistent AttributeTypes
361+
2. Separates storage location (`@store`) from encoding behavior
362+
3. Provides clear semantics for each type
363+
4. Enables gradual migration from current implementation

0 commit comments

Comments
 (0)