|
| 1 | +# Storage Types Redesign Spec |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This document proposes a redesign of DataJoint's storage types (`blob`, `attach`, `filepath`, `object`) as a coherent system built on the `AttributeType` base class. |
| 6 | + |
| 7 | +## Current State Analysis |
| 8 | + |
| 9 | +### Existing Types |
| 10 | + |
| 11 | +| Type | DB Column | Storage | Semantics | |
| 12 | +|------|-----------|---------|-----------| |
| 13 | +| `longblob` | LONGBLOB | Internal | Raw bytes | |
| 14 | +| `blob@store` | binary(16) | External | Raw bytes via UUID | |
| 15 | +| `attach` | LONGBLOB | Internal | `filename\0contents` | |
| 16 | +| `attach@store` | binary(16) | External | File via UUID | |
| 17 | +| `filepath@store` | binary(16) | External | Path-addressed file reference | |
| 18 | +| `object` | JSON | External | Managed file/folder with ObjectRef | |
| 19 | + |
| 20 | +### Problems with Current Design |
| 21 | + |
| 22 | +1. **Scattered implementation**: Logic split across `declare.py`, `table.py`, `fetch.py`, `external.py` |
| 23 | +2. **Inconsistent patterns**: Some types use AttributeType, others are hardcoded |
| 24 | +3. **Implicit behaviors**: `longblob` previously auto-serialized, now raw |
| 25 | +4. **Overlapping semantics**: `blob@store` vs `attach@store` unclear |
| 26 | +5. **No internal object type**: `object` always requires external store |
| 27 | + |
| 28 | +## Proposed Architecture |
| 29 | + |
| 30 | +### Core Concepts |
| 31 | + |
| 32 | +1. **Storage Location** (orthogonal to type): |
| 33 | + - **Internal**: Data stored directly in database column |
| 34 | + - **External**: Data stored in external storage, UUID reference in database |
| 35 | + |
| 36 | +2. **Content Model** (what the type represents): |
| 37 | + - **Binary**: Raw bytes (no interpretation) |
| 38 | + - **Serialized**: Python objects encoded via DJ blob format |
| 39 | + - **File**: Single file with filename metadata |
| 40 | + - **Folder**: Directory structure |
| 41 | + - **Reference**: Pointer to externally-managed file (path-addressed) |
| 42 | + |
| 43 | +3. **AttributeType** handles encoding/decoding between Python values and stored representation |
| 44 | + |
| 45 | +### Type Hierarchy |
| 46 | + |
| 47 | +``` |
| 48 | + AttributeType (base) |
| 49 | + │ |
| 50 | + ┌─────────────────┼─────────────────┐ |
| 51 | + │ │ │ |
| 52 | + BinaryType SerializedType FileSystemType |
| 53 | + (passthrough) (pack/unpack) │ |
| 54 | + │ │ ┌──────┴──────┐ |
| 55 | + │ │ │ │ |
| 56 | + longblob <djblob> <attach> <filepath> |
| 57 | + longblob@store <djblob@store> <attach@store> filepath@store |
| 58 | +``` |
| 59 | + |
| 60 | +### Proposed Types |
| 61 | + |
| 62 | +#### 1. Raw Binary (`longblob`, `blob`, etc.) |
| 63 | + |
| 64 | +**Not an AttributeType** - these are primitive MySQL types. |
| 65 | + |
| 66 | +- Store/return raw bytes without transformation |
| 67 | +- `@store` variant stores externally with content-addressed UUID |
| 68 | +- No encoding/decoding needed |
| 69 | + |
| 70 | +```python |
| 71 | +# Table definition |
| 72 | +class RawData(dj.Manual): |
| 73 | + definition = """ |
| 74 | + id : int |
| 75 | + --- |
| 76 | + data : longblob # raw bytes in DB |
| 77 | + large_data : blob@store # raw bytes externally |
| 78 | + """ |
| 79 | + |
| 80 | +# Usage |
| 81 | +table.insert1({'id': 1, 'data': b'raw bytes', 'large_data': b'large raw bytes'}) |
| 82 | +row = (table & 'id=1').fetch1() |
| 83 | +assert row['data'] == b'raw bytes' # bytes returned |
| 84 | +``` |
| 85 | + |
| 86 | +#### 2. Serialized Objects (`<djblob>`) |
| 87 | + |
| 88 | +**AttributeType** with DJ blob serialization. |
| 89 | + |
| 90 | +- Input: Any Python object (arrays, dicts, lists, etc.) |
| 91 | +- Output: Same Python object reconstructed |
| 92 | +- Storage: DJ blob format (mYm/dj0 protocol) |
| 93 | + |
| 94 | +```python |
| 95 | +@dj.register_type |
| 96 | +class DJBlobType(AttributeType): |
| 97 | + type_name = "djblob" |
| 98 | + dtype = "longblob" # or "longblob@store" for external |
| 99 | + |
| 100 | + def encode(self, value, *, key=None) -> bytes: |
| 101 | + return blob.pack(value, compress=True) |
| 102 | + |
| 103 | + def decode(self, stored, *, key=None) -> Any: |
| 104 | + return blob.unpack(stored) |
| 105 | +``` |
| 106 | + |
| 107 | +```python |
| 108 | +# Table definition |
| 109 | +class ProcessedData(dj.Manual): |
| 110 | + definition = """ |
| 111 | + id : int |
| 112 | + --- |
| 113 | + result : <djblob> # serialized in DB |
| 114 | + large_result : <djblob@store> # serialized externally |
| 115 | + """ |
| 116 | + |
| 117 | +# Usage |
| 118 | +table.insert1({'id': 1, 'result': {'array': np.array([1,2,3]), 'meta': 'info'}}) |
| 119 | +row = (table & 'id=1').fetch1() |
| 120 | +assert row['result']['meta'] == 'info' # Python dict returned |
| 121 | +``` |
| 122 | + |
| 123 | +#### 3. File Attachments (`<attach>`) |
| 124 | + |
| 125 | +**AttributeType** for file storage with filename preservation. |
| 126 | + |
| 127 | +- Input: File path (string or Path) |
| 128 | +- Output: Local file path after download |
| 129 | +- Storage: File contents with filename metadata |
| 130 | + |
| 131 | +```python |
| 132 | +@dj.register_type |
| 133 | +class AttachType(AttributeType): |
| 134 | + type_name = "attach" |
| 135 | + dtype = "longblob" # or "longblob@store" for external |
| 136 | + |
| 137 | + # For internal storage |
| 138 | + def encode(self, filepath, *, key=None) -> bytes: |
| 139 | + path = Path(filepath) |
| 140 | + return path.name.encode() + b"\0" + path.read_bytes() |
| 141 | + |
| 142 | + def decode(self, stored, *, key=None) -> str: |
| 143 | + filename, contents = stored.split(b"\0", 1) |
| 144 | + # Download to configured path, return local filepath |
| 145 | + ... |
| 146 | +``` |
| 147 | + |
| 148 | +**Key difference from blob**: Preserves original filename, returns file path not bytes. |
| 149 | + |
| 150 | +```python |
| 151 | +# Table definition |
| 152 | +class Attachments(dj.Manual): |
| 153 | + definition = """ |
| 154 | + id : int |
| 155 | + --- |
| 156 | + config_file : <attach> # small file in DB |
| 157 | + data_file : <attach@store> # large file externally |
| 158 | + """ |
| 159 | + |
| 160 | +# Usage |
| 161 | +table.insert1({'id': 1, 'config_file': '/path/to/config.yaml'}) |
| 162 | +row = (table & 'id=1').fetch1() |
| 163 | +# row['config_file'] == '/downloads/config.yaml' # local path |
| 164 | +``` |
| 165 | + |
| 166 | +#### 4. Filepath References (`<filepath>`) |
| 167 | + |
| 168 | +**AttributeType** for tracking externally-managed files. |
| 169 | + |
| 170 | +- Input: File path in staging area |
| 171 | +- Output: Local file path after sync |
| 172 | +- Storage: Path-addressed (UUID = hash of relative path, not contents) |
| 173 | +- Tracks `contents_hash` separately for verification |
| 174 | + |
| 175 | +```python |
| 176 | +@dj.register_type |
| 177 | +class FilepathType(AttributeType): |
| 178 | + type_name = "filepath" |
| 179 | + dtype = "binary(16)" # Always external (UUID reference) |
| 180 | + requires_store = True # Must specify @store |
| 181 | + |
| 182 | + def encode(self, filepath, *, key=None) -> bytes: |
| 183 | + # Compute UUID from relative path |
| 184 | + # Track contents_hash separately |
| 185 | + ... |
| 186 | + |
| 187 | + def decode(self, uuid_bytes, *, key=None) -> str: |
| 188 | + # Sync file from remote to local stage |
| 189 | + # Verify contents_hash |
| 190 | + # Return local path |
| 191 | + ... |
| 192 | +``` |
| 193 | + |
| 194 | +**Key difference from attach**: |
| 195 | +- Path-addressed (same path = same UUID, even if contents change) |
| 196 | +- Designed for managed file workflows where files may be updated |
| 197 | +- Always external (no internal variant) |
| 198 | + |
| 199 | +```python |
| 200 | +# Table definition |
| 201 | +class ManagedFiles(dj.Manual): |
| 202 | + definition = """ |
| 203 | + id : int |
| 204 | + --- |
| 205 | + data_path : <filepath@store> |
| 206 | + """ |
| 207 | + |
| 208 | +# Usage - file must be in configured stage directory |
| 209 | +table.insert1({'id': 1, 'data_path': '/stage/experiment_001/data.h5'}) |
| 210 | +row = (table & 'id=1').fetch1() |
| 211 | +# row['data_path'] == '/local_stage/experiment_001/data.h5' |
| 212 | +``` |
| 213 | + |
| 214 | +#### 5. Managed Objects (`<object>`) |
| 215 | + |
| 216 | +**AttributeType** for managed file/folder storage with lazy access. |
| 217 | + |
| 218 | +- Input: File path, folder path, or ObjectRef |
| 219 | +- Output: ObjectRef handle (lazy - no automatic download) |
| 220 | +- Storage: JSON metadata column |
| 221 | +- Supports direct writes (Zarr, HDF5) via fsspec |
| 222 | + |
| 223 | +```python |
| 224 | +@dj.register_type |
| 225 | +class ObjectType(AttributeType): |
| 226 | + type_name = "object" |
| 227 | + dtype = "json" |
| 228 | + requires_store = True # Must specify @store |
| 229 | + |
| 230 | + def encode(self, value, *, key=None) -> str: |
| 231 | + # Upload file/folder to object storage |
| 232 | + # Return JSON metadata |
| 233 | + ... |
| 234 | + |
| 235 | + def decode(self, json_str, *, key=None) -> ObjectRef: |
| 236 | + # Return ObjectRef handle (no download) |
| 237 | + ... |
| 238 | +``` |
| 239 | + |
| 240 | +```python |
| 241 | +# Table definition |
| 242 | +class LargeData(dj.Manual): |
| 243 | + definition = """ |
| 244 | + id : int |
| 245 | + --- |
| 246 | + zarr_data : <object@store> |
| 247 | + """ |
| 248 | + |
| 249 | +# Usage |
| 250 | +table.insert1({'id': 1, 'zarr_data': '/path/to/data.zarr'}) |
| 251 | +row = (table & 'id=1').fetch1() |
| 252 | +ref = row['zarr_data'] # ObjectRef handle |
| 253 | +ref.download('/local/path') # Explicit download |
| 254 | +# Or direct access via fsspec |
| 255 | +``` |
| 256 | + |
| 257 | +### Storage Location Modifier (`@store`) |
| 258 | + |
| 259 | +The `@store` suffix is orthogonal to the type and specifies external storage: |
| 260 | + |
| 261 | +| Type | Without @store | With @store | |
| 262 | +|------|---------------|-------------| |
| 263 | +| `longblob` | Raw bytes in DB | Raw bytes in external store | |
| 264 | +| `<djblob>` | Serialized in DB | Serialized in external store | |
| 265 | +| `<attach>` | File in DB | File in external store | |
| 266 | +| `<filepath>` | N/A (error) | Path reference in external store | |
| 267 | +| `<object>` | N/A (error) | Object in external store | |
| 268 | + |
| 269 | +Implementation: |
| 270 | +- `@store` changes the underlying `dtype` to `binary(16)` (UUID) |
| 271 | +- Creates FK relationship to `~external_{store}` tracking table |
| 272 | +- AttributeType's `encode()`/`decode()` work with the external table transparently |
| 273 | + |
| 274 | +### Extended AttributeType Interface |
| 275 | + |
| 276 | +For types that interact with the filesystem, we extend the base interface: |
| 277 | + |
| 278 | +```python |
| 279 | +class FileSystemType(AttributeType): |
| 280 | + """Base for types that work with file paths.""" |
| 281 | + |
| 282 | + # Standard interface |
| 283 | + def encode(self, value, *, key=None) -> bytes | str: |
| 284 | + """Convert input (path or value) to stored representation.""" |
| 285 | + ... |
| 286 | + |
| 287 | + def decode(self, stored, *, key=None) -> str: |
| 288 | + """Convert stored representation to local file path.""" |
| 289 | + ... |
| 290 | + |
| 291 | + # Extended interface for external storage |
| 292 | + def upload(self, filepath: Path, external: ExternalTable) -> uuid.UUID: |
| 293 | + """Upload file to external storage, return UUID.""" |
| 294 | + ... |
| 295 | + |
| 296 | + def download(self, uuid: uuid.UUID, external: ExternalTable, |
| 297 | + download_path: Path) -> Path: |
| 298 | + """Download from external storage to local path.""" |
| 299 | + ... |
| 300 | +``` |
| 301 | + |
| 302 | +### Configuration |
| 303 | + |
| 304 | +```python |
| 305 | +# datajoint config |
| 306 | +dj.config['stores'] = { |
| 307 | + 'main': { |
| 308 | + 'protocol': 's3', |
| 309 | + 'endpoint': 's3.amazonaws.com', |
| 310 | + 'bucket': 'my-bucket', |
| 311 | + 'location': 'datajoint/', |
| 312 | + }, |
| 313 | + 'archive': { |
| 314 | + 'protocol': 'file', |
| 315 | + 'location': '/mnt/archive/', |
| 316 | + } |
| 317 | +} |
| 318 | + |
| 319 | +dj.config['download_path'] = '/tmp/dj_downloads' # For attach |
| 320 | +dj.config['stage'] = '/data/stage' # For filepath |
| 321 | +``` |
| 322 | + |
| 323 | +## Migration Path |
| 324 | + |
| 325 | +### Phase 1: Current State (Done) |
| 326 | +- `<djblob>` AttributeType implemented |
| 327 | +- `longblob` returns raw bytes |
| 328 | +- Legacy `AttributeAdapter` wrapped for backward compat |
| 329 | + |
| 330 | +### Phase 2: Attach as AttributeType |
| 331 | +- Implement `<attach>` and `<attach@store>` as AttributeType |
| 332 | +- Deprecate bare `attach` type (still works, emits warning) |
| 333 | +- Move logic from table.py/fetch.py to AttachType class |
| 334 | + |
| 335 | +### Phase 3: Filepath as AttributeType |
| 336 | +- Implement `<filepath@store>` as AttributeType |
| 337 | +- Deprecate `filepath@store` syntax (redirect to `<filepath@store>`) |
| 338 | + |
| 339 | +### Phase 4: Object Type Refinement |
| 340 | +- Already implemented as separate system |
| 341 | +- Ensure consistent with AttributeType patterns |
| 342 | +- Consider `<object@store>` syntax |
| 343 | + |
| 344 | +### Phase 5: Cleanup |
| 345 | +- Remove scattered type handling from table.py, fetch.py |
| 346 | +- Consolidate external storage logic |
| 347 | +- Update documentation |
| 348 | + |
| 349 | +## Summary |
| 350 | + |
| 351 | +| Type | Input | Output | Internal | External | Use Case | |
| 352 | +|------|-------|--------|----------|----------|----------| |
| 353 | +| `longblob` | bytes | bytes | ✓ | ✓ | Raw binary data | |
| 354 | +| `<djblob>` | any | any | ✓ | ✓ | Python objects, arrays | |
| 355 | +| `<attach>` | path | path | ✓ | ✓ | Files with filename | |
| 356 | +| `<filepath>` | path | path | ✗ | ✓ | Managed file workflows | |
| 357 | +| `<object>` | path/ref | ObjectRef | ✗ | ✓ | Large files, Zarr, HDF5 | |
| 358 | + |
| 359 | +This design: |
| 360 | +1. Makes all custom types consistent AttributeTypes |
| 361 | +2. Separates storage location (`@store`) from encoding behavior |
| 362 | +3. Provides clear semantics for each type |
| 363 | +4. Enables gradual migration from current implementation |
0 commit comments