|
1 | 1 | # Blobs |
2 | 2 |
|
3 | | -DataJoint provides functionality for serializing and deserializing complex data types |
4 | | -into binary blobs for efficient storage and compatibility with MATLAB's mYm |
5 | | -serialization. This includes support for: |
| 3 | +Blob attributes store serialized Python objects in the database. DataJoint |
| 4 | +automatically serializes objects on insert and deserializes them on fetch. |
6 | 5 |
|
7 | | -+ Basic Python data types (e.g., integers, floats, strings, dictionaries). |
8 | | -+ NumPy arrays and scalars. |
9 | | -+ Specialized data types like UUIDs, decimals, and datetime objects. |
| 6 | +## Defining Blob Attributes |
10 | 7 |
|
11 | | -## Serialization and Deserialization Process |
| 8 | +```python |
| 9 | +@schema |
| 10 | +class Recording(dj.Manual): |
| 11 | + definition = """ |
| 12 | + recording_id : int |
| 13 | + --- |
| 14 | + signal : longblob # numpy array |
| 15 | + metadata : longblob # dictionary |
| 16 | + timestamps : longblob # 1D array |
| 17 | + """ |
| 18 | +``` |
12 | 19 |
|
13 | | -Serialization converts Python objects into a binary representation for efficient storage |
14 | | -within the database. Deserialization converts the binary representation back into the |
15 | | -original Python object. |
| 20 | +### Blob Sizes |
16 | 21 |
|
17 | | -Blobs over 1 KiB are compressed using the zlib library to reduce storage requirements. |
| 22 | +| Type | Max Size | Use Case | |
| 23 | +|------|----------|----------| |
| 24 | +| `tinyblob` | 255 bytes | Small binary data | |
| 25 | +| `blob` | 64 KB | Small arrays | |
| 26 | +| `mediumblob` | 16 MB | Medium arrays | |
| 27 | +| `longblob` | 4 GB | Large arrays, images | |
18 | 28 |
|
19 | | -## Supported Data Types |
| 29 | +Use `longblob` for most scientific data to avoid size limitations. |
20 | 30 |
|
21 | | -DataJoint supports the following data types for serialization: |
| 31 | +## Inserting Blobs |
22 | 32 |
|
23 | | -+ Scalars: Integers, floats, booleans, strings. |
24 | | -+ Collections: Lists, tuples, sets, dictionaries. |
25 | | -+ NumPy: Arrays, structured arrays, and scalars. |
26 | | -+ Custom Types: UUIDs, decimals, datetime objects, MATLAB cell and struct arrays. |
| 33 | +```python |
| 34 | +import numpy as np |
| 35 | + |
| 36 | +# Insert numpy arrays |
| 37 | +Recording.insert1({ |
| 38 | + 'recording_id': 1, |
| 39 | + 'signal': np.random.randn(10000, 64), # 10k samples, 64 channels |
| 40 | + 'metadata': {'sampling_rate': 30000, 'gain': 1.5}, |
| 41 | + 'timestamps': np.linspace(0, 10, 10000) |
| 42 | +}) |
| 43 | +``` |
| 44 | + |
| 45 | +### Supported Types |
| 46 | + |
| 47 | +DataJoint serializes these Python types: |
| 48 | + |
| 49 | +**Scalars** |
| 50 | +```python |
| 51 | +data = { |
| 52 | + 'int_val': 42, |
| 53 | + 'float_val': 3.14159, |
| 54 | + 'bool_val': True, |
| 55 | + 'str_val': 'hello world', |
| 56 | +} |
| 57 | +``` |
| 58 | + |
| 59 | +**Collections** |
| 60 | +```python |
| 61 | +data = { |
| 62 | + 'list_val': [1, 2, 3, 4, 5], |
| 63 | + 'tuple_val': (1, 'a', 3.14), |
| 64 | + 'set_val': {1, 2, 3}, |
| 65 | + 'dict_val': {'key1': 'value1', 'key2': [1, 2, 3]}, |
| 66 | +} |
| 67 | +``` |
| 68 | + |
| 69 | +**NumPy Arrays** |
| 70 | +```python |
| 71 | +data = { |
| 72 | + 'array_1d': np.array([1, 2, 3, 4, 5]), |
| 73 | + 'array_2d': np.random.randn(100, 100), |
| 74 | + 'array_3d': np.zeros((10, 256, 256)), # e.g., video frames |
| 75 | + 'complex_array': np.array([1+2j, 3+4j]), |
| 76 | + 'structured': np.array([(1, 2.0), (3, 4.0)], |
| 77 | + dtype=[('x', 'i4'), ('y', 'f8')]), |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +**Special Types** |
| 82 | +```python |
| 83 | +import uuid |
| 84 | +from decimal import Decimal |
| 85 | +from datetime import datetime, date |
| 86 | + |
| 87 | +data = { |
| 88 | + 'uuid_val': uuid.uuid4(), |
| 89 | + 'decimal_val': Decimal('3.14159265358979'), |
| 90 | + 'datetime_val': datetime.now(), |
| 91 | + 'date_val': date.today(), |
| 92 | +} |
| 93 | +``` |
| 94 | + |
| 95 | +## Fetching Blobs |
| 96 | + |
| 97 | +Blobs are automatically deserialized on fetch: |
| 98 | + |
| 99 | +```python |
| 100 | +# Fetch entire entity |
| 101 | +record = (Recording & 'recording_id=1').fetch1() |
| 102 | +signal = record['signal'] # numpy array |
| 103 | +metadata = record['metadata'] # dict |
| 104 | + |
| 105 | +# Fetch specific blob attribute |
| 106 | +signal = (Recording & 'recording_id=1').fetch1('signal') |
| 107 | +print(signal.shape) # (10000, 64) |
| 108 | +print(signal.dtype) # float64 |
| 109 | + |
| 110 | +# Fetch multiple blobs |
| 111 | +signal, timestamps = (Recording & 'recording_id=1').fetch1('signal', 'timestamps') |
| 112 | +``` |
| 113 | + |
| 114 | +## External Storage |
| 115 | + |
| 116 | +For large blobs, use external storage to avoid database bloat: |
| 117 | + |
| 118 | +```python |
| 119 | +@schema |
| 120 | +class LargeData(dj.Manual): |
| 121 | + definition = """ |
| 122 | + data_id : int |
| 123 | + --- |
| 124 | + large_array : blob@external # stored outside database |
| 125 | + """ |
| 126 | +``` |
| 127 | + |
| 128 | +Configure external storage in settings: |
| 129 | + |
| 130 | +```json |
| 131 | +{ |
| 132 | + "stores": { |
| 133 | + "external": { |
| 134 | + "protocol": "file", |
| 135 | + "location": "/data/blobs" |
| 136 | + } |
| 137 | + } |
| 138 | +} |
| 139 | +``` |
| 140 | + |
| 141 | +See [External Store](../admin/external-store.md) for configuration details. |
| 142 | + |
| 143 | +## Compression |
| 144 | + |
| 145 | +Blobs larger than 1 KiB are automatically compressed using zlib. This is |
| 146 | +transparent to users—compression/decompression happens automatically. |
| 147 | + |
| 148 | +```python |
| 149 | +# Large array is compressed automatically |
| 150 | +large_data = np.random.randn(1000000) # ~8 MB uncompressed |
| 151 | +Table.insert1({'data': large_data}) # Stored compressed |
| 152 | +fetched = Table.fetch1('data') # Decompressed automatically |
| 153 | +``` |
| 154 | + |
| 155 | +## Performance Tips |
| 156 | + |
| 157 | +### Use Appropriate Data Types |
| 158 | + |
| 159 | +```python |
| 160 | +# Good: use float32 when float64 precision isn't needed |
| 161 | +signal = signal.astype(np.float32) # Half the storage |
| 162 | + |
| 163 | +# Good: use appropriate integer sizes |
| 164 | +counts = counts.astype(np.uint16) # If values < 65536 |
| 165 | +``` |
| 166 | + |
| 167 | +### Avoid Storing Redundant Data |
| 168 | + |
| 169 | +```python |
| 170 | +# Bad: store computed values that can be derived |
| 171 | +Recording.insert1({ |
| 172 | + 'signal': signal, |
| 173 | + 'mean': signal.mean(), # Can be computed from signal |
| 174 | + 'std': signal.std(), # Can be computed from signal |
| 175 | +}) |
| 176 | + |
| 177 | +# Good: compute on fetch |
| 178 | +signal = Recording.fetch1('signal') |
| 179 | +mean, std = signal.mean(), signal.std() |
| 180 | +``` |
| 181 | + |
| 182 | +### Consider Chunking Large Data |
| 183 | + |
| 184 | +```python |
| 185 | +# For very large data, consider splitting into chunks |
| 186 | +@schema |
| 187 | +class VideoFrame(dj.Manual): |
| 188 | + definition = """ |
| 189 | + -> Video |
| 190 | + frame_num : int |
| 191 | + --- |
| 192 | + frame : longblob |
| 193 | + """ |
| 194 | + |
| 195 | +# Store frames individually rather than entire video |
| 196 | +for i, frame in enumerate(video_frames): |
| 197 | + VideoFrame.insert1({'video_id': 1, 'frame_num': i, 'frame': frame}) |
| 198 | +``` |
| 199 | + |
| 200 | +## MATLAB Compatibility |
| 201 | + |
| 202 | +DataJoint's blob format is compatible with MATLAB's mYm serialization, |
| 203 | +allowing data sharing between Python and MATLAB pipelines: |
| 204 | + |
| 205 | +```python |
| 206 | +# Data inserted from Python |
| 207 | +Table.insert1({'data': np.array([[1, 2], [3, 4]])}) |
| 208 | +``` |
| 209 | + |
| 210 | +```matlab |
| 211 | +% Fetched in MATLAB |
| 212 | +data = fetch1(Table, 'data'); |
| 213 | +% data is a 2x2 matrix |
| 214 | +``` |
| 215 | + |
| 216 | +## Common Patterns |
| 217 | + |
| 218 | +### Store Model Weights |
| 219 | + |
| 220 | +```python |
| 221 | +@schema |
| 222 | +class TrainedModel(dj.Computed): |
| 223 | + definition = """ |
| 224 | + -> TrainingRun |
| 225 | + --- |
| 226 | + weights : longblob |
| 227 | + architecture : varchar(100) |
| 228 | + accuracy : float |
| 229 | + """ |
| 230 | + |
| 231 | + def make(self, key): |
| 232 | + model = train_model(key) |
| 233 | + self.insert1(dict( |
| 234 | + key, |
| 235 | + weights=model.get_weights(), |
| 236 | + architecture=model.name, |
| 237 | + accuracy=evaluate(model) |
| 238 | + )) |
| 239 | +``` |
| 240 | + |
| 241 | +### Store Image Data |
| 242 | + |
| 243 | +```python |
| 244 | +@schema |
| 245 | +class Image(dj.Manual): |
| 246 | + definition = """ |
| 247 | + image_id : int |
| 248 | + --- |
| 249 | + pixels : longblob # HxWxC array |
| 250 | + format : varchar(10) # 'RGB', 'RGBA', 'grayscale' |
| 251 | + """ |
| 252 | + |
| 253 | +# Insert image |
| 254 | +import imageio |
| 255 | +img = imageio.imread('photo.png') |
| 256 | +Image.insert1({'image_id': 1, 'pixels': img, 'format': 'RGB'}) |
| 257 | + |
| 258 | +# Fetch and display |
| 259 | +import matplotlib.pyplot as plt |
| 260 | +pixels = (Image & 'image_id=1').fetch1('pixels') |
| 261 | +plt.imshow(pixels) |
| 262 | +``` |
| 263 | + |
| 264 | +### Store Time Series |
| 265 | + |
| 266 | +```python |
| 267 | +@schema |
| 268 | +class TimeSeries(dj.Imported): |
| 269 | + definition = """ |
| 270 | + -> Recording |
| 271 | + --- |
| 272 | + data : longblob # NxT array (N channels, T samples) |
| 273 | + sampling_rate : float # Hz |
| 274 | + start_time : float # seconds |
| 275 | + """ |
| 276 | + |
| 277 | + def make(self, key): |
| 278 | + data, sr, t0 = load_recording(key) |
| 279 | + self.insert1(dict(key, data=data, sampling_rate=sr, start_time=t0)) |
| 280 | +``` |
| 281 | + |
| 282 | +## Limitations |
| 283 | + |
| 284 | +- Blob content is opaque to SQL queries (can't filter by array values) |
| 285 | +- Large blobs increase database backup size |
| 286 | +- Consider [object type](object.md) for very large files or cloud storage |
| 287 | +- Avoid storing objects with external references (file handles, connections) |
0 commit comments