Skip to content

Commit f77f32f

Browse files
committed
Enhance populate and blob documentation with detailed examples
- populate.md: Complete rewrite with examples for make(), populate(), three-part make pattern, distributed processing, error handling, master-part pattern, common patterns - blob.md: Comprehensive blob documentation covering definition, insertion, supported types, external storage, compression, performance tips, MATLAB compatibility, common patterns
1 parent 9aa2a13 commit f77f32f

File tree

2 files changed

+582
-260
lines changed

2 files changed

+582
-260
lines changed

docs/src/datatypes/blob.md

Lines changed: 278 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,287 @@
11
# Blobs
22

3-
DataJoint provides functionality for serializing and deserializing complex data types
4-
into binary blobs for efficient storage and compatibility with MATLAB's mYm
5-
serialization. This includes support for:
3+
Blob attributes store serialized Python objects in the database. DataJoint
4+
automatically serializes objects on insert and deserializes them on fetch.
65

7-
+ Basic Python data types (e.g., integers, floats, strings, dictionaries).
8-
+ NumPy arrays and scalars.
9-
+ Specialized data types like UUIDs, decimals, and datetime objects.
6+
## Defining Blob Attributes
107

11-
## Serialization and Deserialization Process
8+
```python
9+
@schema
10+
class Recording(dj.Manual):
11+
definition = """
12+
recording_id : int
13+
---
14+
signal : longblob # numpy array
15+
metadata : longblob # dictionary
16+
timestamps : longblob # 1D array
17+
"""
18+
```
1219

13-
Serialization converts Python objects into a binary representation for efficient storage
14-
within the database. Deserialization converts the binary representation back into the
15-
original Python object.
20+
### Blob Sizes
1621

17-
Blobs over 1 KiB are compressed using the zlib library to reduce storage requirements.
22+
| Type | Max Size | Use Case |
23+
|------|----------|----------|
24+
| `tinyblob` | 255 bytes | Small binary data |
25+
| `blob` | 64 KB | Small arrays |
26+
| `mediumblob` | 16 MB | Medium arrays |
27+
| `longblob` | 4 GB | Large arrays, images |
1828

19-
## Supported Data Types
29+
Use `longblob` for most scientific data to avoid size limitations.
2030

21-
DataJoint supports the following data types for serialization:
31+
## Inserting Blobs
2232

23-
+ Scalars: Integers, floats, booleans, strings.
24-
+ Collections: Lists, tuples, sets, dictionaries.
25-
+ NumPy: Arrays, structured arrays, and scalars.
26-
+ Custom Types: UUIDs, decimals, datetime objects, MATLAB cell and struct arrays.
33+
```python
34+
import numpy as np
35+
36+
# Insert numpy arrays
37+
Recording.insert1({
38+
'recording_id': 1,
39+
'signal': np.random.randn(10000, 64), # 10k samples, 64 channels
40+
'metadata': {'sampling_rate': 30000, 'gain': 1.5},
41+
'timestamps': np.linspace(0, 10, 10000)
42+
})
43+
```
44+
45+
### Supported Types
46+
47+
DataJoint serializes these Python types:
48+
49+
**Scalars**
50+
```python
51+
data = {
52+
'int_val': 42,
53+
'float_val': 3.14159,
54+
'bool_val': True,
55+
'str_val': 'hello world',
56+
}
57+
```
58+
59+
**Collections**
60+
```python
61+
data = {
62+
'list_val': [1, 2, 3, 4, 5],
63+
'tuple_val': (1, 'a', 3.14),
64+
'set_val': {1, 2, 3},
65+
'dict_val': {'key1': 'value1', 'key2': [1, 2, 3]},
66+
}
67+
```
68+
69+
**NumPy Arrays**
70+
```python
71+
data = {
72+
'array_1d': np.array([1, 2, 3, 4, 5]),
73+
'array_2d': np.random.randn(100, 100),
74+
'array_3d': np.zeros((10, 256, 256)), # e.g., video frames
75+
'complex_array': np.array([1+2j, 3+4j]),
76+
'structured': np.array([(1, 2.0), (3, 4.0)],
77+
dtype=[('x', 'i4'), ('y', 'f8')]),
78+
}
79+
```
80+
81+
**Special Types**
82+
```python
83+
import uuid
84+
from decimal import Decimal
85+
from datetime import datetime, date
86+
87+
data = {
88+
'uuid_val': uuid.uuid4(),
89+
'decimal_val': Decimal('3.14159265358979'),
90+
'datetime_val': datetime.now(),
91+
'date_val': date.today(),
92+
}
93+
```
94+
95+
## Fetching Blobs
96+
97+
Blobs are automatically deserialized on fetch:
98+
99+
```python
100+
# Fetch entire entity
101+
record = (Recording & 'recording_id=1').fetch1()
102+
signal = record['signal'] # numpy array
103+
metadata = record['metadata'] # dict
104+
105+
# Fetch specific blob attribute
106+
signal = (Recording & 'recording_id=1').fetch1('signal')
107+
print(signal.shape) # (10000, 64)
108+
print(signal.dtype) # float64
109+
110+
# Fetch multiple blobs
111+
signal, timestamps = (Recording & 'recording_id=1').fetch1('signal', 'timestamps')
112+
```
113+
114+
## External Storage
115+
116+
For large blobs, use external storage to avoid database bloat:
117+
118+
```python
119+
@schema
120+
class LargeData(dj.Manual):
121+
definition = """
122+
data_id : int
123+
---
124+
large_array : blob@external # stored outside database
125+
"""
126+
```
127+
128+
Configure external storage in settings:
129+
130+
```json
131+
{
132+
"stores": {
133+
"external": {
134+
"protocol": "file",
135+
"location": "/data/blobs"
136+
}
137+
}
138+
}
139+
```
140+
141+
See [External Store](../admin/external-store.md) for configuration details.
142+
143+
## Compression
144+
145+
Blobs larger than 1 KiB are automatically compressed using zlib. This is
146+
transparent to users—compression/decompression happens automatically.
147+
148+
```python
149+
# Large array is compressed automatically
150+
large_data = np.random.randn(1000000) # ~8 MB uncompressed
151+
Table.insert1({'data': large_data}) # Stored compressed
152+
fetched = Table.fetch1('data') # Decompressed automatically
153+
```
154+
155+
## Performance Tips
156+
157+
### Use Appropriate Data Types
158+
159+
```python
160+
# Good: use float32 when float64 precision isn't needed
161+
signal = signal.astype(np.float32) # Half the storage
162+
163+
# Good: use appropriate integer sizes
164+
counts = counts.astype(np.uint16) # If values < 65536
165+
```
166+
167+
### Avoid Storing Redundant Data
168+
169+
```python
170+
# Bad: store computed values that can be derived
171+
Recording.insert1({
172+
'signal': signal,
173+
'mean': signal.mean(), # Can be computed from signal
174+
'std': signal.std(), # Can be computed from signal
175+
})
176+
177+
# Good: compute on fetch
178+
signal = Recording.fetch1('signal')
179+
mean, std = signal.mean(), signal.std()
180+
```
181+
182+
### Consider Chunking Large Data
183+
184+
```python
185+
# For very large data, consider splitting into chunks
186+
@schema
187+
class VideoFrame(dj.Manual):
188+
definition = """
189+
-> Video
190+
frame_num : int
191+
---
192+
frame : longblob
193+
"""
194+
195+
# Store frames individually rather than entire video
196+
for i, frame in enumerate(video_frames):
197+
VideoFrame.insert1({'video_id': 1, 'frame_num': i, 'frame': frame})
198+
```
199+
200+
## MATLAB Compatibility
201+
202+
DataJoint's blob format is compatible with MATLAB's mYm serialization,
203+
allowing data sharing between Python and MATLAB pipelines:
204+
205+
```python
206+
# Data inserted from Python
207+
Table.insert1({'data': np.array([[1, 2], [3, 4]])})
208+
```
209+
210+
```matlab
211+
% Fetched in MATLAB
212+
data = fetch1(Table, 'data');
213+
% data is a 2x2 matrix
214+
```
215+
216+
## Common Patterns
217+
218+
### Store Model Weights
219+
220+
```python
221+
@schema
222+
class TrainedModel(dj.Computed):
223+
definition = """
224+
-> TrainingRun
225+
---
226+
weights : longblob
227+
architecture : varchar(100)
228+
accuracy : float
229+
"""
230+
231+
def make(self, key):
232+
model = train_model(key)
233+
self.insert1(dict(
234+
key,
235+
weights=model.get_weights(),
236+
architecture=model.name,
237+
accuracy=evaluate(model)
238+
))
239+
```
240+
241+
### Store Image Data
242+
243+
```python
244+
@schema
245+
class Image(dj.Manual):
246+
definition = """
247+
image_id : int
248+
---
249+
pixels : longblob # HxWxC array
250+
format : varchar(10) # 'RGB', 'RGBA', 'grayscale'
251+
"""
252+
253+
# Insert image
254+
import imageio
255+
img = imageio.imread('photo.png')
256+
Image.insert1({'image_id': 1, 'pixels': img, 'format': 'RGB'})
257+
258+
# Fetch and display
259+
import matplotlib.pyplot as plt
260+
pixels = (Image & 'image_id=1').fetch1('pixels')
261+
plt.imshow(pixels)
262+
```
263+
264+
### Store Time Series
265+
266+
```python
267+
@schema
268+
class TimeSeries(dj.Imported):
269+
definition = """
270+
-> Recording
271+
---
272+
data : longblob # NxT array (N channels, T samples)
273+
sampling_rate : float # Hz
274+
start_time : float # seconds
275+
"""
276+
277+
def make(self, key):
278+
data, sr, t0 = load_recording(key)
279+
self.insert1(dict(key, data=data, sampling_rate=sr, start_time=t0))
280+
```
281+
282+
## Limitations
283+
284+
- Blob content is opaque to SQL queries (can't filter by array values)
285+
- Large blobs increase database backup size
286+
- Consider [object type](object.md) for very large files or cloud storage
287+
- Avoid storing objects with external references (file handles, connections)

0 commit comments

Comments
 (0)