11# Filepath Datatype
22
3- ## Configuration & usage
3+ Note: Filepath Datatype is available as a preview feature in DataJoint Python v0.12.
4+ This means that the feature is required to be explicitly enabled. To do so, make sure
5+ to set the environment variable ` FILEPATH_FEATURE_SWITCH=TRUE ` prior to use.
46
5- https://github.com/datajoint/datajoint-python/issues/481
7+ ## Configuration & Usage
68
7- The ` filepath ` attribute type links DataJoint records to files already
9+ Corresponding to issue
10+ [ #481 ] ( https://github.com/datajoint/datajoint-python/issues/481 ) ,
11+ the ` filepath ` attribute type links DataJoint records to files already
812managed outside of DataJoint. This can aid in sharing data with
9- other systems, such as allowing an image viewer application to
13+ other systems such as allowing an image viewer application to
1014directly use files from a DataJoint pipeline, or to allow downstream
11- tables to reference data which lives outside of the DataJoint
12- pipeline .
15+ tables to reference data which reside outside of DataJoint
16+ pipelines .
1317
1418To define a table using the ` filepath ` datatype, an existing DataJoint
1519[ store] ( ../../sysadmin/external-store.md ) should be created and then referenced in the
1620new table definition. For example, given a simple store:
1721
18- ``` json
19- dj.config['stores'] = {
20- 'data': {
21- 'protocol': 'file',
22- 'location': '/data',
23- 'stage': '/data'
24- }
25- }
22+ ``` python
23+ dj.config[' stores' ] = {
24+ ' data' : {
25+ ' protocol' : ' file' ,
26+ ' location' : ' /data' ,
27+ ' stage' : ' /data'
28+ }
29+ }
2630```
2731
28- We can define an ScanImages table as follows:
32+ we can define an ` ScanImages ` table as follows:
2933
3034``` python
3135@schema
3236class ScanImages (dj .Manual ):
33- definition = """
34- -> Session
35- image_id: int
36- ---
37- image_path: filepath@data
38- """
37+ definition = """
38+ -> Session
39+ image_id: int
40+ ---
41+ image_path: filepath@data
42+ """
3943```
4044
41- This table can now be used for tracking paths within the ' /data' area .
45+ This table can now be used for tracking paths within the ` /data ` local directory .
4246For example:
4347
4448``` python
@@ -50,27 +54,43 @@ For example:
5054As can be seen from the example, unlike [ blob] ( blobs.md ) records, file
5155paths are managed as path locations to the underlying file.
5256
53- ## Filepath integrity notes
57+ ## Integrity Notes
5458
5559Unlike other data in DataJoint, data in ` filepath ` records are
5660deliberately intended for shared use outside of DataJoint. To help
57- ensure integrity of filepath records, DataJoint will record a
58- checksum of the file data on insert, and will verify this checksum
59- on fetch. However, since the underlying file data may be shared
61+ ensure integrity of ` filepath ` records, DataJoint will record a
62+ checksum of the file data on ` insert ` , and will verify this checksum
63+ on ` fetch ` . However, since the underlying file data may be shared
6064with other applications, special care should be taken to ensure
6165records stored in ` filepath ` attributes are not modified outside
6266of the pipeline, or, if they are, that records in the pipeline are
63- updated accordingly. A safe method of changing filepath data is
67+ updated accordingly. A safe method of changing ` filepath ` data is
6468as follows:
6569
66- 1 . Delete filepath database record
67- - This will ensure that any downstream records in the pipeline depending
68- on the ` filepath ` record are purged from the database
69- 2 . Modify filepath data
70- 3 . Re-insert corresponding filepath record
71- - This will add the record back to DataJoint with an updated file checksum
72- 4 . Compute any downstream dependencies, if needed
73- - This will ensure that downstream results dependent on the filepath
74- record are updated to reflect the newer filepath contents.
70+ 1 . Delete the ` filepath ` database record.
71+ This will ensure that any downstream records in the pipeline depending
72+ on the ` filepath ` record are purged from the database.
73+ 2 . Modify ` filepath ` data.
74+ 3 . Re-insert corresponding the ` filepath ` record.
75+ This will add the record back to DataJoint with an updated file checksum.
76+ 4 . Compute any downstream dependencies, if needed.
77+ This will ensure that downstream results dependent on the ` filepath `
78+ record are updated to reflect the newer ` filepath ` contents.
79+
80+ ### Disable Fetch Verification
81+
82+ Note: Skipping the checksum is not recommended as it ensures file integrity i.e.
83+ downloaded files are not corrupted. With S3 stores, most of the time to complete a
84+ ` .fetch() ` is from the file download itself as opposed to evaluating the checksum. This
85+ option will primarily benefit ` filepath ` usage connected to a local ` file ` store.
86+
87+ To disable checksums you can set a threshold in bytes
88+ for when to stop evaluating checksums like in the example below:
89+
90+ ``` python
91+ dj.config[" filepath_checksum_size_limit" ] = 5 * 1024 ** 3 # Skip for all files greater than 5GiB
92+ ```
93+
94+ The default is ` None ` which means it will always verify checksums.
7595
7696<!-- TODO: purging filepath data -->
0 commit comments