tutorial on creating metadata for user-created data#128
Conversation
Fixes #126 adds sphinx-copybutton extension
tmchartrand
left a comment
There was a problem hiding this comment.
before merging this needs:
[ ] - add a few glossary terms w/links
[ ] - merge minor changes to Processing schema that this assumes have happened
dougollerenshaw
left a comment
There was a problem hiding this comment.
Notes from the data outreach meeting discussion:
- There should be a clear distinction between data assets that are aggregated for analysis (the output truly requires all inputs) vs those that are aggregated for convenience (someone wants to combine multiple assets because it's easier to track). The latter should just be separate assets with complete inherited metadata (as in, the subject.json should be carried through).
- The text for the aggregated asset should be explicit about the fact that subject and procedures JSON files can be omitted. Right now it's implicit in that they are not listed as required.
| ``` | ||
| - Typically all published external data will be *derived* not *raw* data. | ||
| (Accommodating raw external data may require data schema adjustments.) | ||
| - For `project_name` a shortened version of a related manuscript can be used. |
There was a problem hiding this comment.
I'm wondering if we want this or we want the project name to be the project it's being used for? (e.g. as for aggregated assets). I can kind of talk myself into both directions
There was a problem hiding this comment.
updated this with "external data" as a placeholder project name, still need to figure out where to record this for standardization - probably a constant somewhere in data schema models.
saskiad
left a comment
There was a problem hiding this comment.
lots of comments, most of big things were things we discussed yesterday
|
|
||
|
|
||
| ## Storage locations | ||
| For scientist-derived data that is relatively stable (won't be replace often), |
There was a problem hiding this comment.
relatively stable but not going into published result? (as a way to distinguish from the next section?)
There was a problem hiding this comment.
since very few people have permission to create assets directly in aind-open-data I think starting with internal is still the generic approach, but I'll see if i can make it clearer that they should request the transfer ASAP if its going directly towards a published result.
dbirman
left a comment
There was a problem hiding this comment.
Just putting comments for now, I think we really need to put the rules about single-asset or aggregate asset analysis in aind-data-schema docs
| Steps 1-3 can be scripted end-to-end within a Code Ocean capsule copied from the [metadata template capsule](https://codeocean.com/capsule/1234567/tree), | ||
| or scripts to add metadata can be added to an existing analysis capsule based on the snippets below. | ||
|
|
||
| ## Metadata for non-AIND data |
There was a problem hiding this comment.
Shouldn't this whole section go after the intermediate results section? Also I'm confused because the title of this page is about scientist-derived data so I wasn't expecting this section to pop up in the middle.
There was a problem hiding this comment.
I'm considering non-AIND data a subset of "scientist-derived" since someone would be importing it as needed for their analysis. makes a bit more sense if we used "non-pipeline derived data" or something which was one alternative we considered.
I will move it after though, you're right that makes more sense.
|
|
||
| ### Data Description | ||
|
|
||
| The data description for derived data records the origin and organizational context |
There was a problem hiding this comment.
I think we should be clear here that this is about multi-asset aggregate analyses. Probably will be best to link from here to the data schema figure that I will hopefully create soon showing how the requirements play out.
There was a problem hiding this comment.
we want folks to write a new data description whenever the analysis is for a different project from the input data, so I figured it made the most sense to present this as the generic approach, and copying the data description with the helper methods as a special case shortcut below.
|
|
||
| ### Putting it all together | ||
|
|
||
| #### Single-input results |
There was a problem hiding this comment.
This whole section we should remove and link to the aind-data-schema docs page that I need to create.
There was a problem hiding this comment.
yes definitely, feel free to copy some from here. or rather this section will be replaced with some tips on using helper methods in aind-data-schema and whatever metadata manager library that accomplish the same thing.
| from datetime import datetime | ||
| import aind_data_schema.core.data_description as ds | ||
|
|
||
| creation_time = datetime(2024,4,21) |
There was a problem hiding this comment.
should the creation time be when the data was accessed rather than when the original team created it? I imagine that might be hard to know in some cases?
There was a problem hiding this comment.
I say to use a related publication date instead as a fallback, i feel like one or the other is usually available.
| new_md.acquisition.write_standard_file(output_path) | ||
| ``` | ||
|
|
||
| If the same investigators and project team are responsible for the processing as the input data, |
There was a problem hiding this comment.
I'd just say project name. People get really confused about investigators despite all the documentation around it. Investigators are tied to project so let's not bring up a question of whether the investigators are the same or not.
Maybe also what to do if the project is not the same - e.g. people need to get the right list of investigators and funding, so if they are doing it manually I'm worried they won't do it correctly...
There was a problem hiding this comment.
we maybe need a metadata service endpoint to help with this?
saskiad
left a comment
There was a problem hiding this comment.
I have a few comments that might be worth addressing or can be dealt with after wednesday
Fixes #126
adds sphinx-copybutton extension
📚 Documentation preview 📚: https://scicomp-docs--128.org.readthedocs.build/en/128/