Skip to content

tutorial on creating metadata for user-created data#128

Merged
tmchartrand merged 4 commits into
mainfrom
126-tutorial-on-creating-metadata-for-intermediate-result-assets
May 27, 2026
Merged

tutorial on creating metadata for user-created data#128
tmchartrand merged 4 commits into
mainfrom
126-tutorial-on-creating-metadata-for-intermediate-result-assets

Conversation

@tmchartrand
Copy link
Copy Markdown
Member

@tmchartrand tmchartrand commented May 14, 2026

Fixes #126
adds sphinx-copybutton extension


📚 Documentation preview 📚: https://scicomp-docs--128.org.readthedocs.build/en/128/

Fixes #126
adds sphinx-copybutton extension
Copy link
Copy Markdown
Member Author

@tmchartrand tmchartrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before merging this needs:
[ ] - add a few glossary terms w/links
[ ] - merge minor changes to Processing schema that this assumes have happened

Copy link
Copy Markdown
Contributor

@dougollerenshaw dougollerenshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes from the data outreach meeting discussion:

  • There should be a clear distinction between data assets that are aggregated for analysis (the output truly requires all inputs) vs those that are aggregated for convenience (someone wants to combine multiple assets because it's easier to track). The latter should just be separate assets with complete inherited metadata (as in, the subject.json should be carried through).
  • The text for the aggregated asset should be explicit about the fact that subject and procedures JSON files can be omitted. Right now it's implicit in that they are not listed as required.

Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
```
- Typically all published external data will be *derived* not *raw* data.
(Accommodating raw external data may require data schema adjustments.)
- For `project_name` a shortened version of a related manuscript can be used.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we want this or we want the project name to be the project it's being used for? (e.g. as for aggregated assets). I can kind of talk myself into both directions

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this with "external data" as a placeholder project name, still need to figure out where to record this for standardization - probably a constant somewhere in data schema models.

Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Copy link
Copy Markdown
Contributor

@saskiad saskiad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of comments, most of big things were things we discussed yesterday

@tmchartrand tmchartrand marked this pull request as ready for review May 26, 2026 22:41
@tmchartrand tmchartrand requested review from dbirman and saskiad May 26, 2026 22:41
Comment thread docs/source/glossary.md
Comment thread docs/source/glossary.md


## Storage locations
For scientist-derived data that is relatively stable (won't be replace often),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relatively stable but not going into published result? (as a way to distinguish from the next section?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since very few people have permission to create assets directly in aind-open-data I think starting with internal is still the generic approach, but I'll see if i can make it clearer that they should request the transfer ASAP if its going directly towards a published result.

Copy link
Copy Markdown
Member

@dbirman dbirman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just putting comments for now, I think we really need to put the rules about single-asset or aggregate asset analysis in aind-data-schema docs

Comment thread docs/source/explore_analyze/create_processing_metadata.md Outdated
Steps 1-3 can be scripted end-to-end within a Code Ocean capsule copied from the [metadata template capsule](https://codeocean.com/capsule/1234567/tree),
or scripts to add metadata can be added to an existing analysis capsule based on the snippets below.

## Metadata for non-AIND data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this whole section go after the intermediate results section? Also I'm confused because the title of this page is about scientist-derived data so I wasn't expecting this section to pop up in the middle.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering non-AIND data a subset of "scientist-derived" since someone would be importing it as needed for their analysis. makes a bit more sense if we used "non-pipeline derived data" or something which was one alternative we considered.
I will move it after though, you're right that makes more sense.


### Data Description

The data description for derived data records the origin and organizational context
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be clear here that this is about multi-asset aggregate analyses. Probably will be best to link from here to the data schema figure that I will hopefully create soon showing how the requirements play out.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want folks to write a new data description whenever the analysis is for a different project from the input data, so I figured it made the most sense to present this as the generic approach, and copying the data description with the helper methods as a special case shortcut below.


### Putting it all together

#### Single-input results
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole section we should remove and link to the aind-data-schema docs page that I need to create.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes definitely, feel free to copy some from here. or rather this section will be replaced with some tips on using helper methods in aind-data-schema and whatever metadata manager library that accomplish the same thing.

from datetime import datetime
import aind_data_schema.core.data_description as ds

creation_time = datetime(2024,4,21)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the creation time be when the data was accessed rather than when the original team created it? I imagine that might be hard to know in some cases?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say to use a related publication date instead as a fallback, i feel like one or the other is usually available.

new_md.acquisition.write_standard_file(output_path)
```

If the same investigators and project team are responsible for the processing as the input data,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just say project name. People get really confused about investigators despite all the documentation around it. Investigators are tied to project so let's not bring up a question of whether the investigators are the same or not.
Maybe also what to do if the project is not the same - e.g. people need to get the right list of investigators and funding, so if they are doing it manually I'm worried they won't do it correctly...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we maybe need a metadata service endpoint to help with this?

Copy link
Copy Markdown
Contributor

@saskiad saskiad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments that might be worth addressing or can be dealt with after wednesday

@tmchartrand tmchartrand merged commit 475efb5 into main May 27, 2026
1 check passed
@tmchartrand tmchartrand deleted the 126-tutorial-on-creating-metadata-for-intermediate-result-assets branch May 27, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tutorial on creating metadata for intermediate result assets

4 participants