tutorial on creating metadata for user-created data by tmchartrand · Pull Request #128 · AllenNeuralDynamics/aind-software-docs

tmchartrand · 2026-05-14T17:17:08Z

Fixes #126
adds sphinx-copybutton extension

📚 Documentation preview 📚: https://scicomp-docs--128.org.readthedocs.build/en/128/

Fixes #126 adds sphinx-copybutton extension

tmchartrand

before merging this needs:
[ ] - add a few glossary terms w/links
[ ] - merge minor changes to Processing schema that this assumes have happened

dougollerenshaw

Notes from the data outreach meeting discussion:

There should be a clear distinction between data assets that are aggregated for analysis (the output truly requires all inputs) vs those that are aggregated for convenience (someone wants to combine multiple assets because it's easier to track). The latter should just be separate assets with complete inherited metadata (as in, the subject.json should be carried through).
The text for the aggregated asset should be explicit about the fact that subject and procedures JSON files can be omitted. Right now it's implicit in that they are not listed as required.

saskiad · 2026-05-15T16:57:58Z

+```
+- Typically all published external data will be *derived* not *raw* data.
+(Accommodating raw external data may require data schema adjustments.)
+- For `project_name` a shortened version of a related manuscript can be used.


I'm wondering if we want this or we want the project name to be the project it's being used for? (e.g. as for aggregated assets). I can kind of talk myself into both directions

updated this with "external data" as a placeholder project name, still need to figure out where to record this for standardization - probably a constant somewhere in data schema models.

saskiad

lots of comments, most of big things were things we discussed yesterday

saskiad · 2026-05-26T22:57:08Z

+
+
+## Storage locations
+For scientist-derived data that is relatively stable (won't be replace often),


relatively stable but not going into published result? (as a way to distinguish from the next section?)

since very few people have permission to create assets directly in aind-open-data I think starting with internal is still the generic approach, but I'll see if i can make it clearer that they should request the transfer ASAP if its going directly towards a published result.

dbirman

Just putting comments for now, I think we really need to put the rules about single-asset or aggregate asset analysis in aind-data-schema docs

dbirman · 2026-05-26T22:52:50Z

+Steps 1-3 can be scripted end-to-end within a Code Ocean capsule copied from the [metadata template capsule](https://codeocean.com/capsule/1234567/tree), 
+or scripts to add metadata can be added to an existing analysis capsule based on the snippets below.
+
+## Metadata for non-AIND data


Shouldn't this whole section go after the intermediate results section? Also I'm confused because the title of this page is about scientist-derived data so I wasn't expecting this section to pop up in the middle.

I'm considering non-AIND data a subset of "scientist-derived" since someone would be importing it as needed for their analysis. makes a bit more sense if we used "non-pipeline derived data" or something which was one alternative we considered.
I will move it after though, you're right that makes more sense.

dbirman · 2026-05-26T22:56:24Z

+
+### Data Description 
+
+The data description for derived data records the origin and organizational context 


I think we should be clear here that this is about multi-asset aggregate analyses. Probably will be best to link from here to the data schema figure that I will hopefully create soon showing how the requirements play out.

we want folks to write a new data description whenever the analysis is for a different project from the input data, so I figured it made the most sense to present this as the generic approach, and copying the data description with the helper methods as a special case shortcut below.

dbirman · 2026-05-26T22:57:09Z

+
+### Putting it all together
+
+#### Single-input results


This whole section we should remove and link to the aind-data-schema docs page that I need to create.

yes definitely, feel free to copy some from here. or rather this section will be replaced with some tips on using helper methods in aind-data-schema and whatever metadata manager library that accomplish the same thing.

saskiad · 2026-05-26T22:59:27Z

+from datetime import datetime
+import aind_data_schema.core.data_description as ds
+
+creation_time = datetime(2024,4,21)


should the creation time be when the data was accessed rather than when the original team created it? I imagine that might be hard to know in some cases?

I say to use a related publication date instead as a fallback, i feel like one or the other is usually available.

saskiad · 2026-05-26T23:04:20Z

+new_md.acquisition.write_standard_file(output_path)
+```
+
+If the same investigators and project team are responsible for the processing as the input data, 


I'd just say project name. People get really confused about investigators despite all the documentation around it. Investigators are tied to project so let's not bring up a question of whether the investigators are the same or not.
Maybe also what to do if the project is not the same - e.g. people need to get the right list of investigators and funding, so if they are doing it manually I'm worried they won't do it correctly...

we maybe need a metadata service endpoint to help with this?

saskiad

I have a few comments that might be worth addressing or can be dealt with after wednesday

tutorial on creating metadata for user-created data

42914dc

Fixes #126 adds sphinx-copybutton extension

tmchartrand requested review from dougollerenshaw and saskiad May 14, 2026 17:17

tmchartrand commented May 14, 2026

View reviewed changes

dougollerenshaw reviewed May 14, 2026

View reviewed changes