Skip to content

Conversation

@amas0
Copy link
Collaborator

@amas0 amas0 commented Nov 6, 2025

Submission Checklist

  • Run unit tests
  • Declare copyright holder and open-source license: see below

Summary

This change proposes a new standard naming convention for output files to address some inconsistencies identified in #833.

Specifically it fixes the mismatched stdout mismatched index and enforces all output files to stick to a naming convention that starts with {model_name}-{datetime}_{id}. Extra names, e.g. stdout, profile, will be append with an _{extra} before the file suffix.

This structure was chosen as it is compatible with the intended new argument save_cmdstan_config referenced in #714.

Under these proposed changes, a default 4-chain sample with latent dynamics and profiling saved will create the following output files:

bernoulli-20251106132150_1.csv
bernoulli-20251106132150_1_diagnostic.csv
bernoulli-20251106132150_1_profile.csv
bernoulli-20251106132150_1_stdout.txt
bernoulli-20251106132150_2.csv
bernoulli-20251106132150_2_diagnostic.csv
bernoulli-20251106132150_2_profile.csv
bernoulli-20251106132150_2_stdout.txt
bernoulli-20251106132150_3.csv
bernoulli-20251106132150_3_diagnostic.csv
bernoulli-20251106132150_3_profile.csv
bernoulli-20251106132150_3_stdout.txt
bernoulli-20251106132150_4.csv
bernoulli-20251106132150_4_diagnostic.csv
bernoulli-20251106132150_4_profile.csv
bernoulli-20251106132150_4_stdout.txt

Not included in this PR, but anticipated is the inclusion of the cmdstan config saved, which produces the following output if included:

bernoulli-20251106132150_1_config.json
bernoulli-20251106132150_1.csv
bernoulli-20251106132150_1_diagnostic.csv
bernoulli-20251106132150_1_profile.csv
bernoulli-20251106132150_1_stdout.txt
bernoulli-20251106132150_2_config.json
bernoulli-20251106132150_2.csv
bernoulli-20251106132150_2_diagnostic.csv
bernoulli-20251106132150_2_profile.csv
bernoulli-20251106132150_2_stdout.txt
bernoulli-20251106132150_3_config.json
bernoulli-20251106132150_3.csv
bernoulli-20251106132150_3_diagnostic.csv
bernoulli-20251106132150_3_profile.csv
bernoulli-20251106132150_3_stdout.txt
bernoulli-20251106132150_4_config.json
bernoulli-20251106132150_4.csv
bernoulli-20251106132150_4_diagnostic.csv
bernoulli-20251106132150_4_profile.csv
bernoulli-20251106132150_4_stdout.txt

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): myself

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@amas0
Copy link
Collaborator Author

amas0 commented Nov 6, 2025

@WardBrian Do you have any insight into this test failure? I don't have a windows machine on hand to look more closely.

@WardBrian
Copy link
Member

I think it might just be sporadic, lets see

Related to the PR, I think this might be wrong when using STAN_THREADs to run multiple chains in one executable. In those cases, the _ID will always be forced to be at the end. For example, if you run

$ ./bernoulli sample num_chains=4 data file=bernoulli.data.json output diagnostic_file=foo.csv
$ ls
bernoulli            bernoulli.data.R  foo_1.csv  foo_3.csv  output_1.csv  output_3.csv
bernoulli.data.json  bernoulli.stan    foo_2.csv  foo_4.csv  output_2.csv  output_4.csv

Runset has to deal with this complexity of abstracting over whether we picked the filenames or cmdstan did, at least until the minimum version is 2.37 and we can use a comma separated list and do whatever we want.

I suspect we may be lacking sufficient tests that set force_one_process_per_chain=False to actually expose these issues

Note that when it is time to set save_cmdstan_config, only one of those files will be generated regardless of num_chains, so that will add yet another special case.

@amas0
Copy link
Collaborator Author

amas0 commented Nov 6, 2025

Hmmmm, okay I see the distinctions in a quick test. Enabling STAN_THREADS and sampling under this current code will generate (with save_cmdstan_config enabled):

bernoulli-20251106145201_1.csv
bernoulli-20251106145201_2.csv
bernoulli-20251106145201_3.csv
bernoulli-20251106145201_4.csv
bernoulli-20251106145201_config.json
bernoulli-20251106145201_diagnostic_1.csv
bernoulli-20251106145201_diagnostic_2.csv
bernoulli-20251106145201_diagnostic_3.csv
bernoulli-20251106145201_diagnostic_4.csv
bernoulli-20251106145201_profile.csv
bernoulli-20251106145201_stdout.txt

So we get on per-chain files (just diagnostic) an appended _{id}.

This is my understanding:

In the case where we run one-process per file, we can get consistency across the output files, because we set the the ID as part of the base file names we give to cmdstan, but if we sample in a single process, cmdstan will generate the per-chain files with an appended _{id}, which seems to only affect the diagnostic file (and the stancsv, but extra bits to the filename makes it the same) since everything else is per-process.

Does that sound right?

I'm not sure if we can enforce absolute consistency between the single vs multi process cases unless we renamed files after they were written by cmdstan? In the one process for all chains case, we could rename diagnostic files to have the id before _diagnostic to make the id always be in the beginning. If we attempted to move the id to the end, then we would have to rename the config.json file in the multiple process case since that seems to inherit naming from the stancsv name.

We could also just accept that when force_one_process_per_chain is true, then the ids occur first, and in the alternative, they occur after (which would only happen when save_latent_dynamics is enabled anyways).

@WardBrian
Copy link
Member

I think it’s more consistent if we always end with _id, and then when we only have one chain we just know we won’t have any _# for those other files

I’m still hopeful one day we can make the multithreaded behavior the default and nuke all this…

@amas0
Copy link
Collaborator Author

amas0 commented Nov 6, 2025

Okay cool -- I've swapped the order. This will consistent throughout all the output files we have right now, regardless of single/multi process. For clarity, this is the output now:

one_process_per_chain = False

bernoulli-20251106180710_1.csv
bernoulli-20251106180710_2.csv
bernoulli-20251106180710_3.csv
bernoulli-20251106180710_4.csv
bernoulli-20251106180710_diagnostic_1.csv
bernoulli-20251106180710_diagnostic_2.csv
bernoulli-20251106180710_diagnostic_3.csv
bernoulli-20251106180710_diagnostic_4.csv
bernoulli-20251106180710_profile.csv
bernoulli-20251106180710_stdout.txt

and one_process_per_chain = True:

bernoulli-20251106180852_1.csv
bernoulli-20251106180852_2.csv
bernoulli-20251106180852_3.csv
bernoulli-20251106180852_4.csv
bernoulli-20251106180852_diagnostic_1.csv
bernoulli-20251106180852_diagnostic_2.csv
bernoulli-20251106180852_diagnostic_3.csv
bernoulli-20251106180852_diagnostic_4.csv
bernoulli-20251106180852_profile_1.csv
bernoulli-20251106180852_profile_2.csv
bernoulli-20251106180852_profile_3.csv
bernoulli-20251106180852_profile_4.csv
bernoulli-20251106180852_stdout_1.txt
bernoulli-20251106180852_stdout_2.txt
bernoulli-20251106180852_stdout_3.txt
bernoulli-20251106180852_stdout_4.txt

I've got to add some tests for these cases, update some docstrings, and probably update some documentation somewhere. Will let you know when this is good to go.

What are the major barriers to making the multithreaded behavior the default? Would certainly clean some of this up..

@WardBrian
Copy link
Member

What are the major barriers to making the multithreaded behavior the default? Would certainly clean some of this up

See the discussion at stan-dev/cmdstan#1179. tl;dr is that our current Windows toolchains suffer from pretty dramatic slowdowns when threading is enabled at compile time. This isn't an inherent Windows issue (e.g., using WSL or clang-cl does not have this issue), so there's some hope that updated or swapped-out toolchains could allow us to do this

@amas0
Copy link
Collaborator Author

amas0 commented Nov 7, 2025

@WardBrian Going through updating docstrings and such. Quite a few places that say:

Output files written to the temporary directory contain an additional
8-character random string, e.g. 'bernoulli-201912081451-1-5nm6as7u.csv'.

I'm assuming this is quite out of date? Blame places these lines at ~5 years old and this isn't something that happens on my machine. Just wanted to double check that this isn't a platform-specific thing or an edge case before I remove it.

@WardBrian
Copy link
Member

Oh yeah, I don't think we've done that in a very long time

@amas0
Copy link
Collaborator Author

amas0 commented Nov 7, 2025

I think this should be good to go @WardBrian

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@amas0 amas0 merged commit b8bcfab into stan-dev:develop Nov 8, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mismatch between -stdout.txt output files and also some other output naming conventions

2 participants