-
-
Notifications
You must be signed in to change notification settings - Fork 17
feat: use histogram samples for t-test analysis #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Justification: Each sample in the histogram represents a durationPerOp sample calculated by dividing a certain number of iterations of executing the function under test divided by the cumulative time of those runs. Which is the average of the execution time of each execution. opsSec and opsSecPerRun are then an average of the samples, which are themselves averages. Therefore, using opsSecPerRun as a t-test inaccurately applying the calculation to an average of averages, when it is meant to be applied to a set of averages totalling a minimum of 30 samples, with 40 preferable. In other words, it's a histogram entry that represents a valid t-test sample.
…s too small. This will help me sort out inconclusive tests without missing misconfigured ones. This is necessitated by the changes in the previous commit that allow for failure instead of forcing success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove it
| const suite = new Suite({ | ||
| ttest: true, // Automatically sets repeatSuite=30 | ||
| ttest: true, | ||
| minSamples: 30, // minSamples x repeatSuite must be > 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
|
|
||
| Enable t-test mode with `ttest: true`. This automatically sets `repeatSuite=30` to collect enough | ||
| independent samples for reliable statistical analysis (per the Central Limit Theorem): | ||
| Enable t-test mode with `ttest: true`. Requires 30 independent samples for reliable statistical analysis (per the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be the case. It should be 30 full suites (regardless of samples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have asserted several times now that suites are samples and samples are not samples without expanding on why, aside from restating your assertion in essentially the same words.
Why are samples not samples?
I've been over that code and the way in which count and time are calculated is consistent with the notion of 'sample' as I've seen it described in the literature. Where is my error?
By taking a suite as a sample taken over several seconds, you're taking an average of an average, which makes the results of the t-test less accurate.
Justification:
Each sample in the histogram represents a durationPerOp value calculated by dividing a certain number of iterations of executing the function under test divided by the cumulative time of those runs. Which is the average of the execution time of each execution. opsSec and opsSecPerRun are then an average of the samples, which are themselves averages.
Therefore, using opsSecPerRun as a t-test inaccurately applying the calculation to an average of averages, when it is meant to be applied to a set of averages totalling a minimum of 30 samples, with 40 preferable.
In other words, it's a histogram entry that represents a valid t-test sample.
When repeatSuite > 1, the additional samples accumulate in the histogram.
This change also converts the forced override of the input options to a warning if the sample size is too small. Let the users pick whether they want minSamples: 30 or repeatSuite: 3. The code already had support for omitting the significance data if the low bar is not met.