-
-
Notifications
You must be signed in to change notification settings - Fork 8
Remove duplicate setup code in examples #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove duplicate setup code in examples #14
Conversation
no longer necessary, now that input data is generated exactly once Co-authored-by: Robert Chisholm <Robadob@Robadob.org>
Thank you!Thank you for your pull request 😃 🤖 This automated message can help you check the rendered files in your submission for clarity. If you have any questions, please feel free to open an issue in {sandpaper}. If you have files that automatically render output (e.g. R Markdown), then you should check for the following:
Rendered Changes🔍 Inspect the changes: https://github.com/carpentries-incubator/pando-python/compare/md-outputs..md-outputs-PR-14 The following changes were observed in the rendered markdown documents: What does this mean?If you have source files that require output and figures to be generated (e.g. R Markdown), then it is important to make sure the generated figures and output are reproducible. This output provides a way for you to inspect the output in a diff-friendly manner so that it's easy to see the changes that occur due to new software versions or randomisation. ⏱️ Updated at 2025-10-30 13:19:17 +0000 |
Robadob
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to reduce repeats to speed up the timing runs.
Can't decide whether to include the sort in the timing of binary search or not. (it is a cost, but you'd normally sort once and search many times)
Happy to merge as is, but those are my thoughts.
| ls_out.append(i) | ||
|
|
||
| repeats = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| repeats = 1000 | |
| repeats = 100 |
As noted in the earlier PR, may want to cut this down given how slow uniqueList is, making it easier for people to try it themselves.
Results from 1k repeats locally after these changes, they basically align with what's already there. If repeats is dropped the time will divide too.
uniqueSet: 0.45ms
uniqueSetAdd: 0.98ms
uniqueList: 641.24ms
uniqueListSort: 3.03ms
| return st, ls # Return both set and list | ||
| st = set([random.randint(0, int(N*M)) for i in range(N)]) | ||
| ls = list(st) | ||
| ls.sort() # Sort required for binary search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh, the cost of sorting might be worth including in the benchmark to make it "fair".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. However, I experimented with this a bit and it seems to be more subtle than I initially expected:
- If I use
st = set([random.randint(0, int(N*M)) for i in range(N)]); ls = list(st)thenlsstarts out as a “mostly sorted” list (e.g. something like[32771, 32772, 5, 7, 9, 11, 14, 32785, 17, 19, …]; I’m guessing due to implementation details of thesetiteration and hashing implementations?) Including the sorting in the benchmark seemed to have little (<10%) impact onbinary_search_list. - However, if I switch it around (
ls = [random.randint(0, int(N*M)) for i in range(N)]; st = set(ls)) solsbecomes truly random, thenbinary_search_listbecomes about 50% slower.
The latter is probably the more realistic (no-one’s gonna first create a set and then transform it to a list to search it; they’ll just start out having some list that they want to search). On the other hand, by that argument, it might be fairer to include the st = set(ls) when benchmarking search_set, which would make it slower by a similar 50-ish % …
Overall, since it wouldn’t affect the qualitative comparison either way, I’d err on the side of keeping it as is. (The section on NumPy arrays vs. lists had a brief discussion about the overhead of conversion, so it’s okay to skip over that here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bonus fun fact: With ls = […]; st = set(ls) I’ve found the search_set times vary between 0.55 and 0.69 ms for different random inputs. (But reproducible within ±0.02 ms for a fixed input.) However, for the “slow” inputs, running st = set(set(ls)) instead would pretty reliably drop the time down to 0.57±0.02 ms. 🤯
I’m guessing the outer set() call is technically not a no-op in that case, because the order in which elements get added to the new set is different from the one of st = set(ls)? But I really can’t afford to get nerdsniped by that and study the implementation of sets right now …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair to assume that some inputs require more data-reallocation, or less balanced hashmaps (e.g worse performing lookups). Though interesting to know it's visible in practice.
| j += 1 | ||
|
|
||
|
|
||
| repeats = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| repeats = 1000 | |
| repeats = 10 |
As above, maybe want to reduce this so it's quicker for people to run locally. This one is much slower.
It would be nice to use timeit's auto functionality which just ensures longer than 0.2s, but its a lot more verbose sadly.
My time for 1000x, curiously linear search is only 1.3x whereas the other two are 2x, but eh.
search_set: 1.03ms
linear_search_list: 2141.45ms
binary_search_list: 6.96ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll open an issue to check run times for all code samples; I vaguely remember there being another slow one 🤔
| if i in ls: | ||
| ct += 1 | ||
|
|
||
| repeats = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| repeats = 1000 | |
| repeats = 100 |
My 1000x time, manual search 1.3x operator much the same.
manualSearch: 200.51ms
operatorSearch: 27.39ms
Auto-generated via `{sandpaper}`
Source : c48867e
Branch : md-outputs
Author : GitHub Actions <actions@github.com>
Time : 2025-11-02 01:18:15 +0000
Message : markdown source builds
Auto-generated via `{sandpaper}`
Source : 1134124
Branch : main
Author : Jost Migenda <jost.migenda@kcl.ac.uk>
Time : 2025-11-02 01:17:13 +0000
Message : Merge pull request #14 from JostMigenda/examples_setup
Remove duplicate setup code in examples
Auto-generated via `{sandpaper}`
Source : c48867e
Branch : md-outputs
Author : GitHub Actions <actions@github.com>
Time : 2025-11-02 01:18:15 +0000
Message : markdown source builds
Auto-generated via `{sandpaper}`
Source : 1134124
Branch : main
Author : Jost Migenda <jost.migenda@kcl.ac.uk>
Time : 2025-11-02 01:17:13 +0000
Message : Merge pull request #14 from JostMigenda/examples_setup
Remove duplicate setup code in examples
Auto-generated via `{sandpaper}`
Source : 1134124
Branch : main
Author : Jost Migenda <jost.migenda@kcl.ac.uk>
Time : 2025-11-02 01:17:13 +0000
Message : Merge pull request carpentries-incubator#14 from JostMigenda/examples_setup
Remove duplicate setup code in examples
Auto-generated via `{sandpaper}`
Source : 9ce30eb
Branch : md-outputs
Author : GitHub Actions <actions@github.com>
Time : 2025-11-05 09:15:29 +0000
Message : markdown source builds
Auto-generated via `{sandpaper}`
Source : 1134124
Branch : main
Author : Jost Migenda <jost.migenda@kcl.ac.uk>
Time : 2025-11-02 01:17:13 +0000
Message : Merge pull request carpentries-incubator#14 from JostMigenda/examples_setup
Remove duplicate setup code in examples
Auto-generated via `{sandpaper}`
Source : c48867e
Branch : md-outputs
Author : GitHub Actions <actions@github.com>
Time : 2025-11-02 01:18:15 +0000
Message : markdown source builds
Auto-generated via `{sandpaper}`
Source : 1134124
Branch : main
Author : Jost Migenda <jost.migenda@kcl.ac.uk>
Time : 2025-11-02 01:17:13 +0000
Message : Merge pull request #14 from JostMigenda/examples_setup
Remove duplicate setup code in examples
Auto-generated via `{sandpaper}`
Source : c48867e
Branch : md-outputs
Author : GitHub Actions <actions@github.com>
Time : 2025-11-02 01:18:15 +0000
Message : markdown source builds
Auto-generated via `{sandpaper}`
Source : 1134124
Branch : main
Author : Jost Migenda <jost.migenda@kcl.ac.uk>
Time : 2025-11-02 01:17:13 +0000
Message : Merge pull request #14 from JostMigenda/examples_setup
Remove duplicate setup code in examples
See previous discussion and review of the first two commits in RSE-Sheffield#89. The two commits added today resolve both open questions discussed there (discrepancy in relative timing for the list/set search comparison and bytecode in appendix).
(The default output format of the
dismodule has changed a bit in 3.13; that’s why the last commit looks a bit messy. I think it’s nicer though, with named jump labels instead of instruction offsets.)