Skip to content

Conversation

@JostMigenda
Copy link
Collaborator

See previous discussion and review of the first two commits in RSE-Sheffield#89. The two commits added today resolve both open questions discussed there (discrepancy in relative timing for the list/set search comparison and bytecode in appendix).

(The default output format of the dis module has changed a bit in 3.13; that’s why the last commit looks a bit messy. I think it’s nicer though, with named jump labels instead of instruction offsets.)

JostMigenda and others added 4 commits July 7, 2025 11:47
no longer necessary, now that input data is generated exactly once

Co-authored-by: Robert Chisholm <Robadob@Robadob.org>
@JostMigenda JostMigenda requested a review from Robadob October 30, 2025 13:18
@github-actions
Copy link

github-actions bot commented Oct 30, 2025

Thank you!

Thank you for your pull request 😃

🤖 This automated message can help you check the rendered files in your submission for clarity. If you have any questions, please feel free to open an issue in {sandpaper}.

If you have files that automatically render output (e.g. R Markdown), then you should check for the following:

  • 🎯 correct output
  • 🖼️ correct figures
  • ❓ new warnings
  • ‼️ new errors

Rendered Changes

🔍 Inspect the changes: https://github.com/carpentries-incubator/pando-python/compare/md-outputs..md-outputs-PR-14

The following changes were observed in the rendered markdown documents:

 md5sum.txt                                 |   6 +-
 optimisation-data-structures-algorithms.md |  62 ++++------
 optimisation-using-python.md               |  13 +-
 technical-appendix.md                      | 183 +++++++++++++++--------------
 4 files changed, 124 insertions(+), 140 deletions(-)
What does this mean?

If you have source files that require output and figures to be generated (e.g. R Markdown), then it is important to make sure the generated figures and output are reproducible.

This output provides a way for you to inspect the output in a diff-friendly manner so that it's easy to see the changes that occur due to new software versions or randomisation.

⏱️ Updated at 2025-10-30 13:19:17 +0000

github-actions bot pushed a commit that referenced this pull request Oct 30, 2025
Copy link
Collaborator

@Robadob Robadob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to reduce repeats to speed up the timing runs.

Can't decide whether to include the sort in the timing of binary search or not. (it is a cost, but you'd normally sort once and search many times)

Happy to merge as is, but those are my thoughts.

ls_out.append(i)

repeats = 1000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
repeats = 1000
repeats = 100

As noted in the earlier PR, may want to cut this down given how slow uniqueList is, making it easier for people to try it themselves.

Results from 1k repeats locally after these changes, they basically align with what's already there. If repeats is dropped the time will divide too.

uniqueSet: 0.45ms
uniqueSetAdd: 0.98ms
uniqueList: 641.24ms
uniqueListSort: 3.03ms

return st, ls # Return both set and list
st = set([random.randint(0, int(N*M)) for i in range(N)])
ls = list(st)
ls.sort() # Sort required for binary search
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh, the cost of sorting might be worth including in the benchmark to make it "fair".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. However, I experimented with this a bit and it seems to be more subtle than I initially expected:

  • If I use st = set([random.randint(0, int(N*M)) for i in range(N)]); ls = list(st) then ls starts out as a “mostly sorted” list (e.g. something like [32771, 32772, 5, 7, 9, 11, 14, 32785, 17, 19, …]; I’m guessing due to implementation details of the set iteration and hashing implementations?) Including the sorting in the benchmark seemed to have little (<10%) impact on binary_search_list.
  • However, if I switch it around (ls = [random.randint(0, int(N*M)) for i in range(N)]; st = set(ls)) so ls becomes truly random, then binary_search_list becomes about 50% slower.

The latter is probably the more realistic (no-one’s gonna first create a set and then transform it to a list to search it; they’ll just start out having some list that they want to search). On the other hand, by that argument, it might be fairer to include the st = set(ls) when benchmarking search_set, which would make it slower by a similar 50-ish % …

Overall, since it wouldn’t affect the qualitative comparison either way, I’d err on the side of keeping it as is. (The section on NumPy arrays vs. lists had a brief discussion about the overhead of conversion, so it’s okay to skip over that here.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus fun fact: With ls = […]; st = set(ls) I’ve found the search_set times vary between 0.55 and 0.69 ms for different random inputs. (But reproducible within ±0.02 ms for a fixed input.) However, for the “slow” inputs, running st = set(set(ls)) instead would pretty reliably drop the time down to 0.57±0.02 ms. 🤯

I’m guessing the outer set() call is technically not a no-op in that case, because the order in which elements get added to the new set is different from the one of st = set(ls)? But I really can’t afford to get nerdsniped by that and study the implementation of sets right now …

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair to assume that some inputs require more data-reallocation, or less balanced hashmaps (e.g worse performing lookups). Though interesting to know it's visible in practice.

j += 1


repeats = 1000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
repeats = 1000
repeats = 10

As above, maybe want to reduce this so it's quicker for people to run locally. This one is much slower.

It would be nice to use timeit's auto functionality which just ensures longer than 0.2s, but its a lot more verbose sadly.

My time for 1000x, curiously linear search is only 1.3x whereas the other two are 2x, but eh.

search_set: 1.03ms
linear_search_list: 2141.45ms
binary_search_list: 6.96ms

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll open an issue to check run times for all code samples; I vaguely remember there being another slow one 🤔

if i in ls:
ct += 1

repeats = 1000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
repeats = 1000
repeats = 100

My 1000x time, manual search 1.3x operator much the same.

manualSearch: 200.51ms
operatorSearch: 27.39ms

@JostMigenda JostMigenda merged commit 1134124 into carpentries-incubator:main Nov 2, 2025
3 checks passed
github-actions bot pushed a commit that referenced this pull request Nov 2, 2025
Auto-generated via `{sandpaper}`
Source  : 1134124
Branch  : main
Author  : Jost Migenda <jost.migenda@kcl.ac.uk>
Time    : 2025-11-02 01:17:13 +0000
Message : Merge pull request #14 from JostMigenda/examples_setup

Remove duplicate setup code in examples
github-actions bot pushed a commit that referenced this pull request Nov 2, 2025
Auto-generated via `{sandpaper}`
Source  : c48867e
Branch  : md-outputs
Author  : GitHub Actions <actions@github.com>
Time    : 2025-11-02 01:18:15 +0000
Message : markdown source builds

Auto-generated via `{sandpaper}`
Source  : 1134124
Branch  : main
Author  : Jost Migenda <jost.migenda@kcl.ac.uk>
Time    : 2025-11-02 01:17:13 +0000
Message : Merge pull request #14 from JostMigenda/examples_setup

Remove duplicate setup code in examples
@JostMigenda JostMigenda deleted the examples_setup branch November 3, 2025 07:37
github-actions bot pushed a commit that referenced this pull request Nov 4, 2025
Auto-generated via `{sandpaper}`
Source  : c48867e
Branch  : md-outputs
Author  : GitHub Actions <actions@github.com>
Time    : 2025-11-02 01:18:15 +0000
Message : markdown source builds

Auto-generated via `{sandpaper}`
Source  : 1134124
Branch  : main
Author  : Jost Migenda <jost.migenda@kcl.ac.uk>
Time    : 2025-11-02 01:17:13 +0000
Message : Merge pull request #14 from JostMigenda/examples_setup

Remove duplicate setup code in examples
github-actions bot pushed a commit to JostMigenda/pando-python that referenced this pull request Nov 5, 2025
Auto-generated via `{sandpaper}`
Source  : 1134124
Branch  : main
Author  : Jost Migenda <jost.migenda@kcl.ac.uk>
Time    : 2025-11-02 01:17:13 +0000
Message : Merge pull request carpentries-incubator#14 from JostMigenda/examples_setup

Remove duplicate setup code in examples
github-actions bot pushed a commit to JostMigenda/pando-python that referenced this pull request Nov 5, 2025
Auto-generated via `{sandpaper}`
Source  : 9ce30eb
Branch  : md-outputs
Author  : GitHub Actions <actions@github.com>
Time    : 2025-11-05 09:15:29 +0000
Message : markdown source builds

Auto-generated via `{sandpaper}`
Source  : 1134124
Branch  : main
Author  : Jost Migenda <jost.migenda@kcl.ac.uk>
Time    : 2025-11-02 01:17:13 +0000
Message : Merge pull request carpentries-incubator#14 from JostMigenda/examples_setup

Remove duplicate setup code in examples
github-actions bot pushed a commit that referenced this pull request Nov 11, 2025
Auto-generated via `{sandpaper}`
Source  : c48867e
Branch  : md-outputs
Author  : GitHub Actions <actions@github.com>
Time    : 2025-11-02 01:18:15 +0000
Message : markdown source builds

Auto-generated via `{sandpaper}`
Source  : 1134124
Branch  : main
Author  : Jost Migenda <jost.migenda@kcl.ac.uk>
Time    : 2025-11-02 01:17:13 +0000
Message : Merge pull request #14 from JostMigenda/examples_setup

Remove duplicate setup code in examples
github-actions bot pushed a commit that referenced this pull request Nov 18, 2025
Auto-generated via `{sandpaper}`
Source  : c48867e
Branch  : md-outputs
Author  : GitHub Actions <actions@github.com>
Time    : 2025-11-02 01:18:15 +0000
Message : markdown source builds

Auto-generated via `{sandpaper}`
Source  : 1134124
Branch  : main
Author  : Jost Migenda <jost.migenda@kcl.ac.uk>
Time    : 2025-11-02 01:17:13 +0000
Message : Merge pull request #14 from JostMigenda/examples_setup

Remove duplicate setup code in examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants