From 8696b442adfbf946d1b742f245b73ef36790d742 Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sun, 30 Mar 2025 19:39:46 +0100 Subject: [PATCH 01/16] work in progress --- config.yaml | 1 + learners/technical-appendix.md | 17 +++++++++++++++++ 2 files changed, 18 insertions(+) create mode 100644 learners/technical-appendix.md diff --git a/config.yaml b/config.yaml index 536472a0..8e11e735 100644 --- a/config.yaml +++ b/config.yaml @@ -77,6 +77,7 @@ episodes: learners: - setup.md - registration.md +- technical-appendix.md - acknowledgements.md - ppp.md - reference.md diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md new file mode 100644 index 00000000..c8fc82d7 --- /dev/null +++ b/learners/technical-appendix.md @@ -0,0 +1,17 @@ +--- +title: Technical Appendix +--- + +The topics covered here exceed the level of knowledge required to benefit from the course, however provide a more technical explanation of some of the concepts that can help you understand your code's performance. + +**Contents** + +- []() +- []() +- []() + +## + +## + +## \ No newline at end of file From abf3532d8efc54fd7750614785f9bb4cf86951e6 Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sun, 30 Mar 2025 19:50:10 +0100 Subject: [PATCH 02/16] dis Doesn't feel worth adding a cross-reference for this section. --- episodes/optimisation-using-python.md | 128 -------------------------- learners/technical-appendix.md | 125 ++++++++++++++++++++++++- 2 files changed, 123 insertions(+), 130 deletions(-) diff --git a/episodes/optimisation-using-python.md b/episodes/optimisation-using-python.md index a7f28b11..b7888895 100644 --- a/episodes/optimisation-using-python.md +++ b/episodes/optimisation-using-python.md @@ -150,134 +150,6 @@ operatorSearch: 28.43ms An easy approach to follow is that if two blocks of code do the same operation, the one that contains less Python is probably faster. This won't apply if you're using 3rd party packages written purely in Python though. -::::::::::::::::::::::::::::::::::::: callout - -### Python bytecode - - -You can use `dis` to view the bytecode generated by Python, the amount of byte-code more strongly correlates with how much code is being executed by the Python interpreter. However, this still does not account for whether functions called are implemented using Python or C. - -The pure Python search compiles to 82 lines of byte-code. - -```python -import dis - -def manualSearch(): - ls = generateInputs() - ct = 0 - for i in range(0, int(N*M), M): - for j in range(0, len(ls)): - if ls[j] == i: - ct += 1 - break - -dis.dis(manualSearch) -``` -```output - 11 0 LOAD_GLOBAL 0 (generateInputs) - 2 CALL_FUNCTION 0 - 4 STORE_FAST 0 (ls) - - 12 6 LOAD_CONST 1 (0) - 8 STORE_FAST 1 (ct) - - 13 10 LOAD_GLOBAL 1 (range) - 12 LOAD_CONST 1 (0) - 14 LOAD_GLOBAL 2 (int) - 16 LOAD_GLOBAL 3 (N) - 18 LOAD_GLOBAL 4 (M) - 20 BINARY_MULTIPLY - 22 CALL_FUNCTION 1 - 24 LOAD_GLOBAL 4 (M) - 26 CALL_FUNCTION 3 - 28 GET_ITER - >> 30 FOR_ITER 24 (to 80) - 32 STORE_FAST 2 (i) - - 14 34 LOAD_GLOBAL 1 (range) - 36 LOAD_CONST 1 (0) - 38 LOAD_GLOBAL 5 (len) - 40 LOAD_FAST 0 (ls) - 42 CALL_FUNCTION 1 - 44 CALL_FUNCTION 2 - 46 GET_ITER - >> 48 FOR_ITER 14 (to 78) - 50 STORE_FAST 3 (j) - - 15 52 LOAD_FAST 0 (ls) - 54 LOAD_FAST 3 (j) - 56 BINARY_SUBSCR - 58 LOAD_FAST 2 (i) - 60 COMPARE_OP 2 (==) - 62 POP_JUMP_IF_FALSE 38 (to 76) - - 16 64 LOAD_FAST 1 (ct) - 66 LOAD_CONST 2 (1) - 68 INPLACE_ADD - 70 STORE_FAST 1 (ct) - - 17 72 POP_TOP - 74 JUMP_FORWARD 1 (to 78) - - 15 >> 76 JUMP_ABSOLUTE 24 (to 48) - >> 78 JUMP_ABSOLUTE 15 (to 30) - - 13 >> 80 LOAD_CONST 0 (None) - 82 RETURN_VALUE -``` - -Whereas the `in` variant only compiles to 54. - -```python -import dis - -def operatorSearch(): - ls = generateInputs() - ct = 0 - for i in range(0, int(N*M), M): - if i in ls: - ct += 1 - -dis.dis(operatorSearch) -``` -```output - 4 0 LOAD_GLOBAL 0 (generateInputs) - 2 CALL_FUNCTION 0 - 4 STORE_FAST 0 (ls) - - 5 6 LOAD_CONST 1 (0) - 8 STORE_FAST 1 (ct) - - 6 10 LOAD_GLOBAL 1 (range) - 12 LOAD_CONST 1 (0) - 14 LOAD_GLOBAL 2 (int) - 16 LOAD_GLOBAL 3 (N) - 18 LOAD_GLOBAL 4 (M) - 20 BINARY_MULTIPLY - 22 CALL_FUNCTION 1 - 24 LOAD_GLOBAL 4 (M) - 26 CALL_FUNCTION 3 - 28 GET_ITER - >> 30 FOR_ITER 10 (to 52) - 32 STORE_FAST 2 (i) - - 7 34 LOAD_FAST 2 (i) - 36 LOAD_FAST 0 (ls) - 38 CONTAINS_OP 0 - 40 POP_JUMP_IF_FALSE 25 (to 50) - - 8 42 LOAD_FAST 1 (ct) - 44 LOAD_CONST 2 (1) - 46 INPLACE_ADD - 48 STORE_FAST 1 (ct) - >> 50 JUMP_ABSOLUTE 15 (to 30) - - 6 >> 52 LOAD_CONST 0 (None) - 54 RETURN_VALUE -``` - -::::::::::::::::::::::::::::::::::::::::::::: - ## Example: Parsing data from a text file diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index c8fc82d7..9da78374 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -6,11 +6,132 @@ The topics covered here exceed the level of knowledge required to benefit from t **Contents** -- []() +- [Viewing Python's ByteCode](#viewing-pythons-bytecode) - []() - []() -## +## Viewing Python's ByteCode + +You can use `dis` to view the bytecode generated by Python, the amount of bytecode more strongly correlates with how much code is being executed by the Python interpreter and hence how long it may take to execute. However, this is a crude proxy as it does not account for whether functions that are called and whether those functions are implemented using Python or C. + +The pure Python search compiles to 82 lines of byte-code. + +```python +import dis + +def manualSearch(): + ls = generateInputs() + ct = 0 + for i in range(0, int(N*M), M): + for j in range(0, len(ls)): + if ls[j] == i: + ct += 1 + break + +dis.dis(manualSearch) +``` +```output + 11 0 LOAD_GLOBAL 0 (generateInputs) + 2 CALL_FUNCTION 0 + 4 STORE_FAST 0 (ls) + + 12 6 LOAD_CONST 1 (0) + 8 STORE_FAST 1 (ct) + + 13 10 LOAD_GLOBAL 1 (range) + 12 LOAD_CONST 1 (0) + 14 LOAD_GLOBAL 2 (int) + 16 LOAD_GLOBAL 3 (N) + 18 LOAD_GLOBAL 4 (M) + 20 BINARY_MULTIPLY + 22 CALL_FUNCTION 1 + 24 LOAD_GLOBAL 4 (M) + 26 CALL_FUNCTION 3 + 28 GET_ITER + >> 30 FOR_ITER 24 (to 80) + 32 STORE_FAST 2 (i) + + 14 34 LOAD_GLOBAL 1 (range) + 36 LOAD_CONST 1 (0) + 38 LOAD_GLOBAL 5 (len) + 40 LOAD_FAST 0 (ls) + 42 CALL_FUNCTION 1 + 44 CALL_FUNCTION 2 + 46 GET_ITER + >> 48 FOR_ITER 14 (to 78) + 50 STORE_FAST 3 (j) + + 15 52 LOAD_FAST 0 (ls) + 54 LOAD_FAST 3 (j) + 56 BINARY_SUBSCR + 58 LOAD_FAST 2 (i) + 60 COMPARE_OP 2 (==) + 62 POP_JUMP_IF_FALSE 38 (to 76) + + 16 64 LOAD_FAST 1 (ct) + 66 LOAD_CONST 2 (1) + 68 INPLACE_ADD + 70 STORE_FAST 1 (ct) + + 17 72 POP_TOP + 74 JUMP_FORWARD 1 (to 78) + + 15 >> 76 JUMP_ABSOLUTE 24 (to 48) + >> 78 JUMP_ABSOLUTE 15 (to 30) + + 13 >> 80 LOAD_CONST 0 (None) + 82 RETURN_VALUE +``` + +Whereas the `in` variant only compiles to 54. + +```python +import dis + +def operatorSearch(): + ls = generateInputs() + ct = 0 + for i in range(0, int(N*M), M): + if i in ls: + ct += 1 + +dis.dis(operatorSearch) +``` +```output + 4 0 LOAD_GLOBAL 0 (generateInputs) + 2 CALL_FUNCTION 0 + 4 STORE_FAST 0 (ls) + + 5 6 LOAD_CONST 1 (0) + 8 STORE_FAST 1 (ct) + + 6 10 LOAD_GLOBAL 1 (range) + 12 LOAD_CONST 1 (0) + 14 LOAD_GLOBAL 2 (int) + 16 LOAD_GLOBAL 3 (N) + 18 LOAD_GLOBAL 4 (M) + 20 BINARY_MULTIPLY + 22 CALL_FUNCTION 1 + 24 LOAD_GLOBAL 4 (M) + 26 CALL_FUNCTION 3 + 28 GET_ITER + >> 30 FOR_ITER 10 (to 52) + 32 STORE_FAST 2 (i) + + 7 34 LOAD_FAST 2 (i) + 36 LOAD_FAST 0 (ls) + 38 CONTAINS_OP 0 + 40 POP_JUMP_IF_FALSE 25 (to 50) + + 8 42 LOAD_FAST 1 (ct) + 44 LOAD_CONST 2 (1) + 46 INPLACE_ADD + 48 STORE_FAST 1 (ct) + >> 50 JUMP_ABSOLUTE 15 (to 30) + + 6 >> 52 LOAD_CONST 0 (None) + 54 RETURN_VALUE +``` ## From 5ceb38cf7eaaecb524f266668bc520179a0a290b Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sun, 30 Mar 2025 20:04:01 +0100 Subject: [PATCH 03/16] Understanding memory --- config.yaml | 2 +- episodes/optimisation-conclusion.md | 5 +- ...tion-memory.md => optimisation-latency.md} | 75 +++--------------- .../fig/annotated-motherboard.jpg | Bin {episodes => learners}/fig/hardware.ai | 0 {episodes => learners}/fig/hardware.png | Bin learners/technical-appendix.md | 50 +++++++++++- 7 files changed, 63 insertions(+), 69 deletions(-) rename episodes/{optimisation-memory.md => optimisation-latency.md} (69%) rename {episodes => learners}/fig/annotated-motherboard.jpg (100%) rename {episodes => learners}/fig/hardware.ai (100%) rename {episodes => learners}/fig/hardware.png (100%) diff --git a/config.yaml b/config.yaml index 8e11e735..e34cd0e1 100644 --- a/config.yaml +++ b/config.yaml @@ -70,7 +70,7 @@ episodes: - long-break1.md - optimisation-numpy.md - optimisation-use-latest.md -- optimisation-memory.md +- optimisation-latency.md - optimisation-conclusion.md # Information for Learners diff --git a/episodes/optimisation-conclusion.md b/episodes/optimisation-conclusion.md index 042a32f7..a194f886 100644 --- a/episodes/optimisation-conclusion.md +++ b/episodes/optimisation-conclusion.md @@ -54,10 +54,9 @@ Your feedback enables us to improve the course for future attendees! - Where feasible, the latest version of Python and packages should be used as they can include significant free improvements to the performance of your code. - There is a risk that updating Python or packages will not be possible to due to version incompatibilities or will require breaking changes to your code. - Changes to packages may impact results output by your code, ensure you have a method of validation ready prior to attempting upgrades. -- How the Computer Hardware Affects Performance - - Sequential accesses to memory (RAM or disk) will be faster than random or scattered accesses. - - This is not always natively possible in Python without the use of packages such as NumPy and Pandas +- How Latency Affects Performance - One large file is preferable to many small files. + - Network requests can be parallelised to reduce the impact of fixed overheads. - Memory allocation is not free, avoiding destroying and recreating objects can improve performance. :::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/optimisation-memory.md b/episodes/optimisation-latency.md similarity index 69% rename from episodes/optimisation-memory.md rename to episodes/optimisation-latency.md index 2650d9c8..41dee2ad 100644 --- a/episodes/optimisation-memory.md +++ b/episodes/optimisation-latency.md @@ -1,83 +1,33 @@ --- -title: "Understanding Memory" +title: "Understanding Latency" teaching: 30 exercises: 0 --- :::::::::::::::::::::::::::::::::::::: questions -- How does a CPU look for a variable it requires? -- What impact do cache lines have on memory accesses? - Why is it faster to read/write a single 100 MB file, than 100 files of 1 MB each? +- How many orders of magnitude slower are disk accesses than RAM? +- What's the cost of creating a list? :::::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: objectives -- Able to explain, at a high-level, how memory accesses occur during computation and how this impacts optimisation considerations. - Able to identify the relationship between different latencies relevant to software. +- Demonstrate how to implement parallel network requests. +- Justify the re-use of existing variables over creating new ones. :::::::::::::::::::::::::::::::::::::::::::::::: -## Accessing Variables - -The storage and movement of data plays a large role in the performance of executing software. - - -Modern computer's typically have a single processor (CPU), within this processor there are multiple processing cores each capable of executing different code in parallel. - -Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives). -But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster. - -![An annotated photo of a computer's hardware.](episodes/fig/annotated-motherboard.jpg){alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and harddrive are labelled."} - - -When reading a variable, to perform an operation with it, the CPU will first look in its registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes). -As the register file is so small, most variables won't be found and the CPU's caches will be searched. -It will first check the current processing core's L1 (Level 1) cache, this small cache (typically 64 KB per physical core) is the smallest and fastest to access cache on a CPU. -If the variable is not found in the L1 cache, the L2 cache that is shared between multiple cores will be checked. This shared cache, is slower to access but larger than L1 (typically 1-3MB per core). -This process then repeats for the L3 cache which may be shared among all cores of the CPU. This cache again has higher latency to access, but increased size (typically slightly larger than the total L2 cache size). -If the variable has not been found in any of the CPU's cache, the CPU will look to the computer's RAM. This is an order of magnitude slower to access, with several orders of magnitude greater capacity (tens to hundreds of GB are now standard). - -Correspondingly, the earlier the CPU finds the variable the faster it will be to access. -However, to fully understand the cache's it's necessary to explain what happens once a variable has been found. - -If a variable is not found in the caches, it must be fetched from RAM. -The full 64 byte cache line containing the variable, will be copied first into the CPU's L3, then L2 and then L1. -Most variables are only 4 or 8 bytes, so many neighbouring variables are also pulled into the caches. -Similarly, adding new data to a cache evicts old data. -This means that reading 16 integers contiguously stored in memory, should be faster than 16 scattered integers - -Therefore, to **optimally** access variables they should be stored contiguously in memory with related data and worked on whilst they remain in caches. -If you add to a variable, perform large amount of unrelated processing, then add to the variable again it will likely have been evicted from caches and need to be reloaded from slower RAM again. - - -It's not necessary to remember this full detail of how memory access work within a computer, but the context perhaps helps understand why memory locality is important. - -![An abstract diagram showing the path data takes from disk or RAM to be used for computation.](episodes/fig/hardware.png){alt='An abstract representation of a CPU, RAM and Disk, showing their internal caches and the pathways data can pass.'} - -::::::::::::::::::::::::::::::::::::: callout - -Python as a programming language, does not give you enough control to carefully pack your variables in this manner (every variable is an object, so it's stored as a pointer that redirects to the actual data stored elsewhere). - -However all is not lost, packages such as `numpy` and `pandas` implemented in C/C++ enable Python users to take advantage of efficient memory accesses (when they are used correctly). - -::::::::::::::::::::::::::::::::::::::::::::: - - ## Accessing Disk -When accessing data on disk (or network), a very similar process is performed to that between CPU and RAM when accessing variables. +When reading data from a file, it is first transferred from the disk, to the disk cache, to the RAM (the computer's main memory, where variables are stored). +The latency to access files on disk is another order of magnitude higher than accessing normal variables. -When reading data from a file, it transferred from the disk, to the disk cache, to the RAM. -The latency to access files on disk is another order of magnitude higher than accessing RAM. - -As such, disk accesses similarly benefit from sequential accesses and reading larger blocks together rather than single variables. +As such, disk accesses benefit from sequential accesses and reading larger blocks together rather than single variables. Python's `io` package is already buffered, so automatically handles this for you in the background. However before a file can be read, the file system on the disk must be polled to transform the file path to its address on disk to initiate the transfer (or throw an exception). @@ -158,7 +108,7 @@ An even greater overhead would apply. ## Accessing the Network -When transfering files over a network, similar effects apply. There is a fixed overhead for every file transfer (no matter how big the file), so downloading many small files will be slower than downloading a single large file of the same total size. +When transferring files over a network, similar effects apply. There is a fixed overhead for every file transfer (no matter how big the file), so downloading many small files will be slower than downloading a single large file of the same total size. Because of this overhead, downloading many small files often does not use all the available bandwidth. It may be possible to speed things up by parallelising downloads. @@ -227,7 +177,9 @@ Latency can have a big impact on the speed that a program executes, the below gr ![A graph demonstrating the wide variety of latencies a programmer may experience when accessing data.](episodes/fig/latency.png){alt="A horizontal bar chart displaying the relative latencies for L1/L2/L3 cache, RAM, SSD, HDD and a packet being sent from London to California and back. These latencies range from 1 nanosecond to 140 milliseconds and are displayed with a log scale."} -The lower the latency typically the higher the effective bandwidth (L1 and L2 cache have 1 TB/s, RAM 100 GB/s, SSDs up to 32 GB/s, HDDs up to 150 MB/s), making large memory transactions even slower. +L1/L2/L3 caches are where your most recently accessed variables are stored inside the CPU, whereas RAM is where most of your variables will be found. + +The lower the latency typically the higher the effective bandwidth (L1 and L2 cache have 1 TB/s, RAM 100 GB/s, SSDs up to 32 GB/s, HDDs up to 150 MB/s), making large memory transactions even slower. ## Memory Allocation is not Free @@ -335,9 +287,8 @@ Line # Hits Time Per Hit % Time Line Contents ::::::::::::::::::::::::::::::::::::: keypoints -- Sequential accesses to memory (RAM or disk) will be faster than random or scattered accesses. - - This is not always natively possible in Python without the use of packages such as NumPy and Pandas - One large file is preferable to many small files. +- Network requests can be parallelised to reduce the impact of fixed overheads. - Memory allocation is not free, avoiding destroying and recreating objects can improve performance. :::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/episodes/fig/annotated-motherboard.jpg b/learners/fig/annotated-motherboard.jpg similarity index 100% rename from episodes/fig/annotated-motherboard.jpg rename to learners/fig/annotated-motherboard.jpg diff --git a/episodes/fig/hardware.ai b/learners/fig/hardware.ai similarity index 100% rename from episodes/fig/hardware.ai rename to learners/fig/hardware.ai diff --git a/episodes/fig/hardware.png b/learners/fig/hardware.png similarity index 100% rename from episodes/fig/hardware.png rename to learners/fig/hardware.png diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index 9da78374..acac908f 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -6,8 +6,8 @@ The topics covered here exceed the level of knowledge required to benefit from t **Contents** -- [Viewing Python's ByteCode](#viewing-pythons-bytecode) -- []() +- [Viewing Python's ByteCode](#viewing-pythons-bytecode): What the Python code you write compiles to and executes as. +- [Hardware Level Memory Accesses](#hardware-level-memory-accesses): A look at how memory accesses pass through a processor's caches. - []() ## Viewing Python's ByteCode @@ -133,6 +133,50 @@ dis.dis(operatorSearch) 54 RETURN_VALUE ``` -## +## Hardware Level Memory Accesses + +The storage and movement of data plays a large role in the performance of executing software. + + +Modern computers typically have a single processor (CPU), within this processor there are multiple processing cores each capable of executing different code in parallel. + +Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives). +But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster. + +![An annotated photo of a computer's hardware.](learners/fig/annotated-motherboard.jpg){alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and harddrive are labelled."} + + +When reading a variable, to perform an operation with it, the CPU will first look in its registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes). +As the register file is so small, most variables won't be found and the CPU's caches will be searched. +It will first check the current processing core's L1 (Level 1) cache, this small cache (typically 64 KB per physical core) is the smallest and fastest to access cache on a CPU. +If the variable is not found in the L1 cache, the L2 cache that is shared between multiple cores will be checked. This shared cache, is slower to access but larger than L1 (typically 1-3MB per core). +This process then repeats for the L3 cache which may be shared among all cores of the CPU. This cache again has higher latency to access, but increased size (typically slightly larger than the total L2 cache size). +If the variable has not been found in any of the CPU's cache, the CPU will look to the computer's RAM. This is an order of magnitude slower to access, with several orders of magnitude greater capacity (tens to hundreds of GB are now standard). + +Correspondingly, the earlier the CPU finds the variable the faster it will be to access. +However, to fully understand the cache's it's necessary to explain what happens once a variable has been found. + +If a variable is not found in the caches, it must be fetched from RAM. +The full 64 byte cache line containing the variable, will be copied first into the CPU's L3, then L2 and then L1. +Most variables are only 4 or 8 bytes, so many neighbouring variables are also pulled into the caches. +Similarly, adding new data to a cache evicts old data. +This means that reading 16 integers contiguously stored in memory, should be faster than 16 scattered integers + +Therefore, to **optimally** access variables they should be stored contiguously in memory with related data and worked on whilst they remain in caches. +If you add to a variable, perform large amount of unrelated processing, then add to the variable again it will likely have been evicted from caches and need to be reloaded from slower RAM again. + + +It's not necessary to remember this full detail of how memory access work within a computer, but the context perhaps helps understand why memory locality is important. + +![An abstract diagram showing the path data takes from disk or RAM to be used for computation.](learners/fig/hardware.png){alt='An abstract representation of a CPU, RAM and Disk, showing their internal caches and the pathways data can pass.'} + +::::::::::::::::::::::::::::::::::::: callout + +Python as a programming language, does not give you enough control to carefully pack your variables in this manner (every variable is an object, so it's stored as a pointer that redirects to the actual data stored elsewhere). + +However all is not lost, packages such as `numpy` and `pandas` implemented in C/C++ enable Python users to take advantage of efficient memory accesses (when they are used correctly). + +::::::::::::::::::::::::::::::::::::::::::::: + ## \ No newline at end of file From 04276725c419aa788947307c2af0cadffceff7cf Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Tue, 15 Apr 2025 14:43:39 +0100 Subject: [PATCH 04/16] Move hashing data structures to technical appendix. --- ...optimisation-data-structures-algorithms.md | 22 ++++------------- .../fig/hash_linear_probe.ai | 0 .../fig/hash_linear_probing.png | Bin learners/technical-appendix.md | 23 +++++++++++++++--- 4 files changed, 25 insertions(+), 20 deletions(-) rename {episodes => learners}/fig/hash_linear_probe.ai (100%) rename {episodes => learners}/fig/hash_linear_probing.png (100%) diff --git a/episodes/optimisation-data-structures-algorithms.md b/episodes/optimisation-data-structures-algorithms.md index 14caf8f1..94ef68f8 100644 --- a/episodes/optimisation-data-structures-algorithms.md +++ b/episodes/optimisation-data-structures-algorithms.md @@ -156,14 +156,13 @@ Since Python 3.6, the items within a dictionary will iterate in the order that t ### Hashing Data Structures -Python's dictionaries are implemented as hashing data structures. -Explaining how these work will get a bit technical, so let's start with an analogy: +Python's dictionaries are implemented as hashing data structures, we can understand where these at a high-level with an analogy: A Python list is like having a single long bookshelf. When you buy a new book (append a new element to the list), you place it at the far end of the shelf, right after all the previous books. ![A bookshelf corresponding to a Python list.](episodes/fig/bookshelf_list.jpg){alt="An image of a single long bookshelf, with a large number of books."} -A hashing data structure is more like a bookcase with several shelves, labelled by genre (sci-fi, romance, children's books, non-fiction, …) and author surname. When you buy a new book by Jules Verne, you might place it on the shelf labelled "Sci-Fi, V–Z". +A dictionary is more like a bookcase with several shelves, labelled by genre (sci-fi, romance, children's books, non-fiction, …) and author surname. When you buy a new book by Jules Verne, you might place it on the shelf labelled "Sci-Fi, V–Z". And if you keep adding more books, at some point you'll move to a larger bookcase with more shelves (and thus more fine-grained sorting), to make sure you don't have too many books on a single shelf. ![A bookshelf corresponding to a Python dictionary.](episodes/fig/bookshelf_dict.jpg){alt="An image of two bookcases, labelled "Sci-Fi" and "Romance". Each bookcase contains shelves labelled in alphabetical order, with zero or few books on each shelf."} @@ -186,25 +185,14 @@ In practice, therefore, this trade-off between memory usage and speed is usually :::::::::::::::::::::::::::::::::::::::::::::::: +When a value is inserted into a dictionary, its key is hashed to decide on which "shelf" it should be stored. Most items will have a unique shelf, allowing them to be accessed directly. This is typically much faster for locating a specific item than searching a list. -::::::::::::::::::::::::::::::::::::: callout - -### Technical explanation - -Within a hashing data structure each inserted key is hashed to produce a (hopefully unique) integer key. -The dictionary is pre-allocated to a default size, and the key is assigned the index within the dictionary equivalent to the hash modulo the length of the dictionary. -If that index doesn't already contain another key, the key (and any associated values) can be inserted. -When the index isn't free, a collision strategy is applied. CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c) both use a form of open addressing whereby a hash is mutated and corresponding indices probed until a free one is located. -When the hashing data structure exceeds a given load factor (e.g. 2/3 of indices have been assigned keys), the internal storage must grow. This process requires every item to be re-inserted which can be expensive, but reduces the average probes for a key to be found. - -![An visual explanation of linear probing, CPython uses an advanced form of this.](episodes/fig/hash_linear_probing.png){alt="A diagram demonstrating how the keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. This is followed by the insertion of 59, 80 and 39 which require linear probing to be inserted due to collisions."} - -To retrieve or check for the existence of a key within a hashing data structure, the key is hashed again and a process equivalent to insertion is repeated. However, now the key at each index is checked for equality with the one provided. If any empty index is found before an equivalent key, then the key must not be present in the data structure. +::::::::::::::::::::::::::::::::::::: callout ### Keys -Keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented. +A dictionary's keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented. You can implement `__hash__()` by utilising the ability for Python to hash tuples, avoiding the need to implement a bespoke hash function. diff --git a/episodes/fig/hash_linear_probe.ai b/learners/fig/hash_linear_probe.ai similarity index 100% rename from episodes/fig/hash_linear_probe.ai rename to learners/fig/hash_linear_probe.ai diff --git a/episodes/fig/hash_linear_probing.png b/learners/fig/hash_linear_probing.png similarity index 100% rename from episodes/fig/hash_linear_probing.png rename to learners/fig/hash_linear_probing.png diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index acac908f..4dc8a8f6 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -8,7 +8,7 @@ The topics covered here exceed the level of knowledge required to benefit from t - [Viewing Python's ByteCode](#viewing-pythons-bytecode): What the Python code you write compiles to and executes as. - [Hardware Level Memory Accesses](#hardware-level-memory-accesses): A look at how memory accesses pass through a processor's caches. -- []() +- [Hashing Data-Structures](#hashing-data-structures): A deeper look at how data structures such as Dictionaries operate. ## Viewing Python's ByteCode @@ -143,7 +143,7 @@ Modern computers typically have a single processor (CPU), within this processor Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives). But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster. -![An annotated photo of a computer's hardware.](learners/fig/annotated-motherboard.jpg){alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and harddrive are labelled."} +![An annotated photo of a computer's hardware.](learners/fig/annotated-motherboard.jpg){alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and hard-drive are labelled."} When reading a variable, to perform an operation with it, the CPU will first look in its registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes). @@ -179,4 +179,21 @@ However all is not lost, packages such as `numpy` and `pandas` implemented in C/ ::::::::::::::::::::::::::::::::::::::::::::: -## \ No newline at end of file +## Hashing Data-Structures + +Within a hashing data structure (such as a Dictionary or Set) each inserted key is hashed to produce a (preferably unique) integer key, which serves as the basis for indexing. Dictionaries are initialized with a default size, and the hash value of a key, modulo the dictionary's length, determines its initial index. If this index is available, the key and its associated value are stored there. If the index is already occupied, a collision occurs, and a resolution strategy is applied to find an alternate index. + +In CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c)implementations, a technique called open addressing is employed. This approach modifies the hash and probes subsequent indices until an empty one is found. + +When a dictionary or hash table in Python grows, the underlying storage is resized, which necessitates re-inserting every existing item into the new structure. This process can be computationally expensive but is essential for maintaining efficient average probe times when searching for keys. + +![A visual explanation of linear probing, CPython uses an advanced form of this.](learners/fig/hash_linear_probing.png){alt="A diagram showing how keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. The insertion of 59, 80, and 39 demonstrates linear probing to resolve collisions."} + +To look up or verify the existence of a key in a hashing data structure, the key is re-hashed, and the process mirrors that of insertion. The corresponding index is probed to see if it contains the provided key. If the key at the index matches, the operation succeeds. If an empty index is reached before finding the key, it indicates that the key does not exist in the structure. + +The above diagrams shows a hash table of 5 elements within a block of 11 slots: + +1. We try to add element k=59. Based on its hash, the intended position is p=4. However, slot 4 is already occupied by the element k=37. This results in a collision. +2. To resolve the collision, the linear probing mechanism is employed. The algorithm checks the next available slot, starting from position p=4. The first available slot is found at position 5. +3. The number of jumps (or steps) it took to find the available slot are represented by i=1 (since we moved from position 4 to 5). +In this case, the number of jumps i=1 indicates that the algorithm had to probe one slot to find an empty position at index 5. \ No newline at end of file From 637152506a40249cff06fd70e1842237e7dd89da Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Tue, 15 Apr 2025 14:52:55 +0100 Subject: [PATCH 05/16] Fix some dead links etc on review. --- learners/acknowledgements.md | 6 +++--- learners/technical-appendix.md | 3 +++ 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/learners/acknowledgements.md b/learners/acknowledgements.md index 27c4506d..34d03368 100644 --- a/learners/acknowledgements.md +++ b/learners/acknowledgements.md @@ -16,8 +16,8 @@ Anastasiia Shcherbakova and Mira Sarkis of [ICR-RSE](https://github.com/ICR-RSE- **Resources** -Most of the content was drawn from the education and experience of the author, however the below resources provided inspiration: +Most of the content was drawn from the education and experience of the authors, however the below resources provided inspiration: -* [High Performance Python, 2nd Edition](https://www.oreilly.com/library/view/high-performance-python/9781492055013/): This excellent book goes far deeper than this short course in explaining how to maximise performance in Python, however it inspired the examples; [memory allocation is not free](optimisation-memory.html#memory-allocation-is-not-free) and [vectorisation](optimisation-memory.html#memory-allocation-is-not-free). -* [What scientists must know about hardware to write fast code](https://viralinstruction.com/posts/hardware/): This notebook provides an array of hardware lessons relevant to programming for performance, which could be similarly found in most undergraduate Computer Science courses. Although the notebook is grounded in Julia, a lower level language than Python, it is referring to hardware so many of same lessons are covered in the [memory episode](optimisation-memory.html). +* [High Performance Python, 2nd Edition](https://www.oreilly.com/library/view/high-performance-python/9781492055013/): This excellent book goes far deeper than this short course in explaining how to maximise performance in Python, however it inspired the examples; [memory allocation is not free](optimisation-latency.html#memory-allocation-is-not-free) and [vectorisation](optimisation-latency.html#memory-allocation-is-not-free). +* [What scientists must know about hardware to write fast code](https://viralinstruction.com/posts/hardware/): This notebook provides an array of hardware lessons relevant to programming for performance, which could be similarly found in most undergraduate Computer Science courses. Although the notebook is grounded in Julia, a lower level language than Python, it is referring to hardware so many of same lessons are covered in the [lRWBXT episode](optimisation-latency). * [Why Python is Slow: Looking Under the Hood](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/): This blog post looks under the hood of CPython to explain why Python is often slower than C (and NumPy). We reproduced two of its figures in the [optimisation introduction](optimisation-introduction.html) and [numpy](optimisation-numpy) episodes to explain how memory is laid out. diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index 4dc8a8f6..4c0ff2a6 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -133,6 +133,9 @@ dis.dis(operatorSearch) 54 RETURN_VALUE ``` +A naive assessment of how expensive two functions are can be carried out with this comparison. +However this method of displaying bytecode only shows bytecode for the requested function, so it is not clear how expensive called function's will be or higher level changes to an algorithm which could reduce the number of iterations or similar. + ## Hardware Level Memory Accesses The storage and movement of data plays a large role in the performance of executing software. From 5a1985bcc195a5af88770b34bd9db9948bd918af Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:28:56 +0100 Subject: [PATCH 06/16] Update episodes/optimisation-data-structures-algorithms.md Co-authored-by: Jost Migenda --- episodes/optimisation-data-structures-algorithms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/optimisation-data-structures-algorithms.md b/episodes/optimisation-data-structures-algorithms.md index 94ef68f8..b11ef0b8 100644 --- a/episodes/optimisation-data-structures-algorithms.md +++ b/episodes/optimisation-data-structures-algorithms.md @@ -156,7 +156,7 @@ Since Python 3.6, the items within a dictionary will iterate in the order that t ### Hashing Data Structures -Python's dictionaries are implemented as hashing data structures, we can understand where these at a high-level with an analogy: +Python's dictionaries are implemented as hashing data structures, we can understand these at a high-level with an analogy: A Python list is like having a single long bookshelf. When you buy a new book (append a new element to the list), you place it at the far end of the shelf, right after all the previous books. From ea013ee68246b218cc4d064400162a3221fd0eaa Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:29:12 +0100 Subject: [PATCH 07/16] Update episodes/optimisation-data-structures-algorithms.md Co-authored-by: Jost Migenda --- episodes/optimisation-data-structures-algorithms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/optimisation-data-structures-algorithms.md b/episodes/optimisation-data-structures-algorithms.md index b11ef0b8..785f008f 100644 --- a/episodes/optimisation-data-structures-algorithms.md +++ b/episodes/optimisation-data-structures-algorithms.md @@ -162,7 +162,7 @@ A Python list is like having a single long bookshelf. When you buy a new book (a ![A bookshelf corresponding to a Python list.](episodes/fig/bookshelf_list.jpg){alt="An image of a single long bookshelf, with a large number of books."} -A dictionary is more like a bookcase with several shelves, labelled by genre (sci-fi, romance, children's books, non-fiction, …) and author surname. When you buy a new book by Jules Verne, you might place it on the shelf labelled "Sci-Fi, V–Z". +A Python dictionary is more like a bookcase with several shelves, labelled by genre (sci-fi, romance, children's books, non-fiction, …) and author surname. When you buy a new book by Jules Verne, you might place it on the shelf labelled "Sci-Fi, V–Z". And if you keep adding more books, at some point you'll move to a larger bookcase with more shelves (and thus more fine-grained sorting), to make sure you don't have too many books on a single shelf. ![A bookshelf corresponding to a Python dictionary.](episodes/fig/bookshelf_dict.jpg){alt="An image of two bookcases, labelled "Sci-Fi" and "Romance". Each bookcase contains shelves labelled in alphabetical order, with zero or few books on each shelf."} From a68d6ec5e0195d35bacaeb9b76026be752ba73ae Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:30:11 +0100 Subject: [PATCH 08/16] Update learners/technical-appendix.md Co-authored-by: Jost Migenda --- learners/technical-appendix.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index 4c0ff2a6..2aa09d3e 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -184,7 +184,7 @@ However all is not lost, packages such as `numpy` and `pandas` implemented in C/ ## Hashing Data-Structures -Within a hashing data structure (such as a Dictionary or Set) each inserted key is hashed to produce a (preferably unique) integer key, which serves as the basis for indexing. Dictionaries are initialized with a default size, and the hash value of a key, modulo the dictionary's length, determines its initial index. If this index is available, the key and its associated value are stored there. If the index is already occupied, a collision occurs, and a resolution strategy is applied to find an alternate index. +Within a hashing data structure (such as a dictionary or set) each inserted key is hashed to produce a (preferably unique) integer key, which serves as the basis for indexing. Dictionaries are initialized with a default size, and the initial index of a key is determined by its hash value, modulo the dictionary's length. If this index is available, the key and its associated value are stored there. If the index is already occupied, a collision occurs, and a resolution strategy is applied to find an alternate index. In CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c)implementations, a technique called open addressing is employed. This approach modifies the hash and probes subsequent indices until an empty one is found. From 3425ca40db84c50e1f53b0bd7ed1b86e6f0c0d9d Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:30:20 +0100 Subject: [PATCH 09/16] Update episodes/optimisation-data-structures-algorithms.md Co-authored-by: Jost Migenda --- episodes/optimisation-data-structures-algorithms.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/optimisation-data-structures-algorithms.md b/episodes/optimisation-data-structures-algorithms.md index 785f008f..47fad6eb 100644 --- a/episodes/optimisation-data-structures-algorithms.md +++ b/episodes/optimisation-data-structures-algorithms.md @@ -192,7 +192,7 @@ When a value is inserted into a dictionary, its key is hashed to decide on which ### Keys -A dictionary's keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented. +A dictionary's keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented. You can implement `__hash__()` by utilising the ability for Python to hash tuples, avoiding the need to implement a bespoke hash function. From 1726004747ddb8f307928822833ed9d84f487623 Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:30:29 +0100 Subject: [PATCH 10/16] Update learners/technical-appendix.md Co-authored-by: Jost Migenda --- learners/technical-appendix.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index 2aa09d3e..936d6533 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -186,7 +186,7 @@ However all is not lost, packages such as `numpy` and `pandas` implemented in C/ Within a hashing data structure (such as a dictionary or set) each inserted key is hashed to produce a (preferably unique) integer key, which serves as the basis for indexing. Dictionaries are initialized with a default size, and the initial index of a key is determined by its hash value, modulo the dictionary's length. If this index is available, the key and its associated value are stored there. If the index is already occupied, a collision occurs, and a resolution strategy is applied to find an alternate index. -In CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c)implementations, a technique called open addressing is employed. This approach modifies the hash and probes subsequent indices until an empty one is found. +In CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c) implementations, a technique called open addressing is employed. This approach modifies the hash and probes subsequent indices until an empty one is found. When a dictionary or hash table in Python grows, the underlying storage is resized, which necessitates re-inserting every existing item into the new structure. This process can be computationally expensive but is essential for maintaining efficient average probe times when searching for keys. From b8114747858090f30eca8cb0ba86b19cbd95ede0 Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:30:52 +0100 Subject: [PATCH 11/16] Update episodes/optimisation-latency.md Co-authored-by: Jost Migenda --- episodes/optimisation-latency.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/optimisation-latency.md b/episodes/optimisation-latency.md index 41dee2ad..e89a4f21 100644 --- a/episodes/optimisation-latency.md +++ b/episodes/optimisation-latency.md @@ -24,7 +24,7 @@ exercises: 0 ## Accessing Disk -When reading data from a file, it is first transferred from the disk, to the disk cache, to the RAM (the computer's main memory, where variables are stored). +When reading data from a file, it is first transferred from the disk to the disk cache and then to the RAM (the computer's main memory, where variables are stored). The latency to access files on disk is another order of magnitude higher than accessing normal variables. As such, disk accesses benefit from sequential accesses and reading larger blocks together rather than single variables. From 0f7cab0bbfb60aa63cb91ad2018867f168e692f1 Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:31:04 +0100 Subject: [PATCH 12/16] Update learners/technical-appendix.md Co-authored-by: Jost Migenda --- learners/technical-appendix.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index 936d6533..6e548b66 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -143,7 +143,7 @@ The storage and movement of data plays a large role in the performance of execut Modern computers typically have a single processor (CPU), within this processor there are multiple processing cores each capable of executing different code in parallel. -Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives). +Data held in memory by running software exists in RAM, this memory is faster to access than hard drives (and solid-state drives). But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster. ![An annotated photo of a computer's hardware.](learners/fig/annotated-motherboard.jpg){alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and hard-drive are labelled."} From 60b427be314bf5817f95e9cdc70f325fe4a284f2 Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:31:28 +0100 Subject: [PATCH 13/16] Update learners/technical-appendix.md Co-authored-by: Jost Migenda --- learners/technical-appendix.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index 6e548b66..15ff1775 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -157,7 +157,7 @@ This process then repeats for the L3 cache which may be shared among all cores o If the variable has not been found in any of the CPU's cache, the CPU will look to the computer's RAM. This is an order of magnitude slower to access, with several orders of magnitude greater capacity (tens to hundreds of GB are now standard). Correspondingly, the earlier the CPU finds the variable the faster it will be to access. -However, to fully understand the cache's it's necessary to explain what happens once a variable has been found. +However, to fully understand the caches it's necessary to explain what happens once a variable has been found. If a variable is not found in the caches, it must be fetched from RAM. The full 64 byte cache line containing the variable, will be copied first into the CPU's L3, then L2 and then L1. From efa147f6e40dc5be87d9be35422c46fab60f2b86 Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:31:48 +0100 Subject: [PATCH 14/16] Update learners/technical-appendix.md Co-authored-by: Jost Migenda --- learners/technical-appendix.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index 15ff1775..a74c09dc 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -163,8 +163,7 @@ If a variable is not found in the caches, it must be fetched from RAM. The full 64 byte cache line containing the variable, will be copied first into the CPU's L3, then L2 and then L1. Most variables are only 4 or 8 bytes, so many neighbouring variables are also pulled into the caches. Similarly, adding new data to a cache evicts old data. -This means that reading 16 integers contiguously stored in memory, should be faster than 16 scattered integers - +This means that reading 16 integers contiguously stored in memory should be faster than 16 scattered integers. Therefore, to **optimally** access variables they should be stored contiguously in memory with related data and worked on whilst they remain in caches. If you add to a variable, perform large amount of unrelated processing, then add to the variable again it will likely have been evicted from caches and need to be reloaded from slower RAM again. From f18ed506151c458deeb193b7be63ce2b15f028af Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:33:17 +0100 Subject: [PATCH 15/16] Update learners/acknowledgements.md Co-authored-by: Jost Migenda --- learners/acknowledgements.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/learners/acknowledgements.md b/learners/acknowledgements.md index 34d03368..a03b7834 100644 --- a/learners/acknowledgements.md +++ b/learners/acknowledgements.md @@ -19,5 +19,5 @@ Anastasiia Shcherbakova and Mira Sarkis of [ICR-RSE](https://github.com/ICR-RSE- Most of the content was drawn from the education and experience of the authors, however the below resources provided inspiration: * [High Performance Python, 2nd Edition](https://www.oreilly.com/library/view/high-performance-python/9781492055013/): This excellent book goes far deeper than this short course in explaining how to maximise performance in Python, however it inspired the examples; [memory allocation is not free](optimisation-latency.html#memory-allocation-is-not-free) and [vectorisation](optimisation-latency.html#memory-allocation-is-not-free). -* [What scientists must know about hardware to write fast code](https://viralinstruction.com/posts/hardware/): This notebook provides an array of hardware lessons relevant to programming for performance, which could be similarly found in most undergraduate Computer Science courses. Although the notebook is grounded in Julia, a lower level language than Python, it is referring to hardware so many of same lessons are covered in the [lRWBXT episode](optimisation-latency). +* [What scientists must know about hardware to write fast code](https://viralinstruction.com/posts/hardware/): This notebook provides an array of hardware lessons relevant to programming for performance, which could be similarly found in most undergraduate Computer Science courses. Although the notebook is grounded in Julia, a lower level language than Python, it is referring to hardware so many of same lessons are covered in the [latency episode](optimisation-latency). * [Why Python is Slow: Looking Under the Hood](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/): This blog post looks under the hood of CPython to explain why Python is often slower than C (and NumPy). We reproduced two of its figures in the [optimisation introduction](optimisation-introduction.html) and [numpy](optimisation-numpy) episodes to explain how memory is laid out. From f45c94a61d5fe8aa7d0d031c009d6256afc97cc9 Mon Sep 17 00:00:00 2001 From: Robert Chisholm Date: Sat, 10 May 2025 20:33:41 +0100 Subject: [PATCH 16/16] Update learners/technical-appendix.md Co-authored-by: Jost Migenda --- learners/technical-appendix.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/learners/technical-appendix.md b/learners/technical-appendix.md index a74c09dc..89310fe7 100644 --- a/learners/technical-appendix.md +++ b/learners/technical-appendix.md @@ -10,11 +10,11 @@ The topics covered here exceed the level of knowledge required to benefit from t - [Hardware Level Memory Accesses](#hardware-level-memory-accesses): A look at how memory accesses pass through a processor's caches. - [Hashing Data-Structures](#hashing-data-structures): A deeper look at how data structures such as Dictionaries operate. -## Viewing Python's ByteCode +## Viewing Python's Bytecode -You can use `dis` to view the bytecode generated by Python, the amount of bytecode more strongly correlates with how much code is being executed by the Python interpreter and hence how long it may take to execute. However, this is a crude proxy as it does not account for whether functions that are called and whether those functions are implemented using Python or C. +You can use `dis` to view the bytecode generated by Python. The amount of bytecode more strongly correlates with how much code is being executed by the Python interpreter and hence how long it may take to execute. However, this is a crude proxy as it does not account for functions that are called and whether those functions are implemented using Python or C. -The pure Python search compiles to 82 lines of byte-code. +The pure Python search compiles to 82 lines of bytecode. ```python import dis