PEP 810: Add section about performance (#4633)

pablogsal · web-flow · commit 3768b705863c · 2025-10-06T17:46:26.000Z
diff --git a/peps/pep-0810.rst b/peps/pep-0810.rst
@@ -734,6 +734,132 @@ Subinterpreters are supported. Each subinterpreter maintains its own
 ``sys.lazy_modules`` and import state, so lazy imports in one subinterpreter
 do not affect others.
 
+Performance
+-----------
+
+Lazy imports have **no measurable performance overhead**. The implementation
+is designed to be performance-neutral for both code that uses lazy imports and
+code that doesn't.
+
+Runtime performance
+~~~~~~~~~~~~~~~~~~~
+
+After reification (first use), lazy imports have **zero overhead**. The
+adaptive interpreter specializes the bytecode (typically after 2-3 accesses),
+eliminating any checks. For example, ``LOAD_GLOBAL`` becomes
+``LOAD_GLOBAL_MODULE``, which directly accesses the module identically to
+normal imports.
+
+The `pyperformance suite`_ confirms the implementation is performance-neutral.
+
+.. _pyperformance suite: https://github.com/facebookexperimental/
+    free-threading-benchmarking/blob/main/results/bm-20250922-3.15.0a0-27836e5/
+    bm-20250922-vultr-x86_64-DinoV-lazy_imports-3.15.0a0-27836e5-vs-base.svg
+
+Filter function performance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The filter function (set via ``sys.set_lazy_imports_filter()``) is called for
+every *potentially lazy* import to determine whether it should actually be
+lazy. When no filter is set, this is simply a NULL check (testing whether a
+filter function has been registered), which is a highly predictable branch that
+adds essentially no overhead. When a filter is installed, it is called for each
+potentially lazy import, but this still has **almost no measurable performance
+cost**. To measure this, we benchmarked importing all 278 top-level importable
+modules from the Python standard library (which transitively loads 392 total
+modules including all submodules and dependencies), then forced reification of
+every loaded module to ensure everything was fully materialized.
+
+Note that these measurements establish the baseline overhead of the filter
+mechanism itself. Of course, any user-defined filter function that performs
+additional work beyond a trivial check will add overhead proportional to the
+complexity of that work. However, we expect that in practice this overhead
+will be dwarfed by the performance benefits gained from avoiding unnecessary
+imports. The benchmarks below measure the minimal cost of the filter dispatch
+mechanism when the filter function does essentially nothing.
+
+We compared four different configurations:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 50 25 25
+
+   * - Configuration
+     - Mean ± Std Dev (ms)
+     - Overhead vs Baseline
+   * - **Eager imports** (baseline)
+     - 161.2 ± 4.3
+     - 0%
+   * - **Lazy + filter forcing eager**
+     - 161.7 ± 4.2
+     - +0.3% ± 3.7%
+   * - **Lazy + filter allowing lazy + reification**
+     - 162.0 ± 4.0
+     - +0.5% ± 3.7%
+   * - **Lazy + no filter + reification**
+     - 161.4 ± 4.3
+     - +0.1% ± 3.8%
+
+The four configurations:
+
+1. **Eager imports (baseline)**: Normal Python imports with no lazy machinery.
+   Standard Python behavior.
+
+2. **Lazy + filter forcing eager**: Filter function returns ``False`` for all
+   imports, forcing eager execution, then all imports are reified at script
+   end. Measures pure filter calling overhead since every import goes through
+   the filter but executes eagerly.
+
+3. **Lazy + filter allowing lazy + reification**: Filter function returns
+   ``True`` for all imports, allowing lazy execution. All imports are reified
+   at script end. Measures filter overhead when imports are actually lazy.
+
+4. **Lazy + no filter + reification**: No filter installed, imports are lazy
+   and reified at script end. Baseline for lazy behavior without filter.
+
+The benchmarks used `hyperfine <https://github.com/sharkdp/hyperfine>`_,
+testing 278 standard library modules. Each ran in a fresh Python process.
+All configurations force the import of exactly the same set of modules
+(all modules loaded by the eager baseline) to ensure a fair comparison.
+
+The benchmark environment used CPU isolation with 32 logical CPUs (0-15 at
+3200 MHz, 16-31 at 2400 MHz), the performance scaling governor, Turbo Boost
+disabled, and full ASLR randomization. The overhead error bars are computed
+using standard error propagation for the formula ``(value - baseline) /
+baseline``, accounting for uncertainties in both the measured value and the
+baseline.
+
+Startup time improvements
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The primary performance benefit of lazy imports is reduced startup time by
+loading only the modules actually used at runtime, rather than optimistically
+loading entire dependency trees at startup.
+
+Real-world deployments at scale have demonstrated that the benefits can be
+massive, though of course this depends on the specific codebase and usage
+patterns. Organizations with large, interconnected codebases have reported
+substantial reductions in server reload times, ML training initialization,
+command-line tool startup, and Jupyter notebook loading. Memory usage
+improvements have also been observed as unused modules remain unloaded.
+
+For detailed case studies and performance data from production deployments,
+see:
+
+- `Python Lazy Imports With Cinder
+  <https://developers.facebook.com/blog/post/2022/06/15/python-lazy-imports-with-cinder/>`__
+  (Meta Instagram Server)
+- `Lazy is the new fast: How Lazy Imports and Cinder accelerate machine
+  learning at Meta
+  <https://engineering.fb.com/2024/01/18/developer-tools/lazy-imports-cinder-machine-learning-meta/>`__
+  (Meta ML Workloads)
+- `Inside HRT's Python Fork
+  <https://www.hudsonrivertrading.com/hrtbeat/inside-hrts-python-fork/>`__
+  (Hudson River Trading)
+
+The benefits scale with codebase complexity: the larger and more
+interconnected the codebase, the more dramatic the improvements.
+
 Typing and tools
 ----------------