Skip to content

Performance TODOs #144388

@markshannon

Description

@markshannon

Important

This is a meta issue listing possible performance improvements that:

  • are not too hard, but they aren't easy either: a knowledge of computer science is necessary.
  • do not involving original research, or changes to multiple parts of the VM.
  • should produce a worthwhile performance improvement
  • are self contained:
    • Not increasing coupling or complexity in the code base
    • Can be worked on without troublesome merge conflicts

Since this is a meta issue, please make sure there is an issue for the sub-issue before working on it.

In no particular order:

Convert basic blocks to extended basic blocks in the bytecode compiler

Many local optimizations in the bytecode compiler are limited to a single basic block, but would be more effective and still correct applied to extended basic blocks.

Better conversion of LOAD_FAST to LOAD_FAST_BORROW in the bytecode compiler

For example,

>>> def f(a,b):
...     return a if a < b else b
>>> dis.dis(f)
  1           RESUME                   0

  2           LOAD_FAST_BORROW_LOAD_FAST_BORROW 1 (a, b)
              COMPARE_OP              18 (bool(<))
              POP_JUMP_IF_FALSE        3 (to L1)
              NOT_TAKEN
              LOAD_FAST_BORROW         0 (a)
              RETURN_VALUE
      L1:     LOAD_FAST                1 (b)
              RETURN_VALUE

It is possible that extended basic blocks would fix this, or it might be a separate problem

Replace with _CHECK_STACK_SPACE with _CHECK_STACK_SPACE_OPERAND in the JIT

We removed the optimization that did this because it tried to convert multiple _CHECK_STACK_SPACEs into a single _CHECK_STACK_SPACE_OPERAND. Replacing them one by one should be much simpler.

Function, and maybe code, watchers

We have class and dictionary watchers, and we use them effectively in the JIT. There are a number of optimizations we would like to do, but cannot because functions and code objects can change at runtime and we don't have watchers for them.

We might not need code watchers, as we do a complete de-optimization when any code objects are instrumented. Having code watchers might allow more targetted de-optimizations. We should do function watchers first though.

Track which locals are NULL/immortal/borrowed in the bytecode compiler

We could them use this information to speedup RETURN_VALUE as it wouldn't need to DECREF those locals. This might make sense in the interpreter, but would probably only be of value in the JIT.

Reduce or eliminate the cost of updating the insertion order when initializing an object with STORE_ATTR_INSTANCE_VALUE

STORE_ATTR_INSTANCE_VALUE does three things

  • Stores the new value
  • Maybe decrefs the refcount on the old values
  • Updates the insertion order array

Updating the insertion order array is possibly the most expensive part of this, and could be easily optimised.
We could:

  • Instead of recording the position, record the delta from the "natural" position. In many cases this would be zero and we could skip the write
  • In the JIT determine cases where we would make no write and eliminate the code for that.

Avoid refcount operation in LOAD_FAST_BORROW RETURN_VALUE sequence in non-generator functions.

The idea is that the return value outlives the frame, so the local reference can be borrowed from the return value. In cases where the local is not borrowed, we can put a borrowed reference in the local and move the non-borrowed reference to the stack. We can then omit the code to make the returned value heap safe as it already would be.

This could, in theory, be done in the bytecode compiler, but it isn't worth using up extra opcodes for, so this would be a JIT optimization.

This could work well with local tracking in the compiler, and it would mean fewer decrefs needed to clear the frame when returning.

Optimize _LOAD_SPECIAL to a type check and constant load.

The instruction LOAD_SPECIAL expands to uop sequence _INSERT_NULL + _LOAD_SPECIAL which can be optimized to _GUARD_TYPE_VERSION + _LOAD_CONST_INLINE + _SWAP 2

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions