Skip to content

Cranelift: should we do branch optimizations in MachBuffer? #6818

@cfallin

Description

@cfallin

@alexcrichton brought up in #6810

And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.

This issue is meant to answer the question: should we do this refactor, and why or why not?

First, addressing some specific points:

extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.

It's unclear to me why this is the case -- could you clarify? Constant-branch folding can be done on CLIF in earlier stages; nothing in MachBuffer precludes replacing a brif with a jump when we know the conditional is constant (and we hope to do that optimization in the future!).

For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that

Unfortunately this is also not quite the case I think, for several reasons:

  • The MachBuffer's rearranging of branches requires knowledge of block order, because it is taking advantage of fallthroughs. We don't know block order when we're doing midend optimizations, and the flexibility is important because...
  • ... we also don't know if some blocks will become empty or not, as a result of optimizations or regalloc. For example, we need to split critical edges so regalloc has a place to put spills/reloads in all cases; but many of those split edges end up being an empty block, and MachBuffer can then eat it (chomp the one jump in the body and redirect the label).

So basically pass ordering and regalloc needs mean that block ordering is only known late, and this means that block-order-dependent changes cannot happen in the mid-end.


I think both of the above are possibly springing from a slight mis-interpretation of what purpose the MachBuffer's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step that takes us from N-target branches (every block terminator always specifies all destination labels) to branch-or-fallthrough "machine-style branches". The former is the standard CFG representation in most SSA compilers and is convenient to work with, all the way down into regalloc2. The latter is known as "EBBs" (extended basic blocks) and was formerly used in Cranelift. Removing the lowering step means propagating the output side's invariants and properties (EBBs) upward into all VCode.

EBBs were removed 4 years or so ago (I think? it was before my time), for good reason: it requires subtle and not-well-understood tweaks to common algorithms, it's hard to reason about, and it infects all of the compiler backend. Basically moving back to EBBs would require a careful audit and refactor of regalloc2 and the ~30k lines of code in Cranelift that touch VCode, and that's a nonstarter.

EBBs also require knowledge of, and lock in, the block order, which is a weird nonlocal invariant thing that precludes a bunch of other optimizations as noted above.

So both branch-folding in machine-independent code and the lowering from N-target form to fallthrough form are useful transforms, but they serve different purposes. The former is an optimization and is optional; the latter is fundamental to the IR design, in that it lets us reason at a slightly higher semantic level and avoid huge complexities.


Finally, the original motivation for this discussion came from a perceived inability to make the branch simplifications work with a more complex forward-reference label resolution algorithm. I don't think this is necessarily or fundamentally the case. The label resolution (island/veneer insertion) and branch optimizations interact in subtle ways, but they are separate, tied by the invariant: the buffer of most-recent branches continguous to the end of the buffer (this is the visibility of the peephole opts) must correspond to the latest fixups in the fixup list.

The veneer insertion happens only at island generation, and islands clear the "most recent branches" list (because none are contiguous to the end of buffer anymore), so actually we can do whatever we want with more complex data structures and algorithms, without violating the invariant. The only requirement is that between islands, we do keep a linear list of most recent fixups (we can clear it whenever we emit a non-branch instruction).


So from my perspective, there is a proposal for a large refactor of the compiler backend, with high risk, motivated by two reasons that I don't think are really the case (branch constant folding gets better? --> no, different and orthogonal things; precludes more advanced veneer fixup? --> no, the invariant still allows this). It's entirely possible I've missed something or misunderstood some part of the proposal though -- curious what others think!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions