Skip to content

[SPARK-57241][SQL] Skip statically-dead null checks in StringLocate codegen for non-nullable children#56292

Open
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:stringlocate-codegen
Open

[SPARK-57241][SQL] Skip statically-dead null checks in StringLocate codegen for non-nullable children#56292
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:stringlocate-codegen

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This is a sub-task of SPARK-56908 (reduce generated Java size in whole-stage codegen).

StringLocate.doGenCode always emitted a null check for each of its three children (substr, str, start), plus a boolean isNull = false declaration, regardless of whether a child can actually be null. For a non-nullable child the codegen isNull is statically false, so these checks are dead code. In particular the 2-arg locate(substr, str) / position(substr IN str) form defaults start to a non-null literal, so the if (!start.isNull) guard was an always-true dead branch on every call.

This PR emits each null check only when the corresponding child is nullable, and when the whole expression is non-nullable (i.e. both substr and str are non-nullable) it drops the isNull declaration and returns isNull = FalseLiteral so parent expressions can skip their own null check too.

The behavior is unchanged. eval and the existing nesting are preserved: a null start returns 0 without evaluating substr/str (to conform with Hive), and the result is null iff a non-null start meets a null substr or str. When all three children are nullable the generated code is identical to before.

For example, locate(a, b) on non-nullable columns goes from:

int value = 0;
boolean isNull = false;
/* a.code */
if (!aIsNull) {
  /* b.code */
  if (!bIsNull) {
    /* start.code */    // start = Literal(1), always non-null
    if (!startIsNull) {
      if (start > 0) { value = ...exec(...) + 1; }
    } else { isNull = true; }
  } else { isNull = true; }
}

to:

int value = 0;
/* start.code */
/* a.code */
/* b.code */
if (start > 0) { value = ...exec(...) + 1; }

This follows the same pattern as the merged sibling sub-tasks (e.g. If, NaNvl, AtLeastNNonNulls, CreateNamedStruct).

Why are the changes needed?

To reduce the size of the generated Java in whole-stage codegen, as tracked by SPARK-56908. The skipped guards are statically dead, and locate/position are common functions, so eliminating the always-true start guard and the dead null checks shrinks generated code on the hot path with no behavior change.

Does this PR introduce any user-facing change?

No. This is a codegen-only change; eval, nullable, dataType, and results are unchanged, so SQL output and golden files are unaffected.

How was this patch tested?

Existing StringExpressionsSuite "LOCATE" coverage plus new checkEvaluation cases covering every combination of nullable/non-nullable substr/str/start (so each generated codegen template is exercised; checkEvaluation runs interpreted and codegen, with whole-stage codegen both on and off). Also ran org.apache.spark.sql.StringFunctionsSuite (locate/position).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…odegen for non-nullable children

StringLocate.doGenCode always emitted a null check for each of its three
children (substr, str, start) plus a `boolean isNull = false` declaration,
even when a child is non-nullable and its codegen isNull is statically false.
This produces dead Java: the 2-arg `locate(substr, str)` form defaults start
to a non-null literal, so its guard was an always-true dead branch.

Emit each null check only when the corresponding child is nullable, and when
the whole expression is non-nullable drop the isNull declaration and return
isNull = FalseLiteral so parents can skip their own null checks. Behavior is
unchanged: eval and the nesting (a null start returns 0 without evaluating
substr/str) are preserved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant