Skip to content

Conversation

@steven-aerts
Copy link
Contributor

Update the compiler to generate the implementation of the .equals() and `.hashCode() function, instead of relying on the
implementation of GenericData. This improves the performance of those functions significantly.

The generated implementations are factor 10 to 20 faster for .equals() and a factor 5 to 10 for .hashCode().

The implementation generates the same hashCode as the genericData, which is validated by existing tests

Result of Perf test before the change:

Benchmark              Mode  Cnt          Score             Error  Units
SpecficTest.equals    thrpt    3   12598610.194 +/-  11160265.279  ops/s
SpecficTest.hashCode  thrpt    3   24729446.862 +/-  29051332.794  ops/s

Results using generated functions:

Benchmark              Mode  Cnt          Score             Error  Units
SpecficTest.equals    thrpt    3  211314296.950 +/- 104154793.126  ops/s
SpecficTest.hashCode  thrpt    3  180349506.632 +/- 143639246.771  ops/s

Jira

  • My PR addresses the following: AVRO-3527 Generated equals() and hashCode() for SpecificRecords

Tests

  • My PR adds the following unit tests:
  • TestUtf8#testHashCodeSameAsString()
  • TestGeneratedCode#ignoredFields()
  • JMH test for SpecificRecords equals() and hashCode()

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@github-actions github-actions bot added build Java Pull Requests for Java binding labels Jun 3, 2022
Copy link
Contributor

@opwvhk opwvhk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I like the care that was given to CharSequence, and that it also works for Strings.

I would like to know though if these changes work equally well for CharSequence and String for string fields.

Copy link
Contributor Author

@steven-aerts steven-aerts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opwvhk as you were curious, I locally updated the perf test to also benchmark strings as CharSequences.

Those results also look good, especially for .hashCode() we are on the higher end of what we saw up till now (factor 12).

Before:

Benchmark              Mode  Cnt         Score            Error  Units
SpecficTest.equals    thrpt    3   14115336.148 ±  15794798.446  ops/s
SpecficTest.hashCode  thrpt    3   10566693.395 ±  11566539.470  ops/s

After:

Benchmark              Mode  Cnt          Score           Error  Units
SpecficTest.equals    thrpt    3  191368870.430 ± 123807130.335  ops/s
SpecficTest.hashCode  thrpt    3  136284950.236 ± 113094363.351  ops/s

@RyanSkraba
Copy link
Contributor

Thanks for the PR, this looks really interesting! I set the target for 1.12.0 because it's a change in how classes are generated... but I'm curious whether that should really be the case. I really don't think there's any reason this could be a major change! What do you think?

@steven-aerts
Copy link
Contributor Author

@RyanSkraba there is one small corner case, which let me doubt.

When using CharSequence as stringType, the equals method uses a newly introduced Utf8#compareSequences. As we could not use java.lang.CharSequence#compare which is only available since java 11.

If we choose to introduce this change as a minor version. It would mean that for the Charsequence stringType, the code generated by an 1.11.1 compiler is incompatible with a 1.11.0 avro library. Everything else should work in that constellation.

Do you see that as a reason to introduce it in a major version only?

@RyanSkraba
Copy link
Contributor

Do you see that as a reason to introduce it in a major version only?

Thanks! I think it's a good idea to bump it to the next major version, but we usually do one later in the year.

I used to think that Avro required that the code-generator version must be in sync with the avro library version, because that's the way my company always did it, but I've learned since that it's not always the case.

@opwvhk
Copy link
Contributor

opwvhk commented Jun 9, 2022

I used to think that Avro required that the code-generator version must be in sync with the avro library version, because that's the way my company always did it, but I've learned since that it's not always the case.

This is the case for ANTLR, and is a common practice because some generated code is so tightly coupled to the library, that using the same version is a requirement to ensure compatibility. This is not the case for Avro.

@steven-aerts
Copy link
Contributor Author

@RyanSkraba is there still something I can do to help to get this change submitted?

@RyanSkraba
Copy link
Contributor

Hey, thanks so much for your patience! The RC1 for 1.11.1 is out for testing, so hopefully we should be getting around to PRs targetted for 1.12.0 in August.

Update the compiler to generate the implementation of the `.equals()`
and `.hashCode() function, instead of relying on the
implementation of GenericData.  This improves the performance of
those functions significantly.

The generated implementations are factor 10 to 20 faster for
`.equals()` and a factor 5 to 10 for `.hashCode()`.

Result of Perf test before the change:

```
Benchmark              Mode  Cnt          Score             Error  Units
SpecficTest.equals    thrpt    3   12598610.194 +/-  11160265.279  ops/s
SpecficTest.hashCode  thrpt    3   24729446.862 +/-  29051332.794  ops/s
```

Results using generated functions:

```
Benchmark              Mode  Cnt          Score             Error  Units
SpecficTest.equals    thrpt    3  211314296.950 +/- 104154793.126  ops/s
SpecficTest.hashCode  thrpt    3  180349506.632 +/- 143639246.771  ops/s
```

Signed-off-by: Steven Aerts <steven.aerts@gmail.com>
@steven-aerts
Copy link
Contributor Author

Rebased the patch on master, as it still causes a huge improvement gain when comparing SpecificRecord's

@opwvhk opwvhk merged commit c880f47 into apache:main Jun 10, 2025
8 checks passed
opwvhk pushed a commit that referenced this pull request Jun 13, 2025
Update the compiler to generate the implementation of the `.equals()`
and `.hashCode() function, instead of relying on the
implementation of GenericData.  This improves the performance of
those functions significantly.

The generated implementations are factor 10 to 20 faster for
`.equals()` and a factor 5 to 10 for `.hashCode()`.

Result of Perf test before the change:

```
Benchmark              Mode  Cnt          Score             Error  Units
SpecficTest.equals    thrpt    3   12598610.194 +/-  11160265.279  ops/s
SpecficTest.hashCode  thrpt    3   24729446.862 +/-  29051332.794  ops/s
```

Results using generated functions:

```
Benchmark              Mode  Cnt          Score             Error  Units
SpecficTest.equals    thrpt    3  211314296.950 +/- 104154793.126  ops/s
SpecficTest.hashCode  thrpt    3  180349506.632 +/- 143639246.771  ops/s
```

Signed-off-by: Steven Aerts <steven.aerts@gmail.com>
(cherry picked from commit c880f47)
opwvhk pushed a commit to opwvhk/avro that referenced this pull request Sep 5, 2025
Update the compiler to generate the implementation of the `.equals()`
and `.hashCode() function, instead of relying on the
implementation of GenericData.  This improves the performance of
those functions significantly.

The generated implementations are factor 10 to 20 faster for
`.equals()` and a factor 5 to 10 for `.hashCode()`.

Result of Perf test before the change:

```
Benchmark              Mode  Cnt          Score             Error  Units
SpecficTest.equals    thrpt    3   12598610.194 +/-  11160265.279  ops/s
SpecficTest.hashCode  thrpt    3   24729446.862 +/-  29051332.794  ops/s
```

Results using generated functions:

```
Benchmark              Mode  Cnt          Score             Error  Units
SpecficTest.equals    thrpt    3  211314296.950 +/- 104154793.126  ops/s
SpecficTest.hashCode  thrpt    3  180349506.632 +/- 143639246.771  ops/s
```

Signed-off-by: Steven Aerts <steven.aerts@gmail.com>
@benoitcanton
Copy link

Hi @steven-aerts @RyanSkraba

This PR introduced a bug in the Avro tools 1.12.1 when a schema includes a field with name result.

In the following example, the schema contains two fields:

  • A field result of type ["null", "string"]
  • A field anotherField of type` enum

The hashCode method generated based on the record.vm template will be as follow:

  @Override
  public int hashCode() {
    int result = 1;
    result = 31 * result + (eventHeader == null ? 0 : eventHeader.hashCode());
    result = 31 * result + (result == null ? 0 : result.hashCode());
    result = 31 * result + (anotherField == null ? 0 : ((java.lang.Enum) anotherField).ordinal());
    return result;
  }

Now since the schema has a field result of type ["null", "string"] and the generated hashCode method declares a result of type int. The result.hashCode() will fail since the local result takes precedence.

Trying to compile a project using the generated java class throws:

Compiling with JDK Java compiler API.
..../MyEvent.java:1744: error: bad operand types for binary operator '=='
    result = 31 * result + (result == null ? 0 : result.hashCode());
                                   ^
  first type:  int
  second type: <null>
..../MyEvent.java:1744: error: int cannot be dereferenced
    result = 31 * result + (result == null ? 0 : result.hashCode());
                                                       ^
2 errors
> Task :compileJava FAILED

For the sake of validating my findings, I have updated my schema and renamed my field result to res and sure enough that "fixed" the issue. Obviously, this is not a solution.

One workaround would be to use custom Velocity templates, to rename int result = 1 to int result2 = 1 (just an example) but that is unpractical to maintain.

@steven-aerts @RyanSkraba It would be great if we could rename the result in the record.vm template to be something less common.

If this Github thread is not the right place to report this bug, please let me know where to report it. Thanks!

@steven-aerts
Copy link
Contributor Author

steven-aerts commented Nov 13, 2025

@benoitcanton can you make a bug at https://issues.apache.org/?

I already have a patch ready which fixes your issue by generating:

result = 31 * result + (this.result == null ? 0 : this.result.hashCode());

@benoitcanton
Copy link

Hi @steven-aerts thanks for the quick answer. I created https://issues.apache.org/jira/browse/AVRO-4202

@steven-aerts
Copy link
Contributor Author

@benoitcanton #3547

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Java Pull Requests for Java binding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants