Skip to content

Conversation

@belugabehr
Copy link
Contributor

@belugabehr belugabehr commented Jan 4, 2025

As part of my earlier work for AVRO-4074, I introduced a buffer to store strings during serialization. I chose a buffer size of 128 bytes somewhat arbitrarily: it is a power of 2. However, upon further reflection, a value of 63 is a better partition. A string is decomposed into two fields:

a string is encoded as a long followed by that many bytes of UTF-8 encoded character data.

For the binary format of Avro:

int and long values are written using variable-length zig-zag coding.

63 bytes is the largest ASCII string that can be written using only a single byte for the variable-length size. This makes a more sane boundary for the upper limit of this string buffer. With a string size of 128, two bytes are required for the variable length value.

@github-actions github-actions bot added the Java Pull Requests for Java binding label Jan 4, 2025
@belugabehr belugabehr force-pushed the belugabehr/string-buff-size branch from c527f41 to 2d1a3b9 Compare January 4, 2025 20:02
@belugabehr belugabehr changed the title [Java] Reduce buffer size for ASCII string optimization to 127 bytes [Java] Reduce buffer size for ASCII string optimization to 63 bytes Jan 4, 2025
@belugabehr
Copy link
Contributor Author

@martin-g Please take a look :)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great one @belugabehr, thanks for the review @martin-g

@Fokko Fokko merged commit fb1850b into apache:main Feb 3, 2025
8 checks passed
opwvhk pushed a commit to opwvhk/avro that referenced this pull request Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Java Pull Requests for Java binding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants