Skip to content

Conversation

@steven-aerts
Copy link
Contributor

What is the purpose of the change

Like java use the full type name when encoding the types for unions.
This to prevent ambiguities when decoding this json encoded field back to avro using for example the java library.
Without this patch you cannot use the Java JsonDecoder to decode union with records having a namespace entry generated by the avro C library.

Verifying this change

This change added tests and can be verified by running the test_avro_4135

Documentation

  • Does this pull request introduce a new feature? no

@github-actions github-actions bot added Java Pull Requests for Java binding build C labels May 6, 2025
@github-actions github-actions bot removed Java Pull Requests for Java binding build labels May 6, 2025
@KalleOlaviNiemitalo
Copy link
Contributor

Can you show an example of the ambiguous JSON data and the Avro schema with which was encoded?

https://avro.apache.org/docs/1.12.0/specification/#json-encoding stipulates:

otherwise it is encoded as a JSON object with one name/value pair whose name is the type’s name and whose value is the recursively encoded value. For Avro’s named types (record, fixed or enum) the user-specified name is used, for other types the type name is used.

That specifically says "name" rather than "fullname". If the JSON encoders now should output the fullname of the type as the property name, then the specification should be changed.

This likewise says "name" rather than "fullname":

Unions may not contain more than one schema with the same type, except for the named types record, fixed and enum. For example, unions containing two array types or two map types are not permitted, but two types with different names are permitted. (Names permit efficient resolution when reading and writing unions.)

Moreover, the schema resolution part of the specification says that schemas match if:

both schemas are records with the same (unqualified) name

Suppose the specification were changed to allow multiple identically named record schemas as members of the same union, as long as they are in different namespaces. Then, if such a union schema is the reader's schema, then both record schemas can match the writer's schema, in which case the first matching schema must be used in schema resolution — even if the other record schema has the same namespace as in the writer's schema. That seems an undesirable result to me.

@steven-aerts
Copy link
Contributor Author

steven-aerts commented May 6, 2025

@KalleOlaviNiemitalo the ambiguity you describe is contained in AVRO-2287.

What we are seeing is that the C library goes for the name interpretation and the java library goes for the fullname description.

To reproduce this with an example, I used the schema at the bottom of this message.

To encode:

  • avro_value_to_json will output {"field": {"r2": {}}}.
  • While the JsonEncoder outputs {"field": {"space.r2": {}}}.

On the decoding side:

C does not have a implementation for decoding the json encoding.

However when you try with java to to read {"field": {"r2": {}}} with the JsonDecoder for this schema you get:

Exception in thread "main" org.apache.avro.AvroTypeException: Unknown union branch r2
	at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:434)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:282)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:188)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)

Which shows the implementation differences for both languages. And java is clearly expecting a fully qualified path for unions.

This patch proposes to line up the behavior for C with that of Java.

Would you prefer the other way around? Are you willing to accept a patch on the Java side which allows reading the C encoding?

Example schema:

{
    "type": "record",
    "name": "r",
    "fields": [
        {
            "name": "field",
            "type": [
                "null",
                {
                    "type": "record",
                    "name": "r2",
                    "namespace": "space",
                    "fields": []
                }
            ]
        }
    ]
}

Java code use to generate exception:

var decoder = DecoderFactory.get().jsonDecoder(schema, "{\"field\": {\"r2\": {}}}");
var data = new GenericDatumReader(schema).read(null, decoder);

@steven-aerts
Copy link
Contributor Author

@KalleOlaviNiemitalo meanwhile I checked all the other language bindings on how they interpret the spec.

All the language bindings which have a JSON encoder seem to fully qualify their types. It is only the C binding which deviates here. So this patch lines up this behavior with the other language bindings.

json union encoding json union decoding
C unqualified N/A
C++ qualified qualified
csharp qualified [1] qualified
java qualified qualified
js qualified qualified and unqualified [2]
perl N/A N/A
PHP N/A N/A
python N/A N/A
ruby N/A N/A
rust N/A? N/A?

[1]: possible to configure a non standard encoding where the union value is not wrapped in an object. This variant is not decodable.
[2]: where unqualified is commented in code as being less strict. I think this is done, as it is part of the standard flow to encode any javascript object to (binary) avro, giving developers who encode unions a little bit less to type.

Again, if you think decoders should, like the js decoder, be more lenient I do not mind to workout a patch for the Java bindings.

steven-aerts added a commit to steven-aerts/avro that referenced this pull request May 7, 2025
Support unqualified type references for unions when decoding them from
json.
AVRO-2287 makes it unclear if type reference for a JSON encoded union
needs to be qualified or not.
Today all encoders use the fully qualified types except the C JSON
encoder which uses the unqualified type.

In this patch we make the java JSON Decoder more lenient and let it
fallback to unqualified types names when no qualified type name
matches.  Which matches the behavior currently implemented in the
Javascript Json decoder.

This patch is an alternative for apache#3373 where it is proposed
to update the C library instead.
@steven-aerts
Copy link
Contributor Author

@KalleOlaviNiemitalo in #3374 I propose to add support for non qualified type resolution in the java JsonDecoder.

check_exit(avro_generic_value_new(iface, &val));

#define TEST_AVRO_4135 (1)
#if TEST_AVRO_4135
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this done here ?
Do you need to disable the rest sometimes ?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this style from test_avro_766.c I can remove it from both if desired.

if (json_object_set_new(result, branch_name, branch_json)) {
avro_set_error("Cannot append branch to union");
json_decref(result);
json_decref(full_name_buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do the same for branch_json too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No jansson guarantees to take ownership of branch_json and will decref on errors.

json_t *result = json_object();
if (result == NULL) {
avro_set_error("Cannot allocate JSON union");
json_decref(full_name_buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full_name_buf could be NULL if there is no namespace. Maybe we need to check before calling json_decref() ?!
Same issues below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json_decref supports dereferencing NULL.
So I do not think this is a problem.

branch_schema =
avro_schema_union_branch(union_schema, disc);

namespace = avro_schema_namespace(branch_schema);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/steven-aerts/avro/blob/46d87d9aa8964684e602f114292825b9fd5c2554/lang/c/src/schema.c#L1546-L1555

It seems avro_schema_namespace does not support schema reference, so it will return NULL even if the referenced schema has a namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly clear to me what you mean here. In the use cases I tested it with it works.
Not clear which use case is missing.

Like java use the full type name when encoding the types for unions.
This to prevent ambiguities when decoding this json encoded field back
to avro using for example the java library.
Copy link
Contributor Author

@steven-aerts steven-aerts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review.

json_t *result = json_object();
if (result == NULL) {
avro_set_error("Cannot allocate JSON union");
json_decref(full_name_buf);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json_decref supports dereferencing NULL.
So I do not think this is a problem.

if (json_object_set_new(result, branch_name, branch_json)) {
avro_set_error("Cannot append branch to union");
json_decref(result);
json_decref(full_name_buf);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No jansson guarantees to take ownership of branch_json and will decref on errors.

check_exit(avro_generic_value_new(iface, &val));

#define TEST_AVRO_4135 (1)
#if TEST_AVRO_4135
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this style from test_avro_766.c I can remove it from both if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants