Skip to content

HIVE-29484: HMS.getFields() fails with 'SchemaReader not supported' for Avro and Hbase tables#6350

Open
rtrivedi12 wants to merge 2 commits intoapache:masterfrom
rtrivedi12:HIVE-29484
Open

HIVE-29484: HMS.getFields() fails with 'SchemaReader not supported' for Avro and Hbase tables#6350
rtrivedi12 wants to merge 2 commits intoapache:masterfrom
rtrivedi12:HIVE-29484

Conversation

@rtrivedi12
Copy link
Contributor

What changes were proposed in this pull request?

org.apache.hadoop.hive.serde2.avro.AvroSerDe and org.apache.hadoop.hive.hbase.HBaseSerDe are added to the default value of MetastoreConf.ConfVars.SERDES_USING_METASTORE_FOR_SCHEMA.

Why are the changes needed?

HiveMetaStore.get_fields_with_environment_context() checks whether a table's SerDe is listed in metastore.serdes.using.metastore.for.schema. If it is, columns are returned directly from tbl.getSd().getCols(). If not, HMS delegates to StorageSchemaReader, whose default implementation (DefaultStorageSchemaReader) unconditionally throws:

MetaException: java.lang.UnsupportedOperationException: Storage schema reading not supported

Does this PR introduce any user-facing change?

Yes. Previously, calling HMS.getFields() on a table using AvroSerDe or HBaseSerDe would fail with MetaException: Storage schema reading not supported. After this change, the call succeeds and returns the columns stored in the metastore, consistent with the behavior for all other built-in SerDes

How was this patch tested?

A new test testGetFieldsForStorageSerDes() is added to TestHiveMetaStore

mvn test -Dtest.groups= \
  -Dtest="TestEmbeddedHiveMetaStore#testGetFieldsForStorageSerDes+TestRemoteHiveMetaStore#testGetFieldsForStorageSerDes" \
  -pl standalone-metastore/metastore-server

Copy link
Contributor

@soumyakanti3578 soumyakanti3578 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please divide the test for each SerDe, otherwise the PR looks good!

* "Storage schema reading not supported".
*/
@Test
public void testGetFieldsForStorageSerDes() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably divide this into two tests for the two SerDes as they are completely different and not related to each other. Also, the comment above should be edited accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @soumyakanti3578 !
I have separated the tests for AvroSerDe and HBaseSerDe.

For the AvroSerDe fix, I've modified and kept it out of SERDES_USING_METASTORE_FOR_SCHEMA and handled it at the HMSHandler level instead. Adding AvroSerDe to that config list has two correctness problems:

Breaks schema evolution in HiveServer2 -Table.getColsInternal() uses hasMetastoreBasedSchema() which checks SERDES_USING_METASTORE_FOR_SCHEMA. If AvroSerDe is in the list, it short-circuits directly to SD columns for all Avro tables . This would cause HS2 to silently serve stale schema.

Permanently blocks ALTER TABLE <avro_tbl> UPDATE COLUMNS -AlterTableUpdateColumnsOperation throws UnsupportedOperationException for any SerDe in SERDES_USING_METASTORE_FOR_SCHEMA. This would prevent users from ever syncing an Avro table's SD columns from the live SerDe schema.

@sonarqubecloud
Copy link

Comment on lines +1996 to 1998
"org.apache.hadoop.hive.serde2.OpenCSVSerde," +
"org.apache.hadoop.hive.hbase.HBaseSerDe",
"SerDes retrieving schema from metastore. This is an internal parameter."),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config is deprecated, We don't need this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @saihemanth-cloudera ! One of the test was failing which is using HiveConf, hence I updated this config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it is better to change the test to use metastoreconf instead of hive conf

"org.apache.hadoop.hive.serde2.OpenCSVSerde," +
"org.apache.iceberg.mr.hive.HiveIcebergSerDe",
"org.apache.iceberg.mr.hive.HiveIcebergSerDe," +
"org.apache.hadoop.hive.hbase.HBaseSerDe",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some ideas:
This white list expands as we move on, how about:

  • For native tables, we just get the column from Metastore, regardless of what the serde is;
  • For non-native tables, we try to get the column from serde first, if it fails, then get the column from Metastore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dengzhhu653! To clarify, do you think it is acceptable to return the 'last known' schema as a fallback for non-native tables (like Avro) from HMS ? If we agree on this 'best-effort' approach implemented in this PR, we can handle the task of removing the whitelist logic in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opt for the 'best-effort', so the Metastore cab get ride of serde lib.
For those tables which determine the columns from serde, we can move StorageSchemaReader#readSchema to the client side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants