-
Notifications
You must be signed in to change notification settings - Fork 27
[PECOBLR-1377] SRID in geospatial column type name #1157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
| return parser.parseJsonStringToDbStruct(object.toString(), arrowMetadata); | ||
| } | ||
|
|
||
| private static AbstractDatabricksGeospatial convertToGeospatial( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing duplicate code, using the same logic present in GeospatialConverter
| long rowSize = executionResult.getRowCount(); | ||
| List<String> arrowMetadata = null; | ||
| if (executionResult instanceof ArrowStreamResult) { | ||
| arrowMetadata = ((ArrowStreamResult) executionResult).getArrowMetadata(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
((ArrowStreamResult) executionResult).getArrowMetadata() will always be null for cloud fetch mode at this point in the code flow. So getting arrow metadata from TGetResultSetMetadataResp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will getResultSetMetadata always be present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we should have TGetResultSetMetadataResp object in TFetchResultsResp. Anyways we are handling this case in getArrowMetadata method to check for null values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is arrow metadata consistently populated in TGetResultSetMetadataResp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not when inline result. But that is limitation of complex data types as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not when inline result
Arrow metadata is populated from thrift always when format is CSV/JSON_ARRAY or ARROW.
The only time arrow metadata is not present is for when format is HIVE.
Is inline result in hive format ?
| * @return true if the type name starts with ARRAY, MAP, STRUCT, GEOMETRY, or GEOGRAPHY, false | ||
| * otherwise | ||
| */ | ||
| private static boolean isComplexType(String typeName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving this method to DatabricksTypeUtil for using in other classes
| return complexDatatypeSupport ? obj : obj.toString(); | ||
| } | ||
|
|
||
| private Object handleComplexDataTypesForSEAInline(Object obj, String columnName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating this method to always return complex datatype, if complex mode is disabled, caller will convert to string
| typeText = "STRING"; | ||
| } | ||
|
|
||
| // store base type eg. DECIMAL instead of DECIMAL(7,2) except for geospatial datatypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For geospatial, we want SRID info (in paranthesis) to be present. eg: GEOMETRY(4326)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be a case that SRID info might be missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, even for SRID 0, we get GEOMETRY(0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
srid will always be present
| columnIndex++) { | ||
| TColumnDesc columnDesc = resultManifest.getSchema().getColumns().get(columnIndex); | ||
|
|
||
| ColumnInfo columnInfo = getColumnInfoFromTColumnDesc(columnDesc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TColumnDesc doesn't have complete information about the column metadata, hence enhancing it using the arrowMetadata extracted from the TGetResultSetMetadataResp. This is required for methods which use ColumnInfo later in the flow (thrift + cloudfetch + complex)
| .columnTypeClassName("java.lang.String") | ||
| .columnType(Types.OTHER) | ||
| .columnTypeText(VARIANT); | ||
| } else if (isGeometryColumn(arrowMetadata, columnIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled with updates to the column info, no separate handling is required
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
|
|
||
| if (arrowMetadata != null && isComplexType(arrowMetadata)) { | ||
| typeText = arrowMetadata; | ||
| if (arrowMetadata.startsWith(GEOMETRY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to check if Geospatial flag is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an inner layer, so flag check is not required here. we are checking for geospatial flags in DatabricksResultSetMetaData which is the interface through which users get metadata info
| .map(e -> e.get(ARROW_METADATA_KEY)) | ||
| .collect(Collectors.toList()); | ||
| } catch (IOException e) { | ||
| throw new DatabricksSQLException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added error log
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
vikrantpuppala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! can we also run the jdbc comparator in different modes (sea/arrow) across different column types to ensure things are consistent
NEXT_CHANGELOG.md
Outdated
|
|
||
| ### Fixed | ||
| - Fixed complex types not being returned as objects in SEA Inline mode when `EnableComplexDatatypeSupport=true`. | ||
| - Fixed errors with complex data types in Thrift CloudFetch mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we be more descriptive on what these errors are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the change log with more details
| long rowSize = executionResult.getRowCount(); | ||
| List<String> arrowMetadata = null; | ||
| if (executionResult instanceof ArrowStreamResult) { | ||
| arrowMetadata = ((ArrowStreamResult) executionResult).getArrowMetadata(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is arrow metadata consistently populated in TGetResultSetMetadataResp?
| return parser.parseJsonStringToDbStruct(obj.toString(), columnName); | ||
| } else if (columnName.startsWith(GEOMETRY)) { | ||
| return obj; | ||
| return new GeospatialConverter().toDatabricksGeometry(obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is toDatabricksGeometry not a static method? why do we need to init an object to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the code to use cached converter similar to other datatype's converters
| long rowSize = executionResult.getRowCount(); | ||
| List<String> arrowMetadata = null; | ||
| if (executionResult instanceof ArrowStreamResult) { | ||
| arrowMetadata = ((ArrowStreamResult) executionResult).getArrowMetadata(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not when inline result
Arrow metadata is populated from thrift always when format is CSV/JSON_ARRAY or ARROW.
The only time arrow metadata is not present is for when format is HIVE.
Is inline result in hive format ?
| return parser.parseJsonStringToDbStruct(obj.toString(), columnName); | ||
| } else if (columnName.startsWith(GEOMETRY)) { | ||
| return obj; | ||
| return new GeospatialConverter().toDatabricksGeometry(obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why didnt we handle this before ?
what is the impact of not having this in 3.0.6 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently complex data types are not supported in SEA inline flow, so geospatial type is returned as a string (till 3.0.7). This is consistent with our documentation. With this change we are supporting complex types for SEA inline mode as well. Will update the documentation accordingly.
| typeText = "STRING"; | ||
| } | ||
|
|
||
| // store base type eg. DECIMAL instead of DECIMAL(7,2) except for geospatial datatypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
srid will always be present
| * and EnableComplexDatatypeSupport are enabled AND not in Thrift+Inline mode. Otherwise, returns as | ||
| * STRING. | ||
| */ | ||
| public class GeospatialTests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets test geometry any as well:
example query:
%sql
SELECT * FROM VALUES
(ST_GeomFromText('POINT(17 7)', 4326)),
(ST_GeomFromText('POINT(5 5)', 0)) AS t(geom)
expected type name: GEOMETRY(ANY) expected SRIDs per row: 4326 and 0. lets assert on everything, that both row level values are correct and column metadata is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the tests for GEOMETRY(ANY)
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
Description
This PR enhances geospatial datatype handling to include SRID (Spatial Reference System Identifier) information in column type names and fixes multiple issues related to complex datatype handling across different result formats.
Key Changes
Geospatial Type Name Enhancement
GEOMETRY(4326)instead ofGEOMETRYSEA Inline Mode Complex Type Fix
EnableComplexDatatypeSupport=trueThrift CloudFetch Metadata Enhancement
INTfromARRAY<INT>) in Thrift CloudFetch modegetColumnInfoFromTColumnDesc()to use Arrow schema metadata alongsideTColumnDescARRAY<INT>) whileTColumnDesconly contains base type (e.g.,ARRAY)Arrow Metadata Extraction
DatabricksThriftUtil.getArrowMetadata()to deserialize Arrow schema fromTGetResultSetMetadataRespDatabricksResultSetconstructor for Thrift CloudFetch modeTesting
Unit Tests
Integration Tests
GeospatialTests.java- Comprehensive E2E integration testAdditional Notes to the Reviewer
Other required details are mentioned in comments in the diff