Skip to content

Conversation

@DrewMcArthur
Copy link

@DrewMcArthur DrewMcArthur commented Aug 2, 2024

Does your PR solve an issue?

fixes #3387
fixes #3390

todo

Copy link
Author

@DrewMcArthur DrewMcArthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abonander @alu I'd love your feedback on this, whenever you get a chance 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are all a result of checking against SELECT ID, COLLATION_NAME FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID; locally - let me know if any of them should be thrown out.

  • There was an additional ~100 collations on my machine, which I added and then commented out, rather than implementing all the other pieces necessary. let me know if you'd like me to finish that out, I'd be happy to, if that's something we want. otherwise, we can remove the comments of those extra collations.
  • uft8_* was displayed as utf8mb3_* on my machine. let me know if those changes should be reverted
  • also added impl TryFrom<u16> for Collation for use here

Comment on lines 65 to +69
// https://dev.mysql.com/doc/internals/en/com-query-response.html#column-type
// dead link ^
// https://dev.mysql.com/doc/dev/mysql-server/9.0.0/page_protocol_com_query_response_text_resultset_column_definition.html
// the "type of the column as defined in enum_field_types"
// https://dev.mysql.com/doc/dev/mysql-server/9.0.0/field__types_8h.html
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to suggest which of these links we should keep - the original link here was dead when i tried it.

@DrewMcArthur DrewMcArthur changed the title [3387] use collation to determine if string type is compatible fix: (3387) use collation to determine if string type is compatible Aug 2, 2024
@DrewMcArthur DrewMcArthur changed the title fix: (3387) use collation to determine if string type is compatible fix: (#3387) use collation to determine if string type is compatible Aug 2, 2024
MySqlTypeInfo {
r#type: ColumnType::VarString, // VARCHAR
flags: ColumnFlags::empty(),
collation: Collation::utf8mb3_bin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DrewMcArthur

I think using utf8mb3_bin regardless of session state is misleading.
In my opinion, the only difference needed in this context is whether it is binary or string.
How about using an enum like this one?

CollationType {
    Binary,
    String
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was one of the pieces I wasn't sure about, what the default should be for some of these. I think you're right, _bin probably isn't right & I'll update this to utf8mb3_general_ci if you think that'd be correct.

As for changing the type of MySqlTypeInfo's collation field to an enum, I think the source of the issue on #3387 is that there are three cases being compressed into the ColumnFlag::binary flag:

  • a VARCHAR has a collation like utf8_general_ci, and the ColumnFlag::binary is false, so this is compatible with a Rust String
  • a BLOB has a collation of binary, ColumnFlag::binary is true, so it's not compatible with a Rust String, which is also correct
  • a VARCHAR has a collation of *_bin, so ColumnFlag::binary is true, so it's marked incompatible when we should be able to decode this into a String. This is the case that needs to be fixed.

This could be handled with that enum, if the value is String for both VARCHAR cases, but I feel like that could also be confusing to have a CollationType::String (as opposed to binary) while simultaneously having ColumnFlag::binary = true.

I think my preference is to keep track of each unique collation here, since it's only a byte, but I'll defer to you and @abonander to decide what makes more sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
collation: Collation::utf8mb3_bin,
collation: Collation::utf8mb3_general_ci,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ @alu (sorry, I'm terrible at remembering to directly @ people in replies here)

@eirnym
Copy link

eirnym commented Oct 9, 2024

@alu This fixes only for MySQL, not for SQLite one. I understand that original question is for MySQL, but SQLite and presumably PostgreSQL are also affected.

Please, tell me if i need to create a separate issue for SQLite.

@alvico

This comment was marked as spam.

@DrewMcArthur
Copy link
Author

@eirnym oh good catch - my bandwidth is pretty low right now, so if you wanted to open a PR into my branch with those changes, or even just the tests for them that'd be awesome!

@alvico I'm not sure their release schedule, but you can target this branch in your Cargo.toml if you'd like to have the fix locally.

@alvico

This comment was marked as spam.

@daveajones

This comment was marked as spam.

@abonander
Copy link
Collaborator

Sorry it took so long to get to this. I really wanted to get #3383 done for 0.9.0 so this had to wait.

Conflating the method of sorting stored data and the character set being used over the wire is one of several design decisions that has made me come to really loathe MySQL. They could have sent the character set ID instead, for not much extra overhead (it's just one more lookup, probably entirely in-memory), and saved the rest of us a ton of headache.

On the surface, I don't see much point in cataloguing collations. We can't treat them as a closed set because new ones are being added all the time, and they're not always UTF-8 either, so we can't assume anything about a collation we don't recognize. The Collation enum is mostly vestigial.

I was thinking there's only one collation we should care about, and that's binary. Anything else we can treat as tentatively compatible with String. Cause ultimately, even technically incompatible character sets still have some amount of overlap (the lower half of latin1 is still ASCII), so if the actual data is compatible with UTF-8, who are we to tell the user they can't try to decode it?

Ultimately, it's a pretty clear logic error if the user tries to encode or decode a string using a completely incompatible encoding.

I think we could catch this if we really wanted to, but I'm wondering about the ratio of value to effort there.

I guess one thing we should worry about is when a given sequence of bytes is valid for two different encodings, but doesn't actually represent the same character. That's tantamount to silent memory corruption, which would be pretty bad.

We could probably do a similar thing to what we do in Postgres, and query the metadata for a given collation when we see it. Then we can actually check if the character set is compatible, since that's a significantly smaller set of possibilities, and doesn't change much.

We can't do that in every situation, however; during the simple query flow (not using a prepared statement), the column data immediately precedes the data we need to decode, so there's no opportunity to stop and issue another query first.

So ultimately, we probably need to catalogue collations anyway so we don't always have to query them.

But Collation still probably shouldn't be a simple enum over u8 since it's not a closed set. It should likely be closer to the internal representation of PgTypeInfo.

Given that this is a major departure from the current PR and that your last message mentioned you didn't have the bandwidth to finish this, I'm going to close. However, I'm going to follow up with my own PR cause this really needs to be fixed.

By the way, this is not an issue for Postgres or SQLite. SQLite defines strings to always be UTF-8, and Postgres transcodes to/from the character set requested by the client. Any issue in other databases which seems to be related is almost certainly not. Please open a search for an existing issue specific to that database or open a new one.

@abonander abonander closed this Jul 6, 2025
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 6, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 7, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 7, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 7, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 7, 2025
cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
abonander added a commit that referenced this pull request Jul 8, 2025
…r` (#3924)

cc #3400 (comment)

comment in `sqlx-mysql/src/collation.rs` for explanation

fixes #3200
fixes #3387
fixes #3390
fixes #3409
kriswuollett pushed a commit to kriswuollett/sqlx that referenced this pull request Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sqlx think medium text as binary? MySql type VARCHAR with *_bin collation incorrectly interpreted as VARBINARY Column

6 participants