-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
(I'll include the problem here, as the "problems" repo seems "done" now that PEP-733 is up).
Traditional C APIs take zero-terminated strings, which means that Python strings that with embedded NUL bytes appear truncated. There are many ways to get such a char*: converted directly using PyUnicode_AsUTF8, encoded and accessed via PyBytes_AsString, or accessed with something like PyUnicode_AsUTF8AndSize while ignoring the size.
Many APIs that convert to char* raise an error on embedded NUL bytes. On that:
- This is safe, but it needs an extra O(n) search, which is not necessary for all tasks.
- It is too late to change widely used existing API (
PyUnicode_AsUTF8) to do this. See the reverted [C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters python/cpython#111089
We could:
- Soft-deprecate
PyUnicode_AsUTF8, nudging people towardPyUnicode_AsUTF8AndSize. (It's still possible to ignore the size, but it's much less likely to do so on purpose -- unless we encourage people to mechanically replacePyUnicode_AsUTF8(s)withPyUnicode_AsUTF8AndSize(s, NULL)).- See Macro to hide deprecated functions #24 for making soft-deprecation more relevant
- In CPython, use the "pointer+size" representation more --- only use "pointer only" for working with external APIs or for backwards compatibility. This might help find APIs we might want to expose.
Notes on some of the issues @vstinner collected in python/cpython#111656 (comment):
- In APIs that look up names and take aliases (codec names, hash algorithm names, timezone names, etc.), the embedded NUL is not as security issue. For example, I don't see a problem with
UTF-8,utf8andutf8\0spamspamspaaamall naming the same encoding. (The fact that some APIs will reject the latter string, and others will not, is unfortunate but not terrible.) - In error/warning messages, we might want to filter out newlines, backspaces, terminal escape sequences and the like. If we're not doing that, there's not much additional harm in allowing an “end of message” control character. (FWIW,
PyObject_Repris very useful for arbitrary strings, though we shouldn't call it “safe” as it still passes Unicode lookalikes or BIDI characters through.)
Metadata
Metadata
Assignees
Labels
No labels