Skip to content

Soft-deprecate PyUnicode_AsUTF8 #39

@encukou

Description

@encukou

(I'll include the problem here, as the "problems" repo seems "done" now that PEP-733 is up).

Traditional C APIs take zero-terminated strings, which means that Python strings that with embedded NUL bytes appear truncated. There are many ways to get such a char*: converted directly using PyUnicode_AsUTF8, encoded and accessed via PyBytes_AsString, or accessed with something like PyUnicode_AsUTF8AndSize while ignoring the size.

Many APIs that convert to char* raise an error on embedded NUL bytes. On that:


We could:

  • Soft-deprecate PyUnicode_AsUTF8, nudging people toward PyUnicode_AsUTF8AndSize. (It's still possible to ignore the size, but it's much less likely to do so on purpose -- unless we encourage people to mechanically replace PyUnicode_AsUTF8(s) with PyUnicode_AsUTF8AndSize(s, NULL)).
  • In CPython, use the "pointer+size" representation more --- only use "pointer only" for working with external APIs or for backwards compatibility. This might help find APIs we might want to expose.

Notes on some of the issues @vstinner collected in python/cpython#111656 (comment):

  • In APIs that look up names and take aliases (codec names, hash algorithm names, timezone names, etc.), the embedded NUL is not as security issue. For example, I don't see a problem with UTF-8, utf8 and utf8\0spamspamspaaam all naming the same encoding. (The fact that some APIs will reject the latter string, and others will not, is unfortunate but not terrible.)
  • In error/warning messages, we might want to filter out newlines, backspaces, terminal escape sequences and the like. If we're not doing that, there's not much additional harm in allowing an “end of message” control character. (FWIW, PyObject_Repr is very useful for arbitrary strings, though we shouldn't call it “safe” as it still passes Unicode lookalikes or BIDI characters through.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions