diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index cb3558de2ab..0c90dfc9aa7 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -660,6 +660,7 @@ peps/pep-0777.rst @warsaw peps/pep-0779.rst @Yhg1s @colesbury @mpage peps/pep-0780.rst @lysnikolaou peps/pep-0781.rst @methane +peps/pep-0782.rst @vstinner # ... peps/pep-0789.rst @njsmith # ... diff --git a/peps/pep-0782.rst b/peps/pep-0782.rst new file mode 100644 index 00000000000..b785882a1ed --- /dev/null +++ b/peps/pep-0782.rst @@ -0,0 +1,372 @@ +PEP: 782 +Title: Add PyBytesWriter C API +Author: Victor Stinner +Status: Draft +Type: Standards Track +Created: 27-Mar-2025 +Python-Version: 3.14 +Post-History: + `18-Feb-2025 `__ + + +.. highlight:: c + + +Abstract +======== + +Add a new ``PyBytesWriter`` C API to create ``bytes`` objects. + +Soft deprecate ``PyBytes_FromStringAndSize(NULL, size)`` and +``_PyBytes_Resize()`` APIs. These APIs treat an immutable ``bytes`` +object as a mutable object. They remain available and maintained, don't +emit deprecation warning, but are no longer recommended when writing new +code. + + +Rationale +========= + +Disallow creation of incomplete/inconsistent objects +---------------------------------------------------- + +Creating a Python :class:`bytes` object using +``PyBytes_FromStringAndSize(NULL, size)`` and ``_PyBytes_Resize()`` +treats an immutable :class:`bytes` object as mutable. It goes against +the principle that :class:`bytes` objects are immutable. It also creates +an incomplete or "invalid" object since bytes are not initialized. In +Python, a :class:`bytes` object should always have its bytes fully +initialized. + +* `Avoid creating incomplete/invalid objects api-evolution#36 + `_ +* `Disallow mutating immutable objects api-evolution#20 + `_ +* `Disallow creation of incomplete/inconsistent objects problems#56 + `_ + +Inefficient allocation strategy +------------------------------- + +When creating a bytes string and the output size is unknown, one +strategy is to allocate a short buffer and extend it (to the exact size) +each time a larger write is needed. + +This strategy is inefficient because it requires to enlarge the buffer +multiple timess. It's more efficient to overallocate the buffer the +first time that a larger write is needed. It reduces the number of +expensive ``realloc()`` operations which can imply a memory copy. + + +Specification +============= + +API +--- + +.. c:type:: PyBytesWriter + + A Python :class:`bytes` writer instance created by + :c:func:`PyBytesWriter_Create`. + + The instance must be destroyed by :c:func:`PyBytesWriter_Finish` or + :c:func:`PyBytesWriter_Discard`. + +Create, Finish, Discard +^^^^^^^^^^^^^^^^^^^^^^^ + +.. c:function:: PyBytesWriter* PyBytesWriter_Create(Py_ssize_t size) + + Create a :c:type:`PyBytesWriter` to write *size* bytes. + + If *size* is greater than zero, allocate *size* bytes for the + returned buffer. + + On error, set an exception and return NULL. + + *size* must be positive or zero. + +.. c:function:: PyObject* PyBytesWriter_Finish(PyBytesWriter *writer) + + Finish a :c:type:`PyBytesWriter` created by + :c:func:`PyBytesWriter_Create`. + + On success, return a Python :class:`bytes` object. + On error, set an exception and return ``NULL``. + + The writer instance is invalid after the call in any case. + +.. c:function:: PyObject* PyBytesWriter_FinishWithSize(PyBytesWriter *writer, Py_ssize_t size) + + Similar to :c:func:`PyBytesWriter_Finish`, but resize the writer + to *size* bytes before creating the :class:`bytes` object. + +.. c:function:: PyObject* PyBytesWriter_FinishWithPointer(PyBytesWriter *writer, void *buf) + + Similar to :c:func:`PyBytesWriter_Finish`, but resize the writer + using *buf* pointer before creating the :class:`bytes` object. + + Pseudo-code:: + + Py_ssize_t size = (char*)buf - (char*)PyBytesWriter_GetData(writer); + return PyBytesWriter_FinishWithSize(writer, size); + + Set an exception and return ``NULL`` if *buf* pointer is outside the + internal buffer bounds. + +.. c:function:: void PyBytesWriter_Discard(PyBytesWriter *writer) + + Discard a :c:type:`PyBytesWriter` created by :c:func:`PyBytesWriter_Create`. + + Do nothing if *writer* is ``NULL``. + + The writer instance is invalid after the call. + +High-level API +^^^^^^^^^^^^^^ + +.. c:function:: int PyBytesWriter_WriteBytes(PyBytesWriter *writer, const void *bytes, Py_ssize_t size) + + Write *size* bytes of *bytes* into the *writer*. + + If *size* is equal to ``-1``, call ``strlen(bytes)`` to get the + string length. + + On success, return ``0``. + On error, set an exception and return ``-1``. + +.. c:function:: int PyBytesWriter_Format(PyBytesWriter *writer, const char *format, ...) + + Similar to ``PyBytes_FromFormat()``, but write the output directly + into the writer. + + On success, return ``0``. + On error, set an exception and return ``-1``. + +Getters +^^^^^^^ + +.. c:function:: Py_ssize_t PyBytesWriter_GetSize(PyBytesWriter *writer) + + Get the writer size. + +.. c:function:: void* PyBytesWriter_GetData(PyBytesWriter *writer) + + Get the writer data. + + The pointer is valid until :c:func:`PyBytesWriter_Finish` or + :c:func:`PyBytesWriter_Discard` is called on *writer*. + +Low-level API +^^^^^^^^^^^^^ + +.. c:function:: int PyBytesWriter_Resize(PyBytesWriter *writer, Py_ssize_t size) + + Resize the writer to *size* bytes. It can be used to enlarge or to + shrink the writer. + + Newly allocated bytes are left uninitialized. + + On success, return ``0``. + On error, set an exception and return ``-1``. + + *size* must be positive or zero. + +.. c:function:: int PyBytesWriter_Grow(PyBytesWriter *writer, Py_ssize_t grow) + + Resize the writer by adding *grow* bytes to the current writer size. + + Newly allocated bytes are left uninitialized. + + On success, return ``0``. + On error, set an exception and return ``-1``. + + *size* must be positive or zero. + +.. c:function:: void* PyBytesWriter_GrowAndUpdatePointer(PyBytesWriter *writer, Py_ssize_t size, void *buf) + + Similar to :c:func:`PyBytesWriter_Grow`, but update also the *buf* + pointer. + + On error, set an exception and return ``NULL``. + + Pseudo-code:: + + Py_ssize_t pos = (char*)buf - (char*)PyBytesWriter_GetData(writer); + if (PyBytesWriter_Grow(writer, size) < 0) { + return NULL; + } + return (char*)PyBytesWriter_GetData(writer) + pos; + + +Overallocation +-------------- + +:c:func:`PyBytesWriter_Resize` and :c:func:`PyBytesWriter_Grow` +overallocate the internal buffer to reduce the number of ``realloc()`` +calls and so reduce memory copies. + + +Thread safety +------------- + +The API is not thread safe: a writer should only be used by a single +thread at the same time. + + +Soft deprecations +----------------- + +Soft deprecate ``PyBytes_FromStringAndSize(NULL, size)`` and +``_PyBytes_Resize()`` APIs. These APIs treat an immutable ``bytes`` +object as a mutable object. They remain available and maintained, don't +emit deprecation warning, but are no longer recommended when writing new +code. + +``PyBytes_FromStringAndSize(str, size)`` is not soft deprecated. Only +calls with ``NULL`` *str* are soft deprecated. + + +Examples +======== + +High-level API +-------------- + +Create the bytes string ``b"Hello World!"``:: + + PyObject* hello_world(void) + { + PyBytesWriter *writer = PyBytesWriter_Create(0); + if (writer == NULL) { + goto error; + } + if (PyBytesWriter_WriteBytes(writer, "Hello", -1) < 0) { + goto error; + } + if (PyBytesWriter_Format(writer, " %s!", "World") < 0) { + goto error; + } + return PyBytesWriter_Finish(writer); + + error: + PyBytesWriter_Discard(writer); + return NULL; + } + + +Create the bytes string "abc" +----------------------------- + +Example creating the bytes string ``b"abc"``, with a fixed size of 3 bytes:: + + PyObject* create_abc(void) + { + PyBytesWriter *writer = PyBytesWriter_Create(3); + if (writer == NULL) { + return NULL; + } + + char *str = PyBytesWriter_GetData(writer); + memcpy(str, "abc", 3); + return PyBytesWriter_Finish(writer); + } + +GrowAndUpdatePointer() example +------------------------------ + +Example using a pointer to write bytes and to track the written size. + +Create the bytes string ``b"Hello World"``:: + + PyObject* grow_example(void) + { + // Allocate 10 bytes + PyBytesWriter *writer = PyBytesWriter_Create(10); + if (writer == NULL) { + return NULL; + } + + // Write some bytes + char *buf = PyBytesWriter_GetData(writer); + memcpy(buf, "Hello ", strlen("Hello ")); + buf += strlen("Hello "); + + // Allocate 10 more bytes + buf = PyBytesWriter_GrowAndUpdatePointer(writer, 10, buf); + if (buf == NULL) { + PyBytesWriter_Discard(writer); + return NULL; + } + + // Write more bytes + memcpy(buf, "World", strlen("World")); + buf += strlen("World"); + + // Truncate the string at 'buf' position + // and create a bytes object + return PyBytesWriter_FinishWithPointer(writer, buf); + } + + +Reference Implementation +======================== + +`Pull request gh-131681 `__. + +The implementation allocates internally a :class:`bytes` object, so +:c:func:`PyBytesWriter_Finish` just returns the object without having +to copy memory. + +For strings up to 256 bytes, a small internal raw buffer of bytes is +used. It avoids having to resize a :class:`bytes` object which is +inefficient. At the end, :c:func:`PyBytesWriter_Finish` creates the +:class:`bytes` object from this small buffer. + +A free list is used to reduce the cost of allocating a +:c:type:`PyBytesWriter` on the heap memory. + + +Backwards Compatibility +======================= + +There is no impact on the backward compatibility, only new APIs are +added. + + +Prior Discussions +================= + +* March 2025: Third public API attempt, using size rather than pointers: + + * `Discussion `_ + * `Pull request gh-131681 `__ + +* February 2025: Second public API attempt: + + * `Issue gh-129813 `_ + and + `pull request gh-129814 + `_ + +* July 2024: First public API attempt: + + * C API Working Group decision: + `Add PyBytes_Writer() API + `_ + (August 2024) + * `Pull request gh-121726 + `_: + first public API attempt (July 2024) + +* March 2016: + `Fast _PyAccu, _PyUnicodeWriter and _PyBytesWriter APIs to produce + strings in CPython `_: + Article on the original private ``_PyBytesWriter`` C API. + + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive.