diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index a4254d43e5a..038b2b35741 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -663,7 +663,7 @@ peps/pep-0782.rst @vstinner peps/pep-0783.rst @hoodmane @ambv peps/pep-0784.rst @gpshead peps/pep-0785.rst @gpshead -# ... +peps/pep-0786.rst @ncoghlan peps/pep-0787.rst @ncoghlan peps/pep-0788.rst @ZeroIntensity @vstinner # ... diff --git a/peps/pep-0786.rst b/peps/pep-0786.rst new file mode 100644 index 00000000000..20feeefe398 --- /dev/null +++ b/peps/pep-0786.rst @@ -0,0 +1,567 @@ +PEP: 786 +Title: Precision and modulo-precision flag format specifiers for integer fields +Author: Jay Berry +Sponsor: Alyssa Coghlan +Status: Draft +Type: Standards Track +Created: 04-Apr-2025 +Python-Version: 3.15 +Post-History: `14-Feb-2025 `__, + + +Abstract +======== + +This PEP proposes implementing the standard format specifiers ``.`` and ``z`` +of :pep:`3101` for integer fields as "precision" and "modulo-precision" +respectively. Both are presented together in this PEP as the alternative +rejected implementations entail intertwined combinations of both. + +``.`` ("precision") shall format an integer to a specified *minimum* number of +digits, identical to the behavior of old-style ``%`` formatting. This shall be +implemented for all integer presentation types except ``'c'``. + +``z`` ("modulo-precision") shall be permitted as an optional "modulo" flag +when formatting an integer with precision and one of the binary, octal, or +hexadecimal presentation types (bases that are powers of two). This first +reduces the integer into ``range(base ** precision)`` using the ``%`` operator. +The result is a predictable two's complement style formatting with the *exact* +number of digits equal to the precision. + +This PEP amends the clause of :pep:`3101` which states "The precision is +ignored for integer conversions". + + +Rationale +========= + +When string formatting integers in binary octal and hexadecimal, one often +desires the resulting string to contain a guaranteed minimum number of digits. +For unsigned integers of known machine-width bounds (for example, 8-bit bytes) +this often also ends up the exact resulting number of digits. This has +previously been implemented in the old-style ``%`` formatting using the +``.`` "precision" format specifier, closely related to that of the C +programming language. + +.. code-block:: python + + >>> "0x%.2x" % 15 + '0x0f' # two hex digits, ideal for displaying an unsigned byte + >>> "0o%.3o" % 18 + '0o022' # three octal digits, ideal for displaying a umask or file permissions + +When :pep:`3101` new-style formatting was first introduced, used in +``str.format`` and f-strings, the `format specification `_ was +simple enough that the behavior of "precision" could be trivially emulated with +the ``width`` format specifier. Precision therefore was left unimplemented and +forbidden for ``int`` fields. However, as time has progressed and new format +specifiers have been added, whose interactions with ``width`` noticeably +diverge its behavior away from emulating precision, the readmission of +precision as its own format specifier, ``.``, is sufficiently warranted. + +The ``width`` format specifier guarantees a minimum length of the entire +replacement field, not just the number of digits in a formatted integer. +For example, the wonderful ``#`` specifier that prepends the prefix of the +corresponding presentation type consumes from ``width``: + +.. code-block:: python + + >>> x = 12 + >>> f"0x{x:02x}" # manually specifying '0x' prefix + '0x0c' # two hex digits :) + >>> f"{x:#02x}" # use '#' format specifier to output '0x' automatically + '0xc' # only one hex digit :( + >>> f"{x:#08b}" + '0b001100' # we wanted 8 bits, not 6 :( + +One could attempt to argue that since the length of a prefix is known to +always be 2, it can be accounted for manually by adding 2 to the desired +number of digits. Consider however the following demonstrations of why this is +a bad idea: + +* By correcting the second example to ``f"{x:#04x}"``, at a glance this looks + like it may produce four hex digits, but it only produces two. This is bad + for readability. ``4`` is thus too much of a 'magic number', and trying to + counter that by being overly explicit with ``f"{x:#0{2+2}x}"`` looks ridiculous. +* In the future it is possible that a type specifier may be added with a prefix + not of length 2, meaning the programmer has to calculate the prefix length, + rather than Python's internal string formatting code handling that automatically. +* Things get more complicated when using the ``sign`` format specifier, + ``f"{x: #0{1+2+2}x}"`` required to produce ``' 0x0c'``. +* Things get *even more* complicated when introducing a ``grouping_option``, + for example formatting an integer into ``k`` 'word' segments joined by ``_``: + ``x = 3735928559; k = 2; f"{x: #0{1+2+4*k+(k - 1)}_x}"`` is required to + produce ``' 0xdead_beef'``. Surely this would be easier to write + with precision as ``f"{x: #_.8x}"``? + +It is clear at this point that the reduction of complexity that would be +provided by precision's implementation for ``int`` fields would be beneficial +to any user. Nor is this proposal a new special-case behavior being demanded +exclusively at the behest of ``int`` fields: the precision token ``.`` is +already implemented as prescribed in :pep:`3101` for ``str`` data to truncate +the field's length, and for ``float`` data to ensure that there are a fixed +number of digits after the decimal point, eg ``f"{0.1+0.2: .4f}"`` producing +``' 0.3000'``. Thus no new tokens need adding to the `format specification `_ +because of this proposal, maintaining its modest size. + +For the sake of completion, and lack of any reasonable objection, we propose +that precision shall work also in decimal, base 10. Explicitly, the integer +presentation types laid out in :pep:`3101` that are permitted to implement +precision are ``'b'``, ``'d'``, ``'o'``, ``'x'``, ``'X'``, ``'n'``, +and ``''`` (``None``). The only presentation type not permitted is +``c`` ('character'), whose purpose is to format an integer to a single Unicode +character, or an appropriate replacement for non-printable characters, for +which it does not make sense to implement precision. In the event that new +integer presentation types are added in the future, such as ``'B'`` and ``'O'`` +which mutatis-mutandis could provide the same behavior as ``'X'`` (that is a +capitalized prefix and digits), their addition should appropriately consider +whether precision should be implemented or not. In the case of ``'B'`` and ``'O'`` +as described here it would be correct to implement precision. A ``ValueError`` +shall be raised when precision is attempted to be used for invalid integer +presentation types. + + +Precision For Negative Numbers +------------------------------ + +So far in this PEP we have cautiously avoided talking about the formatting of +negative numbers with precision, which we shall now discuss. + + +Short Verdict +''''''''''''' + +We desire two behaviors, which motivates the implementation of a flag ``z`` to +toggle on the latter's behavior: + +* For precision without the ``z`` flag, a negative integer ``x`` shall be + formatted with a negative sign and the digits of ``-x``'s formatting. This is + the same friendly behavior as old-style ``%`` formatting. + + For example ``f"{-12:#.2x}"`` shall produce ``'-0x0c'``, equivalent to ``"%#.2x" % -12``. + +* For precision with the ``z`` flag, ``r = x % base ** n`` is first taken when + formatting ``f"{x:z.{n}{base_char}}"``, and ``r`` is passed on to precision, + the resulting string being equivalent to ``f"{r:.{n}{base_char}}"``. Because + ``r`` is in ``range(base ** n)`` the number of digits will always be exactly + ``n``, resulting in a predictable two's complement style formatting, which is + useful to the end user in environments that deal with machine-width oriented + integers such as :mod:`struct`. + + For example in formatting ``f"{-1:z#.2x}"``, ``-1`` is reduced modulo ``256`` + via ``255 = -1 % 256``, the resulting string being equivalent to ``f"{255:#.2x}"``, + which is ``'0xff'``. + + The ``z`` flag shall only be implemented for presentation types corresponding + to bases that are powers of two, specifically at present binary, octal, and + hexadecimal. Whilst reduction of integers modulo by powers of ten is computationally + possible, a 'ten's complement?' has no demand and so precision is unimplemented + for decimal presentation types. The ``z`` flag shall work for all integers, + not just negatives. + + The syntax choice of ``z`` is again out of respect for maintaining the modest + size of the `format specification `_. ``z`` was introduced to the + format specification in :pep:`682` as a flag for normalizing negative zero to + positive zero for the ``float`` and ``Decimal`` types. It is currently + unimplemented for the ``int`` type, and since integers never have a 'negative zero' + situation it seems uncontroversial to repurpose ``z``, again as a flag. If one + squints hard enough, the ``z`` looks like a ``2`` for two's complement! + + +Long Introspection +'''''''''''''''''' + +We first present some observations about the binary representations of *signed* +integers in two's complement. This leads us to a couple of alternative formulations +of formatting negative numbers. + +Observe that one can always extend a signed number's binary representation by +extending the the leading digit as a prefix: + +.. code-block:: text + + 45 (8-bit) 00101101 + 45 (9-bit) 000101101 + -19 (8-bit) 11101101 + -19 (9-bit) 111101101 + +For non-negative numbers this is obvious. For negative numbers this is because +the erstwhile leading column of an ``n``\ -bit representation goes from having a +value of ``-2 ** (n-1)``, to ``+2 ** (n-1)``, with a new ``n+1``\ th column of +value ``-2 ** n`` prefixed on, the overall sum unaffected. + +This is what C's ``printf`` does, working with powers of two as the numbers of digits: + +.. code-block:: C + + printf("%#hhb\n", -19); // 0b11101101 + printf("%#hho\n", -19); // 0355 + printf("%#hhx\n", -19); // 0xed + + printf("%#b\n", -19); // 0b11111111111111111111111111101101 + printf("%#o\n", -19); // 037777777755 + printf("%#x\n", -19); // 0xffffffed + +Conversely it should be clear that one can losslessly truncate a signed number's +binary representation to have only one leading ``0`` if it is non-negative, and +one leading ``1`` if it is negative: + +.. code-block:: text + + 45 (8-bit) 00101101 + 45 (7-bit) 0101101 + -19 (8-bit) 11101101 + -19 (7-bit) 1101101 + +If one were to truncate another digit off of these examples, then both would +end up as ``101101``, 45 being indistinguishable from -19 when using only 6 binary +digits because they are both the same modulo ``2 ** 6 = 64``. Therefore to +losslessly and unambiguously represent a signed integer ``x`` as a binary string +which is rendered to the end user, we have a de facto 'minimal width' representation +convention, using ``n`` digits, where ``n`` is the smallest integer such that +``x`` is in ``range(-2 ** (n-1), 2 ** (n-1))``. + +For rendering octal and hexadecimal strings one has to extend the definition of +the 'minimal width' representation convention to be sufficiently unambiguous. +383's minimal width binary string is ``0101111111``, and -129's is ``101111111``, +a suffix of the former's. A naive, incorrect, implementation of hexadecimal +string formatting would render both as ``'0x17f'`` by *padding* both binary +representations to ``000101111111``. The method was correct to desire a number +of binary digits (12) that is divisible by the number of bits in the base +(4 bits in base 16) so that the binary representation can be segmented up into +(hex) digits, but it was incorrect in *padding*; the method should have instead +*extended* as we have observed previously, 383 extended to ``000101111111``, +and -129 extended to ``111101111111``, whence 383 is rendered as ``'0x17f'`` +and -129 as ``0xf7f``. + +Thus the generalized definition of our 'minimal width' representation convention +is: for an integer ``x`` to rendered in base ``base``, produce ``n`` digits, +where ``n`` is the smallest integer such that ``x`` is in +``range(-base ** n / 2, base ** n / 2)``. + +This leads onto the rejected alternatives. + + +Rejected Alternatives +===================== + +Behavior of ``z`` +----------------- + +The desired implementation of ``z``, the two's complement style formatting flag, +has split into two main camps of opinions, disagreeing over lossless vs lossy +presentation. The lossless camp believes that the formatted strings corresponding +to integers should all be distinct from each other, uniqueness preserved by the +minimal width representation convention; precision with ``z`` enabled should still +be only a *minimum* number of digits requested, as it is without ``z``. The lossy +camp believes that precision with ``z`` enabled should first reduce the integer +using modular arithmetic, which then produces *exactly* the number of digits +requested, equivalent to left-truncating the minimal width representation string. + +We endeavor to conclude in the following section that the former camp, lossless +formatting, has no use cases, and is thus a rejected idea, whence this PEP +proposes the latter, lossy, behavior. + + +Minimal Width Representation Convention +''''''''''''''''''''''''''''''''''''''' + +This idea was fiercely entertained only due to its lossless behavior, however it +is a obstacle to ergonomics in every candidate use case. These arguments about +the aesthetics of string rendering are not irrational or about personal taste, +but rather they are crucial in how information is communicated to the end user. + +In a program in which signed-ness of integers is critical to communicate, any +implementation of ``z`` should not be used, as the average user will be expecting +to see a negative sign ``-``. The alternative of using minimal width representation +convention requires one to be uncomfortably vigilant looking for leading digits +of numbers belonging to the upper half of the base's range whenever a negative +number is present (``1`` for binary, ``4-7`` for octal, and ``8-f`` for hex). +Any end user that is not aware of this de facto convention, and even those who +are but are not expecting it to be present in a program, would have a hard time: + +The formatting of 128 and -128 using ``f"{x:z#.2x}"`` would produce ``'0x080'`` +and ``'0x80'`` respectively. It is the PEP author's opinion that there is a 0% +chance that ``'0x80'`` is being read as *negative* 128 under normal conditions. +Furthermore the hideous rendering of positive 128 as ``'0x080'`` is useless for +a program that should produce a uniformly spaced hexdump of bytes, agnostic of +whether they are signed or unsigned; all bytes should be rendered in the form +``'0xNN'``. See the `examples <#modulo-precision>`__ section on how modulo-precision +handles bytes in the correct sign-agnostic way. + +Contrapositively therefore ``z``'s purpose is to be used in environments where +signed-ness is *not* critical, and more likely than not where it is even +encouraged to treat the integers with respect to the modular arithmetic that +arises in two's complement hardware of fixed register sizes. In the example above +128 and -128 are the same modulo 256, and the respectable rendering is ``'0x80'``. +In general the purpose of ``z`` is to treat integers modulo ``base ** precision`` +as the same. So too 255 and -1 should both be rendered as ``'0xff'``, not +``'0x0ff'`` and ``'0xff'`` respectively; the truncation is not a hindrance, but +the desired behavior. Formally we may say that the formatting should be a well +defined bijection between the equivalence classes of ``Z/(base ** precision)Z`` +and strings with ``precision`` digits. + +The remaining question is "is there no chance to communicate this truncation to +the user?" as a concern for the 'loss of information' arising from the effectively +left-truncated strings. We reject this question's premise that there ever is such +a case of unintentional loss of information, by considering the two cases of +hardware-aware integers and otherwise: + +With respect to hardware-aware integers we have so far played around with examples +of integers in ``range(-128, 256)``, the union of the signed and unsigned ranges +for bytes. The virtues of formatting ``x`` and ``x - 256`` as the same are clearly +established. In these contexts that one expects to find ``z``, any erroneous integers +corresponding to bytes that lie outside that range are likely a programming error. +For example if a library sets a pixel brightness integer to be 257, and prints out +``'0x01'`` instead of ``'0x101'`` via ``f"{x:z#.2x}"``, that's not our problem or +doing; string formatting shouldn't raise an exception, or even a ``SyntaxWarning`` +as an invalid escape sequence ``"\y"`` would, because ``ValueError: bytes must be in range(0, 256)`` +will be raised by ``bytes`` when trying to serialize that integer via ``bytes([257])``; +let the appropriate 'layer' of code raise the exception, as that is more indicative +of a defect in the library, not our string formatting. + +In the case of non-hardware aware integers, one would have to intentionally opt to +use ``z``, in which modular arithmetic is the chosen desired effect. It is for +this reason also that we shall not raise a ``SyntaxWarning`` or ``ValueError`` +for integers lying outside of ``range(-base ** precision / 2, base ** precision)``. + +Thus we have defended the lossy behavior of ``z`` implemented as modulo-precision, +and we have exhausted all reasonable use cases of lossless behavior. + +A final compromise to consider and reject is implementing ``z`` not as a flag +*contingent* on ``.``, but as a flag that can be *combined* with ``.``. +Specifically: ``z`` without ``.`` would turn on two's complement mode to render +the minimal width representation of the formatted integer, ``.`` without ``z`` +would implement precision as already explained, a minimum number of digits in the +magnitude and a sign if necessary, and ``z`` combined with ``.`` would turn on the +left-truncating modulo-precision. This labyrinth of combinations does not seem +useful to anyone, as we have already discredited the ergonomics of minimal width +representation convention, whence ``z`` would rarely be used on its own, and this +behavior of two options that individually render a *minimum* number of digits +combining together to render an *exact* number of digits seems counterintuitive. + + +Infinite Length Indication +'''''''''''''''''''''''''' + +Another, less popular, rejected alternative was for ``z`` to directly acknowledge +the infinite prefix of ``0``\ s or ``1``\ s that precede a non-negative or negative +number respectively. For example: + +.. code-block:: python + + >>> f"{-1:z#.8b}" + '0b[...1]11111111' + >>> f"{300:z#.8b}" + '0b[...0]100101100' + +This is effectively the minimal width representation convention with an 'infinite' +prefix attached to it. + +In the C programming language the machine-width dependent two's complement +formatting of ``int`` data with precision exhibits excessive lengths of prefixes +that arise from negative numbers, even those with small magnitude: + +.. code-block:: C + + printf("%#.2x\n", -19); // 0xffffffed + printf("%#.2llx\n", (long long unsigned int)-19); // 0xffffffffffffffed + +This prefix could continue on indefinitely if it were not limited by a maximum machine-width! + +Python's ``int`` type is indeed not limited by a maximum machine-width. Thus to +avoid printing infinitely long two's complement strings we could use a similar +approach to that of the builtin ``list``'s string formatting for printing a list +that contains itself: + +.. code-block:: python + + >>> l = [] + >>> l.append(l) + >>> l + [[...]] + + >>> y = -1 + >>> f"{y:z#.8b}" + '0b[...1]11111111' + +This may have been useful to educate beginners on how bitwise binary operations +work, for example showing how ``-1 & x`` is always trivially equal to ``x``, or +how the binary representation of the negation of a number can be obtained by +adding one to its bitwise complement: + +.. code-block:: python + + >>> x = 42 + >>> f"{x:z#.8b}" + '0b[...0]00101010' + >>> f"{~x:z#.8b}" + '0b[...1]11010101' + >>> f"{x|~x:z#.8b}" + '0b[...1]11111111' + # x | ~x == -1 + # x | ~x == x + ~x because of their disjoint bitwise representations + # thus x + ~x == -1 + # thus -x == ~x + 1 + >>> y = ~x + 1 + >>> f"{y:z#.8b}" + '0b[...1]11010110' + >>> y == -x + True + +Its use case is just too narrow, and modulo-precision outshines it. + + +General +------- + +* What about ones's complement, or other binary representations? + + Two's complement is so dominant that no one really considers other representations. + GCC only supports two's complement. + +* Could we do nothing? + + Programmers continue to hobble on using the ``width`` format specifier with ad-hoc + corrections to mimic precision. This is intolerable, and the rationale of this PEP + makes conclusive arguments for the addition and implementation choices of precision. + + Refusing to implement precision for integer fields using ``.`` reserves ``.`` for + possible future uses. However in the ~20 year timespan since :pep:`3101` no + alternatives have been accepted, and any alternate use of ``.`` takes it further + out of sync with both old-style ``%`` formatting, and the C programming language. + + +Syntax +------ + +* ``!`` instead of ``z.`` for precision with modulo-precision, mutually exclusive with ``.``. + + Pros: + + - ``!`` is graphically related to ``.``, an extension if you will. Precision + with the modulo-precision flag set is indeed an extension of precision. + - ``!`` in the English language is often used for imperative, commanding sentences. + So too modulo-precision commands the *exact* number of digits to which its input + shall be formatted, whereas precision is the *minimum* number of digits. + This is idiomatic. + - ``!`` is only one symbol as opposed to ``z.``. This coupled with ``!`` being + mutually exclusive with ``.`` leaves the overall length of one's written code + unaffected when switching on modulo-precision. + - Using a new ``!`` symbol reserves ``z`` for other future uses, whatever that may be. + + Cons: + + - ``z.`` also conveys a sense of extension from ``.``, a flag attached to ``.``, + and lexicographically flows left to right as 'modulo' (``z``) 'precision' (``.``). + - ``.`` and ``!`` being mutually exclusive to each other may give a beginner + programmer analysis-paralysis over which to choose when looking at the + `format specification `_ documentation. + - ``!`` would be another addition to the format specification for a single purpose. + It would not have any implementation for ``str``, ``float``, or any other type. + - There also already exists a ``["!" conversion]`` "explicit conversion flag" + in the `format string syntax `_ as laid out in :pep:`3101`. + For example in ``f"{s!r}"`` the ``!r`` calls ``repr`` on ``s``. This would + *not* syntactically clash with a ``!`` format specifier, the format specifiers + ``[":" format_spec]`` being separated by a well-defined preceding colon, + however users unfamiliar with the new modulo-precision mode may glance over + format strings containing ``!`` and expect different behavior. + + Verdict: + + - Whilst graphically attractive, ``!`` would clutter the format specification for + a single purpose that can be achieved by overloading the preexisting ``z`` flag. + + +Backwards Compatibility +======================= + +To quote :pep:`682`: + + The new formatting behavior is opt-in, so numerical formatting of existing + programs will not be affected. + +unless someone out there is specifically relying upon ``.`` raising a ``ValueError`` +for integers as it currently does, but to quote :pep:`475`: + + The authors of this PEP don't think that such applications exist + + +Examples And Teaching +===================== + +Precision +--------- + +Documentation and tutorials in the Python sphere of influence should encourage +the adoption of ``.``, precision, as the default format specifier for formatting +``int`` fields as opposed to ``width``, when it is clear a minimum number of *digits* +is required, not a minimum length of the *whole replacement field*. + +Since the concept of precision is common in other languages such as C, and was +already present in Python's old-style ``%`` formatting, we don't need to go *too* +overboard, but a decent few examples as below may demonstrate its uses. + +.. code-block:: python + + >>> def hexdump(b: bytes) -> str: + ... return " ".join(f"{c:#.2x}" for c in b) + + >>> hexdump(b"GET /\r\n\r\n") + '0x47 0x45 0x54 0x20 0x2f 0x0d 0x0a 0x0d 0x0a' + # observe the CR and LF bytes padded to precision 2 + # in this basic HTTP/0.9 request + + >>> def unicode_dump(s: str) -> str: + ... return " ".join(f"U+{ord(c):.4X}" for c in s) + + >>> unicode_dump("USA 🦅") + 'U+0055 U+0053 U+0041 U+0020 U+1F985' + # observe the last character's Unicode codepoint has 5 digits; + # precision is only the minimum number of digits + + +Modulo-Precision +---------------- + +The clear area for encouraging the use of modulo-precision is when dealing with +machine-width oriented integers such as those packed and unpacked by :mod:`struct`. +We give an example of the consistent predictable two's complement formatting of +signed and unsigned integers. + +.. code-block:: python + + >>> import struct + + >>> my_struct = b"\xff" + >>> (t,) = struct.unpack('b', my_struct) # signed char + >>> print(t, f"{t:#.2x}", f"{t:z#.2x}") + '-1 -0x01 0xff' + >>> (t,) = struct.unpack('B', my_struct) # unsigned char + >>> print(t, f"{t:#.2x}", f"{t:z#.2x}") + '255 0xff 0xff' + + # observe in both the signed and unsigned unpacking the modulo-precision flag 'z' + # produces a predictable two's complement formatting + + +Thanks +====== + +Thank you to + +* Raymond Hettinger, for the initial suggestion of the two's complement behavior. + + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive. + + +Footnotes +========= + +.. _formatstrings: https://docs.python.org/3/library/string.html#formatstrings +.. _formatspec: https://docs.python.org/3/library/string.html#formatspec