@@ -91,6 +91,12 @@ with a single seek to `file_size - 32`, without first reading the header.
9191+--------+------+---------+----------------------------------------+
9292```
9393
94+ The magic number ` 0x54414348 ` ("TACH" for Tachyon) identifies the file format
95+ and also serves as an ** endianness marker** . When read on a system with
96+ different byte order than the writer, it appears as ` 0x48434154 ` . The reader
97+ uses this to detect cross-endian files and automatically byte-swap all
98+ multi-byte integer fields.
99+
94100The Python version field records the major, minor, and micro version numbers
95101of the Python interpreter that generated the file. This allows analysis tools
96102to detect version mismatches when replaying data collected on a different
@@ -399,14 +405,17 @@ enable O(1) lookup during interning.
399405
400406### Reading
401407
402- 1 . Read the header and validate magic/version
403- 2 . Seek to end − 32 and read the footer
404- 3 . Allocate string array of ` string_count ` elements
405- 4 . Parse the string table, populating the array
406- 5 . Allocate frame array of ` frame_count * 3 ` uint32 elements
407- 6 . Parse the frame table, populating the array
408- 7 . If compressed, decompress the sample data region
409- 8 . Iterate through samples, resolving indices to strings/frames
408+ 1 . Read the header magic number to detect endianness (set ` needs_swap ` flag
409+ if the magic appears byte-swapped)
410+ 2 . Validate version and read remaining header fields (byte-swapping if needed)
411+ 3 . Seek to end − 32 and read the footer (byte-swapping counts if needed)
412+ 4 . Allocate string array of ` string_count ` elements
413+ 5 . Parse the string table, populating the array
414+ 6 . Allocate frame array of ` frame_count * 3 ` uint32 elements
415+ 7 . Parse the frame table, populating the array
416+ 8 . If compressed, decompress the sample data region
417+ 9 . Iterate through samples, resolving indices to strings/frames
418+ (byte-swapping thread_id and interpreter_id if needed)
410419
411420The reader builds lookup arrays rather than dictionaries since it only needs
412421index-to-value mapping, not value-to-index.
@@ -420,22 +429,22 @@ fields when writing. However, the reader supports **cross-endian reading**:
420429files written on a little-endian system (x86, ARM) can be read on a
421430big-endian system (s390x, PowerPC), and vice versa.
422431
423- ** Endianness Detection ** : The magic number serves as an endianness marker.
424- When read on a system with different byte order, it appears byte-swapped
425- ( ` 0x48434154 ` instead of ` 0x54414348 ` ). The reader detects this and
426- automatically byte-swaps all fixed-width integer fields during parsing.
427-
428- ** Writer Requirements ** : Fixed-width integer fields must be written using
429- ` memcpy() ` from properly-sized integer types . When the source variable's
430- type differs from the field width (e.g., ` size_t ` being written as 4 bytes),
431- explicit casting to the correct type (e.g., ` uint32_t ` ) is required before
432- ` memcpy() ` . On big-endian systems, copying from an oversized type would
433- copy the wrong bytes (high-order zeros instead of the actual value) .
434-
435- ** Reader Implementation ** : The reader tracks whether byte-swapping is needed
436- via a ` needs_swap ` flag set during header parsing. All fixed-width fields
437- in the header, footer, and sample data are conditionally byte-swapped using
438- inline swap functions (` bswap32 ` , ` bswap64 ` ).
432+ The magic number doubles as an endianness marker. When read on a system with
433+ different byte order, it appears byte-swapped ( ` 0x48434154 ` instead of
434+ ` 0x54414348 ` ). The reader detects this and automatically byte-swaps all
435+ fixed-width integer fields during parsing.
436+
437+ Writers must use ` memcpy() ` from properly-sized integer types when writing
438+ fixed-width integer fields . When the source variable's type differs from the
439+ field width (e.g., ` size_t ` written as 4 bytes), explicit casting to the
440+ correct type (e.g., ` uint32_t ` ) is required before ` memcpy() ` . On big-endian
441+ systems, copying from an oversized type would copy the wrong bytes—high-order
442+ zeros instead of the actual value.
443+
444+ The reader tracks whether byte-swapping is needed via a ` needs_swap ` flag set
445+ during header parsing. All fixed-width fields in the header, footer, and
446+ sample data are conditionally byte-swapped using Python's internal byte-swap
447+ functions (` _Py_bswap32 ` , ` _Py_bswap64 ` from ` pycore_bitutils.h ` ).
439448
440449Variable-length integers (varints) are byte-order independent since they
441450encode values one byte at a time using the LEB128 scheme, so they require
0 commit comments