Skip to content

Conversation

@Mythie
Copy link
Contributor

@Mythie Mythie commented Jan 28, 2026

Summary

Fixes corrupted rendering of accented/Latin-1 characters (á, é, ñ, ö, €, curly quotes, em dashes, etc.) when using Standard 14 fonts like Helvetica, Times-Roman, and Courier.

Problem

Three compounding bugs caused non-ASCII characters to render as mojibake:

  1. Wrong text encodingencodeTextForFont() used PDFDocEncoding (a metadata encoding) instead of WinAnsiEncoding, producing incorrect bytes in the 0x80–0x9F range (€, curly quotes, em dash, etc.)
  2. UTF-8 round-trip corruption — The pipeline shuttled content stream bytes through Operator.toString()TextDecoder (UTF-8) → TextEncoder (UTF-8), destroying any non-ASCII byte (e.g., 0xE9 for é became U+FFFD)
  3. Missing /Encoding in font dict — Without an explicit /Encoding WinAnsiEncoding entry, PDF viewers fell back to the font's built-in StandardEncoding, mapping bytes to wrong glyphs
  4. Wrong width measurementgetGlyphName() only mapped ASCII, returning "space" for all accented characters, breaking text layout and measurement

Solution

  • Use proper font encoding: encodeTextForFont() now uses WinAnsiEncoding for Helvetica/Times/Courier, SymbolEncoding for Symbol, and ZapfDingbatsEncoding for ZapfDingbats. Unencodable characters (CJK, emoji) substitute with .notdef (byte 0x00).
  • Hex-format PdfString: Standard 14 text is encoded as hex strings (<636166E9>) — pure ASCII that's immune to any encoding transformation. Matches pdf-lib's approach.
  • Bytes-first pipeline: appendOperators() uses Operator.toBytes() directly. createContentStream/appendContent/prependContent accept string | Uint8Array, eliminating the UTF-8 round-trip.
  • Font dict /Encoding: Standard 14 font dicts now include /Encoding /WinAnsiEncoding (omitted for Symbol/ZapfDingbats per PDF spec Table 5.15).
  • Extended glyph map: CHAR_TO_GLYPH now covers all WinAnsi non-ASCII characters (~95 entries), fixing width measurement for accented text.

Changes

File Change
src/fonts/standard-14.ts Encoding helpers, extended CHAR_TO_GLYPH map
src/api/pdf-page.ts Encoding fix, bytes pipeline, font dict /Encoding
src/api/drawing/path-builder.ts ContentAppender type accepts string | Uint8Array
src/api/drawing/latin1-encoding.test.ts 29 new tests

Test coverage

  • Font encoding selection (WinAnsi vs Symbol vs ZapfDingbats)
  • Glyph name mapping and width measurement for accented characters
  • Font dict /Encoding presence/absence verification
  • Hex string encoding in content streams (é → <E9>, not UTF-8 C3A1)
  • Unencodable character .notdef substitution
  • Round-trip PDF generation with all Standard 14 font families
  • Backward compatibility (shapes, paths, images still work)

…ndard 14 fonts

- Add getEncodingForStandard14() to select the correct encoding per font
- Add isWinAnsiStandard14() to distinguish Symbol/ZapfDingbats
- Extend CHAR_TO_GLYPH map with all WinAnsi non-ASCII characters
  (0x80-0x9F and 0xA0-0xFF ranges) fixing width measurement for
  accented text like é, ñ, ü, €, etc.
…d 14 fonts

Three compounding bugs caused accented characters (á, é, ñ, ö, €, etc.)
to render as mojibake with Standard 14 fonts like Helvetica:

1. Wrong text encoding: used PDFDocEncoding instead of WinAnsiEncoding
2. UTF-8 round-trip corruption: Operator.toString() → TextDecoder (UTF-8)
   destroyed non-ASCII bytes when re-encoded via TextEncoder
3. Missing /Encoding in font dict: viewers fell back to StandardEncoding

Fix:
- encodeTextForFont() now uses WinAnsiEncoding (or SymbolEncoding/
  ZapfDingbatsEncoding) with hex-format PdfString output
- Unencodable characters substitute with .notdef (byte 0x00)
- appendOperators() uses Operator.toBytes() directly, bypassing the
  string intermediate that caused UTF-8 corruption
- createContentStream/appendContent/prependContent accept string |
  Uint8Array for the broad bytes-first pipeline refactor
- addFontResource() adds /Encoding WinAnsiEncoding for Helvetica/
  Times/Courier families (omitted for Symbol/ZapfDingbats per spec)
- ContentAppender type updated to string | Uint8Array
…14 fonts

29 tests covering:
- Font encoding selection (WinAnsi vs Symbol vs ZapfDingbats)
- Glyph name mapping for accented/non-ASCII characters
- Width measurement correctness for accented text
- Font dict /Encoding verification
- Hex string encoding in content streams
- Unencodable character .notdef substitution
- Round-trip PDF generation with all font families
- Bytes pipeline backward compatibility (shapes, paths, images)
@vercel
Copy link
Contributor

vercel bot commented Jan 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
core Ready Ready Preview, Comment Jan 28, 2026 4:54am

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes corrupted rendering of accented and Latin-1 characters (á, é, ñ, €, curly quotes, em dashes, etc.) when using Standard 14 fonts like Helvetica, Times-Roman, and Courier. The fix addresses four compounding bugs: wrong text encoding (PDFDocEncoding instead of WinAnsiEncoding), UTF-8 round-trip corruption through the content stream pipeline, missing /Encoding entries in font dictionaries, and incorrect width measurements for accented characters.

Changes:

  • Implements proper font encoding selection (WinAnsi for Helvetica/Times/Courier, Symbol/ZapfDingbats for those respective fonts)
  • Refactors content stream pipeline to work with Uint8Array throughout, eliminating UTF-8 corruption
  • Adds /Encoding /WinAnsiEncoding to Standard 14 font dictionaries (except Symbol/ZapfDingbats)
  • Extends CHAR_TO_GLYPH map to cover all WinAnsi non-ASCII characters for correct width measurement
  • Uses hex-format PdfStrings for Standard 14 text as defense-in-depth against encoding transformations

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

File Description
src/fonts/standard-14.ts Adds encoding helper functions (getEncodingForStandard14, isWinAnsiStandard14) and extends CHAR_TO_GLYPH map with ~95 WinAnsi non-ASCII entries
src/api/pdf-page.ts Refactors encodeTextForFont() to use proper font encodings, updates appendContent/prependContent/createContentStream to accept bytes, implements bytes-first appendOperators(), adds /Encoding to font dicts
src/api/drawing/path-builder.ts Updates ContentAppender type to accept string | Uint8Array for backward-compatible bytes support
src/api/drawing/latin1-encoding.test.ts Adds comprehensive test coverage with 29 tests covering encoding selection, glyph mapping, font dictionary structure, content stream encoding, and round-trip rendering

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Mythie Mythie merged commit 13bd3f4 into main Jan 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants