Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions .agents/plans/044-latin1-standard14-encoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# 044: Fix Latin-1 / Accented Character Rendering with Standard 14 Fonts

## Problem

Drawing text with accented characters (á, é, ñ, ö, etc.) using Standard 14 fonts like Helvetica produces corrupted output. Characters render as mojibake because the content stream pipeline round-trips bytes through UTF-8, which destroys single-byte Latin-1 values.

**Root cause**: Three compounding issues in the content generation pipeline, plus a related width measurement bug:

1. **Wrong text encoding**: `encodeTextForFont()` (`pdf-page.ts:2433`) uses `PdfString.fromString()` which encodes via PDFDocEncoding (a metadata encoding), not WinAnsiEncoding (the font encoding Standard 14 fonts actually use). While the byte values happen to match for U+00A0–U+00FF, they diverge in the 0x80–0x9F range (€ is 0x80 in WinAnsi but 0xA0 in PDFDocEncoding; curly quotes, em dash, etc. all differ).

2. **UTF-8 round-trip corruption**: The pipeline converts `Operator` → `toString()` (UTF-8 decode via `TextDecoder`) → `appendContent(string)` → `TextEncoder.encode()` (UTF-8 encode). When a `PdfString` literal contains raw byte 0xE9 (WinAnsi `é`), the UTF-8 decode treats it as an invalid sequence and produces `U+FFFD`, destroying the original byte.

3. **Missing `/Encoding` in font dict**: The Standard 14 font dictionary (`pdf-page.ts:2392-2397`) is emitted without an `/Encoding` entry, so viewers fall back to the font's built-in encoding (typically StandardEncoding for Type1), not WinAnsiEncoding. Even if bytes were correct, the wrong encoding means wrong glyphs.

4. **Wrong width measurement**: `getGlyphName()` (`standard-14.ts:262`) only maps ASCII code points to glyph names. Any non-ASCII character (é, ñ, ü, etc.) falls through to return `"space"`, meaning `widthOfTextAtSize()` returns incorrect widths for accented text. This breaks text layout, line wrapping, and centering.

## Goals

- Accented Latin characters (á, é, ñ, ü, ß, €, etc.) render correctly with all Standard 14 fonts
- Symbol and ZapfDingbats fonts work correctly with their built-in encodings
- Text width measurement is correct for all WinAnsi characters
- Embedded fonts (Identity-H with GIDs) continue to work unchanged
- The content stream pipeline works with `Uint8Array` throughout, eliminating the UTF-8 round-trip
- Unencodable characters (CJK, emoji) produce `.notdef` by default with an option to throw

## Scope

### In scope

- Fix all four issues above
- Broad bytes-first refactor of the content stream pipeline (all callers move to bytes)
- Wire up WinAnsiEncoding for Standard 14 fonts (except Symbol/ZapfDingbats)
- Wire up SymbolEncoding and ZapfDingbatsEncoding for those two fonts
- Fix `getGlyphName()` to cover all WinAnsi non-ASCII glyph names
- Add tests for accented character rendering, width measurement, and all encoding paths

### Out of scope

- Custom encoding differences arrays
- Text extraction / parsing (already works correctly)

## Design

### The core insight

The content stream pipeline currently uses strings as an intermediate representation between operators and bytes. This is the fundamental problem — PDF content streams are binary, and shuttling them through JavaScript strings (which are UTF-16 internally) and then through UTF-8 TextEncoder/TextDecoder corrupts any non-ASCII bytes.

The fix makes the pipeline work with `Uint8Array` throughout, avoiding the string round-trip entirely. At the same time, we use `WinAnsiEncoding` (which already exists in the codebase but is only used for parsing) to properly encode text for Standard 14 fonts.

### Approach: bytes-first pipeline

The reporter's fix (converting string char-by-char via `charCodeAt & 0xFF`) works but is a band-aid that relies on JavaScript strings preserving Latin-1 byte values. Our approach is cleaner:

**1. `encodeTextForFont()` — use proper font encoding for all Standard 14 fonts**

Instead of `PdfString.fromString(text)` (PDFDocEncoding), select the correct encoding based on font name:

- **Helvetica, Times, Courier families** → `WinAnsiEncoding.instance`
- **Symbol** → `SymbolEncoding.instance`
- **ZapfDingbats** → `ZapfDingbatsEncoding.instance`

Call `encoding.encode(text)` to produce the correct byte values, then wrap in a hex-format `PdfString`. This properly handles the 0x80–0x9F range where PDFDocEncoding and WinAnsiEncoding differ.

**Unencodable characters**: By default, substitute with the `.notdef` glyph (byte 0x00). This matches PDF convention — the font's `.notdef` glyph typically renders as an empty box or blank space. Users who prefer a hard failure can pass an option to throw instead (see API below). The rationale: leniency by default matches the project's design principle of being tolerant, while the option gives strict users control.

**2. `appendContent()` / `appendOperators()` — broad bytes-first refactor**

Refactor the entire content pipeline to work with `Uint8Array`:

- `appendContent()` accepts `string | Uint8Array`. String inputs get `TextEncoder`'d (safe for ASCII-only callers like `drawPage` and `drawImage`). `Uint8Array` inputs pass through directly.
- `appendOperators()` uses `Operator.toBytes()` directly, concatenates into a `Uint8Array`, and passes bytes to `appendContent()`.
- `createContentStream()` gains a `Uint8Array` overload that skips the `TextEncoder` step.
- `prependContent()` gets the same `string | Uint8Array` treatment for consistency.
- `ContentAppender` type in `path-builder.ts` changes to `(content: string | Uint8Array) => void`.
- `PathBuilder.emitOps()` can migrate to bytes at its own pace — it only produces ASCII content (path operators, numbers), so the string path remains safe for it.

This is the principled fix: content streams are binary data, and the pipeline treats them as such. The `toString()` method on `Operator` remains for debugging/logging, but the serialization path uses `toBytes()`.

**3. `addFontResource()` — add `/Encoding` where appropriate**

| Font | `/Encoding` value | Reason |
| ---------------- | ----------------- | -------------------------------------------------------------------------------- |
| Helvetica family | `WinAnsiEncoding` | Explicit encoding ensures correct glyph mapping |
| Times family | `WinAnsiEncoding` | Same |
| Courier family | `WinAnsiEncoding` | Same |
| Symbol | _(omitted)_ | Uses built-in encoding; no valid `/Encoding` name exists per PDF spec Table 5.15 |
| ZapfDingbats | _(omitted)_ | Same as Symbol |

For Symbol and ZapfDingbats, `SymbolEncoding` / `ZapfDingbatsEncoding` are used only for Unicode → byte mapping in `encodeTextForFont()`. The font dict has no `/Encoding` entry because the PDF spec doesn't define named encodings for these fonts — their built-in encoding is implicit.

**4. Fix `getGlyphName()` for non-ASCII characters**

Extend the `CHAR_TO_GLYPH` map in `standard-14.ts` to cover all WinAnsi non-ASCII code points. The WinAnsiEncoding table maps Unicode code points to byte values, and the glyph width tables already have entries for all these glyphs (e.g., `eacute`, `ntilde`, `Euro`, `endash`, `Adieresis`). We need to bridge the gap: given a Unicode character, look up its glyph name so we can look up its width.

The approach: use `WinAnsiEncoding` to map Unicode → byte code, then use the Adobe Glyph List (already in `glyph-list.ts`) or a direct Unicode → glyph name mapping to find the glyph name. Alternatively, extend `CHAR_TO_GLYPH` with all the Latin-1 supplement and WinAnsi 0x80-0x9F entries directly.

### Why hex format for Standard 14 text?

Using hex format (`<E9>` instead of `(é)`) for Standard 14 font text strings is the most robust approach:

- **Defense-in-depth**: Even though the bytes pipeline is correct, hex format is immune to any future string-based manipulation of content streams
- **Precedent**: pdf-lib uses hex strings for all Standard 14 font text (see `StandardFontEmbedder.encodeText()`)
- **Simpler code**: No need to worry about escaping parentheses, backslashes, or non-ASCII bytes in literal strings
- **Trade-off**: Slightly larger output (2 hex chars per byte vs 1 byte in literal), but content streams are typically compressed anyway

### Changes summary

| File | Method/Area | Change |
| --------------------------------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `src/api/pdf-page.ts` | `encodeTextForFont()` | Use WinAnsi/Symbol/ZapfDingbats encoding + hex `PdfString`; `.notdef` substitution for unencodable chars |
| `src/api/pdf-page.ts` | `appendContent()` | Accept `string \| Uint8Array`; bytes pass through, strings get TextEncoder'd |
| `src/api/pdf-page.ts` | `prependContent()` | Same dual-type support |
| `src/api/pdf-page.ts` | `createContentStream()` | Accept `string \| Uint8Array`; skip TextEncoder for bytes |
| `src/api/pdf-page.ts` | `appendOperators()` | Use `Operator.toBytes()` directly, pass `Uint8Array` |
| `src/api/pdf-page.ts` | `addFontResource()` | Add `/Encoding WinAnsiEncoding` for non-Symbol/ZapfDingbats Standard 14 fonts |
| `src/api/drawing/path-builder.ts` | `ContentAppender` type | Change to `(content: string \| Uint8Array) => void` |
| `src/fonts/standard-14.ts` | `getGlyphName()` / `CHAR_TO_GLYPH` | Extend to cover all WinAnsi non-ASCII characters |

### Desired usage

From the user's perspective, nothing changes — the existing API just works:

```typescript
const page = pdf.addPage();

// Latin-1 accented characters work with Standard 14 fonts
page.drawText("Héllo café naïve résumé", {
font: "Helvetica",
x: 50,
y: 700,
size: 14,
});

// Characters in the 0x80-0x9F WinAnsi range also work
page.drawText("Price: €42 — "special" edition", {
font: "Times-Roman",
x: 50,
y: 650,
size: 14,
});

// Symbol and ZapfDingbats work with their own encodings
page.drawText("αβγδ", { font: "Symbol", x: 50, y: 600, size: 14 });

// Unencodable characters silently become .notdef (empty box) by default
page.drawText("Hello 世界", { font: "Helvetica", x: 50, y: 550, size: 14 });
// Renders: "Hello " followed by two empty boxes

// Width measurement is correct for accented text
const width = page.widthOfTextAtSize("café", "Helvetica", 12);
// Returns correct width using eacute glyph width, not space
```

## Test plan

### Rendering correctness

- Round-trip test: draw accented text ("café résumé naïve") with Helvetica, save, re-parse, extract text, verify it matches input
- Verify hex string encoding in content stream: `é` → byte `0xE9`, not UTF-8 `0xC3 0xA1`
- Test the full WinAnsi range including 0x80–0x9F characters (€, †, ‡, curly quotes, em dash, ellipsis)
- Test all Standard 14 font families (Helvetica, Times, Courier) with accented text

### Font dictionary

- Verify Helvetica/Times/Courier font dicts contain `/Encoding /WinAnsiEncoding`
- Verify Symbol font dict does **not** contain `/Encoding`
- Verify ZapfDingbats font dict does **not** contain `/Encoding`

### Symbol and ZapfDingbats

- Verify Symbol font correctly encodes Greek letters (α → correct Symbol byte)
- Verify ZapfDingbats correctly encodes decorative symbols

### Encoding edge cases

- Unencodable characters (CJK, emoji) produce `.notdef` byte (0x00) by default
- Embedded fonts continue to work unchanged (Identity-H path with GIDs)

### Width measurement

- `widthOfTextAtSize("é", "Helvetica", 1000)` returns `eacute` width (556), not `space` width (278)
- Width of "café" equals width of "caf" + width of "eacute" glyph
- Width correct for 0x80-0x9F characters (€ = Euro glyph width)

### Bytes pipeline

- Verify `appendContent(Uint8Array)` passes bytes through without TextEncoder transformation
- Verify `appendContent(string)` still works for ASCII content (drawImage, drawPage)
- PathBuilder operations still produce correct output

## Decisions made

1. **Bytes pipeline scope**: Broad — all content-producing paths move to `Uint8Array`, not just `appendOperators()`. The `ContentAppender` type becomes `string | Uint8Array` to allow gradual migration of callers.

2. **Hex vs literal format**: Always hex for Standard 14 text, as defense-in-depth. Even with the bytes pipeline fix, hex format provides immunity against any future string-based manipulation.

3. **Unencodable characters**: Default to `.notdef` glyph substitution (byte 0x00). The font's `.notdef` glyph typically renders as an empty box or blank. This is lenient-by-default per the project's design principles.

4. **Symbol and ZapfDingbats**: Wire up their proper encodings now. Use `SymbolEncoding.instance` and `ZapfDingbatsEncoding.instance` for Unicode → byte mapping. Omit `/Encoding` from the font dict (no valid named encoding exists per PDF spec Table 5.15 — the fonts use their built-in encoding implicitly).

5. **Width measurement**: Fix in this plan. Extend `CHAR_TO_GLYPH` in `standard-14.ts` to cover all WinAnsi non-ASCII characters. Without this, text layout (line wrapping, centering) would be broken for accented text even if rendering is fixed.
Loading