Standard 14 fonts corrupt accented characters (á, é, ñ) due to UTF-8 encoding

## **Problem**

Normal usage of the library causes rendering problems with accented characters such as á, é, ö, etc. The specific scenario happens when using a Standard 14 font such as Helvetica (the one I tried first).


<img width="360" height="207" alt="Image" src="https://github.com/user-attachments/assets/1212aa48-6e0a-4f73-85f8-f4b1ae67d46b" />


Attached a screenshot and a link to a reproducer I prepared, the example is all written in spanish using several of the affected characters in various ways.

Reproducer: https://github.com/Ozmah/libpdf-latin1-reproducer

## **Proposed Solution**

I got it working by changing 3 specific places of this file `src/api/pdf-page.ts`:

### _- Line ~2086 Content Stream Byte Conversion_

Changed

```typescript
const bytes = new TextEncoder().encode(content);
``` 

To

```typescript
const bytes = new Uint8Array(content.length);
for (let i = 0; i < content.length; i++) {
  bytes[i] = content.charCodeAt(i) & 0xff;
}
``` 

**_Reason for change:_** `TextEncoder().encode()` converts strings to UTF-8. For characters in the Latin-1 range (like á, é, ñ), UTF-8 produces multi byte sequences (e.g., á becomes 0xC3 0xA1) but Standard 14 fonts expect WinAnsi encoding, where á is a single byte 0xE1. This causes corruption in the PDF showing characters such as the ones in the screenshot.

### _- Lines ~2277-2281 Method: appendOperators()_

Changed

```typescript
private appendOperators(ops: Operator[]): void {
  const content = ops.map(op => op.toString()).join("\n");
  this.appendContent(content);
}
``` 

To

```typescript
private appendOperators(ops: Operator[]): void {
  const parts: string[] = [];
  for (const op of ops) {
    const bytes = op.toBytes();
    let str = "";
    for (let i = 0; i < bytes.length; i++) {
      str += String.fromCharCode(bytes[i]);
    }
    parts.push(str);
  }
  this.appendContent(parts.join("\n"));
}
``` 

**_Reason for change:_** `op.toString()` loses the original byte values when the string goes through JavaScript's internal encoding, corrupting Latin-1 characters in the process. Using `op.toBytes()` and rebuilding the string piece by piece with String.fromCharCode() ensures the exact byte values are preserved.

### _- Lines ~2313-2318 Method: addFontResource()_

Changed

```typescript
const fontDict = PdfDict.of({
  Type: PdfName.of("Font"),
  Subtype: PdfName.of("Type1"),
  BaseFont: PdfName.of(font),
});
``` 

To

```typescript
const fontDict = PdfDict.of({
  Type: PdfName.of("Font"),
  Subtype: PdfName.of("Type1"),
  BaseFont: PdfName.of(font),
  Encoding: PdfName.of("WinAnsiEncoding"),
});
``` 

**_Reason for change:_** Without an explicit encoding entry PDF viewers may assume different default encodings depending on the platform or implementation. Clearly stating WinAnsiEncoding explicitly ensures that Latin-1 bytes such as 0xE1 consistently maps to "á" across all viewers.

This is a solution I implemented and worked for my specific use case. I would've opened a PR but I don't have enough time right now to add proper tests and check in detail for good integration since this does touch a fair amount of the internal pipeline. Please feel free to comment if you need any additional information.

You can contact me via DM in X as well [here](https://x.com/OzmahG), I've already engaged with @catalinmpit in one post previously about a possible issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standard 14 fonts corrupt accented characters (á, é, ñ) due to UTF-8 encoding #7

Problem

Proposed Solution

- Line ~2086 Content Stream Byte Conversion

- Lines ~2277-2281 Method: appendOperators()

- Lines ~2313-2318 Method: addFontResource()

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Standard 14 fonts corrupt accented characters (á, é, ñ) due to UTF-8 encoding #7

Description

Problem

Proposed Solution

- Line ~2086 Content Stream Byte Conversion

- Lines ~2277-2281 Method: appendOperators()

- Lines ~2313-2318 Method: addFontResource()

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions