Skip to content

Standard 14 fonts corrupt accented characters (á, é, ñ) due to UTF-8 encoding #7

@Ozmah

Description

@Ozmah

Problem

Normal usage of the library causes rendering problems with accented characters such as á, é, ö, etc. The specific scenario happens when using a Standard 14 font such as Helvetica (the one I tried first).

Image

Attached a screenshot and a link to a reproducer I prepared, the example is all written in spanish using several of the affected characters in various ways.

Reproducer: https://github.com/Ozmah/libpdf-latin1-reproducer

Proposed Solution

I got it working by changing 3 specific places of this file src/api/pdf-page.ts:

- Line ~2086 Content Stream Byte Conversion

Changed

const bytes = new TextEncoder().encode(content);

To

const bytes = new Uint8Array(content.length);
for (let i = 0; i < content.length; i++) {
  bytes[i] = content.charCodeAt(i) & 0xff;
}

Reason for change: TextEncoder().encode() converts strings to UTF-8. For characters in the Latin-1 range (like á, é, ñ), UTF-8 produces multi byte sequences (e.g., á becomes 0xC3 0xA1) but Standard 14 fonts expect WinAnsi encoding, where á is a single byte 0xE1. This causes corruption in the PDF showing characters such as the ones in the screenshot.

- Lines ~2277-2281 Method: appendOperators()

Changed

private appendOperators(ops: Operator[]): void {
  const content = ops.map(op => op.toString()).join("\n");
  this.appendContent(content);
}

To

private appendOperators(ops: Operator[]): void {
  const parts: string[] = [];
  for (const op of ops) {
    const bytes = op.toBytes();
    let str = "";
    for (let i = 0; i < bytes.length; i++) {
      str += String.fromCharCode(bytes[i]);
    }
    parts.push(str);
  }
  this.appendContent(parts.join("\n"));
}

Reason for change: op.toString() loses the original byte values when the string goes through JavaScript's internal encoding, corrupting Latin-1 characters in the process. Using op.toBytes() and rebuilding the string piece by piece with String.fromCharCode() ensures the exact byte values are preserved.

- Lines ~2313-2318 Method: addFontResource()

Changed

const fontDict = PdfDict.of({
  Type: PdfName.of("Font"),
  Subtype: PdfName.of("Type1"),
  BaseFont: PdfName.of(font),
});

To

const fontDict = PdfDict.of({
  Type: PdfName.of("Font"),
  Subtype: PdfName.of("Type1"),
  BaseFont: PdfName.of(font),
  Encoding: PdfName.of("WinAnsiEncoding"),
});

Reason for change: Without an explicit encoding entry PDF viewers may assume different default encodings depending on the platform or implementation. Clearly stating WinAnsiEncoding explicitly ensures that Latin-1 bytes such as 0xE1 consistently maps to "á" across all viewers.

This is a solution I implemented and worked for my specific use case. I would've opened a PR but I don't have enough time right now to add proper tests and check in detail for good integration since this does touch a fair amount of the internal pipeline. Please feel free to comment if you need any additional information.

You can contact me via DM in X as well here, I've already engaged with @catalinmpit in one post previously about a possible issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions