-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Problem
Normal usage of the library causes rendering problems with accented characters such as á, é, ö, etc. The specific scenario happens when using a Standard 14 font such as Helvetica (the one I tried first).
Attached a screenshot and a link to a reproducer I prepared, the example is all written in spanish using several of the affected characters in various ways.
Reproducer: https://github.com/Ozmah/libpdf-latin1-reproducer
Proposed Solution
I got it working by changing 3 specific places of this file src/api/pdf-page.ts:
- Line ~2086 Content Stream Byte Conversion
Changed
const bytes = new TextEncoder().encode(content);To
const bytes = new Uint8Array(content.length);
for (let i = 0; i < content.length; i++) {
bytes[i] = content.charCodeAt(i) & 0xff;
}Reason for change: TextEncoder().encode() converts strings to UTF-8. For characters in the Latin-1 range (like á, é, ñ), UTF-8 produces multi byte sequences (e.g., á becomes 0xC3 0xA1) but Standard 14 fonts expect WinAnsi encoding, where á is a single byte 0xE1. This causes corruption in the PDF showing characters such as the ones in the screenshot.
- Lines ~2277-2281 Method: appendOperators()
Changed
private appendOperators(ops: Operator[]): void {
const content = ops.map(op => op.toString()).join("\n");
this.appendContent(content);
}To
private appendOperators(ops: Operator[]): void {
const parts: string[] = [];
for (const op of ops) {
const bytes = op.toBytes();
let str = "";
for (let i = 0; i < bytes.length; i++) {
str += String.fromCharCode(bytes[i]);
}
parts.push(str);
}
this.appendContent(parts.join("\n"));
}Reason for change: op.toString() loses the original byte values when the string goes through JavaScript's internal encoding, corrupting Latin-1 characters in the process. Using op.toBytes() and rebuilding the string piece by piece with String.fromCharCode() ensures the exact byte values are preserved.
- Lines ~2313-2318 Method: addFontResource()
Changed
const fontDict = PdfDict.of({
Type: PdfName.of("Font"),
Subtype: PdfName.of("Type1"),
BaseFont: PdfName.of(font),
});To
const fontDict = PdfDict.of({
Type: PdfName.of("Font"),
Subtype: PdfName.of("Type1"),
BaseFont: PdfName.of(font),
Encoding: PdfName.of("WinAnsiEncoding"),
});Reason for change: Without an explicit encoding entry PDF viewers may assume different default encodings depending on the platform or implementation. Clearly stating WinAnsiEncoding explicitly ensures that Latin-1 bytes such as 0xE1 consistently maps to "á" across all viewers.
This is a solution I implemented and worked for my specific use case. I would've opened a PR but I don't have enough time right now to add proper tests and check in detail for good integration since this does touch a fair amount of the internal pipeline. Please feel free to comment if you need any additional information.
You can contact me via DM in X as well here, I've already engaged with @catalinmpit in one post previously about a possible issue.