Skip to content

Event loop blocking when parsing large binary data as HTML #13

@TimDaub

Description

@TimDaub

Description

When open-graph-scraper-lite is given large amounts of binary data (e.g., PDF files) to parse as HTML, it blocks the Node.js event loop for extended periods, making the server unresponsive. This can happen when a URL is expected to return HTML but actually returns binary content like PDFs.

Steps to Reproduce

import ogs from 'open-graph-scraper-lite';

// Create 5MB of binary data (simulating PDF content)
const buffer = Buffer.alloc(5 * 1024 * 1024);
buffer.write('%PDF-1.4\n', 0); // PDF header
for (let i = 10; i < buffer.length; i++) {
  buffer[i] = Math.floor(Math.random() * 256);
}

// Convert to string (this happens when binary is treated as text)
const html = buffer.toString('utf8');

// This will block the event loop for several seconds
console.time('ogs-parse');
try {
  await ogs({ html });
} catch (err) {
  console.error('Parse error:', err.message);
}
console.timeEnd('ogs-parse');

Observed Behavior

Testing with different file sizes shows severe blocking:

  • 100KB binary: ~148ms blocking
  • 1MB binary: ~914ms blocking
  • 5MB binary: ~4.5 seconds blocking
  • 10MB binary: ~6.8 seconds blocking

During this time, the event loop is completely blocked (0% responsiveness), preventing the server from handling any other requests.

Real-world Impact

This issue was discovered when our server became completely unresponsive after attempting to fetch metadata from a URL that returned an 82MB PDF file instead of HTML. The server had to be manually restarted after the problematic entry was removed from the cache.

Expected Behavior

The library should:

  1. Fail fast when detecting binary/non-HTML content
  2. Not block the event loop for extended periods
  3. Have reasonable limits on input size or processing time

Suggestions

Some potential solutions:

  1. Add early detection for binary content (e.g., check for PDF headers, null bytes, or high proportion of non-printable characters)
  2. Implement size limits with early bailout
  3. Add timeout mechanisms to prevent indefinite blocking
  4. Use worker threads or other async processing for large inputs

Environment

  • Node.js version: node 20
  • open-graph-scraper-lite version: 4.0.3
  • OS: macOS/Linux

Test Case

I've created a test file that demonstrates this issue: https://gist.github.com/TimDaub/54a57bf2c4eaaf8003dcc0ef9396b34f

The test shows how parsing binary data causes complete event loop blocking, making the server unresponsive to all other requests.

Thank you for maintaining this library! Happy to provide more details or help with testing potential fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions