Skip to content

fix: sanitize malformed Unicode in MCP responses#39625

Open
furkankoykiran wants to merge 3 commits intomicrosoft:mainfrom
furkankoykiran:main
Open

fix: sanitize malformed Unicode in MCP responses#39625
furkankoykiran wants to merge 3 commits intomicrosoft:mainfrom
furkankoykiran:main

Conversation

@furkankoykiran
Copy link
Contributor

@furkankoykiran furkankoykiran commented Mar 11, 2026

Summary

Fixes "invalid high surrogate in string" JSON serialization errors in MCP responses when page content contains malformed Unicode (lone surrogates).

Changes

  1. Add sanitizeUnicode() function (response.ts)

    • Uses String.prototype.toWellFormed() on Node 20+ for native Unicode sanitization
    • Falls back to manual surrogate replacement for Node 18 compatibility
    • Replaces lone surrogates with U+FFFD (replacement character)
  2. Integrate into response serialization

    • Applied alongside existing redactText() function in serialize() method
    • Sanitizes all outgoing MCP response text before JSON serialization
  3. Add comprehensive tests (unicode-serialization.spec.ts)

    • Lone high/low surrogates
    • Valid surrogate pairs (emoji, CJK)
    • Mixed content with malformed Unicode
    • Console messages with lone surrogates

Context

Closes microsoft/playwright-mcp#1447

Previous attempt (PR #1448 in playwright-mcp) was closed because it fixed the issue at the transport layer in cli.js. The proper fix location is in response.ts where text processing already happens via redactText().

Verification

npm run ctest-mcp unicode-serialization
# 6 passed (35.3s)

All tests pass, confirming that:

  • MCP responses don't fail with JSON serialization errors
  • Lone surrogates are replaced with U+FFFD
  • Valid surrogate pairs (emoji, CJK) are preserved
  • Normal text remains unchanged

@pavelfeldman
Copy link
Member

Also looks like agent-generated blob that fails lint. Better place to handle it though. Assume Node 20 and hand-craft the tests please.

Add sanitizeUnicode() function to replace lone surrogates with U+FFFD
before JSON serialization. This prevents "invalid high surrogate in string"
errors when page content contains malformed Unicode.

Uses String.prototype.toWellFormed() on Node 20+, with fallback for
Node 18 compatibility. Integrated into response serialization pipeline
alongside existing redactText() function.
Add comprehensive tests for malformed Unicode handling in MCP responses:
- Lone high surrogates
- Lone low surrogates
- Valid surrogate pairs (emoji)
- Mixed CJK content with malformed Unicode
- Multiple consecutive lone surrogates
- Console messages with lone surrogates

All tests verify that MCP responses don't fail with JSON serialization
errors when encountering malformed Unicode from page content.
- Remove Node 18 fallback, use toWellFormed() only
- Rewrite tests to match MCP test patterns (3 focused tests)
- Use server.setContent() for simpler test setup
- Reduce test complexity while maintaining coverage
@furkankoykiran
Copy link
Contributor Author

Thanks for the feedback. I've simplified the changes based on your suggestions:

Changes made:

  1. Removed Node 18 fallback - The sanitizeUnicode() function now only uses String.prototype.toWellFormed() as requested.

  2. Rewrote tests by hand - Reduced from 6 tests to 3 focused tests that match the style of existing MCP tests:

    • Uses server.setContent() for simpler setup
    • Cleaner, more idiomatic test structure
    • Removed excessive assertions
  3. All tests pass - Confirmed with npm run ctest-mcp unicode-serialization

    • handles lone surrogates in page content
    • preserves valid emoji and surrogate pairs
    • handles console messages with lone surrogates

The branch has also been rebased onto latest main and the PR is now up to date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix malformed Unicode in outgoing MCP responses

2 participants