Skip to content

Conversation

@waleedlatif1
Copy link
Collaborator

@waleedlatif1 waleedlatif1 commented Jan 30, 2026

Summary

  • Fix multi-byte UTF-8 character corruption in SSE streaming (Turkish, emoji, CJK)
  • Add { stream: true } to TextDecoder to maintain state across chunks
  • Add SSE message buffering for messages split across HTTP chunk boundaries
  • Add comprehensive tests for UTF-8 boundary conditions
  • Upgrade turborepo 2.7.4 → 2.8.0

Fixes #3068

Type of Change

  • Bug fix

Testing

  • 25 new tests covering Turkish/CJK/emoji characters split at byte boundaries
  • All 3621 existing tests pass

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Jan 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Jan 30, 2026 7:29pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

This PR fixes a critical bug where multi-byte UTF-8 characters (Turkish characters, emojis, CJK characters) were being corrupted during SSE streaming when split across HTTP chunk boundaries.

Key Changes:

  • Added { stream: true } option to TextDecoder.decode() in sse.ts:66, enabling the decoder to maintain state across chunks and properly handle incomplete UTF-8 byte sequences
  • Implemented SSE message buffering using a buffer variable that accumulates partial SSE messages and splits on \n\n delimiters, keeping incomplete messages for the next chunk
  • Added logic to flush remaining bytes when the stream completes by calling decoder.decode() without arguments
  • Added comprehensive test suite with 25 tests covering various UTF-8 scenarios: Turkish characters, emojis, CJK characters split at 2/3/4-byte boundaries, and SSE message buffering edge cases
  • Upgraded turborepo from 2.7.4 to 2.8.0
  • Removed redundant comment in response-format.ts

The fix correctly handles both problems:

  1. UTF-8 byte boundary splits: The { stream: true } option tells TextDecoder to buffer incomplete multi-byte sequences internally
  2. SSE message boundary splits: The buffer keeps partial messages (e.g., "data: {"chu") until the rest arrives in the next chunk

Confidence Score: 5/5

  • This PR is safe to merge - it fixes a real bug with a proper solution and comprehensive tests
  • The implementation correctly uses TextDecoder's streaming mode and adds proper SSE message buffering. The fix is minimal, focused, and well-tested with 25 comprehensive tests covering edge cases. All existing 3621 tests pass.
  • No files require special attention

Important Files Changed

Filename Overview
apps/sim/lib/core/utils/sse.ts Fixed multi-byte UTF-8 character corruption by adding { stream: true } to TextDecoder and implementing SSE message buffering
apps/sim/lib/core/utils/sse.test.ts Comprehensive test suite with 25 tests covering UTF-8 boundary conditions and SSE message buffering

Sequence Diagram

sequenceDiagram
    participant Client
    participant readSSEStream
    participant TextDecoder
    participant Buffer
    participant Callbacks

    Client->>readSSEStream: body.getReader()
    readSSEStream->>TextDecoder: new TextDecoder()
    
    loop For each chunk
        readSSEStream->>readSSEStream: reader.read()
        
        alt Chunk received
            readSSEStream->>TextDecoder: decode(value, { stream: true })
            Note over TextDecoder: Maintains state for<br/>incomplete UTF-8 sequences
            TextDecoder-->>Buffer: Decoded text
            Buffer->>Buffer: Split by '\n\n'
            Buffer->>Buffer: Keep incomplete message
            
            loop For each complete SSE message
                Buffer->>Buffer: Extract "data: " content
                Buffer->>Buffer: JSON.parse(lineData)
                
                alt Valid chunk
                    Buffer->>Callbacks: onChunk(data.chunk)
                    Buffer->>Callbacks: onAccumulated(accumulatedContent)
                end
            end
        else Done
            readSSEStream->>TextDecoder: decode()
            Note over TextDecoder: Flush any remaining bytes
            TextDecoder-->>Buffer: Final text
        end
    end
    
    readSSEStream-->>Client: accumulatedContent
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1 waleedlatif1 merged commit f7c3de0 into staging Jan 30, 2026
12 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/streaming branch January 30, 2026 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants