Optimize source map character offset calculation with correct UTF-16 counting#2719
Optimize source map character offset calculation with correct UTF-16 counting#2719
Conversation
Replace O(n) utf8.RuneCountInString with fast ASCII check + correct UTF-16 counting. For ASCII-only lines (the common case), character offset is simply the byte offset difference, avoiding any scanning. For non-ASCII content, properly count UTF-16 code units (surrogate pairs for supplementary characters) instead of runes. Also add BenchmarkEmit to measure emit performance with source maps enabled. Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
| if r >= 0x10000 { | ||
| n += 2 // Supplementary character requires a surrogate pair. | ||
| } else { | ||
| n++ | ||
| } |
There was a problem hiding this comment.
Don't we have code like this somewhere?
There was a problem hiding this comment.
It was in my weird regex parsing PR; I don't think this PR is even correct, though, because it's still O(n); I was hoping it'd generate a nice data structure.
There was a problem hiding this comment.
That's what I figured as well, since decoding probably has a fast path anyway?
There was a problem hiding this comment.
I think a cache for the previously-requested file and line of interest would be fine.
There was a problem hiding this comment.
The gotcha is that each time the emitter does something, it updates the line map, which then goes back through this code (more or less)
| for i := 0; i < n; i++ { | ||
| if s[i] >= utf8.RuneSelf { | ||
| // Non-ASCII byte found; fall back to full UTF-16 counting. | ||
| return utf16LengthSlow(s) |
There was a problem hiding this comment.
The problem with this is you're already part of the way through the string, so if you have any non-ASCII on a long line, you have to go back and reinterpret all the previous contents.
Least invasive way to partially fix this is to keep things mostly as-is, but instead return len(s) + utf16LengthSlow([i:]) so we don't reiterate over the beginning of the string.
I think a better way to do this would be either
- Keep 2 separate counters: one for the number of non-ASCII units, one for the number of bytes skipped to decode the current rune. At the end, you return
len(s) + numNonAscii - numBytesSkipped. - You keep 1 counter
nonAsciiAdjustmentsthat just combines the two concepts above.
|
I'm going to close this, I think it's just not helpful in isolation |
GetECMALineAndCharacterOfPositioncalledutf8.RuneCountInStringfrom line start to position for every source map mapping — O(n) per call, O(n×m) per file. This is the root cause of the reported 325s compile times. Additionally, rune count is wrong per the source map v3 spec which requires UTF-16 code units.Fix
len(s)directly.Benchmark
Added
BenchmarkEmitexercising the full emit path with source maps enabled against the TypeScript compiler source.Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.