feat: CLI telemetry for all commands + crash reporting#122
Conversation
…amps, and new event types Add environment fingerprint fields (OS, Node version, CI detection, shell) to session.start and session.end events. Add startTimestamp to step and agent.tool events for span reconstruction. Define command and crash event types with stub emission methods. Add discriminated union Zod schema validation tests mirroring the API schema.
… persistence Wire up yargs middleware that emits a provisional command event before each handler runs, then replaces it with actual duration/success on completion. This covers the ~25 process.exit() call sites without modifying them. - Command telemetry middleware with canonical name resolution and flag extraction - Crash reporter with sanitized stack traces (sync handlers, no async) - Store-forward: persist unsent events to temp file on exit, recover on next run - Fix flush() to retain events until HTTP success (was clearing before fetch) - Auto-wrap handlers in registerSubcommand() (single change point) - Shared COMMAND_ALIASES map for telemetry and help-json - analytics.initForNonInstaller() sets gatewayUrl + JWT from stored creds
- Enable debug output for non-installer commands via env var - Log telemetry event details (type, name, duration, attributes) on flush - Register in debug command's env var catalog
- Wrap inline command handlers (seed, setup-org, doctor, etc.) with wrapCommandHandler so they report real duration/success - Skip provisional telemetry event for install command (has own session telemetry) - Add claim -> env.claim to canonical alias map - Defer store-forward file deletion until after successful flush
Client errors (401, 403) are permanent failures that won't succeed on retry. Only retain events for 5xx (transient server errors) and network failures where store-forward retry is meaningful.
- flush() returns true (sent/dropped) or false (retryable) so callers can act on the result - Use splice(0, count) instead of clearing all events, protecting events queued concurrently during the fetch - wrapCommandHandler flushes in-process so events are sent immediately instead of always deferring to next invocation via store-forward - Store-forward recovery deletes files after loading into memory (events are re-persisted by exit handler if flush fails) - Skip provisional events for dashboard and $0 (installer entry points) - Add 4xx drop test coverage
Add a section to CLAUDE.md explaining which commands auto-emit telemetry (registerSubcommand) versus which need manual wrapCommandHandler wrapping (inline top-level .command() calls). Add a pointer comment in bin.ts near the workflow commands block. Prevents new top-level commands from silently emitting duration=0 telemetry.
- Add workos.user_id to command and crash events (from stored credentials or unclaimed environment clientId) so dashboards can count unique users - Add cli.version to command and crash events for release adoption tracking - Support claim-token auth path on the telemetry client, so unclaimed environments' telemetry reaches the API (guard accepts this path too) - Rename CrashEvent's installer.version to cli.version (crashes happen outside the installer too) - initForNonInstaller() now wires up user_id and claim-token auth
Closes security-audit finding #1 on PR #122 (telemetry message sanitization). `error.message` was flowing into 4 capture sites unsanitized, leaking homedir paths (and rarely, credentials) to the WorkOS gateway. - Add `sanitizeMessage()` in crash-reporter.ts: homedir strip + Bearer/ sk_*/JWT redaction + 1KB truncation. - Factor secret redaction into shared `redactSecrets()` used by both `sanitizeMessage` and `sanitizeStack` (Node echoes `.message` into the leading `Error.stack` line, so message-only sanitization was insufficient). - Add private `extractErrorFields()` chokepoint on `Analytics`; route all 4 capture sites through it (`captureException`, `stepCompleted`, `commandExecuted`, `captureUnhandledCrash`). `replaceLastCommandEvent` inherits sanitization via its delegation to `commandExecuted`. - `captureUnhandledCrash` now uses `sanitizeStack` instead of inline truncation, providing defense-in-depth for callers that bypass the crash-reporter wrapper. - Add regression guard test (`telemetry-sanitize.spec.ts`): poisons every capture method with homedir + Bearer + sk_live_ + JWT, asserts no marker reaches the serialized queue. Reviewed: ideation:reviewer cycle 1 PASS (0 critical, 0 high).
Introduces two additive telemetry signals requested by the signals spec:
- device.id — persistent per-install UUID stored at ~/.workos/device-id.
File IO failures fall through to a one-shot session UUID; never throws.
- auth.mode — 4-state enum ('jwt' | 'claim_token' | 'api_key' | 'none')
derived during initForNonInstaller with JWT > claim_token > api_key
priority. Installer flow sets it in run-with-core after credentials
resolve.
Both fields are injected via getEnvFingerprint so they appear on every
event that already carries env context (session.start, session.end,
command, crash). No change to downstream gateway — existing deliveries
accept the new attrs as optional.
Implements Phase 1 of docs/ideation/telemetry-signals/spec-phase-1.md.
Gateway-side type mirroring is tracked separately in the gateway repo.
…and events Replace the boolean `command.success` as the primary outcome dimension with a structured `termination.reason` enum (success | cancelled | auth_required | validation_error | api_error | crash) plus a `error.code` string. - `exitWithError`, `exitWithCode`, `exitWithAuthRequired` now patch the provisional command event via `analytics.recordTermination` before calling `process.exit`. Previously the queued provisional event (success true, duration 0) was persisted as-is by store-forward, producing misleading dashboard data. - `exitWithError` now honors the string error code for exit mapping: `auth_required` -> 4, `cancelled` -> 2, `http_*`/`not_found`/`unknown_error` -> 1 (api_error), everything else -> 1 (validation_error). Previously it always exited 1. - New `TelemetryClient.patchLastEventOfType` helper mutates a queued event in place. Unlike `replaceLastEventOfType`, it preserves the event so multiple callers can update fields incrementally. - `wrapCommandHandler` records `reason: 'success'` on clean completion and `reason: 'crash'` with `error.name` on uncaught throw. - `command.success` remains on CommandEvent for backward-compat. - `api.status`/`api.code`/`api.resource` fields added to CommandEvent and plumbed through `recordTermination` for Phase 3 consumption. Review passed (2 medium + 1 low non-blocking; low fixed in same commit for crash-path test symmetry).
…mmand events Wires `createApiErrorHandler` through to the provisional command event via an optional `apiContext` param on `exitWithError`, populating `api.status`, `api.code`, and `api.resource` attributes on API-failure command events. Also treats `apiContext` presence as ground truth for `api_error` termination reason — WorkOS error codes like `rate_limited` or `validation_error` would otherwise fall through `resolveErrorCode` to `validation_error`, hiding legitimate API failures from api_error dashboards. The `CommandEvent` type already had the `api.*` attribute slots (added as forward-compat in Phase 2); only the wiring and tests are new.
📝 WalkthroughWalkthroughThis PR adds comprehensive telemetry infrastructure to the CLI, enabling command execution tracking, error context capture, device identification, crash reporting, and event persistence through transient network failures. It defines new telemetry event types for commands and crashes, instruments the Analytics class with command and auth-mode tracking, wires telemetry into command execution via middleware and handler wrapping, and integrates store-and-forward recovery into the CLI bootstrap. ChangesTelemetry Infrastructure & Command Tracking
Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (4)
src/utils/telemetry-types.ts (1)
58-58: ⚡ Quick winAlign
startTimestampoptionality with backward-compat contract.
telemetry-schema.spec.tsexplicitly acceptsstep/agent.toolwithoutstartTimestamp, but these interfaces currently require it. Making both optional avoids contract drift.Suggested patch
export interface StepEvent extends TelemetryEvent { type: 'step'; name: string; - startTimestamp: string; + startTimestamp?: string; durationMs: number; success: boolean; error?: { @@ export interface AgentToolEvent extends TelemetryEvent { type: 'agent.tool'; toolName: string; - startTimestamp: string; + startTimestamp?: string; durationMs: number; success: boolean; }Also applies to: 70-70
src/utils/telemetry-client.spec.ts (1)
60-105: ⚡ Quick winAdd coverage for claim-token auth headers.
setClaimTokenAuth()is new behavior in this PR, but there’s no test assertingx-workos-claim-token+x-workos-client-idemission (and bearer-token precedence when both exist).Also applies to: 124-218
src/utils/telemetry-store-forward.ts (1)
1-3: ⚡ Quick winUse async fs APIs for startup recovery path.
recoverPendingEvents()(Line 25+) is async but uses sync disk calls (Lines 27-56), which blocks startup and conflicts with the project rule for TS files.As per coding guidelines "Avoid Node-specific sync APIs (crypto, fs sync) unless necessary".
Also applies to: 27-56
src/utils/output.ts (1)
123-124: ⚡ Quick winProtect process termination from telemetry failures
analytics.recordTermination(...)on Line 123 can throw and prevent Line 124 from executing. Error exits should remain deterministic even when telemetry is unhealthy.🔧 Proposed fix
const reason = error.apiContext ? 'api_error' : codeReason; - analytics.recordTermination(reason, error.code, error.apiContext); - process.exit(exit); + try { + analytics.recordTermination(reason, error.code, error.apiContext); + } finally { + process.exit(exit); + } }
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c45afda9-659c-43a0-9646-cd08dd0cabe3
📒 Files selected for processing (27)
CLAUDE.mdsrc/bin.tssrc/commands/debug.tssrc/lib/api-error-handler.spec.tssrc/lib/api-error-handler.tssrc/lib/command-aliases.tssrc/lib/device-id.spec.tssrc/lib/device-id.tssrc/lib/run-with-core.tssrc/utils/analytics.spec.tssrc/utils/analytics.tssrc/utils/command-telemetry.spec.tssrc/utils/command-telemetry.tssrc/utils/crash-reporter.spec.tssrc/utils/crash-reporter.tssrc/utils/exit-codes.spec.tssrc/utils/exit-codes.tssrc/utils/output.spec.tssrc/utils/output.tssrc/utils/register-subcommand.tssrc/utils/telemetry-client.spec.tssrc/utils/telemetry-client.tssrc/utils/telemetry-sanitize.spec.tssrc/utils/telemetry-schema.spec.tssrc/utils/telemetry-store-forward.spec.tssrc/utils/telemetry-store-forward.tssrc/utils/telemetry-types.ts
| import os from 'node:os'; | ||
| import crypto from 'node:crypto'; | ||
|
|
||
| const UUID_REGEX = /^[0-9a-f-]{36}$/i; |
There was a problem hiding this comment.
Tighten UUID validation before accepting persisted IDs.
The current regex accepts many non-UUID strings with 36 hex/hyphen chars. Use a strict UUID v4/v1-5 pattern (or z.uuid() in validation layer) before reusing the file value.
Suggested patch
-const UUID_REGEX = /^[0-9a-f-]{36}$/i;
+const UUID_REGEX =
+ /^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i;Also applies to: 34-35
| try { | ||
| if (fs.existsSync(filePath)) { | ||
| const raw = fs.readFileSync(filePath, 'utf8').trim(); | ||
| if (UUID_REGEX.test(raw)) { | ||
| cached = raw; | ||
| return raw; | ||
| } | ||
| } | ||
|
|
||
| const id = crypto.randomUUID(); | ||
| fs.mkdirSync(path.dirname(filePath), { recursive: true, mode: 0o700 }); | ||
| fs.writeFileSync(filePath, id, { encoding: 'utf8', mode: 0o600 }); | ||
| cached = id; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
Replace sync fs usage in the command path.
Line 31–43 and Line 49 use synchronous fs/crypto APIs on a hot CLI path. Please move this to async (node:fs/promises) and cache a pending promise to keep call sites simple.
As per coding guidelines src/**/*.ts: “Avoid Node-specific sync APIs (crypto, fs sync) unless necessary”.
Also applies to: 49-50
| const env = getActiveEnvironment(); | ||
| if (env && isUnclaimedEnvironment(env)) { | ||
| telemetryClient.setClaimTokenAuth(env.clientId, env.claimToken); | ||
| // Tag distinctId so unclaimed sessions are identifiable in analytics | ||
| this.distinctId = this.distinctId ?? `unclaimed:${env.clientId}`; | ||
| if (this.authMode === 'none') this.authMode = 'claim_token'; | ||
| } |
There was a problem hiding this comment.
Honor declared auth priority when unclaimed env is present.
Line 95 currently sets claim-token auth unconditionally, even if JWT auth was already set on Line 83. That can override the documented priority and misclassify/downgrade authenticated sessions.
Suggested fix
- if (env && isUnclaimedEnvironment(env)) {
+ if (env && isUnclaimedEnvironment(env) && this.authMode === 'none') {
telemetryClient.setClaimTokenAuth(env.clientId, env.claimToken);
// Tag distinctId so unclaimed sessions are identifiable in analytics
this.distinctId = this.distinctId ?? `unclaimed:${env.clientId}`;
- if (this.authMode === 'none') this.authMode = 'claim_token';
+ this.authMode = 'claim_token';
}| export function extractUserFlags(rawArgs: string[]): string[] { | ||
| const passedFlags = rawArgs | ||
| .filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2)) | ||
| .map((arg) => arg.replace(/^-+/, '').split('=')[0]); | ||
| return [...new Set(passedFlags)]; |
There was a problem hiding this comment.
Harden flag extraction to avoid invalid telemetry flags.
Line 32/33 currently treats -- and values like -1 as flags, which can pollute command.flags.
Suggested fix
export function extractUserFlags(rawArgs: string[]): string[] {
const passedFlags = rawArgs
- .filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2))
- .map((arg) => arg.replace(/^-+/, '').split('=')[0]);
+ .filter((arg) => {
+ if (arg === '--') return false;
+ if (/^--[A-Za-z][\w-]*(=.*)?$/.test(arg)) return true;
+ if (/^-[A-Za-z]$/.test(arg)) return true;
+ return false;
+ })
+ .map((arg) => arg.replace(/^-+/, '').split('=')[0])
+ .filter(Boolean);
return [...new Set(passedFlags)];
}| const topLevelCommand = commandParts[0] ?? ''; | ||
| if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return; |
There was a problem hiding this comment.
Normalize the skip check using canonical top-level command.
Line 60/61 checks the raw command token. If a skipped command is invoked via alias, the exemption can be bypassed and telemetry still emitted.
Suggested fix
- const topLevelCommand = commandParts[0] ?? '';
+ const rawTopLevelCommand = commandParts[0] ?? '';
+ const topLevelCommand = COMMAND_ALIASES[rawTopLevelCommand] ?? rawTopLevelCommand;
if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;| export function sanitizeStack(stack: string | undefined): string { | ||
| if (!stack) return ''; | ||
| let sanitized = stack.replaceAll(HOME, '~'); | ||
| sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/'); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "crash-reporter.ts" -type fRepository: workos/cli
Length of output: 84
🏁 Script executed:
cat -n src/utils/crash-reporter.tsRepository: workos/cli
Length of output: 3137
🏁 Script executed:
python3 << 'EOF'
import re
# Test the current regex against various path formats
current_pattern = r'/[^\s:]+/(node_modules|dist|src)/'
windows_pattern = r'[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\'
# Test cases
test_cases = [
# POSIX paths
"/home/user/project/node_modules/lib/index.js",
"/var/app/dist/main.js",
"/Users/dev/src/index.ts",
# Windows paths
"C:\\Users\\dev\\project\\node_modules\\lib\\index.js",
"D:\\app\\dist\\main.js",
"E:\\code\\src\\index.ts",
# Stack trace examples
" at functionName (/home/user/project/node_modules/pkg/index.js:10:5)",
" at functionName (C:\\Users\\dev\\project\\node_modules\\pkg\\index.js:10:5)",
]
print("Testing CURRENT regex pattern:")
print(f"Pattern: {current_pattern}")
for test in test_cases:
match = re.search(current_pattern, test)
result = re.sub(current_pattern, r'\1/', test) if match else test
print(f" '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'")
print("\nTesting PROPOSED Windows pattern:")
print(f"Pattern: {windows_pattern}")
for test in test_cases:
match = re.search(windows_pattern, test)
result = re.sub(windows_pattern, r'\1\\', test) if match else test
print(f" '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'")
print("\nTesting COMBINED patterns:")
for test in test_cases:
result = test
result = re.sub(current_pattern, r'\1/', result)
result = re.sub(windows_pattern, r'\1\\', result)
print(f" '{test[:60]}...' -> Result: '{result[:60]}...'")
EOFRepository: workos/cli
Length of output: 2630
Windows absolute paths leak in stack trace sanitization
Line 26 only collapses POSIX-style paths (/path/to/node_modules/). Windows paths (e.g., C:\Users\...\node_modules\) bypass this regex entirely and leak full local filesystem paths into telemetry payloads.
Regex testing confirms the current pattern does not match Windows absolute paths:
C:\Users\dev\project\node_modules\pkg\index.js→ remains unsanitized/home/user/project/node_modules/pkg/index.js→ correctly collapsed tonode_modules/pkg/index.js
🔧 Proposed fix
export function sanitizeStack(stack: string | undefined): string {
if (!stack) return '';
let sanitized = stack.replaceAll(HOME, '~');
- sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/');
+ sanitized = sanitized
+ .replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/')
+ .replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\');
sanitized = redactSecrets(sanitized);
return sanitized.length > MAX_STACK_LENGTH
? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]'
: sanitized;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/'); | |
| export function sanitizeStack(stack: string | undefined): string { | |
| if (!stack) return ''; | |
| let sanitized = stack.replaceAll(HOME, '~'); | |
| sanitized = sanitized | |
| .replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/') | |
| .replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\'); | |
| sanitized = redactSecrets(sanitized); | |
| return sanitized.length > MAX_STACK_LENGTH | |
| ? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]' | |
| : sanitized; | |
| } |
| export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never { | ||
| if (error) { | ||
| outputError(error); | ||
| } | ||
| analytics.recordTermination(reasonForExitCode(code), error?.code); | ||
| process.exit(code); |
There was a problem hiding this comment.
Classify termination from error.code when available.
Line 76 derives reason only from numeric exit code, so all code-1 exits become validation_error. That misclassifies API failures like http_500 when they go through exitWithCode.
Suggested fix
export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {
if (error) {
outputError(error);
}
- analytics.recordTermination(reasonForExitCode(code), error?.code);
+ const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code);
+ analytics.recordTermination(reason, error?.code);
process.exit(code);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never { | |
| if (error) { | |
| outputError(error); | |
| } | |
| analytics.recordTermination(reasonForExitCode(code), error?.code); | |
| process.exit(code); | |
| export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never { | |
| if (error) { | |
| outputError(error); | |
| } | |
| const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code); | |
| analytics.recordTermination(reason, error?.code); | |
| process.exit(code); | |
| } |
| async flush(): Promise<boolean> { | ||
| if (this.events.length === 0) return true; | ||
| if (!this.gatewayUrl) { | ||
| debug('[Telemetry] No gateway URL configured, skipping flush'); | ||
| return; | ||
| return false; | ||
| } | ||
|
|
||
| const payload: TelemetryRequest = { events: [...this.events] }; | ||
| this.events = []; | ||
| const count = this.events.length; | ||
| const payload: TelemetryRequest = { events: this.events.slice(0, count) }; | ||
|
|
There was a problem hiding this comment.
Guard against concurrent flush() calls to prevent event loss/duplication.
Line 85 currently allows overlapping flushes. Two concurrent calls can send the same snapshot twice, and later splice(0, count) can remove events queued after the first flush.
Suggested patch
export class TelemetryClient {
private events: TelemetryEvent[] = [];
+ private flushInFlight: Promise<boolean> | null = null;
@@
async flush(): Promise<boolean> {
+ if (this.flushInFlight) return this.flushInFlight;
+ this.flushInFlight = this.flushInternal();
+ try {
+ return await this.flushInFlight;
+ } finally {
+ this.flushInFlight = null;
+ }
+ }
+
+ private async flushInternal(): Promise<boolean> {
if (this.events.length === 0) return true;
@@
- }
+ }
}Also applies to: 135-137, 143-144
| mkdirSync(dirname(filePath), { recursive: true }); | ||
| writeFileSync(filePath, JSON.stringify(this.events), 'utf-8'); | ||
| this.events = []; |
There was a problem hiding this comment.
Persist telemetry with restrictive filesystem permissions.
persistToFile() currently relies on default permissions. Since payloads can include user/device identifiers, write with 0700 dir and 0600 file modes.
Suggested patch
- mkdirSync(dirname(filePath), { recursive: true });
- writeFileSync(filePath, JSON.stringify(this.events), 'utf-8');
+ mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 });
+ writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 });📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mkdirSync(dirname(filePath), { recursive: true }); | |
| writeFileSync(filePath, JSON.stringify(this.events), 'utf-8'); | |
| this.events = []; | |
| mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 }); | |
| writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 }); | |
| this.events = []; |
Greptile SummaryThis PR wires full command-level telemetry across the CLI: every yargs command now emits a
Confidence Score: 4/5The telemetry infrastructure is additive and fail-safe — every path is wrapped in try/catch so telemetry failures cannot crash the CLI. The main areas to watch are the concurrent-flush race (recovered events can be sent twice when previous-session files exist) and the circular import between output.ts and exit-codes.ts. All new code paths are purely telemetry — they don't affect command output, user-visible behavior, or data integrity. The circular import between output.ts and exit-codes.ts works at runtime due to ESM live bindings but is a maintenance risk. The concurrent flush race produces duplicate telemetry events for recovered data, which is a quality concern rather than a correctness one. Both issues are localized to the telemetry subsystem. src/utils/output.ts and src/utils/exit-codes.ts (circular import), src/utils/telemetry-store-forward.ts and src/bin.ts (concurrent flush race), src/utils/command-telemetry.ts (SKIP_TELEMETRY_COMMANDS enforcement gap). Important Files Changed
|
|
|
||
| import chalk from 'chalk'; | ||
| import { analytics } from './analytics.js'; | ||
| import { resolveErrorCode } from './exit-codes.js'; |
There was a problem hiding this comment.
Circular import between
output.ts and exit-codes.ts
output.ts imports resolveErrorCode from ./exit-codes.js, and exit-codes.ts (line 11) imports outputError from ./output.js. This creates a mutual dependency between the two modules. In Node.js ESM this works at runtime because both references are only used inside function bodies (not during module initialization), so live bindings are fully populated by call time. However, it's a fragile antipattern — if either module is ever extended to reference the other's export at the top level (e.g., a default value, a module-level variable), it will silently receive undefined. The clean fix is to move resolveErrorCode into a dedicated module (e.g., error-codes.ts) that has no circular dependency on output.ts.
| analytics.initForNonInstaller(); | ||
| // Fire-and-forget: recover events from previous crashes/exits. | ||
| // NO await — must not block startup (flush timeout is 3s). | ||
| recoverPendingEvents(); |
There was a problem hiding this comment.
Concurrent
flush() calls can send recovered events twice
recoverPendingEvents() is intentionally called without await so it doesn't block startup. All of its synchronous work (reading files, queueing events, deleting files) completes before yargs starts, but the await telemetryClient.flush() at the end of recoverPendingEvents yields to the event loop. The command handler then runs and its wrapCommandHandler.finally block calls telemetryClient.flush() independently. If a previous-session pending file existed, both flush calls will execute concurrently: Flush A snapshots the recovered events [R1, R2], Flush B snapshots [R1, R2, C] (the newly-queued command event). Flush A sends R1/R2 and then splice(0, 2) removes them; Flush B has already sent R1/R2 in its payload, resulting in duplicate delivery. For most fast CLI commands (sub-second), this race window is open. A simple guard — checking whether a flush is already in progress before starting another — would close it.
| return async (argv) => { | ||
| const commandName = String(argv.__telemetryCommandName ?? 'unknown'); | ||
| const startTime = Number(argv.__telemetryStartTime ?? Date.now()); | ||
| const flags = (argv.__telemetryFlags as string[]) ?? []; | ||
|
|
||
| try { | ||
| await handler(argv); | ||
| // Replace the provisional event with the real one, then patch in | ||
| // the structured termination reason. Order matters: replace re-queues, | ||
| // then recordTermination mutates the re-queued event in place. | ||
| analytics.replaceLastCommandEvent(commandName, Date.now() - startTime, true, { flags }); | ||
| analytics.recordTermination('success'); | ||
| } catch (error) { | ||
| const err = error instanceof Error ? error : new Error(String(error)); | ||
| analytics.replaceLastCommandEvent(commandName, Date.now() - startTime, false, { | ||
| error: err, | ||
| flags, | ||
| }); | ||
| // Uncaught throw = crash. Clean exits (exitWithError/exitWithCode) | ||
| // already recorded their own termination reason before exiting. | ||
| analytics.recordTermination('crash', err.name); | ||
| throw error; | ||
| } finally { | ||
| // Flush in-process so events are sent immediately, not deferred to next invocation. | ||
| // If flush fails, store-forward persists on exit. | ||
| await telemetryClient.flush().catch(() => {}); | ||
| } | ||
| }; |
There was a problem hiding this comment.
SKIP_TELEMETRY_COMMANDS only guards the provisional event, not the wrap-time event
SKIP_TELEMETRY_COMMANDS prevents the middleware from queuing a provisional event for commands like install and dashboard. However, wrapCommandHandler always calls analytics.replaceLastCommandEvent(...), which internally calls telemetryClient.replaceLastEventOfType('command') (no-op if no event exists) and then unconditionally calls analytics.commandExecuted(...), queuing a brand-new command event. If a developer accidentally wraps install or dashboard with wrapCommandHandler — following the CLAUDE.md guidance for new commands — those commands would still emit command-level telemetry, duplicating their session-level telemetry. Adding a guard inside wrapCommandHandler (e.g., checking whether SKIP_TELEMETRY_COMMANDS includes the resolved command name) would make the exclusion robust at the code level rather than relying solely on documentation.
| case 'step': return `step(${(e as any).name}, ${(e as any).durationMs}ms, success=${(e as any).success})`; | ||
| case 'agent.tool': return `agent.tool(${(e as any).toolName}, ${(e as any).durationMs}ms)`; | ||
| case 'agent.llm': return `agent.llm(${(e as any).model}, in=${(e as any).inputTokens}, out=${(e as any).outputTokens})`; |
There was a problem hiding this comment.
The switch arms cast to
any to access event-specific fields (name, durationMs, toolName, etc.) even though the discriminated union in telemetry-types.ts fully types each branch. Use a typed assertion (e.g., e as StepEvent) in each case instead to preserve type safety.
| case 'step': return `step(${(e as any).name}, ${(e as any).durationMs}ms, success=${(e as any).success})`; | |
| case 'agent.tool': return `agent.tool(${(e as any).toolName}, ${(e as any).durationMs}ms)`; | |
| case 'agent.llm': return `agent.llm(${(e as any).model}, in=${(e as any).inputTokens}, out=${(e as any).outputTokens})`; | |
| case 'step': { const ev = e as import('./telemetry-types.js').StepEvent; return `step(${ev.name}, ${ev.durationMs}ms, success=${ev.success})`; } | |
| case 'agent.tool': { const ev = e as import('./telemetry-types.js').AgentToolEvent; return `agent.tool(${ev.toolName}, ${ev.durationMs}ms)`; } | |
| case 'agent.llm': { const ev = e as import('./telemetry-types.js').AgentLLMEvent; return `agent.llm(${ev.model}, in=${ev.inputTokens}, out=${ev.outputTokens})`; } |
Summary
Right now we only have telemetry on the installer flow. This PR adds telemetry coverage to every CLI command, plus crash reporting and a debug flag.
What changed:
Event type enrichment -- session.start/end now carry environment fingerprint (OS, Node version, CI detection, shell). Step and tool events get explicit
startTimestampfor accurate span reconstruction on the backend. Two new event types:commandandcrash.Command telemetry for all commands -- every yargs command (both
registerSubcommandand inline handlers) now emits acommandevent with canonical name, duration, success/failure, and which flags were used. Alias resolution normalizesorg->organization,claim->env.claimso metrics don't fragment.Crash reporting -- global
uncaughtException/unhandledRejectionhandlers capture unhandled crashes with sanitized stack traces (home dir stripped, truncated to 4KB). Handlers are synchronous to avoid Node's async-handler footgun.Store-and-forward -- events are persisted to a PID-based temp file on process exit via
writeFileSync. Next CLI invocation recovers and sends them. Handles the ~25 locations that callprocess.exit()directly by queuing a provisional event in the middleware before the handler runs.Flush improvements --
flush()now returns a boolean (sent vs retryable), usesspliceinstead of clearing the whole queue (protects events queued during an in-flight fetch), drops events on 4xx (permanent failures like 401 won't accumulate), and retains on 5xx/network errors for store-forward. Commands flush in-process via the handler wrapper rather than always deferring to next invocation.WORKOS_DEBUG=1-- enables verbose debug logging for all commands (not just the installer's--debugflag). Shows telemetry event details on flush.Design decisions worth noting:
install,dashboard, and default$0commands are excluded from command-level telemetry since they have their own session-based telemetry.Summary by CodeRabbit
Release Notes
Documentation
New Features
WORKOS_DEBUGenvironment variable for enhanced verbose debug logging across all commands.Improvements