feat: CLI telemetry for all commands + crash reporting by nicknisi · Pull Request #122 · workos/cli

nicknisi · 2026-04-14T23:55:36Z

Summary

Right now we only have telemetry on the installer flow. This PR adds telemetry coverage to every CLI command, plus crash reporting and a debug flag.

What changed:

Event type enrichment -- session.start/end now carry environment fingerprint (OS, Node version, CI detection, shell). Step and tool events get explicit startTimestamp for accurate span reconstruction on the backend. Two new event types: command and crash.
Command telemetry for all commands -- every yargs command (both registerSubcommand and inline handlers) now emits a command event with canonical name, duration, success/failure, and which flags were used. Alias resolution normalizes org -> organization, claim -> env.claim so metrics don't fragment.
Crash reporting -- global uncaughtException/unhandledRejection handlers capture unhandled crashes with sanitized stack traces (home dir stripped, truncated to 4KB). Handlers are synchronous to avoid Node's async-handler footgun.
Store-and-forward -- events are persisted to a PID-based temp file on process exit via writeFileSync. Next CLI invocation recovers and sends them. Handles the ~25 locations that call process.exit() directly by queuing a provisional event in the middleware before the handler runs.
Flush improvements -- flush() now returns a boolean (sent vs retryable), uses splice instead of clearing the whole queue (protects events queued during an in-flight fetch), drops events on 4xx (permanent failures like 401 won't accumulate), and retains on 5xx/network errors for store-forward. Commands flush in-process via the handler wrapper rather than always deferring to next invocation.
WORKOS_DEBUG=1 -- enables verbose debug logging for all commands (not just the installer's --debug flag). Shows telemetry event details on flush.

Design decisions worth noting:

Non-installer telemetry only works for JWT-authenticated users. API-key-only users don't have a token the LLM gateway guard accepts, so their events are silently dropped. Acceptable for v1.
The install, dashboard, and default $0 commands are excluded from command-level telemetry since they have their own session-based telemetry.
The crash reporter is sync-only -- it queues the event and relies on store-forward's exit handler to persist it. No async flush attempts in the crash path.

Summary by CodeRabbit

Release Notes

Documentation
- Added telemetry documentation describing command handling and metrics collection.
New Features
- Introduced WORKOS_DEBUG environment variable for enhanced verbose debug logging across all commands.
- Added crash reporting and error recovery capabilities.
Improvements
- Enhanced error reporting with structured context information for better diagnostics.
- Improved command telemetry tracking and error classification.

…amps, and new event types Add environment fingerprint fields (OS, Node version, CI detection, shell) to session.start and session.end events. Add startTimestamp to step and agent.tool events for span reconstruction. Define command and crash event types with stub emission methods. Add discriminated union Zod schema validation tests mirroring the API schema.

… persistence Wire up yargs middleware that emits a provisional command event before each handler runs, then replaces it with actual duration/success on completion. This covers the ~25 process.exit() call sites without modifying them. - Command telemetry middleware with canonical name resolution and flag extraction - Crash reporter with sanitized stack traces (sync handlers, no async) - Store-forward: persist unsent events to temp file on exit, recover on next run - Fix flush() to retain events until HTTP success (was clearing before fetch) - Auto-wrap handlers in registerSubcommand() (single change point) - Shared COMMAND_ALIASES map for telemetry and help-json - analytics.initForNonInstaller() sets gatewayUrl + JWT from stored creds

- Enable debug output for non-installer commands via env var - Log telemetry event details (type, name, duration, attributes) on flush - Register in debug command's env var catalog

- Wrap inline command handlers (seed, setup-org, doctor, etc.) with wrapCommandHandler so they report real duration/success - Skip provisional telemetry event for install command (has own session telemetry) - Add claim -> env.claim to canonical alias map - Defer store-forward file deletion until after successful flush

Client errors (401, 403) are permanent failures that won't succeed on retry. Only retain events for 5xx (transient server errors) and network failures where store-forward retry is meaningful.

- flush() returns true (sent/dropped) or false (retryable) so callers can act on the result - Use splice(0, count) instead of clearing all events, protecting events queued concurrently during the fetch - wrapCommandHandler flushes in-process so events are sent immediately instead of always deferring to next invocation via store-forward - Store-forward recovery deletes files after loading into memory (events are re-persisted by exit handler if flush fails) - Skip provisional events for dashboard and $0 (installer entry points) - Add 4xx drop test coverage

Add a section to CLAUDE.md explaining which commands auto-emit telemetry (registerSubcommand) versus which need manual wrapCommandHandler wrapping (inline top-level .command() calls). Add a pointer comment in bin.ts near the workflow commands block. Prevents new top-level commands from silently emitting duration=0 telemetry.

- Add workos.user_id to command and crash events (from stored credentials or unclaimed environment clientId) so dashboards can count unique users - Add cli.version to command and crash events for release adoption tracking - Support claim-token auth path on the telemetry client, so unclaimed environments' telemetry reaches the API (guard accepts this path too) - Rename CrashEvent's installer.version to cli.version (crashes happen outside the installer too) - initForNonInstaller() now wires up user_id and claim-token auth

Closes security-audit finding #1 on PR #122 (telemetry message sanitization). `error.message` was flowing into 4 capture sites unsanitized, leaking homedir paths (and rarely, credentials) to the WorkOS gateway. - Add `sanitizeMessage()` in crash-reporter.ts: homedir strip + Bearer/ sk_*/JWT redaction + 1KB truncation. - Factor secret redaction into shared `redactSecrets()` used by both `sanitizeMessage` and `sanitizeStack` (Node echoes `.message` into the leading `Error.stack` line, so message-only sanitization was insufficient). - Add private `extractErrorFields()` chokepoint on `Analytics`; route all 4 capture sites through it (`captureException`, `stepCompleted`, `commandExecuted`, `captureUnhandledCrash`). `replaceLastCommandEvent` inherits sanitization via its delegation to `commandExecuted`. - `captureUnhandledCrash` now uses `sanitizeStack` instead of inline truncation, providing defense-in-depth for callers that bypass the crash-reporter wrapper. - Add regression guard test (`telemetry-sanitize.spec.ts`): poisons every capture method with homedir + Bearer + sk_live_ + JWT, asserts no marker reaches the serialized queue. Reviewed: ideation:reviewer cycle 1 PASS (0 critical, 0 high).

Introduces two additive telemetry signals requested by the signals spec: - device.id — persistent per-install UUID stored at ~/.workos/device-id. File IO failures fall through to a one-shot session UUID; never throws. - auth.mode — 4-state enum ('jwt' | 'claim_token' | 'api_key' | 'none') derived during initForNonInstaller with JWT > claim_token > api_key priority. Installer flow sets it in run-with-core after credentials resolve. Both fields are injected via getEnvFingerprint so they appear on every event that already carries env context (session.start, session.end, command, crash). No change to downstream gateway — existing deliveries accept the new attrs as optional. Implements Phase 1 of docs/ideation/telemetry-signals/spec-phase-1.md. Gateway-side type mirroring is tracked separately in the gateway repo.

…and events Replace the boolean `command.success` as the primary outcome dimension with a structured `termination.reason` enum (success | cancelled | auth_required | validation_error | api_error | crash) plus a `error.code` string. - `exitWithError`, `exitWithCode`, `exitWithAuthRequired` now patch the provisional command event via `analytics.recordTermination` before calling `process.exit`. Previously the queued provisional event (success true, duration 0) was persisted as-is by store-forward, producing misleading dashboard data. - `exitWithError` now honors the string error code for exit mapping: `auth_required` -> 4, `cancelled` -> 2, `http_*`/`not_found`/`unknown_error` -> 1 (api_error), everything else -> 1 (validation_error). Previously it always exited 1. - New `TelemetryClient.patchLastEventOfType` helper mutates a queued event in place. Unlike `replaceLastEventOfType`, it preserves the event so multiple callers can update fields incrementally. - `wrapCommandHandler` records `reason: 'success'` on clean completion and `reason: 'crash'` with `error.name` on uncaught throw. - `command.success` remains on CommandEvent for backward-compat. - `api.status`/`api.code`/`api.resource` fields added to CommandEvent and plumbed through `recordTermination` for Phase 3 consumption. Review passed (2 medium + 1 low non-blocking; low fixed in same commit for crash-path test symmetry).

…mmand events Wires `createApiErrorHandler` through to the provisional command event via an optional `apiContext` param on `exitWithError`, populating `api.status`, `api.code`, and `api.resource` attributes on API-failure command events. Also treats `apiContext` presence as ground truth for `api_error` termination reason — WorkOS error codes like `rate_limited` or `validation_error` would otherwise fall through `resolveErrorCode` to `validation_error`, hiding legitimate API failures from api_error dashboards. The `CommandEvent` type already had the `api.*` attribute slots (added as forward-compat in Phase 2); only the wiring and tests are new.

coderabbitai · 2026-05-12T19:54:15Z

📝 Walkthrough

Walkthrough

This PR adds comprehensive telemetry infrastructure to the CLI, enabling command execution tracking, error context capture, device identification, crash reporting, and event persistence through transient network failures. It defines new telemetry event types for commands and crashes, instruments the Analytics class with command and auth-mode tracking, wires telemetry into command execution via middleware and handler wrapping, and integrates store-and-forward recovery into the CLI bootstrap.

Changes

Telemetry Infrastructure & Command Tracking

Layer / File(s)	Summary
Telemetry event types and schema `src/utils/telemetry-types.ts`, `src/utils/telemetry-schema.spec.ts`	Defines `CommandEvent` and `CrashEvent` interfaces, `AuthMode` and `TerminationReason` type unions, enriches `SessionStartEvent`/`StepEvent`/`AgentToolEvent` with device/auth/environment fields and `startTimestamp`, and validates all 7 event variants via Zod schema tests.
Device ID generation and crash reporting `src/lib/device-id.ts`, `src/lib/device-id.spec.ts`, `src/utils/crash-reporter.ts`, `src/utils/crash-reporter.spec.ts`	Implements persistent per-user UUID stored in `~/.workos/device-id` with fallback to session-scoped UUID, and installs global process handlers that sanitize error stacks/messages before capturing via telemetry to prevent leaking secrets and tokens.
Telemetry client with event queue mutations and persistence `src/utils/telemetry-client.ts`, `src/utils/telemetry-client.spec.ts`	Extends client with `setClaimTokenAuth` for unclaimed environments, event-queue mutations (`replaceLastEventOfType`, `patchLastEventOfType`, `queueEvents`), `persistToFile` for disk persistence, and updated `flush()` to return boolean and retain events on transient (5xx) failures for retry.
Store-and-forward telemetry recovery `src/utils/telemetry-store-forward.ts`, `src/utils/telemetry-store-forward.spec.ts`	Registers process-exit handler to persist pending events and provides `recoverPendingEvents()` to reload and batch-retry previous pending events from disk on startup.
Analytics class with command and auth-mode tracking `src/utils/analytics.ts`, `src/utils/analytics.spec.ts`, `src/utils/telemetry-sanitize.spec.ts`	Adds `commandExecuted`, `replaceLastCommandEvent`, `recordTermination` for command lifecycle tracking; `setAuthMode`, `initForNonInstaller` for auth configuration; enriches session/step/tool/crash attributes with computed device/auth/environment fingerprinting; sanitizes error types/messages/stacks to redact secrets/tokens/JWTs from payloads.
Command telemetry middleware and handler wrapping `src/utils/command-telemetry.ts`, `src/utils/command-telemetry.spec.ts`	Provides `commandTelemetryMiddleware` to queue provisional command events with extracted flags before handler execution, and `wrapCommandHandler` to update events post-execution with real duration, success/failure outcome, and termination reason (success or crash with error type).
Canonical command alias registry `src/lib/command-aliases.ts`	Establishes single source of truth for user-facing aliases (e.g., `org` → `organization`, `claim` → `env.claim`) mapping to canonical command names for consistent telemetry aggregation and help-JSON output.
Error handling with apiContext and exit-code mapping `src/lib/api-error-handler.ts`, `src/lib/api-error-handler.spec.ts`, `src/utils/exit-codes.ts`, `src/utils/exit-codes.spec.ts`, `src/utils/output.ts`, `src/utils/output.spec.ts`	Enriches API error handlers to capture structured `apiContext` (status/code/resource), adds `resolveErrorCode` to map error codes to `TerminationReason` with exit codes, and updates error-exit paths to record termination telemetry and use resolved exit codes before process exit.
Auth-mode configuration integration `src/lib/run-with-core.ts`	Integrates auth-mode tracking into `checkAuthentication` actor, setting `claim_token` for unclaimed environments, `api_key` for claimed environments, and `jwt` for credential-store flows.
Subcommand handler auto-wrapping `src/utils/register-subcommand.ts`	Conditionally wraps subcommand handlers with `wrapCommandHandler` at registration time when `WORKOS_TELEMETRY_ENABLED` is true.
CLI bootstrap telemetry initialization `src/bin.ts`	Initializes telemetry, crash reporting, and store-forward recovery at startup; dynamically enables debug logging when `WORKOS_DEBUG=1`; adds `commandTelemetryMiddleware` to yargs; wraps top-level command handlers (`doctor`, `seed`, `setup-org`, `onboard-user`, `debug-sso`, `debug-sync`, `claim`, `emulate`, `dev`); resolves help-JSON commands through `COMMAND_ALIASES`.
Documentation and environment variables `CLAUDE.md`, `src/commands/debug.ts`	Adds `WORKOS_DEBUG` entry to environment variable catalog and documents telemetry wiring pattern: automatic wrapping for `registerSubcommand()`, required `wrapCommandHandler()` for top-level `.command()` handlers, telemetry exclusions via `SKIP_TELEMETRY_COMMANDS`, and guidance on registering command aliases.

Possibly related PRs

workos/cli#143: Modifies top-level help/JSON command resolution and command alias handling in src/bin.ts, related to canonical command name resolution.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: CLI telemetry for all commands + crash reporting' clearly and concisely summarizes the main change—extending telemetry coverage from installer-only to all commands and adding crash reporting functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch nicknisi/telemetry

⚔️ Resolve merge conflicts

Resolve merge conflict in branch nicknisi/telemetry

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (4)

src/utils/telemetry-types.ts (1)
58-58: ⚡ Quick win

Align startTimestamp optionality with backward-compat contract.

telemetry-schema.spec.ts explicitly accepts step/agent.tool without startTimestamp, but these interfaces currently require it. Making both optional avoids contract drift.
Suggested patch
 export interface StepEvent extends TelemetryEvent {
   type: 'step';
   name: string;
-  startTimestamp: string;
+  startTimestamp?: string;
   durationMs: number;
   success: boolean;
   error?: {
@@
 export interface AgentToolEvent extends TelemetryEvent {
   type: 'agent.tool';
   toolName: string;
-  startTimestamp: string;
+  startTimestamp?: string;
   durationMs: number;
   success: boolean;
 }
Also applies to: 70-70
src/utils/telemetry-client.spec.ts (1)

60-105: ⚡ Quick win

Add coverage for claim-token auth headers.

setClaimTokenAuth() is new behavior in this PR, but there’s no test asserting x-workos-claim-token + x-workos-client-id emission (and bearer-token precedence when both exist).

Also applies to: 124-218

src/utils/telemetry-store-forward.ts (1)

1-3: ⚡ Quick win

Use async fs APIs for startup recovery path.

recoverPendingEvents() (Line 25+) is async but uses sync disk calls (Lines 27-56), which blocks startup and conflicts with the project rule for TS files.

As per coding guidelines "Avoid Node-specific sync APIs (crypto, fs sync) unless necessary".

Also applies to: 27-56
src/utils/output.ts (1)
123-124: ⚡ Quick win

Protect process termination from telemetry failures

analytics.recordTermination(...) on Line 123 can throw and prevent Line 124 from executing. Error exits should remain deterministic even when telemetry is unhealthy.
🔧 Proposed fix
   const reason = error.apiContext ? 'api_error' : codeReason;
-  analytics.recordTermination(reason, error.code, error.apiContext);
-  process.exit(exit);
+  try {
+    analytics.recordTermination(reason, error.code, error.apiContext);
+  } finally {
+    process.exit(exit);
+  }
 }

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c45afda9-659c-43a0-9646-cd08dd0cabe3

📥 Commits

Reviewing files that changed from the base of the PR and between 524c709 and 31829ec.

📒 Files selected for processing (27)

CLAUDE.md
src/bin.ts
src/commands/debug.ts
src/lib/api-error-handler.spec.ts
src/lib/api-error-handler.ts
src/lib/command-aliases.ts
src/lib/device-id.spec.ts
src/lib/device-id.ts
src/lib/run-with-core.ts
src/utils/analytics.spec.ts
src/utils/analytics.ts
src/utils/command-telemetry.spec.ts
src/utils/command-telemetry.ts
src/utils/crash-reporter.spec.ts
src/utils/crash-reporter.ts
src/utils/exit-codes.spec.ts
src/utils/exit-codes.ts
src/utils/output.spec.ts
src/utils/output.ts
src/utils/register-subcommand.ts
src/utils/telemetry-client.spec.ts
src/utils/telemetry-client.ts
src/utils/telemetry-sanitize.spec.ts
src/utils/telemetry-schema.spec.ts
src/utils/telemetry-store-forward.spec.ts
src/utils/telemetry-store-forward.ts
src/utils/telemetry-types.ts

coderabbitai · 2026-05-12T20:00:07Z

+import os from 'node:os';
+import crypto from 'node:crypto';
+
+const UUID_REGEX = /^[0-9a-f-]{36}$/i;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten UUID validation before accepting persisted IDs.

The current regex accepts many non-UUID strings with 36 hex/hyphen chars. Use a strict UUID v4/v1-5 pattern (or z.uuid() in validation layer) before reusing the file value.

Suggested patch

-const UUID_REGEX = /^[0-9a-f-]{36}$/i; +const UUID_REGEX = + /^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i;

Also applies to: 34-35

coderabbitai · 2026-05-12T20:00:07Z

+  try {
+    if (fs.existsSync(filePath)) {
+      const raw = fs.readFileSync(filePath, 'utf8').trim();
+      if (UUID_REGEX.test(raw)) {
+        cached = raw;
+        return raw;
+      }
+    }
+
+    const id = crypto.randomUUID();
+    fs.mkdirSync(path.dirname(filePath), { recursive: true, mode: 0o700 });
+    fs.writeFileSync(filePath, id, { encoding: 'utf8', mode: 0o600 });
+    cached = id;


🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Replace sync fs usage in the command path.

Line 31–43 and Line 49 use synchronous fs/crypto APIs on a hot CLI path. Please move this to async (node:fs/promises) and cache a pending promise to keep call sites simple.

As per coding guidelines src/**/*.ts: “Avoid Node-specific sync APIs (crypto, fs sync) unless necessary”.

Also applies to: 49-50

coderabbitai · 2026-05-12T20:00:07Z

+      const env = getActiveEnvironment();
+      if (env && isUnclaimedEnvironment(env)) {
+        telemetryClient.setClaimTokenAuth(env.clientId, env.claimToken);
+        // Tag distinctId so unclaimed sessions are identifiable in analytics
+        this.distinctId = this.distinctId ?? `unclaimed:${env.clientId}`;
+        if (this.authMode === 'none') this.authMode = 'claim_token';
+      }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor declared auth priority when unclaimed env is present.

Line 95 currently sets claim-token auth unconditionally, even if JWT auth was already set on Line 83. That can override the documented priority and misclassify/downgrade authenticated sessions.

Suggested fix

- if (env && isUnclaimedEnvironment(env)) { + if (env && isUnclaimedEnvironment(env) && this.authMode === 'none') { telemetryClient.setClaimTokenAuth(env.clientId, env.claimToken); // Tag distinctId so unclaimed sessions are identifiable in analytics this.distinctId = this.distinctId ?? `unclaimed:${env.clientId}`; - if (this.authMode === 'none') this.authMode = 'claim_token'; + this.authMode = 'claim_token'; }

coderabbitai · 2026-05-12T20:00:07Z

+export function extractUserFlags(rawArgs: string[]): string[] {
+  const passedFlags = rawArgs
+    .filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2))
+    .map((arg) => arg.replace(/^-+/, '').split('=')[0]);
+  return [...new Set(passedFlags)];


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden flag extraction to avoid invalid telemetry flags.

Line 32/33 currently treats -- and values like -1 as flags, which can pollute command.flags.

Suggested fix

export function extractUserFlags(rawArgs: string[]): string[] { const passedFlags = rawArgs - .filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2)) - .map((arg) => arg.replace(/^-+/, '').split('=')[0]); + .filter((arg) => { + if (arg === '--') return false; + if (/^--[A-Za-z][\w-]*(=.*)?$/.test(arg)) return true; + if (/^-[A-Za-z]$/.test(arg)) return true; + return false; + }) + .map((arg) => arg.replace(/^-+/, '').split('=')[0]) + .filter(Boolean); return [...new Set(passedFlags)]; }

coderabbitai · 2026-05-12T20:00:07Z

+    const topLevelCommand = commandParts[0] ?? '';
+    if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize the skip check using canonical top-level command.

Line 60/61 checks the raw command token. If a skipped command is invoked via alias, the exemption can be bypassed and telemetry still emitted.

Suggested fix

- const topLevelCommand = commandParts[0] ?? ''; + const rawTopLevelCommand = commandParts[0] ?? ''; + const topLevelCommand = COMMAND_ALIASES[rawTopLevelCommand] ?? rawTopLevelCommand; if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;

coderabbitai · 2026-05-12T20:00:07Z

+export function sanitizeStack(stack: string | undefined): string {
+  if (!stack) return '';
+  let sanitized = stack.replaceAll(HOME, '~');
+  sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/');


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "crash-reporter.ts" -type f

Repository: workos/cli

Length of output: 84

🏁 Script executed:

cat -n src/utils/crash-reporter.ts

Repository: workos/cli

Length of output: 3137

🏁 Script executed:

python3 << 'EOF' import re # Test the current regex against various path formats current_pattern = r'/[^\s:]+/(node_modules|dist|src)/' windows_pattern = r'[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\' # Test cases test_cases = [ # POSIX paths "/home/user/project/node_modules/lib/index.js", "/var/app/dist/main.js", "/Users/dev/src/index.ts", # Windows paths "C:\\Users\\dev\\project\\node_modules\\lib\\index.js", "D:\\app\\dist\\main.js", "E:\\code\\src\\index.ts", # Stack trace examples " at functionName (/home/user/project/node_modules/pkg/index.js:10:5)", " at functionName (C:\\Users\\dev\\project\\node_modules\\pkg\\index.js:10:5)", ] print("Testing CURRENT regex pattern:") print(f"Pattern: {current_pattern}") for test in test_cases: match = re.search(current_pattern, test) result = re.sub(current_pattern, r'\1/', test) if match else test print(f" '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'") print("\nTesting PROPOSED Windows pattern:") print(f"Pattern: {windows_pattern}") for test in test_cases: match = re.search(windows_pattern, test) result = re.sub(windows_pattern, r'\1\\', test) if match else test print(f" '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'") print("\nTesting COMBINED patterns:") for test in test_cases: result = test result = re.sub(current_pattern, r'\1/', result) result = re.sub(windows_pattern, r'\1\\', result) print(f" '{test[:60]}...' -> Result: '{result[:60]}...'") EOF

Repository: workos/cli

Length of output: 2630

Windows absolute paths leak in stack trace sanitization

Line 26 only collapses POSIX-style paths (/path/to/node_modules/). Windows paths (e.g., C:\Users\...\node_modules\) bypass this regex entirely and leak full local filesystem paths into telemetry payloads.

Regex testing confirms the current pattern does not match Windows absolute paths:

C:\Users\dev\project\node_modules\pkg\index.js → remains unsanitized

/home/user/project/node_modules/pkg/index.js → correctly collapsed to node_modules/pkg/index.js

🔧 Proposed fix

export function sanitizeStack(stack: string | undefined): string { if (!stack) return ''; let sanitized = stack.replaceAll(HOME, '~'); - sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/'); + sanitized = sanitized + .replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/') + .replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\'); sanitized = redactSecrets(sanitized); return sanitized.length > MAX_STACK_LENGTH ? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]' : sanitized; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/');

export function sanitizeStack(stack: string | undefined): string {

if (!stack) return '';

let sanitized = stack.replaceAll(HOME, '~');

sanitized = sanitized

.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/')

.replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\');

sanitized = redactSecrets(sanitized);

return sanitized.length > MAX_STACK_LENGTH

? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]'

: sanitized;

}

coderabbitai · 2026-05-12T20:00:07Z

 export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {
  if (error) {
    outputError(error);
  }
+  analytics.recordTermination(reasonForExitCode(code), error?.code);
  process.exit(code);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Classify termination from error.code when available.

Line 76 derives reason only from numeric exit code, so all code-1 exits become validation_error. That misclassifies API failures like http_500 when they go through exitWithCode.

Suggested fix

export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never { if (error) { outputError(error); } - analytics.recordTermination(reasonForExitCode(code), error?.code); + const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code); + analytics.recordTermination(reason, error?.code); process.exit(code); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {

if (error) {

outputError(error);

}

analytics.recordTermination(reasonForExitCode(code), error?.code);

process.exit(code);

export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {

if (error) {

outputError(error);

}

const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code);

analytics.recordTermination(reason, error?.code);

process.exit(code);

}

coderabbitai · 2026-05-12T20:00:07Z

+  async flush(): Promise<boolean> {
+    if (this.events.length === 0) return true;
    if (!this.gatewayUrl) {
      debug('[Telemetry] No gateway URL configured, skipping flush');
-      return;
+      return false;
    }

-    const payload: TelemetryRequest = { events: [...this.events] };
-    this.events = [];
+    const count = this.events.length;
+    const payload: TelemetryRequest = { events: this.events.slice(0, count) };



⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Guard against concurrent flush() calls to prevent event loss/duplication.

Line 85 currently allows overlapping flushes. Two concurrent calls can send the same snapshot twice, and later splice(0, count) can remove events queued after the first flush.

Suggested patch

export class TelemetryClient { private events: TelemetryEvent[] = []; + private flushInFlight: Promise<boolean> | null = null; @@ async flush(): Promise<boolean> { + if (this.flushInFlight) return this.flushInFlight; + this.flushInFlight = this.flushInternal(); + try { + return await this.flushInFlight; + } finally { + this.flushInFlight = null; + } + } + + private async flushInternal(): Promise<boolean> { if (this.events.length === 0) return true; @@ - } + } }

Also applies to: 135-137, 143-144

coderabbitai · 2026-05-12T20:00:07Z

+      mkdirSync(dirname(filePath), { recursive: true });
+      writeFileSync(filePath, JSON.stringify(this.events), 'utf-8');
+      this.events = [];


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist telemetry with restrictive filesystem permissions.

persistToFile() currently relies on default permissions. Since payloads can include user/device identifiers, write with 0700 dir and 0600 file modes.

Suggested patch

- mkdirSync(dirname(filePath), { recursive: true }); - writeFileSync(filePath, JSON.stringify(this.events), 'utf-8'); + mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 }); + writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mkdirSync(dirname(filePath), { recursive: true });

writeFileSync(filePath, JSON.stringify(this.events), 'utf-8');

this.events = [];

mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 });

writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 });

this.events = [];

greptile-apps · 2026-05-12T20:01:08Z

Greptile Summary

This PR wires full command-level telemetry across the CLI: every yargs command now emits a command event with duration, flags, and structured termination reason, plus a new crash event type for unhandled exceptions via synchronous global handlers.

New infrastructure: crash-reporter.ts (sync uncaughtException/unhandledRejection handlers), telemetry-store-forward.ts (PID-based pending-file persistence + recovery), device-id.ts (stable UUID at ~/.workos/device-id), and command-aliases.ts (single alias source of truth for telemetry and help-json).
Middleware + wrapper pattern: commandTelemetryMiddleware queues a provisional event before any handler runs (ensuring store-forward captures even commands that call process.exit() directly), and wrapCommandHandler replaces it with real duration/outcome on completion.
flush() improvements: uses splice to protect events queued during an in-flight fetch, drops on 4xx, retains on 5xx/network error, and returns a boolean to communicate retry vs. drop semantics to callers.

Confidence Score: 4/5

The telemetry infrastructure is additive and fail-safe — every path is wrapped in try/catch so telemetry failures cannot crash the CLI. The main areas to watch are the concurrent-flush race (recovered events can be sent twice when previous-session files exist) and the circular import between output.ts and exit-codes.ts.

All new code paths are purely telemetry — they don't affect command output, user-visible behavior, or data integrity. The circular import between output.ts and exit-codes.ts works at runtime due to ESM live bindings but is a maintenance risk. The concurrent flush race produces duplicate telemetry events for recovered data, which is a quality concern rather than a correctness one. Both issues are localized to the telemetry subsystem.

src/utils/output.ts and src/utils/exit-codes.ts (circular import), src/utils/telemetry-store-forward.ts and src/bin.ts (concurrent flush race), src/utils/command-telemetry.ts (SKIP_TELEMETRY_COMMANDS enforcement gap).

Important Files Changed

Filename	Overview
src/bin.ts	Installs crash reporter, store-forward, and gateway init at startup; adds commandTelemetryMiddleware; wraps inline handlers with wrapCommandHandler. recoverPendingEvents() is unawaited, enabling a concurrent-flush race with the command handler's flush.
src/utils/analytics.ts	Adds commandExecuted, replaceLastCommandEvent, recordTermination, and captureUnhandledCrash methods; initializes telemetry for non-installer commands with JWT/claim-token/API-key priority. Logic is sound; sanitizeMessage/sanitizeStack are applied at every error-capture boundary.
src/utils/command-telemetry.ts	New module: middleware queues provisional command events and wrapCommandHandler replaces them after execution. SKIP_TELEMETRY_COMMANDS only prevents the provisional event — wrapCommandHandler itself still emits an event if accidentally applied to excluded commands.
src/utils/telemetry-client.ts	Adds replaceLastEventOfType, patchLastEventOfType, queueEvents, and persistToFile; improves flush() to use splice and return a boolean. Debug switch uses (e as any) casts that bypass the discriminated-union types.
src/utils/telemetry-store-forward.ts	PID-based pending-file persistence with recovery on next invocation. Files are deleted before flush, relying on in-memory retention + exit handler for re-persistence if flush fails. Concurrent flush with wrapCommandHandler can duplicate recovered events.
src/utils/crash-reporter.ts	Synchronous crash handlers for uncaughtException/unhandledRejection with re-entry guard. Sanitization of home dir, secrets, and stack truncation is correctly applied.
src/utils/output.ts	Refactors exitWithError to resolve exit codes and record telemetry termination. Introduces a circular import with exit-codes.ts that works at runtime but is a structural antipattern.
src/utils/exit-codes.ts	Adds resolveErrorCode and ERROR_CODE_MAP for structured termination reasons; records telemetry on exitWithCode. Part of the circular import with output.ts.
src/lib/device-id.ts	New module: lazily creates and caches a stable device UUID at ~/.workos/device-id. IO failures fall through to a session-scoped UUID. Correct permissions (0o700 dir, 0o600 file).
src/utils/telemetry-types.ts	Adds CommandEvent, CrashEvent, AuthMode, and TerminationReason types. Schema is clean and well-structured.

Sequence Diagram

sequenceDiagram
    participant bin as bin.ts (startup)
    participant mw as commandTelemetryMiddleware
    participant wrap as wrapCommandHandler
    participant ana as Analytics
    participant tc as TelemetryClient
    participant sf as StoreForward (exit)
    participant disk as Disk (tmp file)

    bin->>tc: installCrashReporter()
    bin->>sf: installStoreForward()
    bin->>ana: initForNonInstaller()
    bin-->>tc: recoverPendingEvents() [no await]
    tc->>disk: "read pending-*.json"
    disk-->>tc: recovered events
    tc->>disk: delete files
    tc-->>tc: flush() [async, concurrent]

    bin->>mw: middleware(rawArgs)
    mw->>ana: commandExecuted(name, 0, true) [provisional]
    ana->>tc: queueEvent(CommandEvent)

    mw->>wrap: handler(argv)
    wrap->>wrap: await handler(argv)
    wrap->>ana: replaceLastCommandEvent(name, duration, success)
    ana->>tc: replaceLastEventOfType + queueEvent
    wrap->>ana: recordTermination(reason)
    ana->>tc: patchLastEventOfType
    wrap->>tc: flush() [in-process]
    tc-->>tc: splice(0, count) on success

    Note over tc,disk: On any exit (including process.exit)
    sf->>tc: persistToFile(pending-PID.json)
    tc->>disk: writeFileSync (sync)

_{Reviews (1): Last reviewed commit: "feat(telemetry): api.status, api.code, a..." | Re-trigger Greptile}

greptile-apps · 2026-05-12T20:01:13Z


 import chalk from 'chalk';
+import { analytics } from './analytics.js';
+import { resolveErrorCode } from './exit-codes.js';


Circular import between output.ts and exit-codes.ts

output.ts imports resolveErrorCode from ./exit-codes.js, and exit-codes.ts (line 11) imports outputError from ./output.js. This creates a mutual dependency between the two modules. In Node.js ESM this works at runtime because both references are only used inside function bodies (not during module initialization), so live bindings are fully populated by call time. However, it's a fragile antipattern — if either module is ever extended to reference the other's export at the top level (e.g., a default value, a module-level variable), it will silently receive undefined. The clean fix is to move resolveErrorCode into a dedicated module (e.g., error-codes.ts) that has no circular dependency on output.ts.

greptile-apps · 2026-05-12T20:01:16Z

+analytics.initForNonInstaller();
+// Fire-and-forget: recover events from previous crashes/exits.
+// NO await — must not block startup (flush timeout is 3s).
+recoverPendingEvents();


Concurrent flush() calls can send recovered events twice

recoverPendingEvents() is intentionally called without await so it doesn't block startup. All of its synchronous work (reading files, queueing events, deleting files) completes before yargs starts, but the await telemetryClient.flush() at the end of recoverPendingEvents yields to the event loop. The command handler then runs and its wrapCommandHandler.finally block calls telemetryClient.flush() independently. If a previous-session pending file existed, both flush calls will execute concurrently: Flush A snapshots the recovered events [R1, R2], Flush B snapshots [R1, R2, C] (the newly-queued command event). Flush A sends R1/R2 and then splice(0, 2) removes them; Flush B has already sent R1/R2 in its payload, resulting in duplicate delivery. For most fast CLI commands (sub-second), this race window is open. A simple guard — checking whether a flush is already in progress before starting another — would close it.

greptile-apps · 2026-05-12T20:01:20Z

+  return async (argv) => {
+    const commandName = String(argv.__telemetryCommandName ?? 'unknown');
+    const startTime = Number(argv.__telemetryStartTime ?? Date.now());
+    const flags = (argv.__telemetryFlags as string[]) ?? [];
+
+    try {
+      await handler(argv);
+      // Replace the provisional event with the real one, then patch in
+      // the structured termination reason. Order matters: replace re-queues,
+      // then recordTermination mutates the re-queued event in place.
+      analytics.replaceLastCommandEvent(commandName, Date.now() - startTime, true, { flags });
+      analytics.recordTermination('success');
+    } catch (error) {
+      const err = error instanceof Error ? error : new Error(String(error));
+      analytics.replaceLastCommandEvent(commandName, Date.now() - startTime, false, {
+        error: err,
+        flags,
+      });
+      // Uncaught throw = crash. Clean exits (exitWithError/exitWithCode)
+      // already recorded their own termination reason before exiting.
+      analytics.recordTermination('crash', err.name);
+      throw error;
+    } finally {
+      // Flush in-process so events are sent immediately, not deferred to next invocation.
+      // If flush fails, store-forward persists on exit.
+      await telemetryClient.flush().catch(() => {});
+    }
+  };


SKIP_TELEMETRY_COMMANDS only guards the provisional event, not the wrap-time event

SKIP_TELEMETRY_COMMANDS prevents the middleware from queuing a provisional event for commands like install and dashboard. However, wrapCommandHandler always calls analytics.replaceLastCommandEvent(...), which internally calls telemetryClient.replaceLastEventOfType('command') (no-op if no event exists) and then unconditionally calls analytics.commandExecuted(...), queuing a brand-new command event. If a developer accidentally wraps install or dashboard with wrapCommandHandler — following the CLAUDE.md guidance for new commands — those commands would still emit command-level telemetry, duplicating their session-level telemetry. Adding a guard inside wrapCommandHandler (e.g., checking whether SKIP_TELEMETRY_COMMANDS includes the resolved command name) would make the exclusion robust at the code level rather than relying solely on documentation.

greptile-apps · 2026-05-12T20:01:23Z

+          case 'step': return `step(${(e as any).name}, ${(e as any).durationMs}ms, success=${(e as any).success})`;
+          case 'agent.tool': return `agent.tool(${(e as any).toolName}, ${(e as any).durationMs}ms)`;
+          case 'agent.llm': return `agent.llm(${(e as any).model}, in=${(e as any).inputTokens}, out=${(e as any).outputTokens})`;


The switch arms cast to any to access event-specific fields (name, durationMs, toolName, etc.) even though the discriminated union in telemetry-types.ts fully types each branch. Use a typed assertion (e.g., e as StepEvent) in each case instead to preserve type safety.

Suggested change

case 'step': return `step(${(e as any).name}, ${(e as any).durationMs}ms, success=${(e as any).success})`;

case 'agent.tool': return `agent.tool(${(e as any).toolName}, ${(e as any).durationMs}ms)`;

case 'agent.llm': return `agent.llm(${(e as any).model}, in=${(e as any).inputTokens}, out=${(e as any).outputTokens})`;

case 'step': { const ev = e as import('./telemetry-types.js').StepEvent; return `step(${ev.name}, ${ev.durationMs}ms, success=${ev.success})`; }

case 'agent.tool': { const ev = e as import('./telemetry-types.js').AgentToolEvent; return `agent.tool(${ev.toolName}, ${ev.durationMs}ms)`; }

case 'agent.llm': { const ev = e as import('./telemetry-types.js').AgentLLMEvent; return `agent.llm(${ev.model}, in=${ev.inputTokens}, out=${ev.outputTokens})`; }

nicknisi added 13 commits April 14, 2026 14:05

feat: add WORKOS_DEBUG=1 env var for verbose logging on all commands

b641538

- Enable debug output for non-installer commands via env var - Log telemetry event details (type, name, duration, attributes) on flush - Register in debug command's env var catalog

fix: drop telemetry events on 4xx responses to prevent accumulation

77adb33

Client errors (401, 403) are permanent failures that won't succeed on retry. Only retain events for 5xx (transient server errors) and network failures where store-forward retry is meaningful.

chore: de-slop

f1a391a

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

		const topLevelCommand = commandParts[0] ?? '';
		if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;

Conversation

nicknisi commented Apr 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 12, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

nicknisi commented Apr 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading