Skip to content

feat: CLI telemetry for all commands + crash reporting#122

Open
nicknisi wants to merge 13 commits into
mainfrom
nicknisi/telemetry
Open

feat: CLI telemetry for all commands + crash reporting#122
nicknisi wants to merge 13 commits into
mainfrom
nicknisi/telemetry

Conversation

@nicknisi
Copy link
Copy Markdown
Member

@nicknisi nicknisi commented Apr 14, 2026

Summary

Right now we only have telemetry on the installer flow. This PR adds telemetry coverage to every CLI command, plus crash reporting and a debug flag.

What changed:

  • Event type enrichment -- session.start/end now carry environment fingerprint (OS, Node version, CI detection, shell). Step and tool events get explicit startTimestamp for accurate span reconstruction on the backend. Two new event types: command and crash.

  • Command telemetry for all commands -- every yargs command (both registerSubcommand and inline handlers) now emits a command event with canonical name, duration, success/failure, and which flags were used. Alias resolution normalizes org -> organization, claim -> env.claim so metrics don't fragment.

  • Crash reporting -- global uncaughtException/unhandledRejection handlers capture unhandled crashes with sanitized stack traces (home dir stripped, truncated to 4KB). Handlers are synchronous to avoid Node's async-handler footgun.

  • Store-and-forward -- events are persisted to a PID-based temp file on process exit via writeFileSync. Next CLI invocation recovers and sends them. Handles the ~25 locations that call process.exit() directly by queuing a provisional event in the middleware before the handler runs.

  • Flush improvements -- flush() now returns a boolean (sent vs retryable), uses splice instead of clearing the whole queue (protects events queued during an in-flight fetch), drops events on 4xx (permanent failures like 401 won't accumulate), and retains on 5xx/network errors for store-forward. Commands flush in-process via the handler wrapper rather than always deferring to next invocation.

  • WORKOS_DEBUG=1 -- enables verbose debug logging for all commands (not just the installer's --debug flag). Shows telemetry event details on flush.

Design decisions worth noting:

  • Non-installer telemetry only works for JWT-authenticated users. API-key-only users don't have a token the LLM gateway guard accepts, so their events are silently dropped. Acceptable for v1.
  • The install, dashboard, and default $0 commands are excluded from command-level telemetry since they have their own session-based telemetry.
  • The crash reporter is sync-only -- it queues the event and relies on store-forward's exit handler to persist it. No async flush attempts in the crash path.

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added telemetry documentation describing command handling and metrics collection.
  • New Features

    • Introduced WORKOS_DEBUG environment variable for enhanced verbose debug logging across all commands.
    • Added crash reporting and error recovery capabilities.
  • Improvements

    • Enhanced error reporting with structured context information for better diagnostics.
    • Improved command telemetry tracking and error classification.

Review Change Stack

nicknisi added 13 commits April 14, 2026 14:05
…amps, and new event types

Add environment fingerprint fields (OS, Node version, CI detection, shell)
to session.start and session.end events. Add startTimestamp to step and
agent.tool events for span reconstruction. Define command and crash event
types with stub emission methods. Add discriminated union Zod schema
validation tests mirroring the API schema.
… persistence

Wire up yargs middleware that emits a provisional command event before each
handler runs, then replaces it with actual duration/success on completion.
This covers the ~25 process.exit() call sites without modifying them.

- Command telemetry middleware with canonical name resolution and flag extraction
- Crash reporter with sanitized stack traces (sync handlers, no async)
- Store-forward: persist unsent events to temp file on exit, recover on next run
- Fix flush() to retain events until HTTP success (was clearing before fetch)
- Auto-wrap handlers in registerSubcommand() (single change point)
- Shared COMMAND_ALIASES map for telemetry and help-json
- analytics.initForNonInstaller() sets gatewayUrl + JWT from stored creds
- Enable debug output for non-installer commands via env var
- Log telemetry event details (type, name, duration, attributes) on flush
- Register in debug command's env var catalog
- Wrap inline command handlers (seed, setup-org, doctor, etc.) with
  wrapCommandHandler so they report real duration/success
- Skip provisional telemetry event for install command (has own session telemetry)
- Add claim -> env.claim to canonical alias map
- Defer store-forward file deletion until after successful flush
Client errors (401, 403) are permanent failures that won't succeed on
retry. Only retain events for 5xx (transient server errors) and network
failures where store-forward retry is meaningful.
- flush() returns true (sent/dropped) or false (retryable) so callers
  can act on the result
- Use splice(0, count) instead of clearing all events, protecting
  events queued concurrently during the fetch
- wrapCommandHandler flushes in-process so events are sent immediately
  instead of always deferring to next invocation via store-forward
- Store-forward recovery deletes files after loading into memory
  (events are re-persisted by exit handler if flush fails)
- Skip provisional events for dashboard and $0 (installer entry points)
- Add 4xx drop test coverage
Add a section to CLAUDE.md explaining which commands auto-emit telemetry
(registerSubcommand) versus which need manual wrapCommandHandler wrapping
(inline top-level .command() calls). Add a pointer comment in bin.ts near
the workflow commands block.

Prevents new top-level commands from silently emitting duration=0 telemetry.
- Add workos.user_id to command and crash events (from stored credentials
  or unclaimed environment clientId) so dashboards can count unique users
- Add cli.version to command and crash events for release adoption tracking
- Support claim-token auth path on the telemetry client, so unclaimed
  environments' telemetry reaches the API (guard accepts this path too)
- Rename CrashEvent's installer.version to cli.version (crashes happen
  outside the installer too)
- initForNonInstaller() now wires up user_id and claim-token auth
Closes security-audit finding #1 on PR #122 (telemetry message
sanitization). `error.message` was flowing into 4 capture sites
unsanitized, leaking homedir paths (and rarely, credentials) to the
WorkOS gateway.

- Add `sanitizeMessage()` in crash-reporter.ts: homedir strip + Bearer/
  sk_*/JWT redaction + 1KB truncation.
- Factor secret redaction into shared `redactSecrets()` used by both
  `sanitizeMessage` and `sanitizeStack` (Node echoes `.message` into the
  leading `Error.stack` line, so message-only sanitization was
  insufficient).
- Add private `extractErrorFields()` chokepoint on `Analytics`; route
  all 4 capture sites through it (`captureException`, `stepCompleted`,
  `commandExecuted`, `captureUnhandledCrash`). `replaceLastCommandEvent`
  inherits sanitization via its delegation to `commandExecuted`.
- `captureUnhandledCrash` now uses `sanitizeStack` instead of inline
  truncation, providing defense-in-depth for callers that bypass the
  crash-reporter wrapper.
- Add regression guard test (`telemetry-sanitize.spec.ts`): poisons
  every capture method with homedir + Bearer + sk_live_ + JWT, asserts
  no marker reaches the serialized queue.

Reviewed: ideation:reviewer cycle 1 PASS (0 critical, 0 high).
Introduces two additive telemetry signals requested by the signals spec:

- device.id — persistent per-install UUID stored at ~/.workos/device-id.
  File IO failures fall through to a one-shot session UUID; never throws.
- auth.mode — 4-state enum ('jwt' | 'claim_token' | 'api_key' | 'none')
  derived during initForNonInstaller with JWT > claim_token > api_key
  priority. Installer flow sets it in run-with-core after credentials
  resolve.

Both fields are injected via getEnvFingerprint so they appear on every
event that already carries env context (session.start, session.end,
command, crash). No change to downstream gateway — existing deliveries
accept the new attrs as optional.

Implements Phase 1 of docs/ideation/telemetry-signals/spec-phase-1.md.
Gateway-side type mirroring is tracked separately in the gateway repo.
…and events

Replace the boolean `command.success` as the primary outcome dimension with
a structured `termination.reason` enum (success | cancelled | auth_required
| validation_error | api_error | crash) plus a `error.code` string.

- `exitWithError`, `exitWithCode`, `exitWithAuthRequired` now patch the
  provisional command event via `analytics.recordTermination` before
  calling `process.exit`. Previously the queued provisional event (success
  true, duration 0) was persisted as-is by store-forward, producing
  misleading dashboard data.
- `exitWithError` now honors the string error code for exit mapping:
  `auth_required` -> 4, `cancelled` -> 2, `http_*`/`not_found`/`unknown_error`
  -> 1 (api_error), everything else -> 1 (validation_error). Previously it
  always exited 1.
- New `TelemetryClient.patchLastEventOfType` helper mutates a queued event
  in place. Unlike `replaceLastEventOfType`, it preserves the event so
  multiple callers can update fields incrementally.
- `wrapCommandHandler` records `reason: 'success'` on clean completion and
  `reason: 'crash'` with `error.name` on uncaught throw.
- `command.success` remains on CommandEvent for backward-compat.
- `api.status`/`api.code`/`api.resource` fields added to CommandEvent and
  plumbed through `recordTermination` for Phase 3 consumption.

Review passed (2 medium + 1 low non-blocking; low fixed in same commit for
crash-path test symmetry).
…mmand events

Wires `createApiErrorHandler` through to the provisional command event via
an optional `apiContext` param on `exitWithError`, populating `api.status`,
`api.code`, and `api.resource` attributes on API-failure command events.

Also treats `apiContext` presence as ground truth for `api_error` termination
reason — WorkOS error codes like `rate_limited` or `validation_error` would
otherwise fall through `resolveErrorCode` to `validation_error`, hiding
legitimate API failures from api_error dashboards.

The `CommandEvent` type already had the `api.*` attribute slots (added as
forward-compat in Phase 2); only the wiring and tests are new.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

This PR adds comprehensive telemetry infrastructure to the CLI, enabling command execution tracking, error context capture, device identification, crash reporting, and event persistence through transient network failures. It defines new telemetry event types for commands and crashes, instruments the Analytics class with command and auth-mode tracking, wires telemetry into command execution via middleware and handler wrapping, and integrates store-and-forward recovery into the CLI bootstrap.

Changes

Telemetry Infrastructure & Command Tracking

Layer / File(s) Summary
Telemetry event types and schema
src/utils/telemetry-types.ts, src/utils/telemetry-schema.spec.ts
Defines CommandEvent and CrashEvent interfaces, AuthMode and TerminationReason type unions, enriches SessionStartEvent/StepEvent/AgentToolEvent with device/auth/environment fields and startTimestamp, and validates all 7 event variants via Zod schema tests.
Device ID generation and crash reporting
src/lib/device-id.ts, src/lib/device-id.spec.ts, src/utils/crash-reporter.ts, src/utils/crash-reporter.spec.ts
Implements persistent per-user UUID stored in ~/.workos/device-id with fallback to session-scoped UUID, and installs global process handlers that sanitize error stacks/messages before capturing via telemetry to prevent leaking secrets and tokens.
Telemetry client with event queue mutations and persistence
src/utils/telemetry-client.ts, src/utils/telemetry-client.spec.ts
Extends client with setClaimTokenAuth for unclaimed environments, event-queue mutations (replaceLastEventOfType, patchLastEventOfType, queueEvents), persistToFile for disk persistence, and updated flush() to return boolean and retain events on transient (5xx) failures for retry.
Store-and-forward telemetry recovery
src/utils/telemetry-store-forward.ts, src/utils/telemetry-store-forward.spec.ts
Registers process-exit handler to persist pending events and provides recoverPendingEvents() to reload and batch-retry previous pending events from disk on startup.
Analytics class with command and auth-mode tracking
src/utils/analytics.ts, src/utils/analytics.spec.ts, src/utils/telemetry-sanitize.spec.ts
Adds commandExecuted, replaceLastCommandEvent, recordTermination for command lifecycle tracking; setAuthMode, initForNonInstaller for auth configuration; enriches session/step/tool/crash attributes with computed device/auth/environment fingerprinting; sanitizes error types/messages/stacks to redact secrets/tokens/JWTs from payloads.
Command telemetry middleware and handler wrapping
src/utils/command-telemetry.ts, src/utils/command-telemetry.spec.ts
Provides commandTelemetryMiddleware to queue provisional command events with extracted flags before handler execution, and wrapCommandHandler to update events post-execution with real duration, success/failure outcome, and termination reason (success or crash with error type).
Canonical command alias registry
src/lib/command-aliases.ts
Establishes single source of truth for user-facing aliases (e.g., orgorganization, claimenv.claim) mapping to canonical command names for consistent telemetry aggregation and help-JSON output.
Error handling with apiContext and exit-code mapping
src/lib/api-error-handler.ts, src/lib/api-error-handler.spec.ts, src/utils/exit-codes.ts, src/utils/exit-codes.spec.ts, src/utils/output.ts, src/utils/output.spec.ts
Enriches API error handlers to capture structured apiContext (status/code/resource), adds resolveErrorCode to map error codes to TerminationReason with exit codes, and updates error-exit paths to record termination telemetry and use resolved exit codes before process exit.
Auth-mode configuration integration
src/lib/run-with-core.ts
Integrates auth-mode tracking into checkAuthentication actor, setting claim_token for unclaimed environments, api_key for claimed environments, and jwt for credential-store flows.
Subcommand handler auto-wrapping
src/utils/register-subcommand.ts
Conditionally wraps subcommand handlers with wrapCommandHandler at registration time when WORKOS_TELEMETRY_ENABLED is true.
CLI bootstrap telemetry initialization
src/bin.ts
Initializes telemetry, crash reporting, and store-forward recovery at startup; dynamically enables debug logging when WORKOS_DEBUG=1; adds commandTelemetryMiddleware to yargs; wraps top-level command handlers (doctor, seed, setup-org, onboard-user, debug-sso, debug-sync, claim, emulate, dev); resolves help-JSON commands through COMMAND_ALIASES.
Documentation and environment variables
CLAUDE.md, src/commands/debug.ts
Adds WORKOS_DEBUG entry to environment variable catalog and documents telemetry wiring pattern: automatic wrapping for registerSubcommand(), required wrapCommandHandler() for top-level .command() handlers, telemetry exclusions via SKIP_TELEMETRY_COMMANDS, and guidance on registering command aliases.

Possibly related PRs

  • workos/cli#143: Modifies top-level help/JSON command resolution and command alias handling in src/bin.ts, related to canonical command name resolution.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: CLI telemetry for all commands + crash reporting' clearly and concisely summarizes the main change—extending telemetry coverage from installer-only to all commands and adding crash reporting functionality.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nicknisi/telemetry
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch nicknisi/telemetry

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (4)
src/utils/telemetry-types.ts (1)

58-58: ⚡ Quick win

Align startTimestamp optionality with backward-compat contract.

telemetry-schema.spec.ts explicitly accepts step/agent.tool without startTimestamp, but these interfaces currently require it. Making both optional avoids contract drift.

Suggested patch
 export interface StepEvent extends TelemetryEvent {
   type: 'step';
   name: string;
-  startTimestamp: string;
+  startTimestamp?: string;
   durationMs: number;
   success: boolean;
   error?: {
@@
 export interface AgentToolEvent extends TelemetryEvent {
   type: 'agent.tool';
   toolName: string;
-  startTimestamp: string;
+  startTimestamp?: string;
   durationMs: number;
   success: boolean;
 }

Also applies to: 70-70

src/utils/telemetry-client.spec.ts (1)

60-105: ⚡ Quick win

Add coverage for claim-token auth headers.

setClaimTokenAuth() is new behavior in this PR, but there’s no test asserting x-workos-claim-token + x-workos-client-id emission (and bearer-token precedence when both exist).

Also applies to: 124-218

src/utils/telemetry-store-forward.ts (1)

1-3: ⚡ Quick win

Use async fs APIs for startup recovery path.

recoverPendingEvents() (Line 25+) is async but uses sync disk calls (Lines 27-56), which blocks startup and conflicts with the project rule for TS files.

As per coding guidelines "Avoid Node-specific sync APIs (crypto, fs sync) unless necessary".

Also applies to: 27-56

src/utils/output.ts (1)

123-124: ⚡ Quick win

Protect process termination from telemetry failures

analytics.recordTermination(...) on Line 123 can throw and prevent Line 124 from executing. Error exits should remain deterministic even when telemetry is unhealthy.

🔧 Proposed fix
   const reason = error.apiContext ? 'api_error' : codeReason;
-  analytics.recordTermination(reason, error.code, error.apiContext);
-  process.exit(exit);
+  try {
+    analytics.recordTermination(reason, error.code, error.apiContext);
+  } finally {
+    process.exit(exit);
+  }
 }

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c45afda9-659c-43a0-9646-cd08dd0cabe3

📥 Commits

Reviewing files that changed from the base of the PR and between 524c709 and 31829ec.

📒 Files selected for processing (27)
  • CLAUDE.md
  • src/bin.ts
  • src/commands/debug.ts
  • src/lib/api-error-handler.spec.ts
  • src/lib/api-error-handler.ts
  • src/lib/command-aliases.ts
  • src/lib/device-id.spec.ts
  • src/lib/device-id.ts
  • src/lib/run-with-core.ts
  • src/utils/analytics.spec.ts
  • src/utils/analytics.ts
  • src/utils/command-telemetry.spec.ts
  • src/utils/command-telemetry.ts
  • src/utils/crash-reporter.spec.ts
  • src/utils/crash-reporter.ts
  • src/utils/exit-codes.spec.ts
  • src/utils/exit-codes.ts
  • src/utils/output.spec.ts
  • src/utils/output.ts
  • src/utils/register-subcommand.ts
  • src/utils/telemetry-client.spec.ts
  • src/utils/telemetry-client.ts
  • src/utils/telemetry-sanitize.spec.ts
  • src/utils/telemetry-schema.spec.ts
  • src/utils/telemetry-store-forward.spec.ts
  • src/utils/telemetry-store-forward.ts
  • src/utils/telemetry-types.ts

Comment thread src/lib/device-id.ts
import os from 'node:os';
import crypto from 'node:crypto';

const UUID_REGEX = /^[0-9a-f-]{36}$/i;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten UUID validation before accepting persisted IDs.

The current regex accepts many non-UUID strings with 36 hex/hyphen chars. Use a strict UUID v4/v1-5 pattern (or z.uuid() in validation layer) before reusing the file value.

Suggested patch
-const UUID_REGEX = /^[0-9a-f-]{36}$/i;
+const UUID_REGEX =
+  /^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$/i;

Also applies to: 34-35

Comment thread src/lib/device-id.ts
Comment on lines +31 to +43
try {
if (fs.existsSync(filePath)) {
const raw = fs.readFileSync(filePath, 'utf8').trim();
if (UUID_REGEX.test(raw)) {
cached = raw;
return raw;
}
}

const id = crypto.randomUUID();
fs.mkdirSync(path.dirname(filePath), { recursive: true, mode: 0o700 });
fs.writeFileSync(filePath, id, { encoding: 'utf8', mode: 0o600 });
cached = id;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Replace sync fs usage in the command path.

Line 31–43 and Line 49 use synchronous fs/crypto APIs on a hot CLI path. Please move this to async (node:fs/promises) and cache a pending promise to keep call sites simple.

As per coding guidelines src/**/*.ts: “Avoid Node-specific sync APIs (crypto, fs sync) unless necessary”.

Also applies to: 49-50

Comment thread src/utils/analytics.ts
Comment on lines +93 to +99
const env = getActiveEnvironment();
if (env && isUnclaimedEnvironment(env)) {
telemetryClient.setClaimTokenAuth(env.clientId, env.claimToken);
// Tag distinctId so unclaimed sessions are identifiable in analytics
this.distinctId = this.distinctId ?? `unclaimed:${env.clientId}`;
if (this.authMode === 'none') this.authMode = 'claim_token';
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor declared auth priority when unclaimed env is present.

Line 95 currently sets claim-token auth unconditionally, even if JWT auth was already set on Line 83. That can override the documented priority and misclassify/downgrade authenticated sessions.

Suggested fix
-      if (env && isUnclaimedEnvironment(env)) {
+      if (env && isUnclaimedEnvironment(env) && this.authMode === 'none') {
         telemetryClient.setClaimTokenAuth(env.clientId, env.claimToken);
         // Tag distinctId so unclaimed sessions are identifiable in analytics
         this.distinctId = this.distinctId ?? `unclaimed:${env.clientId}`;
-        if (this.authMode === 'none') this.authMode = 'claim_token';
+        this.authMode = 'claim_token';
       }

Comment on lines +30 to +34
export function extractUserFlags(rawArgs: string[]): string[] {
const passedFlags = rawArgs
.filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2))
.map((arg) => arg.replace(/^-+/, '').split('=')[0]);
return [...new Set(passedFlags)];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden flag extraction to avoid invalid telemetry flags.

Line 32/33 currently treats -- and values like -1 as flags, which can pollute command.flags.

Suggested fix
 export function extractUserFlags(rawArgs: string[]): string[] {
   const passedFlags = rawArgs
-    .filter((arg) => arg.startsWith('--') || (arg.startsWith('-') && arg.length === 2))
-    .map((arg) => arg.replace(/^-+/, '').split('=')[0]);
+    .filter((arg) => {
+      if (arg === '--') return false;
+      if (/^--[A-Za-z][\w-]*(=.*)?$/.test(arg)) return true;
+      if (/^-[A-Za-z]$/.test(arg)) return true;
+      return false;
+    })
+    .map((arg) => arg.replace(/^-+/, '').split('=')[0])
+    .filter(Boolean);
   return [...new Set(passedFlags)];
 }

Comment on lines +60 to +61
const topLevelCommand = commandParts[0] ?? '';
if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize the skip check using canonical top-level command.

Line 60/61 checks the raw command token. If a skipped command is invoked via alias, the exemption can be bypassed and telemetry still emitted.

Suggested fix
-    const topLevelCommand = commandParts[0] ?? '';
+    const rawTopLevelCommand = commandParts[0] ?? '';
+    const topLevelCommand = COMMAND_ALIASES[rawTopLevelCommand] ?? rawTopLevelCommand;
     if (SKIP_TELEMETRY_COMMANDS.has(topLevelCommand)) return;

export function sanitizeStack(stack: string | undefined): string {
if (!stack) return '';
let sanitized = stack.replaceAll(HOME, '~');
sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "crash-reporter.ts" -type f

Repository: workos/cli

Length of output: 84


🏁 Script executed:

cat -n src/utils/crash-reporter.ts

Repository: workos/cli

Length of output: 3137


🏁 Script executed:

python3 << 'EOF'
import re

# Test the current regex against various path formats
current_pattern = r'/[^\s:]+/(node_modules|dist|src)/'
windows_pattern = r'[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\'

# Test cases
test_cases = [
    # POSIX paths
    "/home/user/project/node_modules/lib/index.js",
    "/var/app/dist/main.js",
    "/Users/dev/src/index.ts",
    # Windows paths
    "C:\\Users\\dev\\project\\node_modules\\lib\\index.js",
    "D:\\app\\dist\\main.js",
    "E:\\code\\src\\index.ts",
    # Stack trace examples
    "  at functionName (/home/user/project/node_modules/pkg/index.js:10:5)",
    "  at functionName (C:\\Users\\dev\\project\\node_modules\\pkg\\index.js:10:5)",
]

print("Testing CURRENT regex pattern:")
print(f"Pattern: {current_pattern}")
for test in test_cases:
    match = re.search(current_pattern, test)
    result = re.sub(current_pattern, r'\1/', test) if match else test
    print(f"  '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'")

print("\nTesting PROPOSED Windows pattern:")
print(f"Pattern: {windows_pattern}")
for test in test_cases:
    match = re.search(windows_pattern, test)
    result = re.sub(windows_pattern, r'\1\\', test) if match else test
    print(f"  '{test[:60]}...' -> Match: {bool(match)}, Result: '{result[:60]}...'")

print("\nTesting COMBINED patterns:")
for test in test_cases:
    result = test
    result = re.sub(current_pattern, r'\1/', result)
    result = re.sub(windows_pattern, r'\1\\', result)
    print(f"  '{test[:60]}...' -> Result: '{result[:60]}...'")
EOF

Repository: workos/cli

Length of output: 2630


Windows absolute paths leak in stack trace sanitization

Line 26 only collapses POSIX-style paths (/path/to/node_modules/). Windows paths (e.g., C:\Users\...\node_modules\) bypass this regex entirely and leak full local filesystem paths into telemetry payloads.

Regex testing confirms the current pattern does not match Windows absolute paths:

  • C:\Users\dev\project\node_modules\pkg\index.js → remains unsanitized
  • /home/user/project/node_modules/pkg/index.js → correctly collapsed to node_modules/pkg/index.js
🔧 Proposed fix
 export function sanitizeStack(stack: string | undefined): string {
   if (!stack) return '';
   let sanitized = stack.replaceAll(HOME, '~');
-  sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/');
+  sanitized = sanitized
+    .replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/')
+    .replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\');
   sanitized = redactSecrets(sanitized);
   return sanitized.length > MAX_STACK_LENGTH
     ? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]'
     : sanitized;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sanitized = sanitized.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/');
export function sanitizeStack(stack: string | undefined): string {
if (!stack) return '';
let sanitized = stack.replaceAll(HOME, '~');
sanitized = sanitized
.replace(/\/[^\s:]+\/(node_modules|dist|src)\//g, '$1/')
.replace(/[A-Za-z]:\\[^\s:]+\\(node_modules|dist|src)\\/g, '$1\\');
sanitized = redactSecrets(sanitized);
return sanitized.length > MAX_STACK_LENGTH
? sanitized.slice(0, MAX_STACK_LENGTH) + '\n...[truncated]'
: sanitized;
}

Comment thread src/utils/exit-codes.ts
Comment on lines 72 to 77
export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {
if (error) {
outputError(error);
}
analytics.recordTermination(reasonForExitCode(code), error?.code);
process.exit(code);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Classify termination from error.code when available.

Line 76 derives reason only from numeric exit code, so all code-1 exits become validation_error. That misclassifies API failures like http_500 when they go through exitWithCode.

Suggested fix
 export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {
   if (error) {
     outputError(error);
   }
-  analytics.recordTermination(reasonForExitCode(code), error?.code);
+  const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code);
+  analytics.recordTermination(reason, error?.code);
   process.exit(code);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {
if (error) {
outputError(error);
}
analytics.recordTermination(reasonForExitCode(code), error?.code);
process.exit(code);
export function exitWithCode(code: ExitCodeValue, error?: { code: string; message: string }): never {
if (error) {
outputError(error);
}
const reason = error ? resolveErrorCode(error.code).reason : reasonForExitCode(code);
analytics.recordTermination(reason, error?.code);
process.exit(code);
}

Comment on lines +85 to 94
async flush(): Promise<boolean> {
if (this.events.length === 0) return true;
if (!this.gatewayUrl) {
debug('[Telemetry] No gateway URL configured, skipping flush');
return;
return false;
}

const payload: TelemetryRequest = { events: [...this.events] };
this.events = [];
const count = this.events.length;
const payload: TelemetryRequest = { events: this.events.slice(0, count) };

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Guard against concurrent flush() calls to prevent event loss/duplication.

Line 85 currently allows overlapping flushes. Two concurrent calls can send the same snapshot twice, and later splice(0, count) can remove events queued after the first flush.

Suggested patch
 export class TelemetryClient {
   private events: TelemetryEvent[] = [];
+  private flushInFlight: Promise<boolean> | null = null;
@@
   async flush(): Promise<boolean> {
+    if (this.flushInFlight) return this.flushInFlight;
+    this.flushInFlight = this.flushInternal();
+    try {
+      return await this.flushInFlight;
+    } finally {
+      this.flushInFlight = null;
+    }
+  }
+
+  private async flushInternal(): Promise<boolean> {
     if (this.events.length === 0) return true;
@@
-  }
+  }
 }

Also applies to: 135-137, 143-144

Comment on lines +164 to +166
mkdirSync(dirname(filePath), { recursive: true });
writeFileSync(filePath, JSON.stringify(this.events), 'utf-8');
this.events = [];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist telemetry with restrictive filesystem permissions.

persistToFile() currently relies on default permissions. Since payloads can include user/device identifiers, write with 0700 dir and 0600 file modes.

Suggested patch
-      mkdirSync(dirname(filePath), { recursive: true });
-      writeFileSync(filePath, JSON.stringify(this.events), 'utf-8');
+      mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 });
+      writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mkdirSync(dirname(filePath), { recursive: true });
writeFileSync(filePath, JSON.stringify(this.events), 'utf-8');
this.events = [];
mkdirSync(dirname(filePath), { recursive: true, mode: 0o700 });
writeFileSync(filePath, JSON.stringify(this.events), { encoding: 'utf-8', mode: 0o600 });
this.events = [];

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR wires full command-level telemetry across the CLI: every yargs command now emits a command event with duration, flags, and structured termination reason, plus a new crash event type for unhandled exceptions via synchronous global handlers.

  • New infrastructure: crash-reporter.ts (sync uncaughtException/unhandledRejection handlers), telemetry-store-forward.ts (PID-based pending-file persistence + recovery), device-id.ts (stable UUID at ~/.workos/device-id), and command-aliases.ts (single alias source of truth for telemetry and help-json).
  • Middleware + wrapper pattern: commandTelemetryMiddleware queues a provisional event before any handler runs (ensuring store-forward captures even commands that call process.exit() directly), and wrapCommandHandler replaces it with real duration/outcome on completion.
  • flush() improvements: uses splice to protect events queued during an in-flight fetch, drops on 4xx, retains on 5xx/network error, and returns a boolean to communicate retry vs. drop semantics to callers.

Confidence Score: 4/5

The telemetry infrastructure is additive and fail-safe — every path is wrapped in try/catch so telemetry failures cannot crash the CLI. The main areas to watch are the concurrent-flush race (recovered events can be sent twice when previous-session files exist) and the circular import between output.ts and exit-codes.ts.

All new code paths are purely telemetry — they don't affect command output, user-visible behavior, or data integrity. The circular import between output.ts and exit-codes.ts works at runtime due to ESM live bindings but is a maintenance risk. The concurrent flush race produces duplicate telemetry events for recovered data, which is a quality concern rather than a correctness one. Both issues are localized to the telemetry subsystem.

src/utils/output.ts and src/utils/exit-codes.ts (circular import), src/utils/telemetry-store-forward.ts and src/bin.ts (concurrent flush race), src/utils/command-telemetry.ts (SKIP_TELEMETRY_COMMANDS enforcement gap).

Important Files Changed

Filename Overview
src/bin.ts Installs crash reporter, store-forward, and gateway init at startup; adds commandTelemetryMiddleware; wraps inline handlers with wrapCommandHandler. recoverPendingEvents() is unawaited, enabling a concurrent-flush race with the command handler's flush.
src/utils/analytics.ts Adds commandExecuted, replaceLastCommandEvent, recordTermination, and captureUnhandledCrash methods; initializes telemetry for non-installer commands with JWT/claim-token/API-key priority. Logic is sound; sanitizeMessage/sanitizeStack are applied at every error-capture boundary.
src/utils/command-telemetry.ts New module: middleware queues provisional command events and wrapCommandHandler replaces them after execution. SKIP_TELEMETRY_COMMANDS only prevents the provisional event — wrapCommandHandler itself still emits an event if accidentally applied to excluded commands.
src/utils/telemetry-client.ts Adds replaceLastEventOfType, patchLastEventOfType, queueEvents, and persistToFile; improves flush() to use splice and return a boolean. Debug switch uses (e as any) casts that bypass the discriminated-union types.
src/utils/telemetry-store-forward.ts PID-based pending-file persistence with recovery on next invocation. Files are deleted before flush, relying on in-memory retention + exit handler for re-persistence if flush fails. Concurrent flush with wrapCommandHandler can duplicate recovered events.
src/utils/crash-reporter.ts Synchronous crash handlers for uncaughtException/unhandledRejection with re-entry guard. Sanitization of home dir, secrets, and stack truncation is correctly applied.
src/utils/output.ts Refactors exitWithError to resolve exit codes and record telemetry termination. Introduces a circular import with exit-codes.ts that works at runtime but is a structural antipattern.
src/utils/exit-codes.ts Adds resolveErrorCode and ERROR_CODE_MAP for structured termination reasons; records telemetry on exitWithCode. Part of the circular import with output.ts.
src/lib/device-id.ts New module: lazily creates and caches a stable device UUID at ~/.workos/device-id. IO failures fall through to a session-scoped UUID. Correct permissions (0o700 dir, 0o600 file).
src/utils/telemetry-types.ts Adds CommandEvent, CrashEvent, AuthMode, and TerminationReason types. Schema is clean and well-structured.

Sequence Diagram

sequenceDiagram
    participant bin as bin.ts (startup)
    participant mw as commandTelemetryMiddleware
    participant wrap as wrapCommandHandler
    participant ana as Analytics
    participant tc as TelemetryClient
    participant sf as StoreForward (exit)
    participant disk as Disk (tmp file)

    bin->>tc: installCrashReporter()
    bin->>sf: installStoreForward()
    bin->>ana: initForNonInstaller()
    bin-->>tc: recoverPendingEvents() [no await]
    tc->>disk: "read pending-*.json"
    disk-->>tc: recovered events
    tc->>disk: delete files
    tc-->>tc: flush() [async, concurrent]

    bin->>mw: middleware(rawArgs)
    mw->>ana: commandExecuted(name, 0, true) [provisional]
    ana->>tc: queueEvent(CommandEvent)

    mw->>wrap: handler(argv)
    wrap->>wrap: await handler(argv)
    wrap->>ana: replaceLastCommandEvent(name, duration, success)
    ana->>tc: replaceLastEventOfType + queueEvent
    wrap->>ana: recordTermination(reason)
    ana->>tc: patchLastEventOfType
    wrap->>tc: flush() [in-process]
    tc-->>tc: splice(0, count) on success

    Note over tc,disk: On any exit (including process.exit)
    sf->>tc: persistToFile(pending-PID.json)
    tc->>disk: writeFileSync (sync)
Loading

Reviews (1): Last reviewed commit: "feat(telemetry): api.status, api.code, a..." | Re-trigger Greptile

Comment thread src/utils/output.ts

import chalk from 'chalk';
import { analytics } from './analytics.js';
import { resolveErrorCode } from './exit-codes.js';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Circular import between output.ts and exit-codes.ts

output.ts imports resolveErrorCode from ./exit-codes.js, and exit-codes.ts (line 11) imports outputError from ./output.js. This creates a mutual dependency between the two modules. In Node.js ESM this works at runtime because both references are only used inside function bodies (not during module initialization), so live bindings are fully populated by call time. However, it's a fragile antipattern — if either module is ever extended to reference the other's export at the top level (e.g., a default value, a module-level variable), it will silently receive undefined. The clean fix is to move resolveErrorCode into a dedicated module (e.g., error-codes.ts) that has no circular dependency on output.ts.

Comment thread src/bin.ts
analytics.initForNonInstaller();
// Fire-and-forget: recover events from previous crashes/exits.
// NO await — must not block startup (flush timeout is 3s).
recoverPendingEvents();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Concurrent flush() calls can send recovered events twice

recoverPendingEvents() is intentionally called without await so it doesn't block startup. All of its synchronous work (reading files, queueing events, deleting files) completes before yargs starts, but the await telemetryClient.flush() at the end of recoverPendingEvents yields to the event loop. The command handler then runs and its wrapCommandHandler.finally block calls telemetryClient.flush() independently. If a previous-session pending file existed, both flush calls will execute concurrently: Flush A snapshots the recovered events [R1, R2], Flush B snapshots [R1, R2, C] (the newly-queued command event). Flush A sends R1/R2 and then splice(0, 2) removes them; Flush B has already sent R1/R2 in its payload, resulting in duplicate delivery. For most fast CLI commands (sub-second), this race window is open. A simple guard — checking whether a flush is already in progress before starting another — would close it.

Comment on lines +77 to +104
return async (argv) => {
const commandName = String(argv.__telemetryCommandName ?? 'unknown');
const startTime = Number(argv.__telemetryStartTime ?? Date.now());
const flags = (argv.__telemetryFlags as string[]) ?? [];

try {
await handler(argv);
// Replace the provisional event with the real one, then patch in
// the structured termination reason. Order matters: replace re-queues,
// then recordTermination mutates the re-queued event in place.
analytics.replaceLastCommandEvent(commandName, Date.now() - startTime, true, { flags });
analytics.recordTermination('success');
} catch (error) {
const err = error instanceof Error ? error : new Error(String(error));
analytics.replaceLastCommandEvent(commandName, Date.now() - startTime, false, {
error: err,
flags,
});
// Uncaught throw = crash. Clean exits (exitWithError/exitWithCode)
// already recorded their own termination reason before exiting.
analytics.recordTermination('crash', err.name);
throw error;
} finally {
// Flush in-process so events are sent immediately, not deferred to next invocation.
// If flush fails, store-forward persists on exit.
await telemetryClient.flush().catch(() => {});
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 SKIP_TELEMETRY_COMMANDS only guards the provisional event, not the wrap-time event

SKIP_TELEMETRY_COMMANDS prevents the middleware from queuing a provisional event for commands like install and dashboard. However, wrapCommandHandler always calls analytics.replaceLastCommandEvent(...), which internally calls telemetryClient.replaceLastEventOfType('command') (no-op if no event exists) and then unconditionally calls analytics.commandExecuted(...), queuing a brand-new command event. If a developer accidentally wraps install or dashboard with wrapCommandHandler — following the CLAUDE.md guidance for new commands — those commands would still emit command-level telemetry, duplicating their session-level telemetry. Adding a guard inside wrapCommandHandler (e.g., checking whether SKIP_TELEMETRY_COMMANDS includes the resolved command name) would make the exclusion robust at the code level rather than relying solely on documentation.

Comment on lines +118 to +120
case 'step': return `step(${(e as any).name}, ${(e as any).durationMs}ms, success=${(e as any).success})`;
case 'agent.tool': return `agent.tool(${(e as any).toolName}, ${(e as any).durationMs}ms)`;
case 'agent.llm': return `agent.llm(${(e as any).model}, in=${(e as any).inputTokens}, out=${(e as any).outputTokens})`;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The switch arms cast to any to access event-specific fields (name, durationMs, toolName, etc.) even though the discriminated union in telemetry-types.ts fully types each branch. Use a typed assertion (e.g., e as StepEvent) in each case instead to preserve type safety.

Suggested change
case 'step': return `step(${(e as any).name}, ${(e as any).durationMs}ms, success=${(e as any).success})`;
case 'agent.tool': return `agent.tool(${(e as any).toolName}, ${(e as any).durationMs}ms)`;
case 'agent.llm': return `agent.llm(${(e as any).model}, in=${(e as any).inputTokens}, out=${(e as any).outputTokens})`;
case 'step': { const ev = e as import('./telemetry-types.js').StepEvent; return `step(${ev.name}, ${ev.durationMs}ms, success=${ev.success})`; }
case 'agent.tool': { const ev = e as import('./telemetry-types.js').AgentToolEvent; return `agent.tool(${ev.toolName}, ${ev.durationMs}ms)`; }
case 'agent.llm': { const ev = e as import('./telemetry-types.js').AgentLLMEvent; return `agent.llm(${ev.model}, in=${ev.inputTokens}, out=${ev.outputTokens})`; }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant