Skip to content

Conversation

@MSNev
Copy link
Collaborator

@MSNev MSNev commented Sep 2, 2025

  • Initial Discussion

@MSNev MSNev requested a review from a team as a code owner September 2, 2025 00:21
@MSNev MSNev requested review from a team and Copilot and removed request for a team September 2, 2025 00:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces initial planning documentation for the OpenTelemetry Web SDK (OTelWebSdk), establishing the foundational architecture and specifications for a comprehensive web telemetry solution.

Purpose: Create detailed technical specifications and planning documents for implementing an OpenTelemetry-based web SDK that provides both standards compliance and performance-optimized features for JavaScript applications.

Key Changes:

  • Comprehensive component specifications covering core SDK, trace, log, metric, and context implementations
  • Detailed usage examples demonstrating multi-instance patterns and Application Insights compatibility
  • Performance strategy, testing framework, and migration planning documentation

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated no comments.

Show a summary per file
File Description
docs/planning/otel/specs/README.md Index document organizing all OTelWebSdk specifications with reading order guidance
docs/planning/otel/specs/OTelWebSdk-UsageExamples.md Comprehensive usage examples showing multi-team patterns and API usage scenarios
docs/planning/otel/specs/OTelWebSdk-Trace.md Detailed trace implementation specification using closures and DynamicProto patterns
docs/planning/otel/specs/OTelWebSdk-Testing.md Testing strategy covering unit, integration, performance, and browser compatibility testing
docs/planning/otel/specs/OTelWebSdk-TelemetryInitializers.md Telemetry processing specification with OpenTelemetry processors and lightweight initializers
docs/planning/otel/specs/OTelWebSdk-Performance.md Performance optimization strategy with targets, monitoring, and best practices
docs/planning/otel/specs/OTelWebSdk-Migration.md Migration planning framework for transitioning from existing telemetry solutions
docs/planning/otel/specs/OTelWebSdk-Metric.md Basic metric implementation specification with simple counter, histogram, and gauge support
docs/planning/otel/specs/OTelWebSdk-Log.md Log implementation specification with structured logging and level-based filtering

- **Bundle Size Sensitivity**: Web applications must minimize JavaScript bundle size for performance
- **Tree-Shaking Requirements**: Dead code elimination is critical for production builds
- **Browser Compatibility**: Runtime requirement with graceful detection and fallback for unsupported browsers
- **Minimum Language Support**: This SDK will target ES2020 features (async/await, optional chaining, nullish coalescing, etc.)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async / await is problematic -- especially during page unload scenario where we can't do ANYTHING asynchronously (which is the default operation of async / await)... So will need to be used sparingly, or with wrappers


## Implementation Timeline

### Detailed Timeline with Milestones
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeline is created by Co-Pilot is (IMHO) is extremely aggressive, the team WILL need to own with defining not just the timeline but also the above phases and how this work is split up between the team.

So treat this as a starting point on one possible way the work might be split / implemented.


### Anti-Patterns to Avoid

#### **CRITICAL: Never Import OpenTelemetry Packages Directly**
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this comment here to highlight this extremely important, and we need to talk and discuss why as a team.


**Note**: Performance targets will be validated through comprehensive benchmarking during implementation. Targets are based on web application requirements and Application Insights SDK performance analysis.

### Initialization Performance
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use these as goals that drive performant designs, the actual final values will depend on several factors which includes the browser / runtime that we measure these with (using the IPerfManager interface and implementation), once we have the base lines we can then determine which parts of the code will need work (or not). We have the existing Application Insights performance tests which can be a guide on how we measure these scenarios.

│ │ │ - Access shared resources
│ │ │ - Setup providers
│ │ │
│ 7. Initialize SDK │ │
Copy link
Collaborator Author

@MSNev MSNev Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point 7 should not needed as this is done as part of point 6, but we can discuss, as this would sort of keep a simular pattern for initialization whether using the shared or direct version.

- **Web Worker Support**: Complete functionality in Web Worker and Service Worker environments
- **Node.js Integration**: Seamless operation in Node.js environments for SSR and build tools

```typescript
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples are OVERKILL and this is how we don't want to implement this. We don't need to "know" upfront what features do and don't exist. At the point of attempting to "use" a feature we detect there and then (via helper functions) and if the API / feature is not present we just don't initialize that part of the code or we provide some gracefull fallback.

```typescript
// Core SDK Interfaces
export interface IOTelWebSdk extends IOTelSdk {
readonly traceProvider: ITraceProvider;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless providing an OpenTelemetry compatible API where the name is already defined, we should be cognizant of the length of the interface (and ultimately object / class) names (properties and functions) as these cannot be minified so they will directly impact (increase) the resulting bundle size.


The SDK follows a modular architecture that promotes maintainability, testability, and selective loading:

```bash
Copy link
Collaborator Author

@MSNev MSNev Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed / changed, it seems overly prescriptive.

4. **Environment Detection**: Automatic detection of browser, framework, and deployment environment
5. **Secure Defaults**: All defaults chosen for security and performance
6. **Schema Validation**: JSON Schema validation for configuration objects

Copy link
Collaborator Author

@MSNev MSNev Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another item that probably should be added here (and elsewhere) --

  • NEVER crash the hosting application, whether thats based on bad configuration, usage or injected code (telemetry initializers)
    Which means internally we never explicitly use throw unless we have explicitly wrapped the function as a safe call (using try / catch / finally), however, using exception handling also introduces performance overhead, so we should avoid and just gracefully abort any processing.

### Error Categories and Handling

1. **Configuration Errors**: Validation failures, invalid settings
- Throw immediately during initialization
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> This one is probably the ONLY time where we might "throw" and let the exception propagate to the caller -- BUT only if we have been told (a configuration) that we can otherwise, logging is your friend.

- Fallback to local storage when possible

3. **Runtime Errors**: Unexpected exceptions during telemetry collection
- Graceful degradation to no-op behavior
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the browser world a "no-op" behavior means do NOTHING -- return null / undefined, and NOT return a full no-op implementation of the interface.

There will be no "included" no-op implementation (unlike open telemetry) all callers MUST check that something was returned, not just explicitly use the returned "instance".

We may "provide" a loadable (separate) Sdk instance which is a No-Op instance (as part of the graceful handling of older browsers) from the CDN, but it SHOULD not be included within the SDK, ie. it MUST be tree-shakable.


```typescript
// Plugin architecture for future extensibility
interface ISDKPlugin {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

treat as possible examples only to get your idea's flowing.

### Backward Compatibility Strategy

1. **Interface Versioning**: Semantic versioning for all public interfaces
2. **Deprecation Timeline**: 12-month deprecation period for breaking changes
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is aggressive and wrong.
The correct approach would be (depending on the thing being deprecated) will be marking the "thing" as @deprecated and then we drop it in the next major release with the necessary documentation.

4. **Audit Trails**: Comprehensive audit logging for compliance
5. **Data Governance**: Fine-grained control over data collection and export

## Processing Pipeline Architecture
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples derived from the OpenTelemetry specification, use as guidance only.

@MSNev MSNev marked this pull request as draft September 3, 2025 23:15
…h global registration

- Update JSDoc to TypeDoc
- Remove Enterprise named configuration
- Rename Open Telemetry interfaces
- Remove EnterpriseManager / Policy
- Remove IUnloadManager
- Clarify interface naming
- **Resource Efficiency**: Optimal resource sharing when teams use compatible multi-tenant instances, complete isolation when needed
- **Lifecycle Management**: Coordinated initialization and cleanup across components, SDK versions, and deployment modes

#### **4. Emerging Web Standards and Runtime Diversity**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big list and there are no specifics of what it means in Web SDK perspective, is this more about testing strategies or actual implementation?


#### **Primary Goals**

1. **Full OpenTelemetry API Support for Tracing and Logs**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Metrics and other OpenTelemetry APIs? is this about current scope? not seeing these in secondary goals either, should we track these or explicitly mention these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about OTel semantic conventions?

- **Battery and CPU Conservation**: Prevents excessive CPU usage and allows runtime to sleep properly
- **Resource Impact Minimization**: Reduces background resource consumption and wake-up events

9. **Lightweight Telemetry Processors**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telemetry processors concept is something we are avoiding in other SDKs, having this section more focused on SpanProcessors and LogRecordProcessors would be better here, there are differences on what could be achieved here

- `trackPageView()`, `trackEvent()`, `trackException()`, `trackDependency()` methods, basically emitting events.
- Provides simplified Migration path for existing Application Insights implementations

### **Benefits Over Standard OpenTelemetry SDK**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of duplication of content here, this main document could be simplified and use existing specs references, I will remove this section entirely

- Safe feature detection without syntax errors
- Graceful fallback mechanism activation

### **Browser Support Matrix**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty important and could be lost with all other stuff in this document, this matrix define the runtime requirements and several patterns, maybe add a supportability document and simplify here


This browser compatibility strategy ensures the OTelWebSdk can run effectively across the diverse web ecosystem while providing optimal performance and features for modern browsers.

## Architecture Overview
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already have an architecture spec, let's simplify here


## Overview

The OpenTelemetry Web SDK is designed as a modern, modular implementation that follows the OpenTelemetry specification while providing enhanced flexibility and performance for web applications. It delivers a complete observability solution encompassing distributed tracing, structured logging, and basic metrics collection through a multi-instance SDK factory that enables multiple teams to efficiently share resources while maintaining isolation. This factory pattern is particularly beneficial for modern web applications where multiple teams or components need isolated telemetry contexts in the same runtime while sharing underlying infrastructure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention why we are actually creating this OpenTelemetry based SDK, what are the benefits of OpenTelemetry in general and if we get those covered here


**Key Framework Requirements:**

The SDK must provide seamless integration across modern web frameworks and rendering patterns:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to avoid the "supports" language here. Something more like "will not crash in..." could be more appropriate. Should discuss further.

- **Application Insights API Compatibility**: Backward compatibility support / shim to provide existing Application Insights API surface while processing through OTelWebSdk
- `trackPageView()`, `trackEvent()`, `trackException()`, `trackDependency()` methods, basically emitting events.
- Provides simplified Migration path for existing Application Insights implementations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic Metrics

- **Feature A/B Testing**: Built-in experimentation framework for feature rollouts and testing

4. **Complete Unload and Cleanup Support**
- Complete removal of SDK instances with all associated hooks and timers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any special consideration about timers that need to run? What's this targeted at/do you think there are any specific engineering challenges related to this?

- **Automatic Anonymous Session Management**: Intelligent session boundary detection, anonymous session ID generation, and cross-tab session correlation
- **Anonymous User Tracking**: Privacy-compliant user identification across sessions without PII collection
- **Multi-Runtime Support**: Native support for edge computing (Cloudflare Workers, Vercel Edge, Deno Deploy)
- **Modern Framework Integration**: Purpose-built support for Next.js, SvelteKit, Remix, Astro, Qwik, and other modern frameworks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any frameworks we explicitly don't provide support for here that we can call out?

- **IoC Pattern**: No global state, explicit dependency management
- **Closure-Based Implementation**: For bundle size optimization and true private state management
- **High-Performance Architecture**: Minimal overhead design with advanced batching, resource management, and bundle optimization
- **Modular Architecture**: Tree-shakable, plugin-based extensibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the new plugins mimic what we currently have (react, angular, etc.) or would expect new ones to be necessary?

function BaseProcessor() {
// Stub for DynamicProto
}
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this code block break got added a little too early.

### Technology Evolution Preparedness

1. **Web Standards**: Ready for new browser APIs and web standards
2. **Framework Integration**: Pluggable architecture for new JavaScript frameworks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "pluggable" nature here just referring to creating a structure for our initial plugins that can be easily extensible to possible new frameworks?


### OpenTelemetry Specification Evolution

1. **Specification Tracking**: Automated monitoring of OpenTelemetry specification changes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds pretty cool, but do we have anything like this in existence yet? Sounds like it could be useful across the board as well, not just for this web JS compatible OTel SDK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This more about being aware of semantic conventions updates, and specification changes, not having automated tools for it

- **Development Infrastructure**: Create build tools and test infrastructure with interface validation
- Set up automated testing framework with interface mocking capabilities
- Create performance testing infrastructure for bundle size monitoring
- Implement continuous integration pipeline with cross-browser testing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this can we inherit from what already exists? Or would this be a relatively fresh implementation?


**Key Differentiators:**
- **Multi-Instance Architecture**: Multiple SDK instances can coexist without conflicts, enabling team isolation
- **Web-Optimized Performance**: Minimal bundle size, superior browser compatibility (ES2020+ with ES5 fallback)
Copy link
Contributor

@Karlie-777 Karlie-777 Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future plan: may load loader first then loader gives the current/correct version based on the environment. (this will prevent user to load specific version to avoid version conflicts) The loader will avoid global overwrite.

**Key Differentiators:**
- **Multi-Instance Architecture**: Multiple SDK instances can coexist without conflicts, enabling team isolation
- **Web-Optimized Performance**: Minimal bundle size, superior browser compatibility (ES2020+ with ES5 fallback)
- **Enterprise Features**: Dynamic configuration, complete cleanup/unload, multi-tenant support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after loader is in place, sharedworker structure can be introduced

- **Tree-Shaking Requirements**: Dead code elimination is critical for production builds
- **Browser Compatibility**: Runtime requirement with graceful detection and fallback for unsupported browsers
- **Minimum Language Support**: This SDK will target ES2020 features (async/await, optional chaining, nullish coalescing, etc.)
- **Graceful Degradation**: The SDK will provide a loader/initialization **that targets ES5 (or earlier)** to ensure successful loading, parsing, and execution in older runtimes so it can detect unsupported environments and provide fallback behavior (such as reporting why the SDK can't load or falling back to a basic non-OpenTelemetry based SDK)
Copy link
Contributor

@Karlie-777 Karlie-777 Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for cdn only. options

  1. load older versions
  2. don't load anything
  3. no-op version

#### **2. Web-Specific Telemetry Needs**
- **Page View Tracking**: Navigate between SPA routes and traditional page loads
- **Dependency Tracking**: Track all Ajax style requests (XMLHttpRequest, fetch, sendBeacon) from the page
- **Browser Performance Metrics**: Navigation timing, resource timing, frame rates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to pageviewperformance

- **Client-Side Error Tracking**: Unhandled exceptions, promise rejections, console errors
- **User Session Management**: Session boundaries, user identification, device context
- **Real User Monitoring (RUM)**: Actual user performance vs synthetic monitoring
- **User Experience Monitoring**: Core Web Vitals, paint timing, user interactions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breeze doesn't have event type for web Vitals now

- **Browser Performance Metrics**: Navigation timing, resource timing, frame rates
- **Client-Side Error Tracking**: Unhandled exceptions, promise rejections, console errors
- **User Session Management**: Session boundaries, user identification, device context
- **Real User Monitoring (RUM)**: Actual user performance vs synthetic monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user performance -> click analytics

@MSNev MSNev marked this pull request as ready for review September 22, 2025 21:41
@github-actions
Copy link

This PR has been inactive for 30 days and has been marked as abandoned. You can remove this label by commenting or pushing new changes. If it remains inactive with the abandoned label, it will eventually also be marked as stale and closed.

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

This PR has been inactive for 30 days and has been marked as abandoned. You can remove this label by commenting or pushing new changes. If it remains inactive with the abandoned label, it will eventually also be marked as stale and closed.

@github-actions
Copy link

This PR will be closed in 14 days. Please remove the "Stale" label or comment to avoid closure with no action.

@github-actions github-actions bot added the stale label Dec 20, 2025
@MSNev MSNev added keep Do not Mark as Stale and close and removed stale abandoned labels Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

keep Do not Mark as Stale and close

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants