-
Notifications
You must be signed in to change notification settings - Fork 11.7k
DRAFT: RFC for DO tracing spans #27599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: production
Are you sure you want to change the base?
Conversation
|
This pull request requires reviews from CODEOWNERS as it changes files that match the following patterns:
|
|
Preview URL: https://08b20790.preview.developers.cloudflare.com Files with changes (up to 15) |
src/content/docs/workers/observability/traces/spans-and-attributes.mdx
Outdated
Show resolved
Hide resolved
src/content/docs/workers/observability/traces/spans-and-attributes.mdx
Outdated
Show resolved
Hide resolved
| - `cloudflare.durable_object.response.rows_read` | ||
| - `cloudflare.durable_object.response.rows_written` | ||
| - `cloudflare.durable_object.response.bytes_written` | ||
| - `cloudflare.durable_object.response.sql_duration_ms` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to understand whether this will always be 0 or not because it is a synchronous operation. If it's always 0, then I do not think we should include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### `durable_object_subrequest` | ||
|
|
||
| - `cloudflare.durable_object.startup_duration_ms` | ||
| - `cloudflare.durable_object.constructor_invoked` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's also worth adding constructor_time_ms? This would depend on the time resolution that we can get for potentially synchronous operations. Some constructors might make outbound requests and this would end up being very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want both? or does constructor_tims_ms=0 indicate the constructor did not run
|
How should I read this PR? Is the goal just to list all the spans available? Will there be documentation explaining what the span means and expected duration for it? (similar to what we have in https://gitlab.cfdata.org/cloudflare/ew/edgeworker/-/blob/master/src/edgeworker/scheduling/jaeger-spans.c%2B%2B ? |
@shrima-cf We will certainly add explanation as part of releasing. I'd love to have all spans/attributes generated from code somehow to avoid syncing issues. Right now, read this PR as capturing all the DO related spans and attributes we want to add |
|
I find it odd that we're adding timing attributes to spans because spans themselves are supposed to represent the time it took something to run. In particular, I think things that are in the programming model that the programmer has control over should be spans. In particular, Additionally, things like CPU time are OK as an attribute because that's an orthogonal dimension than you get out of a span, but having wall time as an attribute makes no sense as that's exactly what the span's duration measures. |
@justin-mp I pretty strongly disagree with those assertions. Creating many spans for every possible timing is one of the most common tracing anti-patterns IMO. It makes querying across many operations far harder than it needs to be, and complicates the waterfall visualization unnecessarily. You need to design your data for how you want to query it.
Attributes are also a core part of the span model, and there is no reason you can't put timing information there. You should think of spans as capturing all of the information about a specific operation. For performance engineering profiling is usually the better tool. I address this attributes-vs-child-spans tradeoff in my guest chapter in Observability Engineering: Let's say in your system you've just shipped a new subsystem that prioritizes payload parsing for enterprise users, and you want to see the impact of that change on tail latencies across all of the regions where you have systems deployed. If all of those attributes are present on the wide event, then this is straightforward: However this may seem like we are duplicating data available in the child spans. Surely we can accomplish the same thing by wrapping the payload parsing method in its own span? Mature observability tooling is capable of running these types of queries at the cost of additional processing, however we should keep in mind how we will be using this data. We want to prioritize quick, iterative exploration, often while responding to active incidents, and we want any engineer on our team to be able to easily navigate our observability tooling. |
| - `cloudflare.durable_object.response.rows_read` | ||
| - `cloudflare.durable_object.response.rows_written` | ||
| - `cloudflare.durable_object.response.bytes_written` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these added after the returned cursor is iterated, so after the operation function returns, or before the cursor is iterated hence it will be zero for rows_read?
Similarly, could we also have "bytes_read"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these added after the returned cursor is iterated, so after the operation function returns, or before the cursor is iterated hence it will be zero for rows_read?
For usefulness, I would expect after the cursor is iterated. But we need to confirm.
Added bytes_read
|
@vy-ton In addition to spans, there are also tags that hold useful information, Alex added a couple for the input/output gate spans - cloudflare/workerd#5827 |
I would not expose the span attributes in cloudflare/workerd#5827 - those seem pretty internal-only to me. |
| - `cloudflare.durable_object.output_gate_lock_hold_ms` | ||
| - `cloudflare.durable_object.output_gate_lock_wait_ms` | ||
|
|
||
| #### For handlers[/workers/observability/traces/spans-and-attributes/#handlers] invoked on a Durable Object such as RPC or fetch(), these attributes exist: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lambrospetrou you mentioned having primary/replica context, what other attributes would you expect here?
Draft PR captures proposal for all DO related spans and attributes we should have for public traces.