-
Notifications
You must be signed in to change notification settings - Fork 152
Payload limit configuration and validation #1288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
temporalio/worker/_activity.py
Outdated
| elif isinstance( | ||
| err, | ||
| temporalio.exceptions.PayloadsSizeError, | ||
| ): | ||
| temporalio.activity.logger.warning( | ||
| "Activity task failed: payloads size exceeded the error limit. Size: %d bytes, Limit: %d bytes", | ||
| err.payloads_size, | ||
| err.payloads_limit, | ||
| extra={"__temporal_error_identifier": "ActivityFailure"}, | ||
| ) | ||
| await data_converter.encode_failure( | ||
| temporalio.exceptions.ApplicationError( | ||
| type="PayloadsTooLarge", | ||
| message="Payloads size has exceeded the error limit.", | ||
| ), | ||
| completion.result.failed.failure, | ||
| ) | ||
| # TODO: Add force_cause to activity Failure bridge proto? | ||
| # TODO: Add WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE to API | ||
| # completion.result.failed.force_cause = WorkflowTaskFailedCause.WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should let the traditional error path run here. Assuming a readable error message on the error, I think the only reason I can see not doing so is to have the ApplicationError.type be PayloadsTooLarge instead of PayloadsSizeError, but we can either alter the failure converter, or make just that slight specialization in the catch here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Are you suggesting that
PayloadSizeErrorshould extend fromApplicationError? - We want a specialized warning message that better describes what the issue is and give better guidance in the worker output as to what went wrong rather and how to fix it than the standard "Completing activity as failed" message. My understanding is that log messages from sdk-core won't be surfaced in the worker unless a logger is configured via the telemetry configuration. And even then, I haven't seen activity failures get routed through there (see https://github.com/temporalio/sdk-python/pull/1288/changes#diff-3162a4b842d45d546da93b825218f7f863b6b481684d2b9570a38b04facb266bR8833), but maybe I'm doing something wrong with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On extending from ApplicationError, note that PayloadSizeError will be thrown from Client invocations if payload limits are configured. This is in contrast to the description of the ApplicationError which states "Error raised during workflow/activity execution."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that PayloadSizeError should extend from ApplicationError?
No, in these situations, all non-Temporal-failure exceptions automatically convert to application error with their unqualified class name as the error type
We want a specialized warning message that better describes what the issue is and give better guidance in the worker output as to what went wrong rather and how to fix it than the standard "Completing activity as failed" message.
It sounds like everyone should get this message and not just this log statement. Therefore, such a message should be part of the error, not the log.
| temporalio.exceptions.CancelledError("Cancelled"), | ||
| completion.result.cancelled.failure, | ||
| ) | ||
| elif isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this doesn't catch cases where encoding the failure includes payloads too large, may want to handle this error in the outer except. This can happen when application error details are too large, or when stack trace is too large and it's moved to encoded attributes (see temporalio/features#597).
temporalio/worker/_activity.py
Outdated
| elif isinstance( | ||
| err, | ||
| temporalio.exceptions.PayloadsSizeError, | ||
| ): | ||
| temporalio.activity.logger.warning( | ||
| "Activity task failed: payloads size exceeded the error limit. Size: %d bytes, Limit: %d bytes", | ||
| err.payloads_size, | ||
| err.payloads_limit, | ||
| extra={"__temporal_error_identifier": "ActivityFailure"}, | ||
| ) | ||
| await data_converter.encode_failure( | ||
| temporalio.exceptions.ApplicationError( | ||
| type="PayloadsTooLarge", | ||
| message="Payloads size has exceeded the error limit.", | ||
| ), | ||
| completion.result.failed.failure, | ||
| ) | ||
| # TODO: Add force_cause to activity Failure bridge proto? | ||
| # TODO: Add WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE to API | ||
| # completion.result.failed.force_cause = WorkflowTaskFailedCause.WORKFLOW_TASK_FAILED_CAUSE_PAYLOADS_TOO_LARGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that PayloadSizeError should extend from ApplicationError?
No, in these situations, all non-Temporal-failure exceptions automatically convert to application error with their unqualified class name as the error type
We want a specialized warning message that better describes what the issue is and give better guidance in the worker output as to what went wrong rather and how to fix it than the standard "Completing activity as failed" message.
It sounds like everyone should get this message and not just this log statement. Therefore, such a message should be part of the error, not the log.
| ) | ||
|
|
||
| if warning_limit and warning_limit > 0 and total_size > warning_limit: | ||
| # TODO: Use a context aware logger to log extra information about workflow/activity/etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still TODO right?
| payloads, | ||
| self.payload_limits.payload_upload_error_limit, | ||
| self.payload_limits.payload_upload_warning_limit, | ||
| "Payloads size exceeded the warning limit.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Context: of what? upload/download? Workflow task? Which endpoint?
- Actionability: suggest sending to a stable docs URL. Perhaps https://docs.temporal.io/troubleshooting/blob-size-limit-error, or a stable shortened URL, and then we can improve that page over time.
| size: Actual payloads size in bytes. | ||
| limit: Payloads size limit in bytes. | ||
| """ | ||
| super().__init__("Payloads size exceeded the error limit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar points here.
tests/worker/test_workflow.py
Outdated
| config["data_converter"] = dataclasses.replace( | ||
| temporalio.converter.default(), | ||
| payload_limits=PayloadLimitsConfig( | ||
| payload_upload_error_limit=error_limit, payload_upload_warning_limit=1024 | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an elaborate way to override a config. It's not going to be super common but I wonder if we can make it easier.
In any case, can we document the correct procedure on the class so that people who go-to-definition on it can figure out how to override it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an elaborate way to override a config. It's not going to be super common but I wonder if we can make it easier.
The way I have it in test (other than all of the property settings being on one line) is the documented way for specifying overrides using the default data converter as the basis. See examples in https://docs.temporal.io/develop/python/converters-and-encryption. Maybe it looks better when it's on separate lines:
data_converter = dataclasses.replace(
temporalio.converter.default(),
payload_limits=PayloadLimitsConfig(
payload_upload_error_limit=4 * 1024 * 1024,
payload_upload_warning_limit=1024
),
)It you are talking about the names of the fields on the PayloadLimitsConfig, note that there are four of them:
memo_upload_error_limitmemo_upload_warning_limitpayload_upload_error_limitpayload_upload_warning_limit
The memo_ settings are a specialization specifically for Memo fields. These fields are validated by the server by checking the summation of the payload size of each key to see if it is over the memo size limit (which is separately configurable from the general payload limit).
The upload part is to help disambiguate if we need some type of limits for downloads in the future.
I'm up for changing any of the names; just explaining the current rationale.
In any case, can we document the correct procedure on the class so that people who go-to-definition on it can figure out how to override it?
What would the user go-to-definition on specifically in their pursuit to (a) find that they can override something and (b) know how to override it? Starting with a basic client configuration, it's not obvious where that starting point is. Would you GTD on load_client_connect_config, then GTD on the return type (if you can figure it out), and then you are looking at the ClientConfig (which is hardly documented), etc. It seems to me that samples or developer docs show the way to do "how do I override this specific aspect", not the SDK code itself.
That being said, I would totally agree with an effort to make GTD more useful in terms of discovering what is possible and how to use all of the options, but I don't know if making that effort here would bear fruit if the top level parts of the discoverability chain don't support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thin kthe Top of the funnel here is when they get the error message or warning, they can either read the message or find the line from the callstack and then grep for the threshold variable name.
bd14d85 to
04a1c6d
Compare
What was changed
Update the SDK to check the size of payload collections and issue warnings and errors when the size exceeds limits:
This is done by:
PayloadLimitsConfigclass for configuring the warning and error limits. The error limit are not defined by default whereas the warning limit is set to 512 KiB.payload_limitsproperty toDataConverterof typePayloadLimitsConfig.DataConverterfor encoding and decoding payloads in all forms. The encoding methods call_validate_payload_limitsbefore returning.DataConveterinstead of usingpayload_codecdirectly.Examples
Log output when an activity attempts to return a result that exceeds the error limit:
Log output when a workflow attempts to provide an activity input that exceeds the error limit:
Note that the above example is missing the extra context that the activity result failure example has. This is due to the available logging infra where these errors are raised and can be fixed separately with some log refactoring (see deferred items).
Log output when a workflow attempts to provide activity input that is above the warning threshold but below the error limit:
Same note about the missing extra context.
Deferred
Work that has been deferred to later PRs (unless requested to pull back in):
WorkflowTaskFailedCauseto indicate the specific cause of the failure scenario. Pending Add a new workflow failure cause for oversized payloads api#697, integration into sdk-core, and sdk-core into sdk-python._validate_payload_limitsto get rich information about the execution context when issuing a warning._WorkflowWorker::_handle_activationto get rich information about the execution context when issuing a warning when exceeding the payload error limit.Why?
Users need to know when payload sizes are approaching or have exceeded size limits. This will help prevent workflow outages and inform users to adjust their workflows to make use of alternate storage methods or to break down their payloads more granularly.
Checklist
Closes Warn if the SDK tried to send a payload above a specific size #1284
Closes SDK should fail workflow task if payloads size it known to be too large #1285
How was this tested: Unit tests
Any docs updates needed? Yes