-
Notifications
You must be signed in to change notification settings - Fork 321
Description
Description
We've been running an eternal orchestration pattern in Aus East for the last couple weeks. We noticed that occasionally, the orchestration would never reach or execute context.ContinueAsNew when we had the "functionTimeout": -1 value set.
To help diagnose the issue, we set "functionTimeout": "00:30:00" suspecting it would be a runaway activity, however I was surprised to see that the OrchestrationTrigger itself was timing out. Typically this orchestration will run for about 10-15 minutes, and execute eternally for 5-8+ hours before hanging like this.
We're unsure how to diagnose, or address this issue further.
Interestingly, when we restarted the Flex Consumption Plan when "functionTimeout": -1 the orchestration activity would continue as expected.
Expected behavior
I expect that the Orchestration Trigger shouldn't timeout, as this avoids any recovery logic we have to self heal the process.
Actual behavior
All activities run to completion, however when attempting to continue using context.ContinueAsNew the orchestration trigger times out.
Relevant source code snippets
[Function(nameof(Orchestrator_SyncLogDataToDataLake))]
public async Task Orchestrator_SyncLogDataToDataLake(
[OrchestrationTrigger] TaskOrchestrationContext context)
{
ILogger logger = context.CreateReplaySafeLogger(nameof(Orchestrator_SyncLogDataToDataLake));
var lastRunInputUtc = context.GetInput<DateTime?>();
var importLogId = -1;
var runFailed = false;
var instanceId = context.InstanceId;
var startTimeUtc = context.CurrentUtcDateTime;
try {
importLogId = await context.CallActivityAsync<int>(nameof(SharedActivities.InternalActivity_CreateImportLogForRun),
new SharedActivities.CreateImportLogForRunRequest($"{nameof(Orchestrator_SyncLogDataToDataLake)}_{instanceId}", lastRunUtc, startTimeUtc));
// ***** SKIPPED CODE FOR BREVITY *****
// ... Activites that create references to blob storage for the next phase ...
// ***** SKIPPED CODE FOR BREVITY *****
var exportActivities = new List<Task>();
foreach (var batchReference in batchedSessionReferences)
{
exportActivities.Add(context.CallActivityAsync(nameof(SharedActivities.InternalActivity_ExportAppDataForSessions),
new SharedActivities.ExportAppDataForSessionsRequest(importLogId, batchReference, lastRunUtc,
startTimeUtc)));
}
foreach (var tenantBatch in batchedTenantReferences)
{
exportActivities.Add(context.CallActivityAsync(nameof(SharedActivities.InternalActivity_ExportApiDataForTenants),
new SharedActivities.ExportApiDataForTenantsRequest(importLogId, tenantBatch, lastRunUtc, startTimeUtc)));
}
await Task.WhenAll(exportActivities);
context.SetCustomStatus(new
{
instanceId,
importLogId = importLogId,
lastOperation = "Finished batch operation activities. Cleaning up now.",
timeStartUtc = startTimeUtc,
timeLookBackUtc = maxLookBackUtc,
timeLastRunUtc = lastRunUtc,
countNewSessions = newSessionsCount,
countActiveTenants = activeTenantIds.Length,
countSessionOperations = batchedSessionReferences.Length,
countTenantOperations = batchedTenantReferences.Length,
countExecutedActivities = exportActivities.Count
});
await context.CallActivityAsync(nameof(SharedActivities.InternalActivity_CleanupActivitiesForRun),
new SharedActivities.UpdateImportLogForRunRequest(importLogId, true));
} catch (Exception ex) {
runFailed = true;
try
{
await context.CallActivityAsync(nameof(SharedActivities.InternalActivity_CleanupActivitiesForRun),
new SharedActivities.UpdateImportLogForRunRequest(importLogId, false,
$"{ex.Message}. {ex.InnerException?.Message}".Trim()));
}
catch (Exception ex2)
{
logger.LogCritical(ex2,
"{instanceId} :: {importId} failed to close import session. Restarting orchestration.",
instanceId, importLogId);
}
} finally {
// Restart the process with the start time of this run. If faulted, pass the input in again.
context.ContinueAsNew(runFailed ? lastRunInputUtc : startTimeUtc, false);
}
}host.json
{
"version": "2.0",
"functionTimeout": "00:30:00",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request;Exception;Trace"
},
"enableLiveMetricsFilters": true
}
},
"extensions": {
"durableTask": {
"hubName": "DataIntegrationHub"
}
}
}Known workarounds
Setting "functionTimeout": -1 and restarting the Flex Consumption Plan app when we detect that the process has frozen, not really a work around.
App Details
Note: Ranges added as we update day to day.
"Microsoft.Extensions.Azure": "1.7.4 - 1.7.5""Microsoft.Azure.Functions.Worker": "1.23.0""Microsoft.Azure.Functions.Worker.Sdk": "1.17.4""Microsoft.Azure.Functions.Worker.Extensions.DurableTask": "1.1.4 - 1.1.5"- Runtime:
.NET 8 Isolated - Language:
C# - Hosting:
Linux Flex Consumption Plan (Preview)
Screenshots
Instance History: Unsure if related, however I have consistently seen the sentinel RowKey right before the OrchestratorCompleted event when the process has frozen like this...

If deployed to Azure
We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!
- Timeframe issue observed:
2024-08-13T15:47:23.7057963Z-2024-08-13T16:47:23.7057963Z - Function App name:
pw-prod-dataintegration - Function name(s):
Orchestrator_SyncLogDataToDataLake - Azure region:
Australia East - Orchestration instance ID(s):
71fff80ba19241a69235a8958082cf08 - Azure storage account name:
can be provided privately
If you don't want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it's extremely difficult to look up information.