feat(core): add aiLocateAll#1848
Conversation
✅ Deploy Preview for midscene ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
This PR adds a new aiLocateAll feature that enables locating multiple UI elements in a single operation. The implementation introduces two distinct capabilities: finding all instances of a single element description, and batch-locating multiple different elements for performance optimization.
Changes:
- Added
aiLocateAllmethod to the Agent API that accepts either a single prompt (to find all matching instances) or an array of prompts (to batch-locate multiple different elements) - Implemented two new AI model functions:
AiLocateAllElementsfor finding all instances andAiLocateMultiElementsfor batch operations - Extended the service layer with
locateAllandlocateMultimethods and corresponding type definitions
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/yaml/player.ts | Adds aiLocateAll to YAML player task handler mapping |
| packages/core/src/types.ts | Defines new result types for batch and multi-locate operations |
| packages/core/src/service/index.ts | Implements locateMulti and locateAll service methods |
| packages/core/src/device/index.ts | Updates defineAction signature to accept optional context parameter |
| packages/core/src/ai-model/prompt/llm-locator-multi.ts | New prompt template for batch-locating multiple different elements |
| packages/core/src/ai-model/prompt/llm-locator-all.ts | New prompt template for finding all instances of one element |
| packages/core/src/ai-model/inspect.ts | Implements AiLocateMultiElements and AiLocateAllElements functions |
| packages/core/src/ai-model/index.ts | Exports new AI model functions |
| packages/core/src/agent/ui-utils.ts | Adds display formatting for LocateMulti task parameters |
| packages/core/src/agent/task-builder.ts | Adds task handlers for LocateMulti and LocateAll plan types |
| packages/core/src/agent/agent.ts | Implements locateAll private method and aiLocateAll public API |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export async function AiLocateMultiElements(options: { | ||
| context: UIContext; | ||
| targetElementDescriptions: TUserPrompt[]; | ||
| callAIFn: typeof callAIWithObjectResponse<{ | ||
| elements: Array<{ bbox: [number, number, number, number] | [] }>; | ||
| errors?: string[]; | ||
| }>; | ||
| modelConfig: IModelConfig; | ||
| }): Promise<{ | ||
| parseResult: { | ||
| elements: Array<LocateResultElement | null>; | ||
| errors?: string[]; | ||
| }; | ||
| rawResponse: string; | ||
| usage?: AIUsageInfo; | ||
| }> { | ||
| const { context, targetElementDescriptions, callAIFn, modelConfig } = options; | ||
| const { modelFamily } = modelConfig; | ||
| const screenshotBase64 = context.screenshot.base64; | ||
|
|
||
| const descriptionsText = targetElementDescriptions.map((d) => | ||
| extraTextFromUserPrompt(d), | ||
| ); | ||
| const userInstructionPrompt = findMultiElementsPrompt(descriptionsText); | ||
| const systemPrompt = systemPromptToLocateMultiElements(modelFamily); | ||
|
|
||
| const msgs: AIArgs = [ | ||
| { role: 'system', content: systemPrompt }, | ||
| { | ||
| role: 'user', | ||
| content: [ | ||
| { | ||
| type: 'image_url', | ||
| image_url: { | ||
| url: screenshotBase64, | ||
| detail: 'high', | ||
| }, | ||
| }, | ||
| { | ||
| type: 'text', | ||
| text: userInstructionPrompt, | ||
| }, | ||
| ], | ||
| }, | ||
| ]; | ||
|
|
||
| const { content, usage, contentString } = await callAIFn(msgs, modelConfig); | ||
|
|
||
| const elements: Array<LocateResultElement | null> = []; | ||
| if (content.elements && Array.isArray(content.elements)) { | ||
| content.elements.forEach((item, index) => { | ||
| const bbox = item.bbox; | ||
| if (bbox && bbox.length === 4) { | ||
| const rect = adaptBboxToRect( | ||
| bbox, | ||
| context.size.width, | ||
| context.size.height, | ||
| undefined, | ||
| undefined, | ||
| context.size.width, | ||
| context.size.height, | ||
| modelFamily, | ||
| ); | ||
| const center = { | ||
| x: Math.round(rect.left + rect.width / 2), | ||
| y: Math.round(rect.top + rect.height / 2), | ||
| }; | ||
| elements.push({ | ||
| rect, | ||
| center: [center.x, center.y], | ||
| description: descriptionsText[index] || '', | ||
| }); | ||
| } else { | ||
| elements.push(null); | ||
| } | ||
| }); | ||
| } | ||
|
|
||
| // Ensure output length matches input length | ||
| while (elements.length < targetElementDescriptions.length) { | ||
| elements.push(null); | ||
| } | ||
| if (elements.length > targetElementDescriptions.length) { | ||
| elements.length = targetElementDescriptions.length; | ||
| } | ||
|
|
||
| return { | ||
| parseResult: { | ||
| elements, | ||
| errors: content.errors, | ||
| }, | ||
| rawResponse: contentString, | ||
| usage, | ||
| }; | ||
| } | ||
|
|
||
| export async function AiLocateAllElements(options: { | ||
| context: UIContext; | ||
| targetElementDescription: TUserPrompt; | ||
| callAIFn: typeof callAIWithObjectResponse<{ | ||
| elements: Array<{ bbox: [number, number, number, number] | [] }>; | ||
| errors?: string[]; | ||
| }>; | ||
| modelConfig: IModelConfig; | ||
| }): Promise<{ | ||
| parseResult: { | ||
| elements: LocateResultElement[]; | ||
| errors?: string[]; | ||
| }; | ||
| rawResponse: string; | ||
| usage?: AIUsageInfo; | ||
| }> { | ||
| const { context, targetElementDescription, callAIFn, modelConfig } = options; | ||
| const { modelFamily } = modelConfig; | ||
| const screenshotBase64 = context.screenshot.base64; | ||
|
|
||
| const descriptionText = extraTextFromUserPrompt(targetElementDescription); | ||
| const userInstructionPrompt = findAllElementsPrompt(descriptionText); | ||
| const systemPrompt = systemPromptToLocateAllElements(modelFamily); | ||
|
|
||
| const msgs: AIArgs = [ | ||
| { role: 'system', content: systemPrompt }, | ||
| { | ||
| role: 'user', | ||
| content: [ | ||
| { | ||
| type: 'image_url', | ||
| image_url: { | ||
| url: screenshotBase64, | ||
| detail: 'high', | ||
| }, | ||
| }, | ||
| { | ||
| type: 'text', | ||
| text: userInstructionPrompt, | ||
| }, | ||
| ], | ||
| }, | ||
| ]; | ||
|
|
||
| const { content, usage, contentString } = await callAIFn(msgs, modelConfig); | ||
|
|
||
| const elements: LocateResultElement[] = []; | ||
| if (content.elements && Array.isArray(content.elements)) { | ||
| content.elements.forEach((item) => { | ||
| const bbox = item.bbox; | ||
| if (bbox && bbox.length === 4) { | ||
| const rect = adaptBboxToRect( | ||
| bbox, | ||
| context.size.width, | ||
| context.size.height, | ||
| undefined, | ||
| undefined, | ||
| context.size.width, | ||
| context.size.height, | ||
| modelFamily, | ||
| ); | ||
| const center = { | ||
| x: Math.round(rect.left + rect.width / 2), | ||
| y: Math.round(rect.top + rect.height / 2), | ||
| }; | ||
| elements.push({ | ||
| rect, | ||
| center: [center.x, center.y], | ||
| description: descriptionText, | ||
| }); | ||
| } | ||
| }); | ||
| } | ||
|
|
||
| return { | ||
| parseResult: { | ||
| elements, | ||
| errors: content.errors, | ||
| }, | ||
| rawResponse: contentString, | ||
| usage, | ||
| }; | ||
| } |
There was a problem hiding this comment.
There's significant code duplication between AiLocateMultiElements and AiLocateAllElements functions. Both share almost identical structure for building messages, calling AI, and parsing results. The primary differences are in the prompt construction and whether empty bboxes are filtered out. Consider refactoring to reduce duplication.
| private createLocateMultiTask( | ||
| plan: PlanningAction<PlanningLocateParam[]>, | ||
| detailedLocateParams: DetailedLocateParam[], | ||
| context: PlanBuildContext, | ||
| ): ExecutionTaskApply { | ||
| const { modelConfigForDefaultIntent } = context; | ||
|
|
||
| // Use 'Planning' as main type, and 'LocateMulti' as subType | ||
| const taskLocator: ExecutionTaskApply = { | ||
| type: 'Planning', | ||
| subType: 'LocateMulti', | ||
| subTask: context.subTask || undefined, | ||
| param: detailedLocateParams, | ||
| thought: plan.thought, | ||
| executor: async (param: DetailedLocateParam[], taskContext: any) => { | ||
| const { task } = taskContext; | ||
| let { uiContext } = taskContext; | ||
|
|
||
| if (!uiContext) { | ||
| uiContext = await this.service.contextRetrieverFn(); | ||
| } | ||
|
|
||
| assert(uiContext, 'uiContext is required for Service task'); | ||
|
|
||
| const applyDump = (dump?: ServiceDump) => { | ||
| if (!dump) { | ||
| return; | ||
| } | ||
| task.log = { | ||
| dump, | ||
| }; | ||
| task.usage = dump.taskInfo?.usage; | ||
| }; | ||
|
|
||
| // For now, we skip cache logic for LocateMulti to keep it simple and focused on LLM optimization | ||
|
|
||
| let multiResult; | ||
| try { | ||
| multiResult = await this.service.locateMulti( | ||
| param, | ||
| { context: uiContext }, | ||
| modelConfigForDefaultIntent, | ||
| ); | ||
| applyDump(multiResult.dump); | ||
| } catch (error) { | ||
| if (error instanceof ServiceError) { | ||
| applyDump(error.dump); | ||
| } | ||
| throw error; | ||
| } | ||
|
|
||
| return { | ||
| output: multiResult.results, | ||
| }; | ||
| }, | ||
| }; | ||
| return taskLocator; | ||
| } | ||
|
|
||
| private createLocateAllTask( | ||
| plan: PlanningAction<PlanningLocateParam>, | ||
| detailedLocateParam: DetailedLocateParam, | ||
| context: PlanBuildContext, | ||
| ): ExecutionTaskApply { | ||
| const { modelConfigForDefaultIntent } = context; | ||
|
|
||
| const taskLocator: ExecutionTaskApply = { | ||
| type: 'Planning', | ||
| subType: 'LocateAll', | ||
| subTask: context.subTask || undefined, | ||
| param: detailedLocateParam, | ||
| thought: plan.thought, | ||
| executor: async (param: DetailedLocateParam, taskContext: any) => { | ||
| const { task } = taskContext; | ||
| let { uiContext } = taskContext; | ||
|
|
||
| if (!uiContext) { | ||
| uiContext = await this.service.contextRetrieverFn(); | ||
| } | ||
|
|
||
| assert(uiContext, 'uiContext is required for Service task'); | ||
|
|
||
| const applyDump = (dump?: ServiceDump) => { | ||
| if (!dump) { | ||
| return; | ||
| } | ||
| task.log = { | ||
| dump, | ||
| }; | ||
| task.usage = dump.taskInfo?.usage; | ||
| }; | ||
|
|
||
| let locateResult; | ||
| try { | ||
| locateResult = await this.service.locateAll( | ||
| param, | ||
| { context: uiContext }, | ||
| modelConfigForDefaultIntent, | ||
| ); | ||
| applyDump(locateResult.dump); | ||
| } catch (error) { | ||
| if (error instanceof ServiceError) { | ||
| applyDump(error.dump); | ||
| } | ||
| throw error; | ||
| } | ||
|
|
||
| return { | ||
| output: locateResult.results, | ||
| }; | ||
| }, | ||
| }; | ||
|
|
||
| return taskLocator; | ||
| } |
There was a problem hiding this comment.
There's significant code duplication between createLocateMultiTask and createLocateAllTask methods. Both share nearly identical structure for task creation, UI context retrieval, dump application, and error handling. The main difference is the service method called and the output structure. Consider refactoring to extract common logic.
| task.usage = dump.taskInfo?.usage; | ||
| }; | ||
|
|
||
| // For now, we skip cache logic for LocateMulti to keep it simple and focused on LLM optimization |
There was a problem hiding this comment.
The comment states "For now, we skip cache logic for LocateMulti to keep it simple and focused on LLM optimization" but this creates inconsistency with other locate operations that support caching. This could lead to performance issues when the same multi-locate queries are executed repeatedly. Consider documenting why caching was skipped and whether it should be added in the future.
| // For now, we skip cache logic for LocateMulti to keep it simple and focused on LLM optimization | |
| // NOTE: We intentionally do not apply element/task caching for LocateMulti. | |
| // The multi-locate contract returns multiple potentially overlapping matches, | |
| // and reusing cached elements introduces additional complexity (e.g. partial | |
| // hits, ordering, and staleness across batches) that we have not finalized yet. | |
| // For now, we keep LocateMulti uncached so this path stays simple and focused | |
| // on LLM optimization experiments. This may be revisited once requirements | |
| // for multi-result caching are clearer. | |
| // TODO(core/agent): Evaluate adding cache support for LocateMulti to align | |
| // with other locate operations once a stable multi-result caching strategy | |
| // is defined. |
|
|
||
| const BATCH_SIZE = 10; | ||
| const results: any[] = []; | ||
|
|
||
| const runBatch = async (batchPrompts: TUserPrompt[]) => { | ||
| const detailedParams = batchPrompts.map((p) => | ||
| buildDetailedLocateParam(p, locateOpt), | ||
| ); | ||
|
|
||
| const plan = { | ||
| type: 'LocateMulti', | ||
| param: detailedParams, | ||
| thought: '', | ||
| }; | ||
|
|
||
| const defaultIntentModelConfig = | ||
| this.modelConfigManager.getModelConfig('default'); | ||
| const modelConfigForPlanning = | ||
| this.modelConfigManager.getModelConfig('planning'); | ||
|
|
||
| // Use runPlans to leverage TaskExecutor's pipeline (dump, log, etc) | ||
| // Note: we are passing a custom plan type 'LocateMulti', which must be supported by TaskBuilder | ||
| const { output } = await this.taskExecutor.runPlans( | ||
| `Locate ${batchPrompts.length} elements`, | ||
| [plan], | ||
| modelConfigForPlanning, | ||
| defaultIntentModelConfig, | ||
| ); | ||
|
|
||
| return output; | ||
| }; | ||
|
|
||
| for (let i = 0; i < prompts.length; i += BATCH_SIZE) { | ||
| const batch = prompts.slice(i, i + BATCH_SIZE); | ||
| const batchResults = await runBatch(batch); | ||
| results.push(...batchResults); | ||
| } |
There was a problem hiding this comment.
The logic for handling array inputs is confusing. When an array of prompts is provided, the function batches them using 'LocateMulti' plan type, which locates multiple different elements. However, based on the function name 'locateAll', users would expect it to find all instances of each element in the array, not just one instance of each. This creates a semantic mismatch between the function name and behavior for array inputs.
| async aiLocateAll( | ||
| prompts: TUserPrompt | TUserPrompt[], | ||
| opt?: LocateOption & { | ||
| concurrency?: number; | ||
| freezeContext?: boolean; | ||
| }, | ||
| ) { | ||
| return this.locateAll(prompts, opt); | ||
| } |
There was a problem hiding this comment.
The concurrency option in the function signature is not implemented or used in the function body. This parameter should either be removed from the signature or implemented to control the concurrency of batch operations.
| ) { | ||
| return this.locateAll(prompts, opt); | ||
| } | ||
|
|
There was a problem hiding this comment.
The naming is confusing. The method is called aiLocateAll (suggesting it finds all instances of a single element description), but it also accepts an array of prompts (suggesting multiple different elements). When a single prompt is provided, it calls the 'LocateAll' plan type which finds all instances. When an array is provided, it uses 'LocateMulti' in batches. This creates ambiguity. Consider either:
- Renaming the method to
aiLocateMultiwhen handling arrays, or - Creating separate methods:
aiLocateAllfor finding all instances of one element, andaiLocateMultifor locating multiple different elements.
| ) { | |
| return this.locateAll(prompts, opt); | |
| } | |
| ) { | |
| if (Array.isArray(prompts)) { | |
| // Backwards compatibility: allow arrays but prefer using aiLocateMulti explicitly. | |
| // TODO: Remove array support in a future major version. | |
| // eslint-disable-next-line no-console | |
| console.warn( | |
| '[Agent.aiLocateAll] Passing an array of prompts is deprecated. Use aiLocateMulti(prompts, opt) instead.', | |
| ); | |
| return this.aiLocateMulti(prompts, opt); | |
| } | |
| return this.locateAll(prompts, opt); | |
| } | |
| async aiLocateMulti( | |
| prompts: TUserPrompt[], | |
| opt?: LocateOption & { | |
| concurrency?: number; | |
| freezeContext?: boolean; | |
| }, | |
| ) { | |
| return this.locateAll(prompts, opt); | |
| } |
| call: ( | ||
| param: TRuntime, | ||
| context?: ExecutorContext, | ||
| ) => Promise<TReturn> | TReturn; |
There was a problem hiding this comment.
The signature change adds an optional context parameter but breaks backward compatibility with existing action implementations. All existing actions that use defineAction only pass the param argument to their call functions. While this change is backward compatible at runtime (optional parameter), it creates inconsistency - some actions may expect context while others don't. Consider documenting this parameter clearly or ensuring all action definitions are updated consistently.
| async locateMulti( | ||
| queries: DetailedLocateParam[], | ||
| opt: LocateOpts, | ||
| modelConfig: IModelConfig, | ||
| ): Promise<LocateResultsWithDump<LocateResultElement | null>> { | ||
| assert(queries.length > 0, 'queries must not be empty'); | ||
| const queryPrompts = queries.map((q) => | ||
| typeof q === 'string' ? q : q.prompt, | ||
| ); | ||
| const context = opt?.context || (await this.contextRetrieverFn()); | ||
|
|
||
| const startTime = Date.now(); | ||
| const { parseResult, rawResponse, usage } = await AiLocateMultiElements({ | ||
| callAIFn: this.aiVendorFn, | ||
| context, | ||
| targetElementDescriptions: queryPrompts, | ||
| modelConfig, | ||
| }); | ||
|
|
||
| const timeCost = Date.now() - startTime; | ||
| const taskInfo: ServiceTaskInfo = { | ||
| ...(this.taskInfo ? this.taskInfo : {}), | ||
| durationMs: timeCost, | ||
| rawResponse: JSON.stringify(rawResponse), | ||
| formatResponse: JSON.stringify(parseResult), | ||
| usage, | ||
| }; | ||
|
|
||
| let errorLog: string | undefined; | ||
| if (parseResult.errors?.length) { | ||
| errorLog = `failed to locate elements: \n${parseResult.errors.join('\n')}`; | ||
| } | ||
|
|
||
| const matchedElements = parseResult.elements.filter( | ||
| (e): e is LocateResultElement => e !== null, | ||
| ); | ||
|
|
||
| const dumpData: PartialServiceDumpFromSDK = { | ||
| type: 'locate', | ||
| userQuery: { | ||
| element: JSON.stringify(queryPrompts), | ||
| }, | ||
| matchedElement: matchedElements, | ||
| data: null, | ||
| taskInfo, | ||
| deepThink: false, | ||
| error: errorLog, | ||
| }; | ||
|
|
||
| const dump = createServiceDump(dumpData); | ||
|
|
||
| if (errorLog) { | ||
| throw new ServiceError(errorLog, dump); | ||
| } | ||
|
|
||
| return { | ||
| results: parseResult.elements, | ||
| dump, | ||
| }; | ||
| } | ||
|
|
||
| async locateAll( | ||
| query: DetailedLocateParam, | ||
| opt: LocateOpts, | ||
| modelConfig: IModelConfig, | ||
| ): Promise<LocateAllResultWithDump> { | ||
| const queryPrompt = typeof query === 'string' ? query : query.prompt; | ||
| const context = opt?.context || (await this.contextRetrieverFn()); | ||
|
|
||
| const startTime = Date.now(); | ||
| const { parseResult, rawResponse, usage } = await AiLocateAllElements({ | ||
| callAIFn: this.aiVendorFn, | ||
| context, | ||
| targetElementDescription: queryPrompt, | ||
| modelConfig, | ||
| }); | ||
|
|
||
| const timeCost = Date.now() - startTime; | ||
| const taskInfo: ServiceTaskInfo = { | ||
| ...(this.taskInfo ? this.taskInfo : {}), | ||
| durationMs: timeCost, | ||
| rawResponse: JSON.stringify(rawResponse), | ||
| formatResponse: JSON.stringify(parseResult), | ||
| usage, | ||
| }; | ||
|
|
||
| let errorLog: string | undefined; | ||
| if (parseResult.errors?.length) { | ||
| errorLog = `failed to locate elements: \n${parseResult.errors.join('\n')}`; | ||
| } | ||
|
|
||
| const dumpData: PartialServiceDumpFromSDK = { | ||
| type: 'locate', | ||
| userQuery: { | ||
| element: JSON.stringify(queryPrompt), | ||
| }, | ||
| matchedElement: parseResult.elements, | ||
| data: null, | ||
| taskInfo, | ||
| deepThink: false, | ||
| error: errorLog, | ||
| }; | ||
|
|
||
| const dump = createServiceDump(dumpData); | ||
|
|
||
| if (errorLog) { | ||
| throw new ServiceError(errorLog, dump); | ||
| } | ||
|
|
||
| return { | ||
| results: parseResult.elements, | ||
| dump, | ||
| }; | ||
| } |
There was a problem hiding this comment.
There's significant code duplication between locateMulti and locateAll methods. Both methods share nearly identical structure for error handling, dump creation, and task info management. Consider extracting common logic into a shared helper function to improve maintainability and reduce the risk of inconsistencies.
Removed ExecutorContext from the call method signature.
… across multiple files
ab72d35
into
web-infra-dev:feat/locate-all
No description provided.