Skip to content

feat(core): add aiLocateAll#1848

Merged
yuyutaotao merged 14 commits intoweb-infra-dev:feat/locate-allfrom
lhuanyu:locate-all
Feb 3, 2026
Merged

feat(core): add aiLocateAll#1848
yuyutaotao merged 14 commits intoweb-infra-dev:feat/locate-allfrom
lhuanyu:locate-all

Conversation

@lhuanyu
Copy link
Copy Markdown
Contributor

@lhuanyu lhuanyu commented Jan 22, 2026

No description provided.

Copilot AI review requested due to automatic review settings January 22, 2026 14:50
@netlify
Copy link
Copy Markdown

netlify Bot commented Jan 22, 2026

Deploy Preview for midscene ready!

Name Link
🔨 Latest commit 6e9ec29
🔍 Latest deploy log https://app.netlify.com/projects/midscene/deploys/697ce5adc0aeb400080f3b85
😎 Deploy Preview https://deploy-preview-1848--midscene.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new aiLocateAll feature that enables locating multiple UI elements in a single operation. The implementation introduces two distinct capabilities: finding all instances of a single element description, and batch-locating multiple different elements for performance optimization.

Changes:

  • Added aiLocateAll method to the Agent API that accepts either a single prompt (to find all matching instances) or an array of prompts (to batch-locate multiple different elements)
  • Implemented two new AI model functions: AiLocateAllElements for finding all instances and AiLocateMultiElements for batch operations
  • Extended the service layer with locateAll and locateMulti methods and corresponding type definitions

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
packages/core/src/yaml/player.ts Adds aiLocateAll to YAML player task handler mapping
packages/core/src/types.ts Defines new result types for batch and multi-locate operations
packages/core/src/service/index.ts Implements locateMulti and locateAll service methods
packages/core/src/device/index.ts Updates defineAction signature to accept optional context parameter
packages/core/src/ai-model/prompt/llm-locator-multi.ts New prompt template for batch-locating multiple different elements
packages/core/src/ai-model/prompt/llm-locator-all.ts New prompt template for finding all instances of one element
packages/core/src/ai-model/inspect.ts Implements AiLocateMultiElements and AiLocateAllElements functions
packages/core/src/ai-model/index.ts Exports new AI model functions
packages/core/src/agent/ui-utils.ts Adds display formatting for LocateMulti task parameters
packages/core/src/agent/task-builder.ts Adds task handlers for LocateMulti and LocateAll plan types
packages/core/src/agent/agent.ts Implements locateAll private method and aiLocateAll public API

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/core/src/ai-model/inspect.ts Outdated
Comment on lines +571 to +749
export async function AiLocateMultiElements(options: {
context: UIContext;
targetElementDescriptions: TUserPrompt[];
callAIFn: typeof callAIWithObjectResponse<{
elements: Array<{ bbox: [number, number, number, number] | [] }>;
errors?: string[];
}>;
modelConfig: IModelConfig;
}): Promise<{
parseResult: {
elements: Array<LocateResultElement | null>;
errors?: string[];
};
rawResponse: string;
usage?: AIUsageInfo;
}> {
const { context, targetElementDescriptions, callAIFn, modelConfig } = options;
const { modelFamily } = modelConfig;
const screenshotBase64 = context.screenshot.base64;

const descriptionsText = targetElementDescriptions.map((d) =>
extraTextFromUserPrompt(d),
);
const userInstructionPrompt = findMultiElementsPrompt(descriptionsText);
const systemPrompt = systemPromptToLocateMultiElements(modelFamily);

const msgs: AIArgs = [
{ role: 'system', content: systemPrompt },
{
role: 'user',
content: [
{
type: 'image_url',
image_url: {
url: screenshotBase64,
detail: 'high',
},
},
{
type: 'text',
text: userInstructionPrompt,
},
],
},
];

const { content, usage, contentString } = await callAIFn(msgs, modelConfig);

const elements: Array<LocateResultElement | null> = [];
if (content.elements && Array.isArray(content.elements)) {
content.elements.forEach((item, index) => {
const bbox = item.bbox;
if (bbox && bbox.length === 4) {
const rect = adaptBboxToRect(
bbox,
context.size.width,
context.size.height,
undefined,
undefined,
context.size.width,
context.size.height,
modelFamily,
);
const center = {
x: Math.round(rect.left + rect.width / 2),
y: Math.round(rect.top + rect.height / 2),
};
elements.push({
rect,
center: [center.x, center.y],
description: descriptionsText[index] || '',
});
} else {
elements.push(null);
}
});
}

// Ensure output length matches input length
while (elements.length < targetElementDescriptions.length) {
elements.push(null);
}
if (elements.length > targetElementDescriptions.length) {
elements.length = targetElementDescriptions.length;
}

return {
parseResult: {
elements,
errors: content.errors,
},
rawResponse: contentString,
usage,
};
}

export async function AiLocateAllElements(options: {
context: UIContext;
targetElementDescription: TUserPrompt;
callAIFn: typeof callAIWithObjectResponse<{
elements: Array<{ bbox: [number, number, number, number] | [] }>;
errors?: string[];
}>;
modelConfig: IModelConfig;
}): Promise<{
parseResult: {
elements: LocateResultElement[];
errors?: string[];
};
rawResponse: string;
usage?: AIUsageInfo;
}> {
const { context, targetElementDescription, callAIFn, modelConfig } = options;
const { modelFamily } = modelConfig;
const screenshotBase64 = context.screenshot.base64;

const descriptionText = extraTextFromUserPrompt(targetElementDescription);
const userInstructionPrompt = findAllElementsPrompt(descriptionText);
const systemPrompt = systemPromptToLocateAllElements(modelFamily);

const msgs: AIArgs = [
{ role: 'system', content: systemPrompt },
{
role: 'user',
content: [
{
type: 'image_url',
image_url: {
url: screenshotBase64,
detail: 'high',
},
},
{
type: 'text',
text: userInstructionPrompt,
},
],
},
];

const { content, usage, contentString } = await callAIFn(msgs, modelConfig);

const elements: LocateResultElement[] = [];
if (content.elements && Array.isArray(content.elements)) {
content.elements.forEach((item) => {
const bbox = item.bbox;
if (bbox && bbox.length === 4) {
const rect = adaptBboxToRect(
bbox,
context.size.width,
context.size.height,
undefined,
undefined,
context.size.width,
context.size.height,
modelFamily,
);
const center = {
x: Math.round(rect.left + rect.width / 2),
y: Math.round(rect.top + rect.height / 2),
};
elements.push({
rect,
center: [center.x, center.y],
description: descriptionText,
});
}
});
}

return {
parseResult: {
elements,
errors: content.errors,
},
rawResponse: contentString,
usage,
};
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's significant code duplication between AiLocateMultiElements and AiLocateAllElements functions. Both share almost identical structure for building messages, calling AI, and parsing results. The primary differences are in the prompt construction and whether empty bboxes are filtered out. Consider refactoring to reduce duplication.

Copilot uses AI. Check for mistakes.
Comment thread packages/core/src/agent/task-builder.ts Outdated
Comment on lines +601 to +715
private createLocateMultiTask(
plan: PlanningAction<PlanningLocateParam[]>,
detailedLocateParams: DetailedLocateParam[],
context: PlanBuildContext,
): ExecutionTaskApply {
const { modelConfigForDefaultIntent } = context;

// Use 'Planning' as main type, and 'LocateMulti' as subType
const taskLocator: ExecutionTaskApply = {
type: 'Planning',
subType: 'LocateMulti',
subTask: context.subTask || undefined,
param: detailedLocateParams,
thought: plan.thought,
executor: async (param: DetailedLocateParam[], taskContext: any) => {
const { task } = taskContext;
let { uiContext } = taskContext;

if (!uiContext) {
uiContext = await this.service.contextRetrieverFn();
}

assert(uiContext, 'uiContext is required for Service task');

const applyDump = (dump?: ServiceDump) => {
if (!dump) {
return;
}
task.log = {
dump,
};
task.usage = dump.taskInfo?.usage;
};

// For now, we skip cache logic for LocateMulti to keep it simple and focused on LLM optimization

let multiResult;
try {
multiResult = await this.service.locateMulti(
param,
{ context: uiContext },
modelConfigForDefaultIntent,
);
applyDump(multiResult.dump);
} catch (error) {
if (error instanceof ServiceError) {
applyDump(error.dump);
}
throw error;
}

return {
output: multiResult.results,
};
},
};
return taskLocator;
}

private createLocateAllTask(
plan: PlanningAction<PlanningLocateParam>,
detailedLocateParam: DetailedLocateParam,
context: PlanBuildContext,
): ExecutionTaskApply {
const { modelConfigForDefaultIntent } = context;

const taskLocator: ExecutionTaskApply = {
type: 'Planning',
subType: 'LocateAll',
subTask: context.subTask || undefined,
param: detailedLocateParam,
thought: plan.thought,
executor: async (param: DetailedLocateParam, taskContext: any) => {
const { task } = taskContext;
let { uiContext } = taskContext;

if (!uiContext) {
uiContext = await this.service.contextRetrieverFn();
}

assert(uiContext, 'uiContext is required for Service task');

const applyDump = (dump?: ServiceDump) => {
if (!dump) {
return;
}
task.log = {
dump,
};
task.usage = dump.taskInfo?.usage;
};

let locateResult;
try {
locateResult = await this.service.locateAll(
param,
{ context: uiContext },
modelConfigForDefaultIntent,
);
applyDump(locateResult.dump);
} catch (error) {
if (error instanceof ServiceError) {
applyDump(error.dump);
}
throw error;
}

return {
output: locateResult.results,
};
},
};

return taskLocator;
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's significant code duplication between createLocateMultiTask and createLocateAllTask methods. Both share nearly identical structure for task creation, UI context retrieval, dump application, and error handling. The main difference is the service method called and the output structure. Consider refactoring to extract common logic.

Copilot uses AI. Check for mistakes.
Comment thread packages/core/src/agent/task-builder.ts Outdated
task.usage = dump.taskInfo?.usage;
};

// For now, we skip cache logic for LocateMulti to keep it simple and focused on LLM optimization
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "For now, we skip cache logic for LocateMulti to keep it simple and focused on LLM optimization" but this creates inconsistency with other locate operations that support caching. This could lead to performance issues when the same multi-locate queries are executed repeatedly. Consider documenting why caching was skipped and whether it should be added in the future.

Suggested change
// For now, we skip cache logic for LocateMulti to keep it simple and focused on LLM optimization
// NOTE: We intentionally do not apply element/task caching for LocateMulti.
// The multi-locate contract returns multiple potentially overlapping matches,
// and reusing cached elements introduces additional complexity (e.g. partial
// hits, ordering, and staleness across batches) that we have not finalized yet.
// For now, we keep LocateMulti uncached so this path stays simple and focused
// on LLM optimization experiments. This may be revisited once requirements
// for multi-result caching are clearer.
// TODO(core/agent): Evaluate adding cache support for LocateMulti to align
// with other locate operations once a stable multi-result caching strategy
// is defined.

Copilot uses AI. Check for mistakes.
Comment thread packages/core/src/agent/agent.ts Outdated
Comment on lines +1234 to +1270

const BATCH_SIZE = 10;
const results: any[] = [];

const runBatch = async (batchPrompts: TUserPrompt[]) => {
const detailedParams = batchPrompts.map((p) =>
buildDetailedLocateParam(p, locateOpt),
);

const plan = {
type: 'LocateMulti',
param: detailedParams,
thought: '',
};

const defaultIntentModelConfig =
this.modelConfigManager.getModelConfig('default');
const modelConfigForPlanning =
this.modelConfigManager.getModelConfig('planning');

// Use runPlans to leverage TaskExecutor's pipeline (dump, log, etc)
// Note: we are passing a custom plan type 'LocateMulti', which must be supported by TaskBuilder
const { output } = await this.taskExecutor.runPlans(
`Locate ${batchPrompts.length} elements`,
[plan],
modelConfigForPlanning,
defaultIntentModelConfig,
);

return output;
};

for (let i = 0; i < prompts.length; i += BATCH_SIZE) {
const batch = prompts.slice(i, i + BATCH_SIZE);
const batchResults = await runBatch(batch);
results.push(...batchResults);
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for handling array inputs is confusing. When an array of prompts is provided, the function batches them using 'LocateMulti' plan type, which locates multiple different elements. However, based on the function name 'locateAll', users would expect it to find all instances of each element in the array, not just one instance of each. This creates a semantic mismatch between the function name and behavior for array inputs.

Copilot uses AI. Check for mistakes.
Comment thread packages/core/src/agent/agent.ts Outdated
Comment on lines +1369 to +1377
async aiLocateAll(
prompts: TUserPrompt | TUserPrompt[],
opt?: LocateOption & {
concurrency?: number;
freezeContext?: boolean;
},
) {
return this.locateAll(prompts, opt);
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concurrency option in the function signature is not implemented or used in the function body. This parameter should either be removed from the signature or implemented to control the concurrency of batch operations.

Copilot uses AI. Check for mistakes.
Comment thread packages/core/src/agent/agent.ts Outdated
Comment on lines +1375 to +1378
) {
return this.locateAll(prompts, opt);
}

Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming is confusing. The method is called aiLocateAll (suggesting it finds all instances of a single element description), but it also accepts an array of prompts (suggesting multiple different elements). When a single prompt is provided, it calls the 'LocateAll' plan type which finds all instances. When an array is provided, it uses 'LocateMulti' in batches. This creates ambiguity. Consider either:

  1. Renaming the method to aiLocateMulti when handling arrays, or
  2. Creating separate methods: aiLocateAll for finding all instances of one element, and aiLocateMulti for locating multiple different elements.
Suggested change
) {
return this.locateAll(prompts, opt);
}
) {
if (Array.isArray(prompts)) {
// Backwards compatibility: allow arrays but prefer using aiLocateMulti explicitly.
// TODO: Remove array support in a future major version.
// eslint-disable-next-line no-console
console.warn(
'[Agent.aiLocateAll] Passing an array of prompts is deprecated. Use aiLocateMulti(prompts, opt) instead.',
);
return this.aiLocateMulti(prompts, opt);
}
return this.locateAll(prompts, opt);
}
async aiLocateMulti(
prompts: TUserPrompt[],
opt?: LocateOption & {
concurrency?: number;
freezeContext?: boolean;
},
) {
return this.locateAll(prompts, opt);
}

Copilot uses AI. Check for mistakes.
Comment thread packages/core/src/device/index.ts Outdated
Comment on lines +74 to +77
call: (
param: TRuntime,
context?: ExecutorContext,
) => Promise<TReturn> | TReturn;
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature change adds an optional context parameter but breaks backward compatibility with existing action implementations. All existing actions that use defineAction only pass the param argument to their call functions. While this change is backward compatible at runtime (optional parameter), it creates inconsistency - some actions may expect context while others don't. Consider documenting this parameter clearly or ensuring all action definitions are updated consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +220 to +333
async locateMulti(
queries: DetailedLocateParam[],
opt: LocateOpts,
modelConfig: IModelConfig,
): Promise<LocateResultsWithDump<LocateResultElement | null>> {
assert(queries.length > 0, 'queries must not be empty');
const queryPrompts = queries.map((q) =>
typeof q === 'string' ? q : q.prompt,
);
const context = opt?.context || (await this.contextRetrieverFn());

const startTime = Date.now();
const { parseResult, rawResponse, usage } = await AiLocateMultiElements({
callAIFn: this.aiVendorFn,
context,
targetElementDescriptions: queryPrompts,
modelConfig,
});

const timeCost = Date.now() - startTime;
const taskInfo: ServiceTaskInfo = {
...(this.taskInfo ? this.taskInfo : {}),
durationMs: timeCost,
rawResponse: JSON.stringify(rawResponse),
formatResponse: JSON.stringify(parseResult),
usage,
};

let errorLog: string | undefined;
if (parseResult.errors?.length) {
errorLog = `failed to locate elements: \n${parseResult.errors.join('\n')}`;
}

const matchedElements = parseResult.elements.filter(
(e): e is LocateResultElement => e !== null,
);

const dumpData: PartialServiceDumpFromSDK = {
type: 'locate',
userQuery: {
element: JSON.stringify(queryPrompts),
},
matchedElement: matchedElements,
data: null,
taskInfo,
deepThink: false,
error: errorLog,
};

const dump = createServiceDump(dumpData);

if (errorLog) {
throw new ServiceError(errorLog, dump);
}

return {
results: parseResult.elements,
dump,
};
}

async locateAll(
query: DetailedLocateParam,
opt: LocateOpts,
modelConfig: IModelConfig,
): Promise<LocateAllResultWithDump> {
const queryPrompt = typeof query === 'string' ? query : query.prompt;
const context = opt?.context || (await this.contextRetrieverFn());

const startTime = Date.now();
const { parseResult, rawResponse, usage } = await AiLocateAllElements({
callAIFn: this.aiVendorFn,
context,
targetElementDescription: queryPrompt,
modelConfig,
});

const timeCost = Date.now() - startTime;
const taskInfo: ServiceTaskInfo = {
...(this.taskInfo ? this.taskInfo : {}),
durationMs: timeCost,
rawResponse: JSON.stringify(rawResponse),
formatResponse: JSON.stringify(parseResult),
usage,
};

let errorLog: string | undefined;
if (parseResult.errors?.length) {
errorLog = `failed to locate elements: \n${parseResult.errors.join('\n')}`;
}

const dumpData: PartialServiceDumpFromSDK = {
type: 'locate',
userQuery: {
element: JSON.stringify(queryPrompt),
},
matchedElement: parseResult.elements,
data: null,
taskInfo,
deepThink: false,
error: errorLog,
};

const dump = createServiceDump(dumpData);

if (errorLog) {
throw new ServiceError(errorLog, dump);
}

return {
results: parseResult.elements,
dump,
};
}
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's significant code duplication between locateMulti and locateAll methods. Both methods share nearly identical structure for error handling, dump creation, and task info management. Consider extracting common logic into a shared helper function to improve maintainability and reduce the risk of inconsistencies.

Copilot uses AI. Check for mistakes.
@yuyutaotao yuyutaotao changed the base branch from main to feat/locate-all February 3, 2026 02:38
@yuyutaotao yuyutaotao merged commit ab72d35 into web-infra-dev:feat/locate-all Feb 3, 2026
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants