From f70fcaa5a81d1c5f5cbe0023bf5d915ca8c5935d Mon Sep 17 00:00:00 2001 From: G26Karthik Date: Wed, 12 Nov 2025 05:30:03 +0530 Subject: [PATCH 1/2] Add non-English language support to FinalResponseMatchV2Evaluator MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Enhanced the LLM-as-judge prompt to explicitly handle non-English languages including Chinese, Thai, Japanese, Korean, Arabic, Hebrew, Hindi, and other non-Latin scripts. The evaluator now: - Recognizes identical strings in any language as valid matches - Handles Unicode and character encoding differences - Accepts language-specific punctuation variations (e.g., 。 vs . in Chinese) - Treats all languages with equal evaluation standards Fixes #3111 Fixes #3162 --- src/google/adk/evaluation/final_response_match_v2.py | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/google/adk/evaluation/final_response_match_v2.py b/src/google/adk/evaluation/final_response_match_v2.py index ea90c37487..abce1c18cd 100644 --- a/src/google/adk/evaluation/final_response_match_v2.py +++ b/src/google/adk/evaluation/final_response_match_v2.py @@ -44,6 +44,12 @@ _FINAL_RESPONSE_MATCH_V2_PROMPT = """You are an expert rater for an AI agent. The AI agent is going to call an API to answer the user query and generate API tool use code based for the choice of the API and API arguments. The ideal model response should be a function call that fulfills user query, or a natural language response hedges or asks users for further clarification if a function call does not apply. The primary focus of this rating task is to check correctness of the model responses. +IMPORTANT: This evaluation supports ALL languages including but not limited to English, Chinese (Simplified/Traditional), Japanese, Korean, Thai, Arabic, Hebrew, Hindi, and other non-Latin scripts. When comparing text in any language: +- Treat identical strings in ANY language as VALID, regardless of the script or character set used. +- Pay attention to the semantic meaning in the language being evaluated. +- Be aware that punctuation marks may vary across languages (e.g., 。vs. . in Chinese/Japanese, ؟ in Arabic). +- Consider language-specific formatting conventions as valid variations unless explicitly contradicted by the reference. + The data consists of: - A user query. - A model generated response for the prompt. The responses can consist of: @@ -54,6 +60,9 @@ When the agent response is provided in the form of tables/dataframes or should be best provided in the form of tables/dataframes: focus on the key entities and main components requested in the user query and check whether you can retrieve those from the agent response. Likewise, if you have the reference response, then find out the key entities and main components in them and check whether you can retrieve those from the agent response. If the prompt does not specify any format instructions and the main items/components are included in the response then tolerate the differences in the formatting of those tables/dataframes. You should follow the constitutions below very carefully to rate the model response: +- **Language Equality**: Responses in ANY language (English, Chinese, Thai, Arabic, etc.) should be evaluated with the same standards. Identical text in non-English languages must be recognized as valid matches. +- **Unicode and Character Encoding**: Be aware that different languages use different character sets and encodings. Identical strings in non-Latin scripts (e.g., 你好 vs 你好, สวัสดี vs สวัสดี) must match exactly when they are byte-for-byte identical. +- **Script-Specific Punctuation**: Recognize language-specific punctuation as valid (e.g., 。in Chinese/Japanese equals . in English, !equals !, ?equals ?). - Allow flexibility of format even when reference code only uses one of the possible format, unless API spec or user prompt has explicit format requirement - e.g. For state name, allow both abbreviation and full name unless API spec has explicit requirement. e.g. both 'tx' and 'Texas' should be allowed in the agent response even when reference code only uses one of them. - e.g. If a reference response list outputs in a list format, the agent response is allowed to use sentence format and vice versa unless user prompt explicitly asks for a specific format. From 4b520aa93b33b2365df9d15c65c01738a4f34505 Mon Sep 17 00:00:00 2001 From: G Karthik Koundinya <144328549+G26karthik@users.noreply.github.com> Date: Wed, 12 Nov 2025 05:38:09 +0530 Subject: [PATCH 2/2] Update src/google/adk/evaluation/final_response_match_v2.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- src/google/adk/evaluation/final_response_match_v2.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/google/adk/evaluation/final_response_match_v2.py b/src/google/adk/evaluation/final_response_match_v2.py index abce1c18cd..5eb8ccee7c 100644 --- a/src/google/adk/evaluation/final_response_match_v2.py +++ b/src/google/adk/evaluation/final_response_match_v2.py @@ -61,7 +61,7 @@ You should follow the constitutions below very carefully to rate the model response: - **Language Equality**: Responses in ANY language (English, Chinese, Thai, Arabic, etc.) should be evaluated with the same standards. Identical text in non-English languages must be recognized as valid matches. -- **Unicode and Character Encoding**: Be aware that different languages use different character sets and encodings. Identical strings in non-Latin scripts (e.g., 你好 vs 你好, สวัสดี vs สวัสดี) must match exactly when they are byte-for-byte identical. +- **Unicode and Character Encoding**: Be aware that different languages use different character sets and encodings. Identical strings in non-Latin scripts (e.g., "你好" vs "你好", "สวัสดี" vs "สวัสดี") must match exactly when they are character-for-character identical. - **Script-Specific Punctuation**: Recognize language-specific punctuation as valid (e.g., 。in Chinese/Japanese equals . in English, !equals !, ?equals ?). - Allow flexibility of format even when reference code only uses one of the possible format, unless API spec or user prompt has explicit format requirement - e.g. For state name, allow both abbreviation and full name unless API spec has explicit requirement. e.g. both 'tx' and 'Texas' should be allowed in the agent response even when reference code only uses one of them.