add armo rm verifier for stackexchange-tezos and stackexchange-overflow #19
add armo rm verifier for stackexchange-tezos and stackexchange-overflow #19sklisa wants to merge 4 commits intoopen-thoughts:mainfrom
Conversation
Summary of ChangesHello @sklisa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the ArmoRM verifier into the existing dataset generation workflows for StackExchange Tezos and StackExchange Overflow. The primary goal is to automatically evaluate agent trajectories using a pre-trained reward model, assigning a raw score that can be used for reinforcement learning or performance analysis. This integration streamlines the process of creating datasets with built-in reward mechanisms, enhancing the ability to assess and improve agent behavior. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an ArmoRM verifier and associated data generation scripts for stackexchange-tezos and stackexchange-overflow datasets. The changes are well-structured, adding a new verifier module and two new generation scripts.
My review focuses on improving the robustness and maintainability of the new scripts. I've suggested refactoring some duplicated code in the verifier template, making the TOML configuration update more resilient to changes, and ensuring the overflow data generation script correctly handles multiple parquet files. These changes will make the new data generation pipeline more reliable.
| # Use the converter directly to extract | ||
| tpc.from_parquet( | ||
| parquet_path=str(parquet_files[0]), | ||
| base=str(output_dir), | ||
| on_exist="overwrite" | ||
| ) |
There was a problem hiding this comment.
The current implementation only processes the first parquet file found (parquet_files[0]). If the source dataset mlfoundations-dev/stackexchange-overflow-sandboxes ever contains multiple parquet files, the others will be ignored, potentially leading to incomplete data. To make this more robust, you should iterate over all found parquet files.
| # Use the converter directly to extract | |
| tpc.from_parquet( | |
| parquet_path=str(parquet_files[0]), | |
| base=str(output_dir), | |
| on_exist="overwrite" | |
| ) | |
| # Use the converter directly to extract | |
| for parquet_file in parquet_files: | |
| tpc.from_parquet( | |
| parquet_path=str(parquet_file), | |
| base=str(output_dir), | |
| on_exist="overwrite" | |
| ) |
There was a problem hiding this comment.
this is consistent to how parquet is processed on main branch https://github.com/open-thoughts/OpenThoughts-Agent/blob/main/scripts/datagen/extract_tasks_from_parquet.py#L176
all 10k tasks are contained in one parquet
| if not traj_path.exists(): | ||
| print(f"Error: {traj_path} not found.") | ||
| reward_file.parent.mkdir(parents=True, exist_ok=True) | ||
| reward_file.write_text("0") | ||
| return | ||
|
|
||
| messages = parse_trajectory_authentic_multiturn(traj_path) | ||
|
|
||
| if not messages: | ||
| print("Error: No valid steps found in trajectory.") | ||
| reward_file.parent.mkdir(parents=True, exist_ok=True) | ||
| reward_file.write_text("0") | ||
| return |
There was a problem hiding this comment.
The error handling logic for when the trajectory is not found or is empty is duplicated. This can be refactored into a helper function to improve readability and maintainability. Defining a nested helper function can encapsulate this logic cleanly.
| if not traj_path.exists(): | |
| print(f"Error: {traj_path} not found.") | |
| reward_file.parent.mkdir(parents=True, exist_ok=True) | |
| reward_file.write_text("0") | |
| return | |
| messages = parse_trajectory_authentic_multiturn(traj_path) | |
| if not messages: | |
| print("Error: No valid steps found in trajectory.") | |
| reward_file.parent.mkdir(parents=True, exist_ok=True) | |
| reward_file.write_text("0") | |
| return | |
| def _handle_verification_error(message: str): | |
| """Logs an error, writes a zero reward, and ensures the reward directory exists.""" | |
| print(message) | |
| reward_file.parent.mkdir(parents=True, exist_ok=True) | |
| reward_file.write_text("0") | |
| if not traj_path.exists(): | |
| _handle_verification_error(f"Error: {traj_path} not found.") | |
| return | |
| messages = parse_trajectory_authentic_multiturn(traj_path) | |
| if not messages: | |
| _handle_verification_error("Error: No valid steps found in trajectory.") | |
| return |
| # Get the baseline TOML and update the timeout for ArmoRM loading | ||
| base_toml = create_standard_task_toml() | ||
| # Increase verifier timeout from 720 to 1200 seconds | ||
| updated_toml = base_toml.replace("timeout_sec = 720.0", "timeout_sec = 1200.0") |
There was a problem hiding this comment.
Using str.replace() to modify the TOML configuration is brittle. If the formatting in create_standard_task_toml() changes (e.g., extra space, or the value itself), this replacement will fail silently. It's more robust to parse the TOML, modify the value, and then write it back.
You could use a library like tomllib (for reading, standard in Python 3.11+) and tomli-w (for writing). This would require adding tomli-w as a dependency.
Example:
import tomllib
import tomli_w
# ... in inject_armorm_verifier
base_toml_str = create_standard_task_toml()
toml_data = tomllib.loads(base_toml_str)
if 'verifier' in toml_data:
toml_data['verifier']['timeout_sec'] = 1200.0
updated_toml = tomli_w.dumps(toml_data)
Add Armo RM verifier to review the traces and assign a reward as test script. Generated datasets are at
https://huggingface.co/datasets/DCAgent/stackexchange-tezos-sandboxes-armo-rm
https://huggingface.co/datasets/DCAgent/stackexchange-overflow-sandboxes-armo-rm