Skip to content

Evaluation Difficulty of FeatureBench #2

@zhu-zhu-ding

Description

@zhu-zhu-ding

Hello authors,

Thank you for this excellent work. I have a question regarding the current evaluation difficulty of FeatureBench.

I recently conducted experiments on the Lite subset using the latest DeepSeek-V4-Flash model (non-thinking mode) within the OpenHands framework, and observed a 100% Resolved rate (all tests passed). I carefully reviewed the execution traces and did not find any obvious signs of cheating behavior, such as accessing the internet or directly using the source code.

Image

This makes me wonder: does this suggest that FeatureBench may gradually lose its effectiveness for evaluating the ability of frontier LLMs to handle complex function generation, especially as stronger models such as GPT-5.4 and Claude 4.7 continue to emerge? More broadly, how do you view the long-term challenge of maintaining the difficulty and discriminative power of this benchmark?

I’m also curious whether you plan to further extend or evolve this benchmark in the future (e.g., harder tasks, dynamic environments, multi-turn settings, repository-level dependencies, etc.).

I would greatly appreciate your thoughts. Thank you again for your insightful work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions