Skip to content

AdaCore/ada-eval

Ada Eval

Framework for evaluating LLM based tools for Ada/SPARK use cases.

Setup

To get started with this project, you will need the following tools installed on your system:

You will also need an Ada toolchain and GNATprove installed on your system, which can be obtained through Alire.

Install Ada/SPARK Toolchain

The recommended way to set up an Ada/SPARK toolchain is via Alire. More information on installing Alire can be found here.

Once you have Alire installed, you can install the toolchain and GNATprove with

alr install gnat_native gprbuild gnatprove gnatformat

By default, this will install the tools in ~/.alire/bin/, which you must ensure is in your PATH. Alternative installation locations can be specified with alr install's --prefix option.

Per-clone setup

Do this when you first clone the project. Use uv to install python dependencies (this will automatically create a virtual environment):

uv sync

By default, uv will create a virtual environment named .venv in your project. If you're using VS Code, you can use Python: Select Interpreter to select Python from the virtual environment generated by uv. This will enable code navigation.

If you don't use VS Code, you use the standard source .venv/bin/activate or equivalent to activate the virtual environment.

Alternatively, you can run commands in the project's virtual environment without activating it by running uv run ... (e.g., uv run pytest).

Project Development Info

Project structure

At a high level, the project is structured as follows:

.
├── data  # Data directory containing various datasets
│   ├── base  # Base data directory
│   │   ├── compacted  # Contains datasets in single jsonl files
│   │   └── expanded  # Contains expanded datasets. Much easier to read and modify vs the jsonl files
│   ├── evaluated  # Datasets that contain generated completions and corresponding evaluation results
│   └── generated  # Datasets that contain generated completions
├── src  # Source code for ada-eval
├── tests  # Tests for ada-eval
├── tools  # Tools used by ada-eval during generation
├── Makefile  # Contains various commands for convenience
├── pyproject.toml  # Defines project dependencies and configuration
├── README.md  # This file
└── uv.lock  # Lock file for dependencies managed by uv

Common Commands

To format your code, run:

make format

To run linter/type checker, run:

make check

To run the testsuite run:

make test

Adding or Updating Python Dependencies

To add a new Python dependency using uv, run

uv add <package-name>
uv sync

For more info see docs.

Usage

The project has a simple CLI, as well as a makefile that contains some useful commands for your convenience.

You can see the CLI options by running:

uv run ada-eval --help

If you look at the make target generate-spark-claude, you will see an example of how to generate completions for our existing challenges, using Claude Code.

make generate-spark-claude

Adding a new Sample (Challenge/Problem)

In data/base/expanded, you will find directories that each represent a dataset, which contain a number of samples (challenges/problems). The sample will have a different structure depending on the type of the dataset. Currently we only have SPARK datasets, though there is some support for Explain and Ada datasets too.

The general idea of each of these types are:

  • Ada: Will require the modification of some Ada code to solve the challenge/problem
  • SPARK: Will require the modification of some SPARK code to solve the challenge/problem. This could include adding new SPARK contracts, modifying existing ones, or changing the implementation to satisfy the given challenge.
  • Explain: Contains some Ada/SPARK code and a question about it. The goal is to generate an explanation of the code that answers the question.

The rough structure of a SPARK sample is as follows:

example_sample
├── base  # The project for the challenge. This will be duplicated somewhere for a tool to attempt to modify to solve the challenge
│   ├── main.gpr
│   └── src
├── comments.md  # Comments about the sample for humans to better understand it
├── other.json  # Contains additional metadata for the sample. For SPARK samples, this will include the name of the subprogram of interest, and the relative path to the file that contains its specification.
├── prompt.md  # The prompt that is provided to the tool to specify the challenge/problem
├── solution  # A solution to the challenge/problem. For SPARK samples, this will be a project that can be built, passes the tests, and gnatprove is happy with.
│   ├── main.gpr
│   ├── main.adc
│   └── src
└── tests  # Unit test project for the sample. Used in addition to gnatprove to verify that the solution is correct.
    ├── src
    └── tests.gpr

Adding a new SPARK sample

  1. Create a new directory in data/base/expanded/your_dataset_of_choice with the name of the sample.
  2. Add the following:
    • base/: A project that contains the code that needs to be modified to solve the challenge/problem

    • solution/: A complete project that contains the modified code with the solved problem

    • tests/: A project that contains unit tests for the sample. It should compile to an executable at tests/bin/tests which returns a zero exit code if (and only if) the tests pass.

    • comments.md: if the challenge isn't obvious, consider including a more detailed description of the challenge/problem in this file

    • prompt.md: A prompt that describes the challenge/problem. This will be provided to the tool to generate a solution.

    • other.json: A file that contains additional metadata about the sample. For SPARK samples, this should look like:

      {
          "location": {
              "path": "src/relative/path/to/file.ads",  // Should be relative to the base/solution directory of the sample
              "subprogram_name": "Name_Of_Subprogram"  // The name of the subprogram that is the focus of the challenge/problem
          },
          "required_checks": [
              {
                  "rule": "VC_POSTCONDITION",  // Rule name of a check that must be proved
                  "entity_name": "Integer_Utils.Increment",  // Optional name of the entity the check must be attached to
                  "src_pattern": "Increment'Result = X \\+ 1;"  // Optional regex pattern which must match the proved source code
              },
              ...
          ]
      }

      The "path" from the "location" is used as the argument to gnatprove, so only the corresponding unit, and any it (recursively) depends on, is/are analysed during evaluation.

      The "required_checks" list is optional, and unnecessary for most samples, since the proof and/or tests will simply fail unless the problem is solved. However, for checks like postconditions, a desperate agent may "fix" the proof failure by simply removing or relaxing the condition. Where the proof of a postcondition or similar is a key part of a challenge/problem, a corresponding required_check should be added to ensure that such "fixes" are not erroneously evaluated as correct. The "src_pattern" must match starting from the location reported by gnatprove (run gnatprove on the solution with --report=all for an output which includes the line and column numbers of these locations).

  3. Verify that the solution builds, proves, etc. by running
    make evaluate-canonical
    to populate other.json and then
    make check-datasets
    to verify everything is correct.

Generating new completions

To generate new completions for one or multiple datasets, you can use the generate command of the CLI. For example, to generate completions for all datasets using Claude Code, you can run:

uv run ada-eval generate --tool shell_script --tool-config-file tools/configs/claude_code_no_mcp.json

This also available via the shortcut make generate-spark-claude.

Currently you have to specify the type of tool you want to use, and the configuration file for that tool. The configuration files provide a place for modifying settings. Currently there are only shell tools, and the only values are used to specify where the script is located, and how long it should be allowed to run for.

This interface is not final.

Evaluating completions

To evaluate completions that were generated as described in the previous section, you can run

uv run ada-eval evaluate

which will run all available evaluations on the generations in data/generated/, and save the results to data/evaluated/.

For more options, such as specifying which evaluations are run, see the output of

uv run ada-eval evaluate --help

To print a summary of the results, run

uv run ada-eval report

Filtering options etc. can likewise be found via the --help switch.

The combination of both of these commands is available via the shortcut make evaluate.

This interface is not final.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages