Framework for evaluating LLM based tools for Ada/SPARK use cases.
To get started with this project, you will need the following tools installed on your system:
You will also need an Ada toolchain and GNATprove installed on your system, which can be obtained through Alire.
The recommended way to set up an Ada/SPARK toolchain is via Alire. More information on installing Alire can be found here.
Once you have Alire installed, you can install the toolchain and GNATprove with
alr install gnat_native gprbuild gnatprove gnatformatBy default, this will install the tools in ~/.alire/bin/, which you must ensure is in your PATH. Alternative installation locations can be specified with alr install's --prefix option.
Do this when you first clone the project. Use uv to install python dependencies (this will automatically create a virtual environment):
uv syncBy default, uv will create a virtual environment named .venv in your project. If you're using VS Code, you can use Python: Select Interpreter to select Python from the virtual environment generated by uv. This will enable code navigation.
If you don't use VS Code, you use the standard source .venv/bin/activate or equivalent to activate the virtual environment.
Alternatively, you can run commands in the project's virtual environment without activating it by running uv run ... (e.g., uv run pytest).
At a high level, the project is structured as follows:
.
├── data # Data directory containing various datasets
│ ├── base # Base data directory
│ │ ├── compacted # Contains datasets in single jsonl files
│ │ └── expanded # Contains expanded datasets. Much easier to read and modify vs the jsonl files
│ ├── evaluated # Datasets that contain generated completions and corresponding evaluation results
│ └── generated # Datasets that contain generated completions
├── src # Source code for ada-eval
├── tests # Tests for ada-eval
├── tools # Tools used by ada-eval during generation
├── Makefile # Contains various commands for convenience
├── pyproject.toml # Defines project dependencies and configuration
├── README.md # This file
└── uv.lock # Lock file for dependencies managed by uvTo format your code, run:
make formatTo run linter/type checker, run:
make checkTo run the testsuite run:
make testTo add a new Python dependency using uv, run
uv add <package-name>
uv syncFor more info see docs.
The project has a simple CLI, as well as a makefile that contains some useful commands for your convenience.
You can see the CLI options by running:
uv run ada-eval --helpIf you look at the make target generate-spark-claude, you will see an example of how to generate completions for our existing challenges, using Claude Code.
make generate-spark-claudeIn data/base/expanded, you will find directories that each represent a dataset, which contain a number of samples (challenges/problems). The sample will have a different structure depending on the type of the dataset. Currently we only have SPARK datasets, though there is some support for Explain and Ada datasets too.
The general idea of each of these types are:
- Ada: Will require the modification of some Ada code to solve the challenge/problem
- SPARK: Will require the modification of some SPARK code to solve the challenge/problem. This could include adding new SPARK contracts, modifying existing ones, or changing the implementation to satisfy the given challenge.
- Explain: Contains some Ada/SPARK code and a question about it. The goal is to generate an explanation of the code that answers the question.
The rough structure of a SPARK sample is as follows:
example_sample
├── base # The project for the challenge. This will be duplicated somewhere for a tool to attempt to modify to solve the challenge
│ ├── main.gpr
│ └── src
├── comments.md # Comments about the sample for humans to better understand it
├── other.json # Contains additional metadata for the sample. For SPARK samples, this will include the name of the subprogram of interest, and the relative path to the file that contains its specification.
├── prompt.md # The prompt that is provided to the tool to specify the challenge/problem
├── solution # A solution to the challenge/problem. For SPARK samples, this will be a project that can be built, passes the tests, and gnatprove is happy with.
│ ├── main.gpr
│ ├── main.adc
│ └── src
└── tests # Unit test project for the sample. Used in addition to gnatprove to verify that the solution is correct.
├── src
└── tests.gpr- Create a new directory in
data/base/expanded/your_dataset_of_choicewith the name of the sample. - Add the following:
-
base/: A project that contains the code that needs to be modified to solve the challenge/problem -
solution/: A complete project that contains the modified code with the solved problem -
tests/: A project that contains unit tests for the sample. It should compile to an executable attests/bin/testswhich returns a zero exit code if (and only if) the tests pass. -
comments.md: if the challenge isn't obvious, consider including a more detailed description of the challenge/problem in this file -
prompt.md: A prompt that describes the challenge/problem. This will be provided to the tool to generate a solution. -
other.json: A file that contains additional metadata about the sample. For SPARK samples, this should look like:{ "location": { "path": "src/relative/path/to/file.ads", // Should be relative to the base/solution directory of the sample "subprogram_name": "Name_Of_Subprogram" // The name of the subprogram that is the focus of the challenge/problem }, "required_checks": [ { "rule": "VC_POSTCONDITION", // Rule name of a check that must be proved "entity_name": "Integer_Utils.Increment", // Optional name of the entity the check must be attached to "src_pattern": "Increment'Result = X \\+ 1;" // Optional regex pattern which must match the proved source code }, ... ] }The
"path"from the"location"is used as the argument tognatprove, so only the corresponding unit, and any it (recursively) depends on, is/are analysed during evaluation.The
"required_checks"list is optional, and unnecessary for most samples, since the proof and/or tests will simply fail unless the problem is solved. However, for checks like postconditions, a desperate agent may "fix" the proof failure by simply removing or relaxing the condition. Where the proof of a postcondition or similar is a key part of a challenge/problem, a correspondingrequired_checkshould be added to ensure that such "fixes" are not erroneously evaluated as correct. The"src_pattern"must match starting from the location reported bygnatprove(rungnatproveon the solution with--report=allfor an output which includes the line and column numbers of these locations).
-
- Verify that the solution builds, proves, etc. by running
to populate
make evaluate-canonical
other.jsonand thento verify everything is correct.make check-datasets
To generate new completions for one or multiple datasets, you can use the generate command of the CLI. For example, to generate completions for all datasets using Claude Code, you can run:
uv run ada-eval generate --tool shell_script --tool-config-file tools/configs/claude_code_no_mcp.jsonThis also available via the shortcut make generate-spark-claude.
Currently you have to specify the type of tool you want to use, and the configuration file for that tool. The configuration files provide a place for modifying settings. Currently there are only shell tools, and the only values are used to specify where the script is located, and how long it should be allowed to run for.
This interface is not final.
To evaluate completions that were generated as described in the previous section, you can run
uv run ada-eval evaluatewhich will run all available evaluations on the generations in data/generated/, and save the results to data/evaluated/.
For more options, such as specifying which evaluations are run, see the output of
uv run ada-eval evaluate --helpTo print a summary of the results, run
uv run ada-eval reportFiltering options etc. can likewise be found via the --help switch.
The combination of both of these commands is available via the shortcut make evaluate.
This interface is not final.