How can we improve our testing? #711

planetf1 · 2026-03-20T14:31:35Z

planetf1
Mar 20, 2026
Maintainer

Test Strategy Discussion

Why

We have come challenges in our test suite. We get resource conflicts (especially with GPUs), test breakage. Qualitative tests are excluded from CI entirely. Contributors have incomplete guidance on what to run, what to install, or how to categorize tests. GPU testing is partly manual and ad-hoc. We need to get to the point where we have consistent, reliable, expanding testing and run the most appropriate tests for each environment.

Desired Outcome

A shared definition of test tiers (including pre-release) with agreed time budgets
Clear expectations for contributors — setup, what to run, how to mark tests
Prioritized backlog of improvements including infrastructure

1. Where We Are

Test classification

We have backend markers and resource markers (introduced in #326) but no clear unit vs integration vs end-to-end distinction. Are our markers appropriate? Do they overlap? We've had issues with missing markers — e.g., pytest -m "not ollama" still ran Ollama-dependent tests (#419, fixed), and test_openai_vllm.py is still missing GPU markers (#622, open). 30 of 80 test files have no markers at all — mostly plugins, helpers, formatters, core — which are effectively unmarked unit tests. Should we formalise this (e.g., a unit marker or a convention that "no marker = unit test")?

Current markers and usage (80 test files total):

Marker	Purpose	Files	Usages
Backend
`ollama`	Requires Ollama running locally	23	39
`huggingface`	Requires HF backend (local, heavy)	11	15
`vllm`	Requires vLLM backend (local, GPU)	5	5
`openai`	Requires OpenAI API	4	4
`watsonx`	Requires Watsonx API	5	5
`litellm`	Requires LiteLLM backend	4	4
Resource
`requires_gpu`	Needs CUDA or MPS	15	30
`requires_heavy_ram`	Needs 48GB+ RAM	15	18
`requires_api_key`	Needs external API keys	5	5
`requires_gpu_isolation`	Needs subprocess isolation for CUDA memory	6	11
Behaviour
`qualitative`	Non-deterministic LLM output check (skipped in CI)	25	98
`llm`	Makes any LLM call (needs at least Ollama)	35	52
`slow`	Takes >5 minutes (skipped by default)	2	2
`plugins`	Registers acceptance plugin sets	3	3

Notable: 98 individual qualitative usages across 25 files — these are all skipped in CI. Only 2 tests are marked slow. 30 files (plugins, helpers, formatters, core) have no markers and are effectively unit tests that need no external dependencies.

This also relates directly to the unit/integration/end-to-end question — the unmarked files are our de facto unit test tier, but we don't call them that or make it easy to run just those.

Deprecation & experimental status

We have some deprecated functions. When do we remove? Do we stop testing? Similarly do we need to more clearly mark experimental areas? These may need additional test classifications (e.g., @pytest.mark.experimental, @pytest.mark.deprecated).

Issue tracking

I've tagged all known test-related issues with the testing label — view them here. Do we need sub-categories (e.g., testing:infra, testing:flaky, testing:coverage, testing:markers) or is a single label sufficient? Do we need a project board?

2. Developer Environment & Local Testing

The first two tiers — pre-commit and local dev — run on whatever the developer has. Unlike CI, these environments are uncontrolled and vary widely.

Developer expectations

What do we explicitly ask contributors to do before opening a PR? Do we improve our PR template to help/recommend? Should every new function need a unit test, every new backend interaction an integration test? Is that enforced or just advisory?

Environment setup & onboarding

Developer machines vary widely (32GB Mac, 64GB Linux, GPU vs no GPU, Intel Mac vs Apple Silicon). The setup path is unclear in places:

Which Ollama models to pull was only recently documented (docs: add required models to run tests with ollama backend to the CONTRIBUTING.md doc #674).
No guidance on which extras to install for which test subsets (mellea[hf] vs mellea[all] vs core).
Intel Mac, Python 3.13+/Rust, and conda all have separate workaround paths buried in troubleshooting.
No "environment check" command — developers discover missing deps through test failures.

Some contributors will not have LSF/GPU access. Is it clear to them which tests they can and should run? Do we make it easy to say "run everything that works on a Mac with Ollama" in one command?

Local test tiers

Tier	Trigger	Budget	Key question
Pre-commit	Every commit	<60s	Should we run any tests here, or keep it lint-only? Any quick wins?
Local dev	Ad-hoc	<5 min	Is `pytest -m "not qualitative"` the right default? Are our resource checks ok? Do we take too long?

Do we have the markers and filtering needed so these tiers work reliably across different machines?

Dependency testing

Two concerns:

Contributor environment: we expect uv sync --all-extras --all-groups for development, but is that clearly documented? Are there cases where a contributor only needs a subset?
Consumer environment: mellea is modular (mellea[hf], mellea[vllm], mellea[all], etc.). Do we test that each extra works in isolation? That features outside an installed extra produce helpful errors rather than cryptic import failures? Today we only test the full --all-extras install (feat: improve package import handling and testing #629).

3. CI, Automation & Controlled Environments

From PR CI onwards, environments are controlled — but we have gaps in what we run and where.

CI test tiers

Tier	Trigger	Budget	Key question
PR CI	Every push	<15 min	Should we split into parallel jobs (unit / ollama)? Can we increase coverage?
Nightly CI	Scheduled	~60 min	Should qualitative tests run here? Or at least anything very long? What infra?
Pre-release	Manual	~90 min	Do we need any more here? Add a checklist? Include external consumption, docs?
GPU (LSF)	Manual/batch	Unbounded	Is this something different? Or integrated into above? What about heavy GPU needs?

GPU access

CI runs on GitHub Actions (no GPU). GPU testing mostly happens on LSF behind the firewall, batch-only, for contributors with access to GPU resources. vLLM tests on LSF fail intermittently (~25-50% of runs, #699). External contributors have no way to validate GPU-related changes before submitting. Can we get any form of integration here so we can run EVERY test at least nightly?

4. What's Broken

Flakiness & non-determinism

Recurring pattern: LLM output varies, string assertion fails, test gets marked qualitative or xfail. Examples: test_kv fails on model safety refusals (#398), astream tests flaked in CI until mocked variants were added (#562, fixed in #567), react example exceeds iteration budget intermittently (#684). Is there a better approach (semantic assertions (#692), prompt design guidelines, mock-first policy)? We've also seen tests where too short a context window was used (#573, some fixed).

CUDA cleanup & platform differences

This is a recurring source of breakage:

CUDA memory is process-scoped — Python/PyTorch cleanup doesn't release it. We've built process isolation (--isolate-heavy, fix: restore VSCode test discovery and make GPU isolation opt-in #605) as a workaround after Testing is generally broken #604 exposed that the original approach broke VSCode discovery and silently skipped 400+ tests. But the underlying backends (LocalHFBackend, vLLM) don't clean up after themselves — at least for HF (creating multiple LocalHFBackend in pytest caused a memory leak #630). This may be a product bug that also affects users (bug: LocalHFBackend seems to have a memory blow up if you repeatedly call Instruct #378).
MPS (Apple Silicon) behaves differently to CUDA — no memory query API, different failure modes, PyTorch version-dependent (MPS broken before 2.8.0 for aLoRA training). Even here memory usage differs too.
Small GPU/No GPU at all (CI, some contributor laptops) — tests must skip cleanly, not error.
Platform-specific workarounds are scattered across conftest.py, individual tests, and README docs rather than centralised.

Do we invest in fixing the backends' cleanup, or formalise the workarounds? How do we test across platforms systematically?

Resource cleanup & runtime

Beyond CUDA specifically — backend memory leaks cause OOM cascades. vLLM spins up redundant servers (#625). HF loads models that are never freed (#630), and the HF metrics test alone consumes 40GB+ on a 32GB Mac (#620). Do we fix root causes or keep building workarounds? Are we doing testing of memory usage to ensure no leaks? Scanning source code to look for bugs?

Model & adapter coverage

Intrinsic and aLoRA tests depend on specific model+adapter combinations. Some adapter combinations don't exist yet (e.g., Granite 4 LoRA adapters, #359), blocking test migration. The answer_relevance_classifier adapter failed to resolve at collection time (#680, fixed). Are these appropriate for different machines/environments? Do we need a smaller or greater range of model configurations? Or backends? How about going beyond Ollama locally? Can we document and check the right models are available and ensure we clearly report if not?

5. How We Improve

Expanding test coverage

Where are the gaps? Do new features ship without tests? Are there areas of the codebase with no coverage at all that we should target? Should every LLM-dependent test have a mocked variant for CI? (The pattern in #567 — adding test_astream_mock.py alongside the live tests — is a good model.)

Making tests more flexible & reusable

Many tests are tightly coupled to a specific backend or model. Could we write more backend-agnostic tests that parameterise across whatever backends are available? Tests that work with any Ollama model rather than hardcoding granite4:micro? This would make tests more resilient to model changes and more useful across different developer environments.

Can we make it easier to write good tests — templates, example patterns, shared fixtures, clear "here's how to add a test for a new backend" guides?

Examples and tutorials

What's our strategy on tutorials and examples? Currently we test some examples under pytest (#372 optimised discovery/execution) - but this can only work for those that are non-interactive. Subprocess-based example tests have broken in CI due to missing PYTHONPATH (#593, fixed in #594). A qiskit example was collected as a test and failed (#683, fixed in #686). Can we do this completely to ensure examples/tutorials are always good?

Best practice tooling

We already have several pytest plugins installed but underused. Are they the right ones? Should we adopt more?

pytest-xdist — parallel execution (-n auto). Installed, never used in CI. For non-LLM tests this could cut CI time significantly. For LLM tests the bottleneck is inference, so parallelism helps less unless we have multiple Ollama instances or API backends.
pytest-recording — response recording/replay. Installed, used in one test. Could make LLM-dependent tests deterministic in CI by replaying recorded responses.
pytest-timeout — per-test timeouts. Installed and configured globally at 900s. Should we have tiered timeouts (30s unit, 120s Ollama, 600s GPU)?
nbmake — notebook testing. Installed, unused.
Other options: pytest-rerunfailures (auto-retry flaky tests), pytest-memray (memory profiling), hypothesis (property-based testing for non-LLM code paths).

What's the right balance between adopting more tooling and keeping the test infrastructure simple enough that everyone understands it?

Ideas worth exploring

These have come up in issues/PRs or are suggested by the tooling we already have:

Semantic assertions (Proposal: Semantic assertions for qualitative test validation #692) — Replace brittle string-matching on LLM output with meaning-based validators using LLM-as-judge. Tackles the root cause of qualitative flakiness rather than just skipping tests.
Response recording — pytest-recording is already a dev dependency but used in exactly one test. Recording real LLM responses and replaying them would make qualitative tests deterministic in CI without writing manual mocks. Could be the pragmatic middle ground between "skip in CI" and "build semantic assertions."
"What can I run?" diagnostic — A pytest --check-env or setup script that reports: Ollama status, available models, GPU type/memory, RAM, API keys, installed extras. Tells a developer exactly which test subsets will work on their machine.
Automated test suggestion from diffs — Given a PR diff, infer which backends/markers are affected and emit the right pytest -m expression. Could be a CI step or local tool.
Model pre-flight checks (Add model checks for pytest #574) — At collection time, verify required models are actually available and fail fast with actionable messages instead of cryptic mid-suite errors.
Shared session-scoped backends (CI: backends for vllm testing are redundant and cause CUDA context lock #625) — One vLLM/HF server per pytest session instead of per-module. Eliminates CUDA fragmentation, cuts GPU test time significantly.
Backend contract tests — All backends implement the same interface. A single parameterised test suite that runs against each available backend would catch inconsistencies and make it obvious which backends are tested.
Integration test matrix (Set up integration testing infrastructure for framework adapters #451) — Separate CI jobs per framework adapter (langchain, dspy, crewai) with isolated deps.
Notebook & doc code block testing (add automated testing for jupyter notebooks #89, CI: end-to-end execution of documentation code blocks #638) — nbmake is a dev dependency but unused. Doc code snippets are untested. Both catch doc rot.
Network isolation by default — pytest-recording's block_network marker could enforce that unit tests never make real network calls, catching tests that silently depend on Ollama/APIs without declaring it via markers.

6. How We Track

Metrics & visibility

We don't track test results over time. We added code coverage (#352, #353) but removed it from screen output (#565) for readability — the data is generated but not surfaced anywhere. No visibility into skip counts, runtime trends, flakiness rates, or coverage trends. What's the minimum viable approach? Can we determine actions needed from this data — for example more coverage needed in certain areas? Do we need a dashboard?

Labelling & tracking test work

How do we organise and prioritise the work that comes out of this discussion? Discovery/investigation work isn't well-distinguished from implementation. Do we need sub-categories beyond testing, a project board, a consistent label scheme? Some items are quick wins that could land immediately; others are significant infrastructure investments — how do we make that distinction visible?

planetf1 · 2026-03-20T14:31:52Z

planetf1
Mar 20, 2026
Maintainer Author

^ @avinash2692 @ajbozarth

0 replies

ajbozarth · 2026-03-20T18:31:27Z

ajbozarth
Mar 20, 2026
Maintainer

Some notes and observations as I read through:

Pre-commit | Every commit | <60s | Should we run any tests here, or keep it lint-only? Any quick wins?

I think keeping this to lint-only is important, mypy has already pushed the acceptable pre-commit wait time IMHO

Contributor environment: we expect uv sync --all-extras --all-groups for development, but is that clearly documented? Are there cases where a contributor only needs a subset?

I think this is well documented now, but I'm not sure everyone is consistently doing it, unsure the best way to solve that

Do we test that each extra works in isolation?

This is a great point, I think the current expectation in the tests is to run with all or else tests will break. If we want to add and enforce this we would need to make sure tests skip properly for each extra

PR CI | Every push | <15 min | Should we split into parallel jobs (unit / ollama)? Can we increase coverage?

If we can safely parallelize this I would do so

Nightly CI | Scheduled | ~60 min | Should qualitative tests run here? Or at least anything very long? What infra?
Pre-release | Manual | ~90 min | Do we need any more here? Add a checklist? Include external consumption, docs?
GPU (LSF) | Manual/batch | Unbounded | Is this something different? Or integrated into above? What about heavy GPU needs?

If at all possible EVERY test should be run in nightlies with no exceptions, pre-release would just be a manual trigger to the nightlies.

External contributors have no way to validate GPU-related changes before submitting. Can we get any form of integration here so we can run EVERY test at least nightly?

If we can successfully get nightlies doing every test I think we should also have a flag/command that committers can run/comment on a PR to add it to the next nightlies tests in addition to main (I think nightlies on every branch is overkill) and report the results back to that PR

Do we fix root causes or keep building workarounds?

We should definitely try to fix the root issues as much as possible (eg HFBackend), but if we have to add workaround or exceptions we should clearly document them both in the dev docs and the test output

Semantic assertions (#692)

I like this, qualitative tests already require llm so there's theoretically not much extra overhead

"What can I run?" diagnostic — A pytest --check-env or setup script that reports: Ollama status, available models, GPU type/memory, RAM, API keys, installed extras. Tells a developer exactly which test subsets will work on their machine.

Yes Yes Yes YES

We don't track test results over time. We added code coverage (#352, #353) but removed it from screen output (#565) for readability — the data is generated but not surfaced anywhere. No visibility into skip counts, runtime trends, flakiness rates, or coverage trends.

We should update this to at least dump that output file into the CI artifacts for ref

Labelling & tracking test work

I think the one label is good enough, but perhaps it should be paired with a label for whatever type of code is being tested (currently only integrations and observability are labels)

6 replies

ajbozarth Mar 20, 2026
Maintainer

Personally, I would remove the pre-commit requirement

Im actually against this, due to how we're doing our squash committing with the queue every commit is part of the final commit message so having to constantly add an extra commit nit: fix lint when someone forgets to lint would dirty the logs.

I find blocking a commit because I don't have my virtual environment activated a frustrating UX

I'm actually confused by this, I though pre-commit running in the venv even if its not activated. Or are you hitting the issue where you check out a new branch with different dependencies and forget to run uv sync, cause that could break the type checking

psschwei Mar 20, 2026
Maintainer

Personally, I would remove the pre-commit requirement

Im actually against this, due to how we're doing our squash committing with the queue every commit is part of the final commit message so having to constantly add an extra commit nit: fix lint when someone forgets to lint would dirty the logs.

I don't believe that's accurate, #551 and #678 for example merged with commits not following that pattern

I find blocking a commit because I don't have my virtual environment activated a frustrating UX

I'm actually confused by this, I though pre-commit running in the venv even if its not activated. Or are you hitting the issue where you check out a new branch with different dependencies and forget to run uv sync, cause that could break the type checking

Nope, venv has to be active

$ cd ../mellea-precommit-test

$ uv sync --all-extras --all-groups
Using CPython 3.12.12
Creating virtual environment at: .venv
Resolved 449 packages in 6ms
      Built mellea @ file:///Users/paulschw/generative-computing/mellea-precommit-test
Prepared 1 package in 327ms
Installed 360 packages in 1.52s
<snip>
 + pre-commit==4.4.0
<snip>

$ nvim mellea/backends/ollama.py

$ git add mellea/backends/ollama.py

$ git commit -sS -m "fix: update ollama"
`pre-commit` not found.  Did you forget to activate your virtualenv?

planetf1 Mar 24, 2026
Maintainer Author

I'm in favour of finding issues earlier in the process, with the proviso there is some user control - that should extend to be able to checkin work in progress within one's own branch. pre-commit already allows checks to be skipped if required

Given that I'm in favour of having checks here - but they MUST be quick and reliable. Ideally running in seconds. I frequently get into tangles with mypy. It's slow, seems to be buggy at times requiring cache cleaning, and isn't yet configured to find some issues we know we have (I am VERY keen on type safety).

We have an issue open #456 which discusses alternative checkers. I would ideally keep a type check, but ensure it's quick. ty might fulfill that - even if it's less thorough than a CI mypy run (though ideally we'd be consistent)

As for testing, IF we can identify a tiny number of tests that can run in a few seconds with minimal output, it's worth considering for pre-commit. But it's more likely to be bypassed for work in progress in any case. We discussed on the call and I believe we concluded we would not add tests here. I suggest we cotninue discussion in #456 on type checking

planetf1 Mar 24, 2026
Maintainer Author

If we have specific pain points with pre-commit we might be able to make easier -- like the activated python environment, though isn't that a fair requirement when working on mellea? I guess doc/script edits may lie outside this -- I wonder if that's detailed refinement we can tackle through issues.

planetf1 Mar 24, 2026
Maintainer Author

Contributor environment: we expect uv sync --all-extras --all-groups for development, but is that clearly documented? Are there cases where a contributor only needs a subset?

Ideally a contributor could work with what they have/need, and our tests/precommit checks would adapt to that. The challenge may be whether we can handle all those variations.

This requirement is in the contribution guide (albeit it says recommended not mandated)

If anyone feels more strongly we need to allow optional dependencies I suggest opening an issue (Paul, Alex?)

I think this is well documented now, but I'm not sure everyone is consistently doing it, unsure the best way to solve that
Validation check when testing?
clear messages if dependency missing?
Suggesting use of a script? (seems unnecessary layer - and would have side effects, and differs if they are using vscode directly)

This is a great point, I think the current expectation in the tests is to run with all or else tests will break. If we want to add and enforce this we would need to make sure tests skip properly for each extra
as above

PR CI | Every push | <15 min | Should we split into parallel jobs (unit / ollama)? Can we increase coverage?

If we can safely parallelize this I would do so

We probably don't really know until we've done some test refactoring. So I'm not sure this is really reasonably actionable just yet. Best we work on the test categorization first.

If at all possible EVERY test should be run in nightlies with no exceptions, pre-release would just be a manual trigger to the nightlies.

We clarified that in discussion yesterday - that should be the case - though 'no exceptions' isn't something we yet meet. For example the telemetry tests do not run. So we should consider that a bug

External contributors have no way to validate GPU-related changes before submitting. Can we get any form of integration here so we can run EVERY test at least nightly?

If we can successfully get nightlies doing every test I think we should also have a flag/command that committers can run/comment on a PR to add it to the next nightlies tests in addition to main (I think nightlies on every branch is overkill) and report the results back to that PR

Discussed and agreed yesterday -- will raise issue if one not opened yet

Do we fix root causes or keep building workarounds?

We should definitely try to fix the root issues as much as possible (eg HFBackend), but if we have to add workaround or exceptions we should clearly document them both in the dev docs and the test output

I'm not sure this is actionable - agree with the statement we want to go to root cause we can. Some of the workarounds were driven by onboarding & release deadlines -- we've learnt from that and are now trying to address things long term

Semantic assertions (#692)

I like this, qualitative tests already require llm so there's theoretically not much extra overhead

Agreed. I support this principle for any qualitative tests (in addition to our discussion we had out about having more integration tests with stubbed backends)

"What can I run?" diagnostic — A pytest --check-env or setup script that reports: Ollama status, available models, GPU type/memory, RAM, API keys, installed extras. Tells a developer exactly which test subsets will work on their machine.

Yes Yes Yes YES

Ideally a report at beginning/end and a check. We'll have to see what's possible. Will open an issue on this

We don't track test results over time. We added code coverage (#352, #353) but removed it from screen output (#565) for readability — the data is generated but not surfaced anywhere. No visibility into skip counts, runtime trends, flakiness rates, or coverage trends.

We should update this to at least dump that output file into the CI artifacts for ref

I'll raise an issue. Let's go further (but prioritized in backlog) to report inc. trends of results and coverage

Labelling & tracking test work

I think the one label is good enough, but perhaps it should be paired with a label for whatever type of code is being tested (currently only integrations and observability are labels)

Let's handle this more generally as part of how we track issues in the project. SPecifics of labels let's do organically

ajbozarth · 2026-03-23T17:13:20Z

ajbozarth
Mar 23, 2026
Maintainer

Meeting notes from March 23rd call with @ajbozarth @planetf1 @jakelorocco @avinash2692

On a high level tests should be marked using two dimensions:

granularity: unit (test functions) > integration (tests full stack using mocks) > e2e (does not use mocks)
- qualitative would be a subset of e2e that specifically checks the output of a llm call
backends: ollama, huggingface, vllm, openai, watsonx, litellm, etc

Potential work items:

identify and mock as many tests that would currently be e2e and "downgrade" them to integration
catalog and minimize the necessary models needed to run all tests
see if unit tests are fast enough to add to pre-commit

Nightlies

@avinash2692 has been working on setting up nightlies that will run on bluevela running the full test suite each night, then report issues by opening issues on the repo.
The test script will be in the repo and can be used by any dev to manually run tests on bluevela moving forward.

For nightlies v2:

add ability for committers to flag PRs (maybe via a comment to trigger a an app) to run as part of nightlies the next night and report the results.

Action Items

@planetf1 will create a new EPIC issue (or convert this discussion to one) for us to iterate on further and move existing test issues under as sub issues for tracking. Further planning will loop in @psschwei in the next week or so.

Note

Add other meeting notes I missed as response comments under this discussion comment for tracking

1 reply

planetf1 Mar 24, 2026
Maintainer Author

Thanks for writing up some notes @ajbozarth . A few things to add:

Any e2e tests which could be integration we may not just downgrade, but may 'split' - ie have both a mocked version, and real backend. They may diverge to focus on the different aspects of those two paths.
tests may need to be further filtered due to resource requirements (time, memory etc). It's likely these will mostly be e2e/backend specific
Where we currently have unique tests for hf vs vllm (for example) we hope to move to a model of parametized tests. This will improve code consistency and ease maint of the tests
We may be able to automatically detect what tests should run on a PR given touched files (for example maybe only changes to hugging face cause just those tests to be run) -- but this also faces challenges of resource, so may come later
Models. We had some discussion about simplifying the range of models used in testing. Particularly for local testing this makes it easier for a dev to have the right environment. However at the same time we may want to expand our testing to cover other models like qwen. Some tests will be very tied to a model (like intrinsics). We need to have a default but be able to override models sensibly to cover these extra cases.
Examples: We agreed we continue with the tagging and execution of examples by default (users can skip if needed)

planetf1 · 2026-03-24T13:40:33Z

planetf1
Mar 24, 2026
Maintainer Author

Thanks for all the discussion. I've tried to reflect our sync chat, and the discussion here in the new epic and child issues:

#726

I suggest we edit the initial post in each created issue to tweak if needed

0 replies

How can we improve our testing? #711

Uh oh!

Uh oh!

planetf1 Mar 20, 2026 Maintainer

Test Strategy Discussion

Why

Desired Outcome

1. Where We Are

Test classification

Deprecation & experimental status

Issue tracking

2. Developer Environment & Local Testing

Developer expectations

Environment setup & onboarding

Local test tiers

Dependency testing

3. CI, Automation & Controlled Environments

CI test tiers

GPU access

4. What's Broken

Flakiness & non-determinism

CUDA cleanup & platform differences

Resource cleanup & runtime

Model & adapter coverage

5. How We Improve

Expanding test coverage

Making tests more flexible & reusable

Examples and tutorials

Best practice tooling

Ideas worth exploring

6. How We Track

Metrics & visibility

Labelling & tracking test work

Replies: 4 comments · 7 replies

Uh oh!

Uh oh!

planetf1 Mar 20, 2026 Maintainer Author

Uh oh!

ajbozarth Mar 20, 2026 Maintainer

Uh oh!

ajbozarth Mar 20, 2026 Maintainer

Uh oh!

psschwei Mar 20, 2026 Maintainer

Uh oh!

planetf1 Mar 24, 2026 Maintainer Author

Uh oh!

planetf1 Mar 24, 2026 Maintainer Author

Uh oh!

planetf1 Mar 24, 2026 Maintainer Author

Uh oh!

ajbozarth Mar 23, 2026 Maintainer

Meeting notes from March 23rd call with @ajbozarth @planetf1 @jakelorocco @avinash2692

Nightlies

Action Items

Uh oh!

planetf1 Mar 24, 2026 Maintainer Author

Uh oh!

planetf1 Mar 24, 2026 Maintainer Author

planetf1
Mar 20, 2026
Maintainer

Replies: 4 comments 7 replies

planetf1
Mar 20, 2026
Maintainer Author

ajbozarth
Mar 20, 2026
Maintainer

ajbozarth Mar 20, 2026
Maintainer

psschwei Mar 20, 2026
Maintainer

planetf1 Mar 24, 2026
Maintainer Author

planetf1 Mar 24, 2026
Maintainer Author

planetf1 Mar 24, 2026
Maintainer Author

ajbozarth
Mar 23, 2026
Maintainer

planetf1 Mar 24, 2026
Maintainer Author

planetf1
Mar 24, 2026
Maintainer Author