Replies: 4 comments 7 replies
-
Beta Was this translation helpful? Give feedback.
-
|
Some notes and observations as I read through:
I think keeping this to lint-only is important, mypy has already pushed the acceptable pre-commit wait time IMHO
I think this is well documented now, but I'm not sure everyone is consistently doing it, unsure the best way to solve that
This is a great point, I think the current expectation in the tests is to run with all or else tests will break. If we want to add and enforce this we would need to make sure tests skip properly for each extra
If we can safely parallelize this I would do so
If at all possible EVERY test should be run in nightlies with no exceptions, pre-release would just be a manual trigger to the nightlies.
If we can successfully get nightlies doing every test I think we should also have a flag/command that committers can run/comment on a PR to add it to the next nightlies tests in addition to main (I think nightlies on every branch is overkill) and report the results back to that PR
We should definitely try to fix the root issues as much as possible (eg HFBackend), but if we have to add workaround or exceptions we should clearly document them both in the dev docs and the test output
I like this, qualitative tests already require llm so there's theoretically not much extra overhead
Yes Yes Yes YES
We should update this to at least dump that output file into the CI artifacts for ref
I think the one label is good enough, but perhaps it should be paired with a label for whatever type of code is being tested (currently only |
Beta Was this translation helpful? Give feedback.
-
Meeting notes from March 23rd call with @ajbozarth @planetf1 @jakelorocco @avinash2692On a high level tests should be marked using two dimensions:
Potential work items:
Nightlies@avinash2692 has been working on setting up nightlies that will run on bluevela running the full test suite each night, then report issues by opening issues on the repo. For nightlies v2:
Action Items@planetf1 will create a new EPIC issue (or convert this discussion to one) for us to iterate on further and move existing test issues under as sub issues for tracking. Further planning will loop in @psschwei in the next week or so. Note Add other meeting notes I missed as response comments under this discussion comment for tracking |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for all the discussion. I've tried to reflect our sync chat, and the discussion here in the new epic and child issues: I suggest we edit the initial post in each created issue to tweak if needed |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Test Strategy Discussion
Why
We have come challenges in our test suite. We get resource conflicts (especially with GPUs), test breakage. Qualitative tests are excluded from CI entirely. Contributors have incomplete guidance on what to run, what to install, or how to categorize tests. GPU testing is partly manual and ad-hoc. We need to get to the point where we have consistent, reliable, expanding testing and run the most appropriate tests for each environment.
Desired Outcome
1. Where We Are
Test classification
We have backend markers and resource markers (introduced in #326) but no clear unit vs integration vs end-to-end distinction. Are our markers appropriate? Do they overlap? We've had issues with missing markers — e.g.,
pytest -m "not ollama"still ran Ollama-dependent tests (#419, fixed), andtest_openai_vllm.pyis still missing GPU markers (#622, open). 30 of 80 test files have no markers at all — mostly plugins, helpers, formatters, core — which are effectively unmarked unit tests. Should we formalise this (e.g., aunitmarker or a convention that "no marker = unit test")?Current markers and usage (80 test files total):
ollamahuggingfacevllmopenaiwatsonxlitellmrequires_gpurequires_heavy_ramrequires_api_keyrequires_gpu_isolationqualitativellmslowpluginsNotable: 98 individual
qualitativeusages across 25 files — these are all skipped in CI. Only 2 tests are markedslow. 30 files (plugins, helpers, formatters, core) have no markers and are effectively unit tests that need no external dependencies.This also relates directly to the unit/integration/end-to-end question — the unmarked files are our de facto unit test tier, but we don't call them that or make it easy to run just those.
Deprecation & experimental status
We have some deprecated functions. When do we remove? Do we stop testing? Similarly do we need to more clearly mark experimental areas? These may need additional test classifications (e.g.,
@pytest.mark.experimental,@pytest.mark.deprecated).Issue tracking
I've tagged all known test-related issues with the
testinglabel — view them here. Do we need sub-categories (e.g.,testing:infra,testing:flaky,testing:coverage,testing:markers) or is a single label sufficient? Do we need a project board?2. Developer Environment & Local Testing
The first two tiers — pre-commit and local dev — run on whatever the developer has. Unlike CI, these environments are uncontrolled and vary widely.
Developer expectations
What do we explicitly ask contributors to do before opening a PR? Do we improve our PR template to help/recommend? Should every new function need a unit test, every new backend interaction an integration test? Is that enforced or just advisory?
Environment setup & onboarding
Developer machines vary widely (32GB Mac, 64GB Linux, GPU vs no GPU, Intel Mac vs Apple Silicon). The setup path is unclear in places:
mellea[hf]vsmellea[all]vs core).Some contributors will not have LSF/GPU access. Is it clear to them which tests they can and should run? Do we make it easy to say "run everything that works on a Mac with Ollama" in one command?
Local test tiers
pytest -m "not qualitative"the right default? Are our resource checks ok? Do we take too long?Do we have the markers and filtering needed so these tiers work reliably across different machines?
Dependency testing
Two concerns:
uv sync --all-extras --all-groupsfor development, but is that clearly documented? Are there cases where a contributor only needs a subset?mellea[hf],mellea[vllm],mellea[all], etc.). Do we test that each extra works in isolation? That features outside an installed extra produce helpful errors rather than cryptic import failures? Today we only test the full--all-extrasinstall (feat: improve package import handling and testing #629).3. CI, Automation & Controlled Environments
From PR CI onwards, environments are controlled — but we have gaps in what we run and where.
CI test tiers
GPU access
CI runs on GitHub Actions (no GPU). GPU testing mostly happens on LSF behind the firewall, batch-only, for contributors with access to GPU resources. vLLM tests on LSF fail intermittently (~25-50% of runs, #699). External contributors have no way to validate GPU-related changes before submitting. Can we get any form of integration here so we can run EVERY test at least nightly?
4. What's Broken
Flakiness & non-determinism
Recurring pattern: LLM output varies, string assertion fails, test gets marked
qualitativeorxfail. Examples:test_kvfails on model safety refusals (#398), astream tests flaked in CI until mocked variants were added (#562, fixed in #567), react example exceeds iteration budget intermittently (#684). Is there a better approach (semantic assertions (#692), prompt design guidelines, mock-first policy)? We've also seen tests where too short a context window was used (#573, some fixed).CUDA cleanup & platform differences
This is a recurring source of breakage:
--isolate-heavy, fix: restore VSCode test discovery and make GPU isolation opt-in #605) as a workaround after Testing is generally broken #604 exposed that the original approach broke VSCode discovery and silently skipped 400+ tests. But the underlying backends (LocalHFBackend, vLLM) don't clean up after themselves — at least for HF (creating multiple LocalHFBackend in pytest caused a memory leak #630). This may be a product bug that also affects users (bug:LocalHFBackendseems to have a memory blow up if you repeatedly callInstruct#378).Do we invest in fixing the backends' cleanup, or formalise the workarounds? How do we test across platforms systematically?
Resource cleanup & runtime
Beyond CUDA specifically — backend memory leaks cause OOM cascades. vLLM spins up redundant servers (#625). HF loads models that are never freed (#630), and the HF metrics test alone consumes 40GB+ on a 32GB Mac (#620). Do we fix root causes or keep building workarounds? Are we doing testing of memory usage to ensure no leaks? Scanning source code to look for bugs?
Model & adapter coverage
Intrinsic and aLoRA tests depend on specific model+adapter combinations. Some adapter combinations don't exist yet (e.g., Granite 4 LoRA adapters, #359), blocking test migration. The
answer_relevance_classifieradapter failed to resolve at collection time (#680, fixed). Are these appropriate for different machines/environments? Do we need a smaller or greater range of model configurations? Or backends? How about going beyond Ollama locally? Can we document and check the right models are available and ensure we clearly report if not?5. How We Improve
Expanding test coverage
Where are the gaps? Do new features ship without tests? Are there areas of the codebase with no coverage at all that we should target? Should every LLM-dependent test have a mocked variant for CI? (The pattern in #567 — adding
test_astream_mock.pyalongside the live tests — is a good model.)Making tests more flexible & reusable
Many tests are tightly coupled to a specific backend or model. Could we write more backend-agnostic tests that parameterise across whatever backends are available? Tests that work with any Ollama model rather than hardcoding
granite4:micro? This would make tests more resilient to model changes and more useful across different developer environments.Can we make it easier to write good tests — templates, example patterns, shared fixtures, clear "here's how to add a test for a new backend" guides?
Examples and tutorials
What's our strategy on tutorials and examples? Currently we test some examples under pytest (#372 optimised discovery/execution) - but this can only work for those that are non-interactive. Subprocess-based example tests have broken in CI due to missing
PYTHONPATH(#593, fixed in #594). A qiskit example was collected as a test and failed (#683, fixed in #686). Can we do this completely to ensure examples/tutorials are always good?Best practice tooling
We already have several pytest plugins installed but underused. Are they the right ones? Should we adopt more?
pytest-xdist— parallel execution (-n auto). Installed, never used in CI. For non-LLM tests this could cut CI time significantly. For LLM tests the bottleneck is inference, so parallelism helps less unless we have multiple Ollama instances or API backends.pytest-recording— response recording/replay. Installed, used in one test. Could make LLM-dependent tests deterministic in CI by replaying recorded responses.pytest-timeout— per-test timeouts. Installed and configured globally at 900s. Should we have tiered timeouts (30s unit, 120s Ollama, 600s GPU)?nbmake— notebook testing. Installed, unused.pytest-rerunfailures(auto-retry flaky tests),pytest-memray(memory profiling),hypothesis(property-based testing for non-LLM code paths).What's the right balance between adopting more tooling and keeping the test infrastructure simple enough that everyone understands it?
Ideas worth exploring
These have come up in issues/PRs or are suggested by the tooling we already have:
Semantic assertions (Proposal: Semantic assertions for qualitative test validation #692) — Replace brittle string-matching on LLM output with meaning-based validators using LLM-as-judge. Tackles the root cause of qualitative flakiness rather than just skipping tests.
Response recording —
pytest-recordingis already a dev dependency but used in exactly one test. Recording real LLM responses and replaying them would make qualitative tests deterministic in CI without writing manual mocks. Could be the pragmatic middle ground between "skip in CI" and "build semantic assertions.""What can I run?" diagnostic — A
pytest --check-envor setup script that reports: Ollama status, available models, GPU type/memory, RAM, API keys, installed extras. Tells a developer exactly which test subsets will work on their machine.Automated test suggestion from diffs — Given a PR diff, infer which backends/markers are affected and emit the right
pytest -mexpression. Could be a CI step or local tool.Model pre-flight checks (Add model checks for pytest #574) — At collection time, verify required models are actually available and fail fast with actionable messages instead of cryptic mid-suite errors.
Shared session-scoped backends (CI: backends for vllm testing are redundant and cause CUDA context lock #625) — One vLLM/HF server per pytest session instead of per-module. Eliminates CUDA fragmentation, cuts GPU test time significantly.
Backend contract tests — All backends implement the same interface. A single parameterised test suite that runs against each available backend would catch inconsistencies and make it obvious which backends are tested.
Integration test matrix (Set up integration testing infrastructure for framework adapters #451) — Separate CI jobs per framework adapter (langchain, dspy, crewai) with isolated deps.
Notebook & doc code block testing (add automated testing for jupyter notebooks #89, CI: end-to-end execution of documentation code blocks #638) —
nbmakeis a dev dependency but unused. Doc code snippets are untested. Both catch doc rot.Network isolation by default —
pytest-recording'sblock_networkmarker could enforce that unit tests never make real network calls, catching tests that silently depend on Ollama/APIs without declaring it via markers.6. How We Track
Metrics & visibility
We don't track test results over time. We added code coverage (#352, #353) but removed it from screen output (#565) for readability — the data is generated but not surfaced anywhere. No visibility into skip counts, runtime trends, flakiness rates, or coverage trends. What's the minimum viable approach? Can we determine actions needed from this data — for example more coverage needed in certain areas? Do we need a dashboard?
Labelling & tracking test work
How do we organise and prioritise the work that comes out of this discussion? Discovery/investigation work isn't well-distinguished from implementation. Do we need sub-categories beyond
testing, a project board, a consistent label scheme? Some items are quick wins that could land immediately; others are significant infrastructure investments — how do we make that distinction visible?Beta Was this translation helpful? Give feedback.
All reactions