CSHARP-5992: Add LINQ translation benchmark suite#2004
Conversation
ea1069d to
148d1de
Compare
15 benchmarks covering filter, projection, and IQueryable composition translation paths. New LinqBench category added to AllCategories and the perf-test runner. BulkWriteBench also added to AllCategories.
Document benchmark inventory, code path coverage, interpretation guidance, and provisional thresholds. Record targeted injection test results validating selective sensitivity of each benchmark to its target translator code path.
PartialEvaluator injection test showed OrChainFilter is still affected (evaluator traverses all expressions). Reframed as a sensitivity amplifier rather than diagnostic isolator.
…methods LinqBench uses translations/second instead of MB/s since there is no data throughput to measure. Add Unit and MetricName to BenchmarkResult so exporters label scores correctly. Change all benchmark methods to return their values to prevent JIT dead-code elimination.
The composite score loop was using default MB/s labels for all categories including LinqBench. Now correctly labels LinqBench composites as translations_per_second.
…d-to-end benchmarks; update regression thresholds - Redesign LinqTranslationBenchmark.cs: 15 individual feature benchmarks → 10 representative user queries covering distinct translator code paths (MultiFieldSearch, OrStatusFilter, BatchLookup, ArrayElementQuery, FieldSelection, AggregationProjection, ProjectionSentinel, UpdatePipeline, QueryablePipeline, GroupByAggregation) - Add LinqEndToEndBenchmark.cs: one-off characterization of translation overhead vs pre-built BsonDocument queries on a live collection; not wired into CI - Update README regression thresholds based on 7-run M1 Max drift characterization: tight bucket (15%) for MultiFieldSearch/UpdatePipeline/BatchLookup/ArrayElementQuery; wider bucket (30%) for OrStatusFilter/FieldSelection/AggregationProjection/ QueryablePipeline/GroupByAggregation
BorisDog
left a comment
There was a problem hiding this comment.
Look very good overall.
| { | ||
| return _collection.Find(x => | ||
| x.Status == _statusFilter && | ||
| x.CustomerName.StartsWith(_prefix) && |
There was a problem hiding this comment.
Does StartsWith create the exact same regex as in the raw version?
There was a problem hiding this comment.
It might makes sense to made a first translation somewhere in GlobalSetup and compare the produced MQL. So if we will change the translation in future the Benchmark will throw.
| { | ||
| private const string DatabaseName = "linqbench"; | ||
| private const string CollectionName = "orders"; | ||
| private const int SeedCount = 500; |
There was a problem hiding this comment.
Consider creating indexes in the database such that server time is minimized and translation time changes will be more apparent.
| public List<BsonDocument> GroupByLinq() | ||
| { | ||
| return _collection.Aggregate() | ||
| .Group(x => x.Status, g => new { Status = g.Key, Count = g.Count(), TotalRevenue = g.Sum(x => x.Total) }) |
There was a problem hiding this comment.
There is projection to an anonymous type here, followed by creation of the BSON document. The raw example does not project to an anonymous type.
…quivalence fixes; rename OrStatusFilter to OrFilter
There was a problem hiding this comment.
Pull request overview
Adds a new LINQ-focused benchmark suite to the driver benchmarks project, enabling perf-job tracking and composite scoring for LINQ translation performance (plus an optional end-to-end comparison suite).
Changes:
- Introduces
LinqTranslationBenchmark(translation-only, no query execution) andLinqEndToEndBenchmark(LINQ vs raw query plans with real DB execution). - Wires new
LinqBenchcategory into perf-job filtering and composite-score export (including score units/metric names). - Extends benchmark category constants and composite export output to include
LinqBench(and nowBulkWriteBench) inAllCategories.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| evergreen/run-perf-tests.sh | Adds LinqBench to Evergreen perf category filter. |
| benchmarks/MongoDB.Driver.Benchmarks/Linq/README.md | Documents the LINQ benchmark inventory and interpretation guidance. |
| benchmarks/MongoDB.Driver.Benchmarks/Linq/LinqTranslationBenchmark.cs | Adds translation-focused benchmarks across filter/field/projection/update/IQueryable entry points. |
| benchmarks/MongoDB.Driver.Benchmarks/Linq/LinqEndToEndBenchmark.cs | Adds end-to-end LINQ vs raw benchmarks that seed data and run against a live server. |
| benchmarks/MongoDB.Driver.Benchmarks/Exporters/LocalExporter.cs | Emits per-category and per-benchmark units (MB/s vs translations/s). |
| benchmarks/MongoDB.Driver.Benchmarks/Exporters/EvergreenExporter.cs | Emits per-category and per-benchmark metric names for Evergreen (MB/s vs translations/s). |
| benchmarks/MongoDB.Driver.Benchmarks/DriverBenchmarkCategory.cs | Adds LinqBench and includes it (and BulkWriteBench) in composite category list. |
| benchmarks/MongoDB.Driver.Benchmarks/BenchmarkResult.cs | Adds unit/metric metadata and computes translations/s scoring for LinqBench. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| return _collection.Find(x => | ||
| x.Status == _statusFilter && | ||
| x.CustomerName.StartsWith(_prefix) && |
There was a problem hiding this comment.
It might makes sense to made a first translation somewhere in GlobalSetup and compare the produced MQL. So if we will change the translation in future the Benchmark will throw.
…inqbench from LinqEndToEndBenchmark
…ed for IQueryable benchmarks
| }; | ||
|
|
||
| // ids.Contains(x.Id) translates to { "_id": { "$in": [...] } } since Id maps to _id by convention. | ||
| _inFilter = new BsonDocument("_id", new BsonDocument("$in", new BsonArray(_lookupIds.Select(id => (BsonValue)id)))); |
There was a problem hiding this comment.
Should we _lookupIds.Select( ...) be evaluated once? So it doesn't contributed to benchmark time.
There was a problem hiding this comment.
Good instinct, but it's already setup-only. That _lookupIds.Select(...) is inside PreBuildQueries(), which runs from Setup() under [GlobalSetup] — so the $in array is materialized into _inFilter once before any measured iteration, not per-invocation.
…osite unit from members Dispatch BenchmarkResult scoring on model rather than category: byte-throughput (BsonBench data or a present BenchmarkDataSetSize param) yields MB/s via a shared helper, everything else is time-throughput scored as operations/second. This gives the end-to-end LINQ benchmark an honest score and removes the NullReferenceException that hit any benchmark lacking BenchmarkDataSetSize. Mark ProjectionSentinel ExcludeFromComposite so its ~17ns fast-path timing no longer dominates the averaged LinqBench composite while it still runs and is reported individually. CalculateComposite now derives the composite's unit from its members instead of branching on category, dropping the duplicated unit ternary from both exporters.
Put LinqEndToEndBenchmark in the LinqBench category and mark it ExcludeFromComposite. It runs in the perf job (LinqBench is already selected) and is reported individually, but stays out of the LinqBench composite, whose translation-only members are orders of magnitude faster. These benchmarks measure the translator's share of user-visible latency under realistic result sizes; tracking them over time shows how that share shifts as serialization and other round-trip costs change. Remove ProjectionSentinel: the x => x fast path it guarded is already covered deterministically by CSharp4742Tests, which assert the identity projection renders a null document. A timing benchmark with a 100% threshold is a weaker guard of the same behavior.
…the README Retitle to cover both suites and scope the no-queries statement to the translation suite. Add an End-to-End Benchmarks section explaining what that suite measures (the translator's share of user-visible latency), that translator regressions are caught by the translation suite rather than here, and that it is tracked as a trend rather than gated. Remove the Code Path Coverage section, including its claim of validation by targeted injection tests that do not exist. Drop the ProjectionSentinel references and specific test-hardware names.
…trend wording The repo enforces no regression thresholds — detection runs downstream on the emitted results — so a provisional, locally-characterized threshold table in the README documented nothing the code controls and duplicated the calibrated bands in the PR. Reword the end-to-end note to recommend reading the series as a trend rather than asserting it is configured that way.
- Empty-category composites fall back to the category's normal unit (LinqBench → ops/s, otherwise MB/s) so a composite's metric name is stable across runs whether or not the category ran, instead of always reporting operations_per_second. - Replace DateTime.UtcNow in the update-pipeline benchmark expression with a fixed DateTime, removing a per-invocation clock read from the translation measurement. - Use the strongly-typed index-key overload for ShippingAddress.City instead of a string, matching the sibling keys.
| }; | ||
|
|
||
| // ids.Contains(x.Id) translates to { "_id": { "$in": [...] } } since Id maps to _id by convention. | ||
| _inFilter = new BsonDocument("_id", new BsonDocument("$in", new BsonArray(_lookupIds.Select(id => (BsonValue)id)))); |
Summary
Adds a 9-benchmark
LinqBenchsuite that exercises the LINQ-to-aggregation translation layer in isolation (no queries executed at benchmark time), plus aLinqEndToEndsuite that compares LINQ vs hand-builtBsonDocumentquery plans end-to-end. WiresLinqBenchinto the perf-job category filter and the composite-score path, giving us a defensible signal when the translator (especiallySerializerFinder,AstSimplifier, and the method-call sub-translators) moves.Cross-commit runs against pre-/post-SerializerFinder and an in-flight optimization PR (#1961) show the suite catches the regressions and improvements we'd want it to — see Validation.
Motivation
The driver has spec-driven benchmarks for I/O-heavy patterns (
Find,BulkWrite,GridFS, BSON encode/decode) but nothing for the LINQ translator. As internal paths shift —SerializerFinderoverhauls,AstSimplifierchanges, new visitor support — we have no signal on translation-cost movement. When CSHARP-5572 introducedSerializerFinder(#1700), we couldn't quantify what it cost. This closes that gap.What this adds
benchmarks/MongoDB.Driver.Benchmarks/Linq/LinqTranslationBenchmark.csbenchmarks/MongoDB.Driver.Benchmarks/Linq/LinqEndToEndBenchmark.csbenchmarks/MongoDB.Driver.Benchmarks/Linq/LinqBenchmarkDataTypes.csbenchmarks/MongoDB.Driver.Benchmarks/Linq/README.mdbenchmarks/MongoDB.Driver.Benchmarks/DriverBenchmarkCategory.csLinqBenchandExcludeFromCompositeconsts;LinqBenchandBulkWriteBenchadded toAllCategories.ExcludeFromCompositekeeps the end-to-end benchmarks out of theLinqBenchcomposite while still running thembenchmarks/MongoDB.Driver.Benchmarks/BenchmarkResult.cs+Exporters/*.csExcludeFromCompositebenchmarksevergreen/run-perf-tests.shLinqBenchto the--anyCategoriesfilterCedar/SPS auto-discovers new composite categories — no dashboard config required.
Design decisions
LinqTranslationBenchmarkcalls translator entry points directly (LinqProviderAdapter.TranslateExpressionTo*,ExpressionToExecutableQueryTranslator.Translate); no queries run at benchmark time, isolating translator regressions from network and serialization noise. (Caveat:QueryablePipelineandGroupByAggregationneed aMongoQueryProvider<T>, obtained fromcollection.AsQueryable()on a realMongoClientin[GlobalSetup]; its background cluster monitor is the only DB-side activity, expected below the 2-4% drift floor.) End-to-end coverage lives inLinqEndToEndBenchmark; it shares theLinqBenchcategory so it runs in the perf job, and is markedExcludeFromCompositeso it stays out of the composite.SerializerFinderVisit*.csfile they exercised; reframed to "what users actually write." Matrix coverage stays an option for targeted gaps.LinqBenchdoes not cross-tag with the spec categories (DriverBench,BsonBench, etc.), keeping LINQ numbers out of the spec composite averages. It lands inAllCategoriesso its own composite is emitted.BulkWriteBenchcomposite bundled here. Previously excluded with a "not part of the benchmarking spec" comment; team wanted its composite tracked, done here to avoid a tiny standalone PR.[MemoryDiagnoser]on everything. Allocation regressions matter independently of time and surface earlier than time regressions on noisy hardware.Benchmark inventory
Translation suite: 9 benchmarks across filter (4), field (1), projection (1), update (1), and IQueryable (2) entry points. End-to-end suite: 12 benchmarks — 6 patterns (
MultiFieldSearch,OrFilter,GroupBy,Projection,InFilter,PagedQuery) each run LINQ vs hand-built raw. The full per-benchmark breakdown (patterns, translator paths exercised) lives inLinq/README.md.The e2e suite seeds 500 documents with secondary indexes on
Status,CreatedAt, andShippingAddress.Cityin[GlobalSetup]; each LINQ/Raw pair renders to byte-equivalent BSON (verified viaLinqProviderAdapter), so the LINQ−Raw delta isolates translator + provider overhead from query-shape differences.Validation
Five lines of evidence that the suite produces actionable signal.
1. Within-run noise
Across all perf-hw runs (n=10 each, multiple commits), BDN-reported within-run StdDev is <1% on most benchmarks (typically 0.2-0.9%), with the fastest micro-benchmarks at 0.3-1.4%. Each individual run is a well-converged measurement.
2. Selectivity — targeted regression injection
Thread.SpinWait(300)(~10 µs on M1) injected into four translator code paths in turn:SerializerFinder.FindSerializers()GetItemMethodToFilterFieldTranslatorFieldSelectiononlyNotExpressionToFilterTranslatorMultiFieldSearch(Not),OrFilter(chains Comparison dispatch)GroupByMethodToPipelineTranslatorGroupByAggregationonlyEach injection moved only the benchmarks that should have moved — clean per-path selectivity, so when something regresses the benchmarks that move tell you which translator moved.
3. Cross-run drift on the perf-job hardware (n=10 on
rhel90-dbx-perf-large).NET 8.0, X64 RyuJit, single perf-task invocation on the same host.
MultiFieldSearchOrFilterBatchLookupArrayElementQueryFieldSelectionAggregationProjectionUpdatePipelineQueryablePipelineGroupByAggregationEvery benchmark but
FieldSelectionlands in 2-4.5% range, CV ≤1.5% — a ~5× compression of the M1 drift bands characterized during development.FieldSelection(~6µs) drifts wider (7.1%), fast enough that small absolute drift looks proportionally large. Sub-10% deltas are individually resolvable, which matters for the optimization comparison below.4. Cross-commit reality check
Transplanted onto pinned commits and run on
rhel90-dbx-perf-large(n=10 per commit), the suite caught both a known historical regression and an in-flight optimization:46640eac98) vs the merge (59c9d34180): every benchmark moved well above its drift band, most 2-3× slower.UpdatePipelinewas a +626% time / +898% allocation outlier becauseTranslateExpressionToSetStagerunsSerializerFinderon the un-preprocessed lambda, unlike the other entry points which preprocess first (root cause in the follow-up below). Confirms the suite surfaces real translator regressions.SerializerFinderVisitMethodCallswitch →MethodInfo-keyed lookup), base66780341e7vs head54973d039a— measurable allocation wins on the method-call / IQueryable benchmarks (ArrayElementQuery-12.4%,QueryablePipeline-4.5%,BatchLookup/GroupByAggregation-3.3%) with smaller time effects at the edge of the drift bands. The suite resolves what kind of change this is — an allocation reduction — which was invisible at M1 noise levels.5. End-to-end overhead — Atlas dev cluster, perf-hardware, n=10
Six patterns run twice each (LINQ-translated vs hand-built
BsonDocument/ pipeline), 500 docs with secondary indexes, against a live Atlas dev cluster from the perf host. Each LINQ/Raw pair renders to byte-equivalent BSON, so the LINQ−Raw delta reflects translator and provider overhead, not query-shape differences.MultiFieldSearchOrFilterGroupByProjectionInFilterPagedQueryTranslator share is the fraction of user-visible LINQ time that disappears if you write raw
BsonDocumentinstead — an upper bound on translator cost (the delta also includes provider overhead like cursor construction and command serialization).Projection51%,MultiFieldSearch40%): translator is ~40-50% of user-visible latency on indexed selective queries; a 10% translator regression is a ~5% user-visible one.GroupBy27%,InFilter29%,PagedQuery26%): translator is ~25-30%; meaningful server-side work ($group,$in, sort-skip-limit) partially offsets it.OrFilter1.2%): translator is ~1% because the 4-way OR matches many documents and serializes ~700 KB, so a 10% translator regression is invisible to users.Projection3.5×,GroupBy3.2× — projected-type-serializer andIGroupingSerializerconstruction is allocation-heavy, and catches translator-side allocation regressions even when network time masks the time delta.Caveat: Atlas dev cluster across the internet from the perf host; per-run absolute times range ±15-30% (up to ±50% where server time dominates), but the translator-share ratios are more stable because LINQ and Raw on the same iteration see correlated network noise. Single-batch result, 500 docs.
Regression-alert thresholds (perf-hardware-calibrated)
Calibrated to observed drift on
rhel90-dbx-perf-large:MultiFieldSearch,BatchLookup,ArrayElementQuery,AggregationProjection,UpdatePipeline,QueryablePipeline,GroupByAggregation,OrFilterFieldSelectionAllocation thresholds should be even tighter — observed allocation drift is 0-1.2%.
Follow-ups (not in this PR)
TranslateExpressionToSetStagepreprocessing asymmetry surfaced above (SerializerFinder runs on the un-preprocessed tree;UpdatePipeline+626% time, +898% alloc). The design is intentional — dispatch pattern-matches onNewExpression/MemberInitExpression, whichPartialEvaluatorwould collapse if applied at the top — so any fix must preserve that dispatch shape. Not a one-line change.Lookup/Join,Distinct,SelectMany,Cast,$exprfallback paths.QueryablePipeline/GroupByAggregation(expected sub-1%, absorbed by the 2-4% drift floor).[GlobalSetup], so a future translator shape change fails the benchmark loudly instead of silently shifting the share numbers.