List view
## Executive Summary I'm proposing a systematic optimisation effort for the Count-Min Sketch implementation in datasketches-cpp. Currently, there is insufficient characterisation of the sketch which limits its use and discoverability. I intend to start with comprehensive benchmarking infrastructure and progressing through targeted performance improvements for parity with the high standard of other sketching characterisations throughout the library. The ultimate goal is to create a high-quality, production-ready Count-Min Sketch implementation and apply these learnings to develop a Count Sketch implementation. Additionally, updates to the website should be made in lockstep with the characterisations so that more people can learn about and use the CountMin sketch. ## Background and Motivation Count-Min Sketch is a fundamental probabilistic data structure for frequency estimation in streams, but its performance is critical for real-world deployments processing millions of updates per second. Whilst the current datasketches-cpp implementation is functional, there are several opportunities for optimisation that could significantly improve performance whilst maintaining the same theoretical guarantees. ## Current Observations After reviewing the current Count-Min Sketch implementation, I've identified several areas that warrant investigation: 1. **Hash function construction**: Hash functions are manually constructed rather than leveraging modern, highly-optimised hash libraries (e.g., xxHash3, wyhash) 2. **Non-power-of-2 widths**: Sketch widths are arbitrary sizes, requiring expensive modulo operations on every hash computation rather than fast bitwise operations 3. **Missing batch operations**: No batch update API, forcing per-item function call overhead in streaming scenarios 4. **Limited benchmarking**: Insufficient performance characterisation to guide optimisation efforts ## Proposed Approach I propose a phased approach that prioritises measurement before optimisation. The plan focuses initially on Phases 0 and 1, with later phases contingent on results and maintainer feedback. --- ### **Phase 0: Baseline Characterisation Infrastructure** This phase establishes the testing infrastructure that will be used throughout all subsequent optimisation work. **Issue 0.1: Benchmark Framework Setup** - Create modular benchmark harness with configurable parameters (width, depth) - Support for multiple data distributions (uniform, Zipfian) **Issue 0.2: Baseline Performance Characterisation** - Measure core metrics: accuracy, update time, query time, merge time - Establish accuracy baselines across different error bounds (ε, δ) - Test across multiple sketch sizes (width × depth combinations) - Profile basic performance metrics (throughput, latency percentiles) --- ### **Phase 1: Fundamental Optimisations** With benchmarks in place, we can now tackle two fundamental inefficiencies. The first is the use of arbitrary-width hash tables requiring expensive modulo operations. The second is the manual hash function construction, which likely underperforms compared to modern hash functions. However, Count-Min Sketch's theoretical guarantees require 2-wise independent hash functions (Figure 2)[https://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf], so some investigation is needed to ensure the hash functions introduced are satisfactory. **Issue 1.1: Power-of-2 Width Support** - **Rationale**: Replace expensive modulo with bitwise AND (`hash % width` → `hash & (width-1)`) - **Implementation**: Add power-of-2 constraint or automatic rounding in constructor - **Considerations**: This will be a **breaking change** as existing code relies on exact width values - **Measurement**: Update throughput, query latency before/after **Issue 1.2: Modern Hash Function Integration** - **Rationale**: Current manual hash construction likely suboptimal - **Options to evaluate**: xxHash3, wyhash, MurmurHash3 - **Critical investigation**: Count-Min Sketch requires 2-wise independent hash functions for theoretical error bounds. This issue must investigate: - What hash construction does the current implementation use? - Are xxHash/MurmurHash/wyhash reasonable practical replacements despite not being provably 2-wise independent? - Empirical validation: Do modern hash functions maintain accuracy guarantees in practice? We can likely use the (smhasher)[https://github.com/rurban/smhasher] repo to aid the investigation. - **Measurement**: Hash computation time, collision rates, overall throughput, empirical error bounds vs theoretical predictions --- ## Questions for Maintainers Before investing significant effort, I'd appreciate guidance on some questions: 1. **Alignment with project goals**: Is this optimisation effort aligned with datasketches-cpp's roadmap and priorities? 2. **Breaking changes**: The power-of-2 width requirement could be a breaking change. What's the project's policy on this? Options: - Make it a new class variant (e.g., `count_min_sketch_pow2`) - Add as configurable template parameter - Make it the default with automatic rounding 3. **Hash function philosophy**: What's the project's stance on the 2-wise independence requirement? - Strict adherence to theoretical guarantees? - Pragmatic acceptance of fast hashes if empirically validated? - Support both via template parameters? --- ## Feedback Requested I welcome feedback on: 1. Overall approach and phasing 2. Specific concerns about proposed changes 3. Suggested alternatives or additional optimisations 4. Any considerations or complications regarding serialisation that I may have overlooked. 5. Anything else that I've missed. ---
No due date