Performance Optimization Guide

Comprehensive guide to optimizing the Mobile AI Orchestrator for constrained mobile platforms.

Performance Targets
Profiling and Measurement
CPU Optimization
Memory Optimization
Battery Optimization
Network Optimization
Platform-Specific Optimizations
Advanced Techniques

Performance Targets {#targets}

Latency Goals

Operation	Target	Acceptable	Notes
Simple query	<10ms	<50ms	Local routing
Complex query	<100ms	<500ms	May involve network
Context switch	<5ms	<20ms	Project switching
Reservoir update	<1ms	<5ms	Per conversation turn
MLP inference	<1ms	<5ms	Routing decision
SNN step	<100μs	<1ms	Wake detection

Resource Limits

Resource	Target	Maximum	Notes
Binary size	<1.5MB	<5MB	Stripped release
Memory (RSS)	<50MB	<128MB	Runtime footprint
CPU (idle)	<1%	<5%	Background operation
CPU (active)	<30%	<80%	During query processing
Battery (24h)	<100mAh	<300mAh	Background + active

Throughput Goals

Metric	Target	Notes
Queries per second	>100	Sustained throughput
Reservoir updates/s	>1000	Batch processing
MLP inferences/s	>5000	Routing decisions
SNN steps/s	>10000	Wake detection

Profiling and Measurement {#profiling}

Built-in Benchmarks

# Run all benchmarks
cargo bench

# Specific component
cargo bench orchestrator
cargo bench reservoir
cargo bench mlp

# With profiling
cargo bench --bench orchestrator_bench -- --profile-time=10

Platform-Specific Profiling

Android

1. Simpleperf (CPU profiling)

# On device
adb shell simpleperf record -p $(pidof mobile-ai) -o /data/local/tmp/perf.data
adb shell simpleperf report -i /data/local/tmp/perf.data

# Or with Rust symbols
adb pull /data/local/tmp/perf.data
simpleperf report-html -i perf.data -o report.html

2. Android Profiler (Android Studio)

1. Open Android Studio
2. View > Tool Windows > Profiler
3. Select device and process
4. Monitor CPU, Memory, Network

3. systrace (System-wide)

# Record trace
python systrace.py -o trace.html sched freq idle am wm gfx view binder_driver hal dalvik camera input res -a com.example.mobileai

# Open in Chrome
google-chrome trace.html

iOS

1. Instruments (CPU)

1. Xcode > Open Developer Tool > Instruments
2. Select "Time Profiler"
3. Choose device and app
4. Record and analyze hotspots

2. Instruments (Memory)

1. Instruments > Allocations
2. Look for memory growth
3. Check for leaks with "Leaks" template

3. Instruments (Energy)

1. Instruments > Energy Log
2. Measure power consumption
3. Identify energy spikes

Linux

1. perf (CPU profiling)

# Record
perf record -F 99 -g ./target/release/mobile-ai

# Report
perf report

# Flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

2. valgrind (Memory profiling)

# Memcheck (leaks)
valgrind --leak-check=full --show-leak-kinds=all ./target/release/mobile-ai

# Massif (heap profiling)
valgrind --tool=massif ./target/release/mobile-ai
ms_print massif.out.* > massif.txt

# Callgrind (call graph)
valgrind --tool=callgrind ./target/release/mobile-ai
kcachegrind callgrind.out.*

3. cargo-flamegraph

# Install
cargo install flamegraph

# Generate flamegraph
cargo flamegraph --bench orchestrator_bench

# Open flamegraph.svg

Custom Instrumentation

use std::time::Instant;

// Timing macro
macro_rules! time_it {
    ($name:expr, $code:block) => {{
        let start = Instant::now();
        let result = $code;
        let elapsed = start.elapsed();
        eprintln!("{}: {:?}", $name, elapsed);
        result
    }};
}

// Usage
pub fn process(&mut self, query: Query) -> Result<Response, String> {
    time_it!("total_query_processing", {
        let eval = time_it!("expert_evaluation", {
            self.expert.evaluate(&query)
        });

        let (route, confidence) = time_it!("routing_decision", {
            self.router.route(&query)
        });

        // ... rest of processing
    })
}

Conditional Compilation

#[cfg(feature = "profiling")]
macro_rules! time_it {
    ($name:expr, $code:block) => {{ /* instrumented version */ }};
}

#[cfg(not(feature = "profiling"))]
macro_rules! time_it {
    ($name:expr, $code:block) => {{ $code }};
}

Add to Cargo.toml:

[features]
profiling = []

Build with profiling:

cargo build --release --features profiling

CPU Optimization {#cpu-optimization}

Hotspot Identification

Based on profiling, typical hotspots:

Reservoir update (~40% of CPU time)
MLP forward pass (~30% of CPU time)
Text encoding (~15% of CPU time)
Routing logic (~10% of CPU time)
Other (~5%)

Optimization Techniques

1. SIMD Vectorization

ARM NEON (Android, iOS, Linux ARM)

#[cfg(target_arch = "aarch64")]
use std::arch::aarch64::*;

#[cfg(target_arch = "aarch64")]
unsafe fn dot_product_neon(a: &[f32], b: &[f32]) -> f32 {
    assert!(a.len() == b.len());
    assert!(a.len() % 4 == 0);

    let mut sum = vdupq_n_f32(0.0);

    for i in (0..a.len()).step_by(4) {
        let va = vld1q_f32(a.as_ptr().add(i));
        let vb = vld1q_f32(b.as_ptr().add(i));
        let vmul = vmulq_f32(va, vb);
        sum = vaddq_f32(sum, vmul);
    }

    // Horizontal sum
    let sum2 = vpadd_f32(vget_low_f32(sum), vget_high_f32(sum));
    let sum4 = vpadd_f32(sum2, sum2);
    vget_lane_f32(sum4, 0)
}

#[cfg(not(target_arch = "aarch64"))]
fn dot_product_neon(a: &[f32], b: &[f32]) -> f32 {
    // Fallback scalar implementation
    a.iter().zip(b).map(|(x, y)| x * y).sum()
}

Usage in Reservoir

// In src/reservoir.rs

pub fn update(&mut self, input: &[f32]) -> Vec<f32> {
    let mut input_activation = vec![0.0; self.reservoir_size];

    // Vectorized matrix-vector multiply
    for i in 0..self.reservoir_size {
        #[cfg(target_arch = "aarch64")]
        {
            input_activation[i] = unsafe {
                dot_product_neon(&self.input_weights[i], input)
            };
        }

        #[cfg(not(target_arch = "aarch64"))]
        {
            input_activation[i] = self.input_weights[i]
                .iter()
                .zip(input)
                .map(|(w, x)| w * x)
                .sum();
        }
    }

    // ... rest of update
}

Benchmark Impact:

Before: 200μs per update
After: 80μs per update (2.5x speedup)

2. Loop Unrolling

// Manual unrolling for small fixed-size loops
pub fn softmax(logits: &[f32]) -> Vec<f32> {
    let max_val = logits.iter().cloned().fold(f32::NEG_INFINITY, f32::max);

    // Compute exp (unrolled for common sizes)
    let mut exp_vals = Vec::with_capacity(logits.len());

    match logits.len() {
        3 => {
            // Routing decision (most common case)
            exp_vals.push((logits[0] - max_val).exp());
            exp_vals.push((logits[1] - max_val).exp());
            exp_vals.push((logits[2] - max_val).exp());
        }
        _ => {
            // General case
            for &logit in logits {
                exp_vals.push((logit - max_val).exp());
            }
        }
    }

    let sum: f32 = exp_vals.iter().sum();
    exp_vals.iter().map(|x| x / sum).collect()
}

3. Lazy Evaluation

// Only compute reservoir output when needed
pub fn snapshot(&self, last_n: usize) -> ContextSnapshot {
    ContextSnapshot {
        recent_turns: self.history.iter().take(last_n).cloned().collect(),
        // Only compute reservoir output if we have reservoir AND it will be saved
        reservoir_state: self.reservoir.as_ref().and_then(|esn| {
            if should_save_reservoir_state() {
                Some(esn.output())  // Expensive operation
            } else {
                None
            }
        }),
        metadata: SnapshotMetadata {
            timestamp: current_timestamp(),
            total_turns: self.history.len(),
        },
    }
}

4. Inlining

// Force inline for small hot functions
#[inline(always)]
pub fn estimate_tokens(text: &str) -> usize {
    text.split_whitespace().count()
}

#[inline(always)]
fn current_timestamp() -> u64 {
    std::time::SystemTime::now()
        .duration_since(std::time::UNIX_EPOCH)
        .unwrap()
        .as_secs()
}

// Prevent inline for large functions
#[inline(never)]
pub fn train(&mut self, inputs: &[Vec<f32>], targets: &[Vec<f32>], lambda: f32) -> Result<(), String> {
    // Large training function - don't bloat call sites
}

5. Branch Prediction Hints

// Use likely/unlikely macros (unstable, or use manual patterns)
#[inline(always)]
fn likely(b: bool) -> bool {
    #[cold]
    fn cold() {}

    if !b { cold(); }
    b
}

// Usage
if likely(query.text.len() < 200) {
    // Common case: short query, local routing
    return (RoutingDecision::Local, 0.75);
}

6. Fast Math

In Cargo.toml:

[profile.release]
# ... existing config
codegen-units = 1

# Optionally enable fast math (less precise but faster)
# rustflags = ["-C", "target-cpu=native", "-C", "opt-level=3"]

Build with target-specific optimizations:

# For specific CPU
RUSTFLAGS="-C target-cpu=cortex-a76" cargo build --release --target aarch64-linux-android

# For best available on build machine
RUSTFLAGS="-C target-cpu=native" cargo build --release

Memory Optimization {#memory-optimization}

Memory Usage Analysis

Current Memory Footprint:

Binary:               1.5MB
Stack (per thread):   ~2MB
Orchestrator:         ~500KB
  ├─ Expert rules:    ~10KB
  ├─ Router config:   ~1KB
  ├─ Context history: ~200KB (100 turns × ~2KB each)
  ├─ Reservoir:       ~250KB (1000 neurons × 384 inputs × 4 bytes)
  └─ MLP:             ~160KB (weights + biases)
SNN:                  ~50KB (sparse weights)
Total (approximate):  ~2.5MB RSS

Optimization Techniques

1. Memory Pooling

use std::sync::Mutex;

// Pool of reusable buffers
pub struct VectorPool {
    pool: Mutex<Vec<Vec<f32>>>,
    size: usize,
}

impl VectorPool {
    pub fn new(capacity: usize, size: usize) -> Self {
        let mut pool = Vec::with_capacity(capacity);
        for _ in 0..capacity {
            pool.push(Vec::with_capacity(size));
        }
        VectorPool {
            pool: Mutex::new(pool),
            size,
        }
    }

    pub fn get(&self) -> Vec<f32> {
        self.pool.lock().unwrap().pop().unwrap_or_else(|| Vec::with_capacity(self.size))
    }

    pub fn return_vec(&self, mut v: Vec<f32>) {
        v.clear();
        if let Ok(mut pool) = self.pool.try_lock() {
            if pool.len() < pool.capacity() {
                pool.push(v);
            }
        }
    }
}

// Global pool for common sizes
lazy_static::lazy_static! {
    static ref VEC_384_POOL: VectorPool = VectorPool::new(10, 384);
    static ref VEC_1000_POOL: VectorPool = VectorPool::new(10, 1000);
}

// Usage
let mut embedding = VEC_384_POOL.get();
// ... use embedding
VEC_384_POOL.return_vec(embedding);

2. Small String Optimization

use std::borrow::Cow;

// Use Cow for strings that might be borrowed
pub struct Query<'a> {
    pub text: Cow<'a, str>,
    pub project_context: Option<Cow<'a, str>>,
    pub priority: u8,
    pub timestamp: u64,
}

impl<'a> Query<'a> {
    // Borrowed version (no allocation)
    pub fn new_borrowed(text: &'a str) -> Self {
        Query {
            text: Cow::Borrowed(text),
            project_context: None,
            priority: 5,
            timestamp: current_timestamp(),
        }
    }

    // Owned version (when needed)
    pub fn new_owned(text: String) -> Self {
        Query {
            text: Cow::Owned(text),
            project_context: None,
            priority: 5,
            timestamp: current_timestamp(),
        }
    }
}

3. Compact Data Structures

// Use smaller types where possible
pub struct CompactQuery {
    pub text: String,
    pub priority: u8,           // u8 instead of usize
    pub timestamp: u32,         // u32 for timestamps (year 2106 problem, but saves 4 bytes)
    pub flags: u8,              // Bit flags instead of multiple bools
}

// Bit flags
const FLAG_HAS_CONTEXT: u8 = 0b0001;
const FLAG_IS_URGENT: u8 = 0b0010;

impl CompactQuery {
    pub fn has_context(&self) -> bool {
        self.flags & FLAG_HAS_CONTEXT != 0
    }

    pub fn set_has_context(&mut self, val: bool) {
        if val {
            self.flags |= FLAG_HAS_CONTEXT;
        } else {
            self.flags &= !FLAG_HAS_CONTEXT;
        }
    }
}

4. Bounded Collections

// Use fixed-size circular buffer instead of unbounded Vec
pub struct BoundedHistory<T> {
    buffer: Vec<T>,
    capacity: usize,
    start: usize,
    len: usize,
}

impl<T> BoundedHistory<T> {
    pub fn new(capacity: usize) -> Self {
        BoundedHistory {
            buffer: Vec::with_capacity(capacity),
            capacity,
            start: 0,
            len: 0,
        }
    }

    pub fn push(&mut self, item: T) {
        if self.len < self.capacity {
            self.buffer.push(item);
            self.len += 1;
        } else {
            // Overwrite oldest
            self.buffer[self.start] = item;
            self.start = (self.start + 1) % self.capacity;
        }
    }

    pub fn iter(&self) -> impl Iterator<Item = &T> {
        // Iterate from newest to oldest
        // ... implementation
    }
}

// Use in ContextManager
pub struct ContextManager {
    history: BoundedHistory<ConversationTurn>,  // Fixed size, no unbounded growth
    // ...
}

5. Lazy Serialization

// Don't serialize everything all the time
impl ContextManager {
    pub fn save_checkpoint(&self, path: &str) -> Result<(), String> {
        // Only serialize what's needed
        let checkpoint = Checkpoint {
            // Recent history only (not full 100 turns)
            recent_turns: self.history.iter().take(10).cloned().collect(),
            // Compressed reservoir state
            reservoir_state: self.reservoir.as_ref().map(|esn| esn.state.clone()),
            // Don't save full project contexts
            current_project: self.current_project.clone(),
        };

        let json = serde_json::to_string(&checkpoint)
            .map_err(|e| e.to_string())?;

        std::fs::write(path, json).map_err(|e| e.to_string())
    }
}

Battery Optimization {#battery-optimization}

Power Consumption Sources

CPU (40-60%): Active computation
Network (20-30%): API calls
Memory (10-15%): DRAM refresh, access
Storage (5-10%): Disk I/O
Sensors (0-5%): Wake detection

Optimization Strategies

1. Event-Driven Architecture

// Don't poll continuously, use event-driven wake
pub struct PowerEfficientOrchestrator {
    orchestrator: Orchestrator,
    wake_detector: SpikingNetwork,
    state: PowerState,
}

#[derive(PartialEq)]
enum PowerState {
    Sleep,      // No activity, minimal power
    Listening,  // SNN active, main system sleeping
    Active,     // Full system active
}

impl PowerEfficientOrchestrator {
    pub fn run(&mut self) {
        loop {
            match self.state {
                PowerState::Sleep => {
                    // Wait for external trigger (user input, notification)
                    std::thread::park();
                    self.state = PowerState::Listening;
                }

                PowerState::Listening => {
                    // Low-power wake detection
                    let audio_sample = get_audio_sample();  // Platform-specific
                    let spikes = audio_to_spikes(&audio_sample);
                    let output = self.wake_detector.step(&spikes, 1.0);

                    if output[0] {  // Wake word detected
                        self.state = PowerState::Active;
                    } else {
                        // Stay in low-power mode
                        std::thread::sleep(Duration::from_millis(10));
                    }
                }

                PowerState::Active => {
                    // Process queries
                    if let Some(query) = check_for_query() {
                        self.orchestrator.process(query);
                    } else {
                        // Timeout back to listening
                        std::thread::sleep(Duration::from_secs(5));
                        self.state = PowerState::Listening;
                    }
                }
            }
        }
    }
}

2. Batch Processing

// Batch queries to amortize wake-up cost
pub struct BatchProcessor {
    pending_queries: Vec<Query>,
    batch_timeout: Duration,
    last_process: Instant,
}

impl BatchProcessor {
    pub fn add_query(&mut self, query: Query) {
        self.pending_queries.push(query);

        // Process if batch full or timeout
        if self.pending_queries.len() >= 10 || self.last_process.elapsed() > self.batch_timeout {
            self.process_batch();
        }
    }

    fn process_batch(&mut self) {
        // Wake CPU to higher P-state once
        // Process all queries
        for query in self.pending_queries.drain(..) {
            process_query(query);
        }
        // CPU can return to low P-state
        self.last_process = Instant::now();
    }
}

3. Adaptive Frequency Scaling

// Hint CPU governor based on workload
#[cfg(target_os = "linux")]
fn set_cpu_governor(governor: &str) -> Result<(), std::io::Error> {
    // Requires appropriate permissions
    std::fs::write("/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor", governor)
}

pub fn process_with_power_hint(&mut self, query: Query) -> Result<Response, String> {
    // Simple query: use powersave governor
    if query.text.len() < 50 {
        #[cfg(target_os = "linux")]
        let _ = set_cpu_governor("powersave");

        let result = self.process(query);

        #[cfg(target_os = "linux")]
        let _ = set_cpu_governor("schedutil");  // Restore

        result
    } else {
        // Complex query: use performance governor
        #[cfg(target_os = "linux")]
        let _ = set_cpu_governor("performance");

        let result = self.process(query);

        #[cfg(target_os = "linux")]
        let _ = set_cpu_governor("schedutil");

        result
    }
}

4. Network Request Coalescing

// Batch network requests instead of making individual calls
pub struct NetworkBatcher {
    pending_requests: Vec<RemoteQuery>,
    timer: Option<Instant>,
}

impl NetworkBatcher {
    pub fn add_request(&mut self, query: RemoteQuery) {
        self.pending_requests.push(query);

        if self.timer.is_none() {
            self.timer = Some(Instant::now());
        }

        // Flush if batch size reached or timeout
        if self.pending_requests.len() >= 5 || self.timer.unwrap().elapsed() > Duration::from_millis(100) {
            self.flush();
        }
    }

    fn flush(&mut self) {
        if self.pending_requests.is_empty() {
            return;
        }

        // Single network request with batch
        let responses = send_batch_request(&self.pending_requests);

        // Distribute responses
        for (query, response) in self.pending_requests.drain(..).zip(responses) {
            query.respond(response);
        }

        self.timer = None;
    }
}

5. Wake Lock Management

// Platform-specific wake lock (Android example)
#[cfg(target_os = "android")]
mod android_power {
    use jni::JNIEnv;
    use jni::objects::{JClass, JObject};

    pub fn acquire_wake_lock(env: &JNIEnv) {
        // Get PowerManager
        let pm = env.call_method(/* context */, "getSystemService", /* ... */);

        // Acquire wake lock
        let wake_lock = env.call_method(pm, "newWakeLock", /* ... */);
        env.call_method(wake_lock, "acquire", /* timeout */);
    }

    pub fn release_wake_lock(env: &JNIEnv) {
        // Release wake lock
        env.call_method(/* wake_lock */, "release", /* ... */);
    }
}

// Use wake locks only when necessary
pub fn process_critical_query(&mut self, query: Query) -> Result<Response, String> {
    #[cfg(target_os = "android")]
    android_power::acquire_wake_lock(&env);

    let result = self.process(query);

    #[cfg(target_os = "android")]
    android_power::release_wake_lock(&env);

    result
}

Network Optimization {#network-optimization}

Techniques

1. Request Caching

use std::collections::HashMap;
use std::time::{Duration, Instant};

pub struct ResponseCache {
    cache: HashMap<String, (Response, Instant)>,
    ttl: Duration,
}

impl ResponseCache {
    pub fn new(ttl: Duration) -> Self {
        ResponseCache {
            cache: HashMap::new(),
            ttl,
        }
    }

    pub fn get(&mut self, query_hash: &str) -> Option<Response> {
        if let Some((response, timestamp)) = self.cache.get(query_hash) {
            if timestamp.elapsed() < self.ttl {
                return Some(response.clone());
            } else {
                // Expired
                self.cache.remove(query_hash);
            }
        }
        None
    }

    pub fn insert(&mut self, query_hash: String, response: Response) {
        self.cache.insert(query_hash, (response, Instant::now()));
    }
}

2. Compression

#[cfg(feature = "network")]
async fn send_compressed_request(query: &str) -> Result<Response, String> {
    use flate2::write::GzEncoder;
    use flate2::Compression;

    // Compress request
    let mut encoder = GzEncoder::new(Vec::new(), Compression::default());
    encoder.write_all(query.as_bytes()).map_err(|e| e.to_string())?;
    let compressed = encoder.finish().map_err(|e| e.to_string())?;

    // Send compressed data
    let response = reqwest::Client::new()
        .post("https://api.example.com/v1/chat")
        .header("Content-Encoding", "gzip")
        .body(compressed)
        .send()
        .await
        .map_err(|e| e.to_string())?;

    // ... parse response
}

3. Timeout and Retry

#[cfg(feature = "network")]
async fn request_with_timeout(
    query: &str,
    timeout: Duration,
    max_retries: u32,
) -> Result<Response, String> {
    let mut retries = 0;

    loop {
        match tokio::time::timeout(timeout, send_request(query)).await {
            Ok(Ok(response)) => return Ok(response),
            Ok(Err(e)) => {
                retries += 1;
                if retries >= max_retries {
                    return Err(format!("Max retries exceeded: {}", e));
                }
                // Exponential backoff
                tokio::time::sleep(Duration::from_millis(100 * 2_u64.pow(retries))).await;
            }
            Err(_) => {
                retries += 1;
                if retries >= max_retries {
                    return Err("Request timeout".to_string());
                }
            }
        }
    }
}

Platform-Specific Optimizations {#platform-specific}

Android

1. Use Hardware Acceleration

// Leverage Android Neural Networks API (NNAPI)
#[cfg(target_os = "android")]
mod nnapi {
    // Wrapper for NNAPI
    // Requires JNI bindings to android.neuralnetworks

    pub fn run_mlp_on_nnapi(mlp: &MLP, input: &[f32]) -> Vec<f32> {
        // Convert MLP to NNAPI model
        // Run on NPU/GPU instead of CPU
        // Return results
        todo!()
    }
}

2. Doze Mode Handling

#[cfg(target_os = "android")]
pub fn handle_doze_mode() {
    // Use AlarmManager for critical tasks during Doze
    // Defer non-critical work
}

iOS

1. Core ML Integration

// Convert MLP to Core ML model
import CoreML

class MLPRouter {
    let model: MLModel

    init() {
        // Load compiled Core ML model
        model = try! MLModel(contentsOf: /* model URL */)
    }

    func route(query: [Float]) -> Int {
        // Run inference on Neural Engine
        let input = MLPInput(features: query)
        let output = try! model.prediction(from: input)
        return output.decision
    }
}

2. Background Processing

import BackgroundTasks

func scheduleBackgroundProcessing() {
    let request = BGProcessingTaskRequest(identifier: "com.example.mobileai.reservoir")
    request.requiresNetworkConnectivity = false
    request.requiresExternalPower = false

    try? BGTaskScheduler.shared.submit(request)
}

Linux ARM

1. DSP Offload

// Use Qualcomm Hexagon DSP (if available)
#[cfg(all(target_os = "linux", target_arch = "aarch64"))]
mod hexagon {
    // Offload SNN to DSP for ultra-low power
    pub fn run_snn_on_dsp(snn: &SpikingNetwork, input: &[bool]) -> Vec<bool> {
        // Requires Hexagon SDK
        todo!()
    }
}

Advanced Techniques {#advanced-techniques}

Model Quantization

// Convert f32 weights to int8
pub struct QuantizedMLP {
    weights: Vec<Vec<Vec<i8>>>,
    scale_factors: Vec<f32>,
    zero_points: Vec<i8>,
}

impl QuantizedMLP {
    pub fn from_mlp(mlp: &MLP) -> Self {
        // Quantize weights: w_int8 = round((w_float - zero_point) / scale)
        let mut quantized_weights = Vec::new();
        let mut scale_factors = Vec::new();
        let mut zero_points = Vec::new();

        for layer in &mlp.weights {
            let min = layer.iter().flat_map(|row| row.iter()).cloned().fold(f32::INFINITY, f32::min);
            let max = layer.iter().flat_map(|row| row.iter()).cloned().fold(f32::NEG_INFINITY, f32::max);

            let scale = (max - min) / 255.0;
            let zero_point = -min / scale;

            let quantized_layer: Vec<Vec<i8>> = layer
                .iter()
                .map(|row| {
                    row.iter()
                        .map(|&w| ((w / scale + zero_point).round() as i8))
                        .collect()
                })
                .collect();

            quantized_weights.push(quantized_layer);
            scale_factors.push(scale);
            zero_points.push(zero_point as i8);
        }

        QuantizedMLP {
            weights: quantized_weights,
            scale_factors,
            zero_points,
        }
    }

    pub fn forward(&self, input: &[f32]) -> Vec<f32> {
        // Integer arithmetic forward pass
        // Dequantize at the end
        todo!()
    }
}

// Benefits:
// - 4x memory reduction (f32 → i8)
// - 2-4x faster on mobile (int8 SIMD)
// - <1% accuracy loss (typically)

Model Pruning

pub fn prune_mlp(mlp: &mut MLP, threshold: f32) -> usize {
    let mut pruned_count = 0;

    for layer in &mut mlp.weights {
        for row in layer {
            for weight in row {
                if weight.abs() < threshold {
                    *weight = 0.0;
                    pruned_count += 1;
                }
            }
        }
    }

    pruned_count
}

// Example: Prune 50% of weights
let mlp = MLP::new(384, vec![100, 50], 3);
let pruned = prune_mlp(&mut mlp, 0.01);  // Threshold = 0.01
println!("Pruned {} weights ({:.1}%)", pruned, pruned as f32 / total_weights * 100.0);

// Convert to sparse representation
pub struct SparseMLP {
    weights: Vec<HashMap<(usize, usize), f32>>,  // Only store non-zero
}

Knowledge Distillation

// Train small model to mimic large model
pub fn distill_mlp(
    large_mlp: &MLP,
    small_mlp: &mut MLP,
    training_data: &[Vec<f32>],
    temperature: f32,
) {
    for input in training_data {
        // Get soft targets from large model
        let large_output = large_mlp.forward(input);
        let soft_targets = softmax_with_temperature(&large_output, temperature);

        // Train small model to match
        let small_output = small_mlp.forward(input);
        let loss = kl_divergence(&soft_targets, &softmax(&small_output));

        // Backprop and update
        // ...
    }
}

// Result: 10x smaller model with 95% accuracy of large model

Monitoring and Metrics

Runtime Metrics

use std::sync::atomic::{AtomicU64, Ordering};

pub struct Metrics {
    queries_processed: AtomicU64,
    total_latency_ms: AtomicU64,
    cache_hits: AtomicU64,
    cache_misses: AtomicU64,
}

impl Metrics {
    pub fn record_query(&self, latency_ms: u64, cache_hit: bool) {
        self.queries_processed.fetch_add(1, Ordering::Relaxed);
        self.total_latency_ms.fetch_add(latency_ms, Ordering::Relaxed);

        if cache_hit {
            self.cache_hits.fetch_add(1, Ordering::Relaxed);
        } else {
            self.cache_misses.fetch_add(1, Ordering::Relaxed);
        }
    }

    pub fn report(&self) {
        let queries = self.queries_processed.load(Ordering::Relaxed);
        let total_latency = self.total_latency_ms.load(Ordering::Relaxed);
        let avg_latency = if queries > 0 { total_latency / queries } else { 0 };

        let hits = self.cache_hits.load(Ordering::Relaxed);
        let misses = self.cache_misses.load(Ordering::Relaxed);
        let hit_rate = if hits + misses > 0 {
            hits as f32 / (hits + misses) as f32 * 100.0
        } else {
            0.0
        };

        eprintln!("Metrics:");
        eprintln!("  Queries: {}", queries);
        eprintln!("  Avg latency: {}ms", avg_latency);
        eprintln!("  Cache hit rate: {:.1}%", hit_rate);
    }
}

Checklist

Run benchmarks: cargo bench Profile CPU: perf record && perf report Check memory: valgrind --tool=massif

Uh oh!

FilesExpand file tree

PERFORMANCE.md

Latest commit

History

PERFORMANCE.md

File metadata and controls

Performance Optimization Guide

Table of Contents

Performance Targets {#targets}

Latency Goals

Resource Limits

Throughput Goals

Profiling and Measurement {#profiling}

Built-in Benchmarks

Platform-Specific Profiling

Android

iOS

Linux

Custom Instrumentation

CPU Optimization {#cpu-optimization}

Hotspot Identification

Optimization Techniques

1. SIMD Vectorization

2. Loop Unrolling

3. Lazy Evaluation

4. Inlining

5. Branch Prediction Hints

6. Fast Math

Memory Optimization {#memory-optimization}

Memory Usage Analysis

Optimization Techniques

1. Memory Pooling

2. Small String Optimization

3. Compact Data Structures

4. Bounded Collections

5. Lazy Serialization

Battery Optimization {#battery-optimization}

Power Consumption Sources

Optimization Strategies

1. Event-Driven Architecture

2. Batch Processing

3. Adaptive Frequency Scaling

4. Network Request Coalescing

5. Wake Lock Management

Network Optimization {#network-optimization}

Techniques

1. Request Caching

2. Compression

3. Timeout and Retry

Platform-Specific Optimizations {#platform-specific}

Android

1. Use Hardware Acceleration

2. Doze Mode Handling

iOS

1. Core ML Integration

2. Background Processing

Linux ARM

1. DSP Offload

Advanced Techniques {#advanced-techniques}

Model Quantization

Model Pruning

Knowledge Distillation

Monitoring and Metrics

Runtime Metrics

Checklist