Skip to content

Latest commit

 

History

History
1248 lines (990 loc) · 29.9 KB

File metadata and controls

1248 lines (990 loc) · 29.9 KB

Performance Optimization Guide

Comprehensive guide to optimizing the Mobile AI Orchestrator for constrained mobile platforms.


Table of Contents

  1. Performance Targets
  2. Profiling and Measurement
  3. CPU Optimization
  4. Memory Optimization
  5. Battery Optimization
  6. Network Optimization
  7. Platform-Specific Optimizations
  8. Advanced Techniques

Performance Targets {#targets}

Latency Goals

Operation Target Acceptable Notes
Simple query <10ms <50ms Local routing
Complex query <100ms <500ms May involve network
Context switch <5ms <20ms Project switching
Reservoir update <1ms <5ms Per conversation turn
MLP inference <1ms <5ms Routing decision
SNN step <100μs <1ms Wake detection

Resource Limits

Resource Target Maximum Notes
Binary size <1.5MB <5MB Stripped release
Memory (RSS) <50MB <128MB Runtime footprint
CPU (idle) <1% <5% Background operation
CPU (active) <30% <80% During query processing
Battery (24h) <100mAh <300mAh Background + active

Throughput Goals

Metric Target Notes
Queries per second >100 Sustained throughput
Reservoir updates/s >1000 Batch processing
MLP inferences/s >5000 Routing decisions
SNN steps/s >10000 Wake detection

Profiling and Measurement {#profiling}

Built-in Benchmarks

# Run all benchmarks
cargo bench

# Specific component
cargo bench orchestrator
cargo bench reservoir
cargo bench mlp

# With profiling
cargo bench --bench orchestrator_bench -- --profile-time=10

Platform-Specific Profiling

Android

1. Simpleperf (CPU profiling)

# On device
adb shell simpleperf record -p $(pidof mobile-ai) -o /data/local/tmp/perf.data
adb shell simpleperf report -i /data/local/tmp/perf.data

# Or with Rust symbols
adb pull /data/local/tmp/perf.data
simpleperf report-html -i perf.data -o report.html

2. Android Profiler (Android Studio)

1. Open Android Studio
2. View > Tool Windows > Profiler
3. Select device and process
4. Monitor CPU, Memory, Network

3. systrace (System-wide)

# Record trace
python systrace.py -o trace.html sched freq idle am wm gfx view binder_driver hal dalvik camera input res -a com.example.mobileai

# Open in Chrome
google-chrome trace.html

iOS

1. Instruments (CPU)

1. Xcode > Open Developer Tool > Instruments
2. Select "Time Profiler"
3. Choose device and app
4. Record and analyze hotspots

2. Instruments (Memory)

1. Instruments > Allocations
2. Look for memory growth
3. Check for leaks with "Leaks" template

3. Instruments (Energy)

1. Instruments > Energy Log
2. Measure power consumption
3. Identify energy spikes

Linux

1. perf (CPU profiling)

# Record
perf record -F 99 -g ./target/release/mobile-ai

# Report
perf report

# Flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

2. valgrind (Memory profiling)

# Memcheck (leaks)
valgrind --leak-check=full --show-leak-kinds=all ./target/release/mobile-ai

# Massif (heap profiling)
valgrind --tool=massif ./target/release/mobile-ai
ms_print massif.out.* > massif.txt

# Callgrind (call graph)
valgrind --tool=callgrind ./target/release/mobile-ai
kcachegrind callgrind.out.*

3. cargo-flamegraph

# Install
cargo install flamegraph

# Generate flamegraph
cargo flamegraph --bench orchestrator_bench

# Open flamegraph.svg

Custom Instrumentation

use std::time::Instant;

// Timing macro
macro_rules! time_it {
    ($name:expr, $code:block) => {{
        let start = Instant::now();
        let result = $code;
        let elapsed = start.elapsed();
        eprintln!("{}: {:?}", $name, elapsed);
        result
    }};
}

// Usage
pub fn process(&mut self, query: Query) -> Result<Response, String> {
    time_it!("total_query_processing", {
        let eval = time_it!("expert_evaluation", {
            self.expert.evaluate(&query)
        });

        let (route, confidence) = time_it!("routing_decision", {
            self.router.route(&query)
        });

        // ... rest of processing
    })
}

Conditional Compilation

#[cfg(feature = "profiling")]
macro_rules! time_it {
    ($name:expr, $code:block) => {{ /* instrumented version */ }};
}

#[cfg(not(feature = "profiling"))]
macro_rules! time_it {
    ($name:expr, $code:block) => {{ $code }};
}

Add to Cargo.toml:

[features]
profiling = []

Build with profiling:

cargo build --release --features profiling

CPU Optimization {#cpu-optimization}

Hotspot Identification

Based on profiling, typical hotspots:

  1. Reservoir update (~40% of CPU time)
  2. MLP forward pass (~30% of CPU time)
  3. Text encoding (~15% of CPU time)
  4. Routing logic (~10% of CPU time)
  5. Other (~5%)

Optimization Techniques

1. SIMD Vectorization

ARM NEON (Android, iOS, Linux ARM)

#[cfg(target_arch = "aarch64")]
use std::arch::aarch64::*;

#[cfg(target_arch = "aarch64")]
unsafe fn dot_product_neon(a: &[f32], b: &[f32]) -> f32 {
    assert!(a.len() == b.len());
    assert!(a.len() % 4 == 0);

    let mut sum = vdupq_n_f32(0.0);

    for i in (0..a.len()).step_by(4) {
        let va = vld1q_f32(a.as_ptr().add(i));
        let vb = vld1q_f32(b.as_ptr().add(i));
        let vmul = vmulq_f32(va, vb);
        sum = vaddq_f32(sum, vmul);
    }

    // Horizontal sum
    let sum2 = vpadd_f32(vget_low_f32(sum), vget_high_f32(sum));
    let sum4 = vpadd_f32(sum2, sum2);
    vget_lane_f32(sum4, 0)
}

#[cfg(not(target_arch = "aarch64"))]
fn dot_product_neon(a: &[f32], b: &[f32]) -> f32 {
    // Fallback scalar implementation
    a.iter().zip(b).map(|(x, y)| x * y).sum()
}

Usage in Reservoir

// In src/reservoir.rs

pub fn update(&mut self, input: &[f32]) -> Vec<f32> {
    let mut input_activation = vec![0.0; self.reservoir_size];

    // Vectorized matrix-vector multiply
    for i in 0..self.reservoir_size {
        #[cfg(target_arch = "aarch64")]
        {
            input_activation[i] = unsafe {
                dot_product_neon(&self.input_weights[i], input)
            };
        }

        #[cfg(not(target_arch = "aarch64"))]
        {
            input_activation[i] = self.input_weights[i]
                .iter()
                .zip(input)
                .map(|(w, x)| w * x)
                .sum();
        }
    }

    // ... rest of update
}

Benchmark Impact:

  • Before: 200μs per update
  • After: 80μs per update (2.5x speedup)

2. Loop Unrolling

// Manual unrolling for small fixed-size loops
pub fn softmax(logits: &[f32]) -> Vec<f32> {
    let max_val = logits.iter().cloned().fold(f32::NEG_INFINITY, f32::max);

    // Compute exp (unrolled for common sizes)
    let mut exp_vals = Vec::with_capacity(logits.len());

    match logits.len() {
        3 => {
            // Routing decision (most common case)
            exp_vals.push((logits[0] - max_val).exp());
            exp_vals.push((logits[1] - max_val).exp());
            exp_vals.push((logits[2] - max_val).exp());
        }
        _ => {
            // General case
            for &logit in logits {
                exp_vals.push((logit - max_val).exp());
            }
        }
    }

    let sum: f32 = exp_vals.iter().sum();
    exp_vals.iter().map(|x| x / sum).collect()
}

3. Lazy Evaluation

// Only compute reservoir output when needed
pub fn snapshot(&self, last_n: usize) -> ContextSnapshot {
    ContextSnapshot {
        recent_turns: self.history.iter().take(last_n).cloned().collect(),
        // Only compute reservoir output if we have reservoir AND it will be saved
        reservoir_state: self.reservoir.as_ref().and_then(|esn| {
            if should_save_reservoir_state() {
                Some(esn.output())  // Expensive operation
            } else {
                None
            }
        }),
        metadata: SnapshotMetadata {
            timestamp: current_timestamp(),
            total_turns: self.history.len(),
        },
    }
}

4. Inlining

// Force inline for small hot functions
#[inline(always)]
pub fn estimate_tokens(text: &str) -> usize {
    text.split_whitespace().count()
}

#[inline(always)]
fn current_timestamp() -> u64 {
    std::time::SystemTime::now()
        .duration_since(std::time::UNIX_EPOCH)
        .unwrap()
        .as_secs()
}

// Prevent inline for large functions
#[inline(never)]
pub fn train(&mut self, inputs: &[Vec<f32>], targets: &[Vec<f32>], lambda: f32) -> Result<(), String> {
    // Large training function - don't bloat call sites
}

5. Branch Prediction Hints

// Use likely/unlikely macros (unstable, or use manual patterns)
#[inline(always)]
fn likely(b: bool) -> bool {
    #[cold]
    fn cold() {}

    if !b { cold(); }
    b
}

// Usage
if likely(query.text.len() < 200) {
    // Common case: short query, local routing
    return (RoutingDecision::Local, 0.75);
}

6. Fast Math

In Cargo.toml:

[profile.release]
# ... existing config
codegen-units = 1

# Optionally enable fast math (less precise but faster)
# rustflags = ["-C", "target-cpu=native", "-C", "opt-level=3"]

Build with target-specific optimizations:

# For specific CPU
RUSTFLAGS="-C target-cpu=cortex-a76" cargo build --release --target aarch64-linux-android

# For best available on build machine
RUSTFLAGS="-C target-cpu=native" cargo build --release

Memory Optimization {#memory-optimization}

Memory Usage Analysis

Current Memory Footprint:

Binary:               1.5MB
Stack (per thread):   ~2MB
Orchestrator:         ~500KB
  ├─ Expert rules:    ~10KB
  ├─ Router config:   ~1KB
  ├─ Context history: ~200KB (100 turns × ~2KB each)
  ├─ Reservoir:       ~250KB (1000 neurons × 384 inputs × 4 bytes)
  └─ MLP:             ~160KB (weights + biases)
SNN:                  ~50KB (sparse weights)
Total (approximate):  ~2.5MB RSS

Optimization Techniques

1. Memory Pooling

use std::sync::Mutex;

// Pool of reusable buffers
pub struct VectorPool {
    pool: Mutex<Vec<Vec<f32>>>,
    size: usize,
}

impl VectorPool {
    pub fn new(capacity: usize, size: usize) -> Self {
        let mut pool = Vec::with_capacity(capacity);
        for _ in 0..capacity {
            pool.push(Vec::with_capacity(size));
        }
        VectorPool {
            pool: Mutex::new(pool),
            size,
        }
    }

    pub fn get(&self) -> Vec<f32> {
        self.pool.lock().unwrap().pop().unwrap_or_else(|| Vec::with_capacity(self.size))
    }

    pub fn return_vec(&self, mut v: Vec<f32>) {
        v.clear();
        if let Ok(mut pool) = self.pool.try_lock() {
            if pool.len() < pool.capacity() {
                pool.push(v);
            }
        }
    }
}

// Global pool for common sizes
lazy_static::lazy_static! {
    static ref VEC_384_POOL: VectorPool = VectorPool::new(10, 384);
    static ref VEC_1000_POOL: VectorPool = VectorPool::new(10, 1000);
}

// Usage
let mut embedding = VEC_384_POOL.get();
// ... use embedding
VEC_384_POOL.return_vec(embedding);

2. Small String Optimization

use std::borrow::Cow;

// Use Cow for strings that might be borrowed
pub struct Query<'a> {
    pub text: Cow<'a, str>,
    pub project_context: Option<Cow<'a, str>>,
    pub priority: u8,
    pub timestamp: u64,
}

impl<'a> Query<'a> {
    // Borrowed version (no allocation)
    pub fn new_borrowed(text: &'a str) -> Self {
        Query {
            text: Cow::Borrowed(text),
            project_context: None,
            priority: 5,
            timestamp: current_timestamp(),
        }
    }

    // Owned version (when needed)
    pub fn new_owned(text: String) -> Self {
        Query {
            text: Cow::Owned(text),
            project_context: None,
            priority: 5,
            timestamp: current_timestamp(),
        }
    }
}

3. Compact Data Structures

// Use smaller types where possible
pub struct CompactQuery {
    pub text: String,
    pub priority: u8,           // u8 instead of usize
    pub timestamp: u32,         // u32 for timestamps (year 2106 problem, but saves 4 bytes)
    pub flags: u8,              // Bit flags instead of multiple bools
}

// Bit flags
const FLAG_HAS_CONTEXT: u8 = 0b0001;
const FLAG_IS_URGENT: u8 = 0b0010;

impl CompactQuery {
    pub fn has_context(&self) -> bool {
        self.flags & FLAG_HAS_CONTEXT != 0
    }

    pub fn set_has_context(&mut self, val: bool) {
        if val {
            self.flags |= FLAG_HAS_CONTEXT;
        } else {
            self.flags &= !FLAG_HAS_CONTEXT;
        }
    }
}

4. Bounded Collections

// Use fixed-size circular buffer instead of unbounded Vec
pub struct BoundedHistory<T> {
    buffer: Vec<T>,
    capacity: usize,
    start: usize,
    len: usize,
}

impl<T> BoundedHistory<T> {
    pub fn new(capacity: usize) -> Self {
        BoundedHistory {
            buffer: Vec::with_capacity(capacity),
            capacity,
            start: 0,
            len: 0,
        }
    }

    pub fn push(&mut self, item: T) {
        if self.len < self.capacity {
            self.buffer.push(item);
            self.len += 1;
        } else {
            // Overwrite oldest
            self.buffer[self.start] = item;
            self.start = (self.start + 1) % self.capacity;
        }
    }

    pub fn iter(&self) -> impl Iterator<Item = &T> {
        // Iterate from newest to oldest
        // ... implementation
    }
}

// Use in ContextManager
pub struct ContextManager {
    history: BoundedHistory<ConversationTurn>,  // Fixed size, no unbounded growth
    // ...
}

5. Lazy Serialization

// Don't serialize everything all the time
impl ContextManager {
    pub fn save_checkpoint(&self, path: &str) -> Result<(), String> {
        // Only serialize what's needed
        let checkpoint = Checkpoint {
            // Recent history only (not full 100 turns)
            recent_turns: self.history.iter().take(10).cloned().collect(),
            // Compressed reservoir state
            reservoir_state: self.reservoir.as_ref().map(|esn| esn.state.clone()),
            // Don't save full project contexts
            current_project: self.current_project.clone(),
        };

        let json = serde_json::to_string(&checkpoint)
            .map_err(|e| e.to_string())?;

        std::fs::write(path, json).map_err(|e| e.to_string())
    }
}

Battery Optimization {#battery-optimization}

Power Consumption Sources

  1. CPU (40-60%): Active computation
  2. Network (20-30%): API calls
  3. Memory (10-15%): DRAM refresh, access
  4. Storage (5-10%): Disk I/O
  5. Sensors (0-5%): Wake detection

Optimization Strategies

1. Event-Driven Architecture

// Don't poll continuously, use event-driven wake
pub struct PowerEfficientOrchestrator {
    orchestrator: Orchestrator,
    wake_detector: SpikingNetwork,
    state: PowerState,
}

#[derive(PartialEq)]
enum PowerState {
    Sleep,      // No activity, minimal power
    Listening,  // SNN active, main system sleeping
    Active,     // Full system active
}

impl PowerEfficientOrchestrator {
    pub fn run(&mut self) {
        loop {
            match self.state {
                PowerState::Sleep => {
                    // Wait for external trigger (user input, notification)
                    std::thread::park();
                    self.state = PowerState::Listening;
                }

                PowerState::Listening => {
                    // Low-power wake detection
                    let audio_sample = get_audio_sample();  // Platform-specific
                    let spikes = audio_to_spikes(&audio_sample);
                    let output = self.wake_detector.step(&spikes, 1.0);

                    if output[0] {  // Wake word detected
                        self.state = PowerState::Active;
                    } else {
                        // Stay in low-power mode
                        std::thread::sleep(Duration::from_millis(10));
                    }
                }

                PowerState::Active => {
                    // Process queries
                    if let Some(query) = check_for_query() {
                        self.orchestrator.process(query);
                    } else {
                        // Timeout back to listening
                        std::thread::sleep(Duration::from_secs(5));
                        self.state = PowerState::Listening;
                    }
                }
            }
        }
    }
}

2. Batch Processing

// Batch queries to amortize wake-up cost
pub struct BatchProcessor {
    pending_queries: Vec<Query>,
    batch_timeout: Duration,
    last_process: Instant,
}

impl BatchProcessor {
    pub fn add_query(&mut self, query: Query) {
        self.pending_queries.push(query);

        // Process if batch full or timeout
        if self.pending_queries.len() >= 10 || self.last_process.elapsed() > self.batch_timeout {
            self.process_batch();
        }
    }

    fn process_batch(&mut self) {
        // Wake CPU to higher P-state once
        // Process all queries
        for query in self.pending_queries.drain(..) {
            process_query(query);
        }
        // CPU can return to low P-state
        self.last_process = Instant::now();
    }
}

3. Adaptive Frequency Scaling

// Hint CPU governor based on workload
#[cfg(target_os = "linux")]
fn set_cpu_governor(governor: &str) -> Result<(), std::io::Error> {
    // Requires appropriate permissions
    std::fs::write("/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor", governor)
}

pub fn process_with_power_hint(&mut self, query: Query) -> Result<Response, String> {
    // Simple query: use powersave governor
    if query.text.len() < 50 {
        #[cfg(target_os = "linux")]
        let _ = set_cpu_governor("powersave");

        let result = self.process(query);

        #[cfg(target_os = "linux")]
        let _ = set_cpu_governor("schedutil");  // Restore

        result
    } else {
        // Complex query: use performance governor
        #[cfg(target_os = "linux")]
        let _ = set_cpu_governor("performance");

        let result = self.process(query);

        #[cfg(target_os = "linux")]
        let _ = set_cpu_governor("schedutil");

        result
    }
}

4. Network Request Coalescing

// Batch network requests instead of making individual calls
pub struct NetworkBatcher {
    pending_requests: Vec<RemoteQuery>,
    timer: Option<Instant>,
}

impl NetworkBatcher {
    pub fn add_request(&mut self, query: RemoteQuery) {
        self.pending_requests.push(query);

        if self.timer.is_none() {
            self.timer = Some(Instant::now());
        }

        // Flush if batch size reached or timeout
        if self.pending_requests.len() >= 5 || self.timer.unwrap().elapsed() > Duration::from_millis(100) {
            self.flush();
        }
    }

    fn flush(&mut self) {
        if self.pending_requests.is_empty() {
            return;
        }

        // Single network request with batch
        let responses = send_batch_request(&self.pending_requests);

        // Distribute responses
        for (query, response) in self.pending_requests.drain(..).zip(responses) {
            query.respond(response);
        }

        self.timer = None;
    }
}

5. Wake Lock Management

// Platform-specific wake lock (Android example)
#[cfg(target_os = "android")]
mod android_power {
    use jni::JNIEnv;
    use jni::objects::{JClass, JObject};

    pub fn acquire_wake_lock(env: &JNIEnv) {
        // Get PowerManager
        let pm = env.call_method(/* context */, "getSystemService", /* ... */);

        // Acquire wake lock
        let wake_lock = env.call_method(pm, "newWakeLock", /* ... */);
        env.call_method(wake_lock, "acquire", /* timeout */);
    }

    pub fn release_wake_lock(env: &JNIEnv) {
        // Release wake lock
        env.call_method(/* wake_lock */, "release", /* ... */);
    }
}

// Use wake locks only when necessary
pub fn process_critical_query(&mut self, query: Query) -> Result<Response, String> {
    #[cfg(target_os = "android")]
    android_power::acquire_wake_lock(&env);

    let result = self.process(query);

    #[cfg(target_os = "android")]
    android_power::release_wake_lock(&env);

    result
}

Network Optimization {#network-optimization}

Techniques

1. Request Caching

use std::collections::HashMap;
use std::time::{Duration, Instant};

pub struct ResponseCache {
    cache: HashMap<String, (Response, Instant)>,
    ttl: Duration,
}

impl ResponseCache {
    pub fn new(ttl: Duration) -> Self {
        ResponseCache {
            cache: HashMap::new(),
            ttl,
        }
    }

    pub fn get(&mut self, query_hash: &str) -> Option<Response> {
        if let Some((response, timestamp)) = self.cache.get(query_hash) {
            if timestamp.elapsed() < self.ttl {
                return Some(response.clone());
            } else {
                // Expired
                self.cache.remove(query_hash);
            }
        }
        None
    }

    pub fn insert(&mut self, query_hash: String, response: Response) {
        self.cache.insert(query_hash, (response, Instant::now()));
    }
}

2. Compression

#[cfg(feature = "network")]
async fn send_compressed_request(query: &str) -> Result<Response, String> {
    use flate2::write::GzEncoder;
    use flate2::Compression;

    // Compress request
    let mut encoder = GzEncoder::new(Vec::new(), Compression::default());
    encoder.write_all(query.as_bytes()).map_err(|e| e.to_string())?;
    let compressed = encoder.finish().map_err(|e| e.to_string())?;

    // Send compressed data
    let response = reqwest::Client::new()
        .post("https://api.example.com/v1/chat")
        .header("Content-Encoding", "gzip")
        .body(compressed)
        .send()
        .await
        .map_err(|e| e.to_string())?;

    // ... parse response
}

3. Timeout and Retry

#[cfg(feature = "network")]
async fn request_with_timeout(
    query: &str,
    timeout: Duration,
    max_retries: u32,
) -> Result<Response, String> {
    let mut retries = 0;

    loop {
        match tokio::time::timeout(timeout, send_request(query)).await {
            Ok(Ok(response)) => return Ok(response),
            Ok(Err(e)) => {
                retries += 1;
                if retries >= max_retries {
                    return Err(format!("Max retries exceeded: {}", e));
                }
                // Exponential backoff
                tokio::time::sleep(Duration::from_millis(100 * 2_u64.pow(retries))).await;
            }
            Err(_) => {
                retries += 1;
                if retries >= max_retries {
                    return Err("Request timeout".to_string());
                }
            }
        }
    }
}

Platform-Specific Optimizations {#platform-specific}

Android

1. Use Hardware Acceleration

// Leverage Android Neural Networks API (NNAPI)
#[cfg(target_os = "android")]
mod nnapi {
    // Wrapper for NNAPI
    // Requires JNI bindings to android.neuralnetworks

    pub fn run_mlp_on_nnapi(mlp: &MLP, input: &[f32]) -> Vec<f32> {
        // Convert MLP to NNAPI model
        // Run on NPU/GPU instead of CPU
        // Return results
        todo!()
    }
}

2. Doze Mode Handling

#[cfg(target_os = "android")]
pub fn handle_doze_mode() {
    // Use AlarmManager for critical tasks during Doze
    // Defer non-critical work
}

iOS

1. Core ML Integration

// Convert MLP to Core ML model
import CoreML

class MLPRouter {
    let model: MLModel

    init() {
        // Load compiled Core ML model
        model = try! MLModel(contentsOf: /* model URL */)
    }

    func route(query: [Float]) -> Int {
        // Run inference on Neural Engine
        let input = MLPInput(features: query)
        let output = try! model.prediction(from: input)
        return output.decision
    }
}

2. Background Processing

import BackgroundTasks

func scheduleBackgroundProcessing() {
    let request = BGProcessingTaskRequest(identifier: "com.example.mobileai.reservoir")
    request.requiresNetworkConnectivity = false
    request.requiresExternalPower = false

    try? BGTaskScheduler.shared.submit(request)
}

Linux ARM

1. DSP Offload

// Use Qualcomm Hexagon DSP (if available)
#[cfg(all(target_os = "linux", target_arch = "aarch64"))]
mod hexagon {
    // Offload SNN to DSP for ultra-low power
    pub fn run_snn_on_dsp(snn: &SpikingNetwork, input: &[bool]) -> Vec<bool> {
        // Requires Hexagon SDK
        todo!()
    }
}

Advanced Techniques {#advanced-techniques}

Model Quantization

// Convert f32 weights to int8
pub struct QuantizedMLP {
    weights: Vec<Vec<Vec<i8>>>,
    scale_factors: Vec<f32>,
    zero_points: Vec<i8>,
}

impl QuantizedMLP {
    pub fn from_mlp(mlp: &MLP) -> Self {
        // Quantize weights: w_int8 = round((w_float - zero_point) / scale)
        let mut quantized_weights = Vec::new();
        let mut scale_factors = Vec::new();
        let mut zero_points = Vec::new();

        for layer in &mlp.weights {
            let min = layer.iter().flat_map(|row| row.iter()).cloned().fold(f32::INFINITY, f32::min);
            let max = layer.iter().flat_map(|row| row.iter()).cloned().fold(f32::NEG_INFINITY, f32::max);

            let scale = (max - min) / 255.0;
            let zero_point = -min / scale;

            let quantized_layer: Vec<Vec<i8>> = layer
                .iter()
                .map(|row| {
                    row.iter()
                        .map(|&w| ((w / scale + zero_point).round() as i8))
                        .collect()
                })
                .collect();

            quantized_weights.push(quantized_layer);
            scale_factors.push(scale);
            zero_points.push(zero_point as i8);
        }

        QuantizedMLP {
            weights: quantized_weights,
            scale_factors,
            zero_points,
        }
    }

    pub fn forward(&self, input: &[f32]) -> Vec<f32> {
        // Integer arithmetic forward pass
        // Dequantize at the end
        todo!()
    }
}

// Benefits:
// - 4x memory reduction (f32 → i8)
// - 2-4x faster on mobile (int8 SIMD)
// - <1% accuracy loss (typically)

Model Pruning

pub fn prune_mlp(mlp: &mut MLP, threshold: f32) -> usize {
    let mut pruned_count = 0;

    for layer in &mut mlp.weights {
        for row in layer {
            for weight in row {
                if weight.abs() < threshold {
                    *weight = 0.0;
                    pruned_count += 1;
                }
            }
        }
    }

    pruned_count
}

// Example: Prune 50% of weights
let mlp = MLP::new(384, vec![100, 50], 3);
let pruned = prune_mlp(&mut mlp, 0.01);  // Threshold = 0.01
println!("Pruned {} weights ({:.1}%)", pruned, pruned as f32 / total_weights * 100.0);

// Convert to sparse representation
pub struct SparseMLP {
    weights: Vec<HashMap<(usize, usize), f32>>,  // Only store non-zero
}

Knowledge Distillation

// Train small model to mimic large model
pub fn distill_mlp(
    large_mlp: &MLP,
    small_mlp: &mut MLP,
    training_data: &[Vec<f32>],
    temperature: f32,
) {
    for input in training_data {
        // Get soft targets from large model
        let large_output = large_mlp.forward(input);
        let soft_targets = softmax_with_temperature(&large_output, temperature);

        // Train small model to match
        let small_output = small_mlp.forward(input);
        let loss = kl_divergence(&soft_targets, &softmax(&small_output));

        // Backprop and update
        // ...
    }
}

// Result: 10x smaller model with 95% accuracy of large model

Monitoring and Metrics

Runtime Metrics

use std::sync::atomic::{AtomicU64, Ordering};

pub struct Metrics {
    queries_processed: AtomicU64,
    total_latency_ms: AtomicU64,
    cache_hits: AtomicU64,
    cache_misses: AtomicU64,
}

impl Metrics {
    pub fn record_query(&self, latency_ms: u64, cache_hit: bool) {
        self.queries_processed.fetch_add(1, Ordering::Relaxed);
        self.total_latency_ms.fetch_add(latency_ms, Ordering::Relaxed);

        if cache_hit {
            self.cache_hits.fetch_add(1, Ordering::Relaxed);
        } else {
            self.cache_misses.fetch_add(1, Ordering::Relaxed);
        }
    }

    pub fn report(&self) {
        let queries = self.queries_processed.load(Ordering::Relaxed);
        let total_latency = self.total_latency_ms.load(Ordering::Relaxed);
        let avg_latency = if queries > 0 { total_latency / queries } else { 0 };

        let hits = self.cache_hits.load(Ordering::Relaxed);
        let misses = self.cache_misses.load(Ordering::Relaxed);
        let hit_rate = if hits + misses > 0 {
            hits as f32 / (hits + misses) as f32 * 100.0
        } else {
            0.0
        };

        eprintln!("Metrics:");
        eprintln!("  Queries: {}", queries);
        eprintln!("  Avg latency: {}ms", avg_latency);
        eprintln!("  Cache hit rate: {:.1}%", hit_rate);
    }
}

Checklist

  • Profile on target device (not just development machine)
  • Enable SIMD for matrix operations
  • Use memory pooling for frequent allocations
  • Implement request caching
  • Use event-driven architecture for battery savings
  • Quantize models (f32 → int8)
  • Prune unnecessary weights
  • Batch network requests
  • Set appropriate CPU governor hints
  • Monitor power consumption with platform tools
  • Optimize binary size (target <1.5MB)
  • Test on low-end devices
  • Measure battery impact over 24 hours

Run benchmarks: cargo bench Profile CPU: perf record && perf report Check memory: valgrind --tool=massif