Comprehensive guide to optimizing the Mobile AI Orchestrator for constrained mobile platforms.
- Performance Targets
- Profiling and Measurement
- CPU Optimization
- Memory Optimization
- Battery Optimization
- Network Optimization
- Platform-Specific Optimizations
- Advanced Techniques
| Operation | Target | Acceptable | Notes |
|---|---|---|---|
| Simple query | <10ms | <50ms | Local routing |
| Complex query | <100ms | <500ms | May involve network |
| Context switch | <5ms | <20ms | Project switching |
| Reservoir update | <1ms | <5ms | Per conversation turn |
| MLP inference | <1ms | <5ms | Routing decision |
| SNN step | <100μs | <1ms | Wake detection |
| Resource | Target | Maximum | Notes |
|---|---|---|---|
| Binary size | <1.5MB | <5MB | Stripped release |
| Memory (RSS) | <50MB | <128MB | Runtime footprint |
| CPU (idle) | <1% | <5% | Background operation |
| CPU (active) | <30% | <80% | During query processing |
| Battery (24h) | <100mAh | <300mAh | Background + active |
| Metric | Target | Notes |
|---|---|---|
| Queries per second | >100 | Sustained throughput |
| Reservoir updates/s | >1000 | Batch processing |
| MLP inferences/s | >5000 | Routing decisions |
| SNN steps/s | >10000 | Wake detection |
# Run all benchmarks
cargo bench
# Specific component
cargo bench orchestrator
cargo bench reservoir
cargo bench mlp
# With profiling
cargo bench --bench orchestrator_bench -- --profile-time=101. Simpleperf (CPU profiling)
# On device
adb shell simpleperf record -p $(pidof mobile-ai) -o /data/local/tmp/perf.data
adb shell simpleperf report -i /data/local/tmp/perf.data
# Or with Rust symbols
adb pull /data/local/tmp/perf.data
simpleperf report-html -i perf.data -o report.html2. Android Profiler (Android Studio)
1. Open Android Studio
2. View > Tool Windows > Profiler
3. Select device and process
4. Monitor CPU, Memory, Network
3. systrace (System-wide)
# Record trace
python systrace.py -o trace.html sched freq idle am wm gfx view binder_driver hal dalvik camera input res -a com.example.mobileai
# Open in Chrome
google-chrome trace.html1. Instruments (CPU)
1. Xcode > Open Developer Tool > Instruments
2. Select "Time Profiler"
3. Choose device and app
4. Record and analyze hotspots
2. Instruments (Memory)
1. Instruments > Allocations
2. Look for memory growth
3. Check for leaks with "Leaks" template
3. Instruments (Energy)
1. Instruments > Energy Log
2. Measure power consumption
3. Identify energy spikes
1. perf (CPU profiling)
# Record
perf record -F 99 -g ./target/release/mobile-ai
# Report
perf report
# Flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg2. valgrind (Memory profiling)
# Memcheck (leaks)
valgrind --leak-check=full --show-leak-kinds=all ./target/release/mobile-ai
# Massif (heap profiling)
valgrind --tool=massif ./target/release/mobile-ai
ms_print massif.out.* > massif.txt
# Callgrind (call graph)
valgrind --tool=callgrind ./target/release/mobile-ai
kcachegrind callgrind.out.*3. cargo-flamegraph
# Install
cargo install flamegraph
# Generate flamegraph
cargo flamegraph --bench orchestrator_bench
# Open flamegraph.svguse std::time::Instant;
// Timing macro
macro_rules! time_it {
($name:expr, $code:block) => {{
let start = Instant::now();
let result = $code;
let elapsed = start.elapsed();
eprintln!("{}: {:?}", $name, elapsed);
result
}};
}
// Usage
pub fn process(&mut self, query: Query) -> Result<Response, String> {
time_it!("total_query_processing", {
let eval = time_it!("expert_evaluation", {
self.expert.evaluate(&query)
});
let (route, confidence) = time_it!("routing_decision", {
self.router.route(&query)
});
// ... rest of processing
})
}Conditional Compilation
#[cfg(feature = "profiling")]
macro_rules! time_it {
($name:expr, $code:block) => {{ /* instrumented version */ }};
}
#[cfg(not(feature = "profiling"))]
macro_rules! time_it {
($name:expr, $code:block) => {{ $code }};
}Add to Cargo.toml:
[features]
profiling = []Build with profiling:
cargo build --release --features profilingBased on profiling, typical hotspots:
- Reservoir update (~40% of CPU time)
- MLP forward pass (~30% of CPU time)
- Text encoding (~15% of CPU time)
- Routing logic (~10% of CPU time)
- Other (~5%)
ARM NEON (Android, iOS, Linux ARM)
#[cfg(target_arch = "aarch64")]
use std::arch::aarch64::*;
#[cfg(target_arch = "aarch64")]
unsafe fn dot_product_neon(a: &[f32], b: &[f32]) -> f32 {
assert!(a.len() == b.len());
assert!(a.len() % 4 == 0);
let mut sum = vdupq_n_f32(0.0);
for i in (0..a.len()).step_by(4) {
let va = vld1q_f32(a.as_ptr().add(i));
let vb = vld1q_f32(b.as_ptr().add(i));
let vmul = vmulq_f32(va, vb);
sum = vaddq_f32(sum, vmul);
}
// Horizontal sum
let sum2 = vpadd_f32(vget_low_f32(sum), vget_high_f32(sum));
let sum4 = vpadd_f32(sum2, sum2);
vget_lane_f32(sum4, 0)
}
#[cfg(not(target_arch = "aarch64"))]
fn dot_product_neon(a: &[f32], b: &[f32]) -> f32 {
// Fallback scalar implementation
a.iter().zip(b).map(|(x, y)| x * y).sum()
}Usage in Reservoir
// In src/reservoir.rs
pub fn update(&mut self, input: &[f32]) -> Vec<f32> {
let mut input_activation = vec![0.0; self.reservoir_size];
// Vectorized matrix-vector multiply
for i in 0..self.reservoir_size {
#[cfg(target_arch = "aarch64")]
{
input_activation[i] = unsafe {
dot_product_neon(&self.input_weights[i], input)
};
}
#[cfg(not(target_arch = "aarch64"))]
{
input_activation[i] = self.input_weights[i]
.iter()
.zip(input)
.map(|(w, x)| w * x)
.sum();
}
}
// ... rest of update
}Benchmark Impact:
- Before: 200μs per update
- After: 80μs per update (2.5x speedup)
// Manual unrolling for small fixed-size loops
pub fn softmax(logits: &[f32]) -> Vec<f32> {
let max_val = logits.iter().cloned().fold(f32::NEG_INFINITY, f32::max);
// Compute exp (unrolled for common sizes)
let mut exp_vals = Vec::with_capacity(logits.len());
match logits.len() {
3 => {
// Routing decision (most common case)
exp_vals.push((logits[0] - max_val).exp());
exp_vals.push((logits[1] - max_val).exp());
exp_vals.push((logits[2] - max_val).exp());
}
_ => {
// General case
for &logit in logits {
exp_vals.push((logit - max_val).exp());
}
}
}
let sum: f32 = exp_vals.iter().sum();
exp_vals.iter().map(|x| x / sum).collect()
}// Only compute reservoir output when needed
pub fn snapshot(&self, last_n: usize) -> ContextSnapshot {
ContextSnapshot {
recent_turns: self.history.iter().take(last_n).cloned().collect(),
// Only compute reservoir output if we have reservoir AND it will be saved
reservoir_state: self.reservoir.as_ref().and_then(|esn| {
if should_save_reservoir_state() {
Some(esn.output()) // Expensive operation
} else {
None
}
}),
metadata: SnapshotMetadata {
timestamp: current_timestamp(),
total_turns: self.history.len(),
},
}
}// Force inline for small hot functions
#[inline(always)]
pub fn estimate_tokens(text: &str) -> usize {
text.split_whitespace().count()
}
#[inline(always)]
fn current_timestamp() -> u64 {
std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_secs()
}
// Prevent inline for large functions
#[inline(never)]
pub fn train(&mut self, inputs: &[Vec<f32>], targets: &[Vec<f32>], lambda: f32) -> Result<(), String> {
// Large training function - don't bloat call sites
}// Use likely/unlikely macros (unstable, or use manual patterns)
#[inline(always)]
fn likely(b: bool) -> bool {
#[cold]
fn cold() {}
if !b { cold(); }
b
}
// Usage
if likely(query.text.len() < 200) {
// Common case: short query, local routing
return (RoutingDecision::Local, 0.75);
}In Cargo.toml:
[profile.release]
# ... existing config
codegen-units = 1
# Optionally enable fast math (less precise but faster)
# rustflags = ["-C", "target-cpu=native", "-C", "opt-level=3"]Build with target-specific optimizations:
# For specific CPU
RUSTFLAGS="-C target-cpu=cortex-a76" cargo build --release --target aarch64-linux-android
# For best available on build machine
RUSTFLAGS="-C target-cpu=native" cargo build --releaseCurrent Memory Footprint:
Binary: 1.5MB
Stack (per thread): ~2MB
Orchestrator: ~500KB
├─ Expert rules: ~10KB
├─ Router config: ~1KB
├─ Context history: ~200KB (100 turns × ~2KB each)
├─ Reservoir: ~250KB (1000 neurons × 384 inputs × 4 bytes)
└─ MLP: ~160KB (weights + biases)
SNN: ~50KB (sparse weights)
Total (approximate): ~2.5MB RSS
use std::sync::Mutex;
// Pool of reusable buffers
pub struct VectorPool {
pool: Mutex<Vec<Vec<f32>>>,
size: usize,
}
impl VectorPool {
pub fn new(capacity: usize, size: usize) -> Self {
let mut pool = Vec::with_capacity(capacity);
for _ in 0..capacity {
pool.push(Vec::with_capacity(size));
}
VectorPool {
pool: Mutex::new(pool),
size,
}
}
pub fn get(&self) -> Vec<f32> {
self.pool.lock().unwrap().pop().unwrap_or_else(|| Vec::with_capacity(self.size))
}
pub fn return_vec(&self, mut v: Vec<f32>) {
v.clear();
if let Ok(mut pool) = self.pool.try_lock() {
if pool.len() < pool.capacity() {
pool.push(v);
}
}
}
}
// Global pool for common sizes
lazy_static::lazy_static! {
static ref VEC_384_POOL: VectorPool = VectorPool::new(10, 384);
static ref VEC_1000_POOL: VectorPool = VectorPool::new(10, 1000);
}
// Usage
let mut embedding = VEC_384_POOL.get();
// ... use embedding
VEC_384_POOL.return_vec(embedding);use std::borrow::Cow;
// Use Cow for strings that might be borrowed
pub struct Query<'a> {
pub text: Cow<'a, str>,
pub project_context: Option<Cow<'a, str>>,
pub priority: u8,
pub timestamp: u64,
}
impl<'a> Query<'a> {
// Borrowed version (no allocation)
pub fn new_borrowed(text: &'a str) -> Self {
Query {
text: Cow::Borrowed(text),
project_context: None,
priority: 5,
timestamp: current_timestamp(),
}
}
// Owned version (when needed)
pub fn new_owned(text: String) -> Self {
Query {
text: Cow::Owned(text),
project_context: None,
priority: 5,
timestamp: current_timestamp(),
}
}
}// Use smaller types where possible
pub struct CompactQuery {
pub text: String,
pub priority: u8, // u8 instead of usize
pub timestamp: u32, // u32 for timestamps (year 2106 problem, but saves 4 bytes)
pub flags: u8, // Bit flags instead of multiple bools
}
// Bit flags
const FLAG_HAS_CONTEXT: u8 = 0b0001;
const FLAG_IS_URGENT: u8 = 0b0010;
impl CompactQuery {
pub fn has_context(&self) -> bool {
self.flags & FLAG_HAS_CONTEXT != 0
}
pub fn set_has_context(&mut self, val: bool) {
if val {
self.flags |= FLAG_HAS_CONTEXT;
} else {
self.flags &= !FLAG_HAS_CONTEXT;
}
}
}// Use fixed-size circular buffer instead of unbounded Vec
pub struct BoundedHistory<T> {
buffer: Vec<T>,
capacity: usize,
start: usize,
len: usize,
}
impl<T> BoundedHistory<T> {
pub fn new(capacity: usize) -> Self {
BoundedHistory {
buffer: Vec::with_capacity(capacity),
capacity,
start: 0,
len: 0,
}
}
pub fn push(&mut self, item: T) {
if self.len < self.capacity {
self.buffer.push(item);
self.len += 1;
} else {
// Overwrite oldest
self.buffer[self.start] = item;
self.start = (self.start + 1) % self.capacity;
}
}
pub fn iter(&self) -> impl Iterator<Item = &T> {
// Iterate from newest to oldest
// ... implementation
}
}
// Use in ContextManager
pub struct ContextManager {
history: BoundedHistory<ConversationTurn>, // Fixed size, no unbounded growth
// ...
}// Don't serialize everything all the time
impl ContextManager {
pub fn save_checkpoint(&self, path: &str) -> Result<(), String> {
// Only serialize what's needed
let checkpoint = Checkpoint {
// Recent history only (not full 100 turns)
recent_turns: self.history.iter().take(10).cloned().collect(),
// Compressed reservoir state
reservoir_state: self.reservoir.as_ref().map(|esn| esn.state.clone()),
// Don't save full project contexts
current_project: self.current_project.clone(),
};
let json = serde_json::to_string(&checkpoint)
.map_err(|e| e.to_string())?;
std::fs::write(path, json).map_err(|e| e.to_string())
}
}- CPU (40-60%): Active computation
- Network (20-30%): API calls
- Memory (10-15%): DRAM refresh, access
- Storage (5-10%): Disk I/O
- Sensors (0-5%): Wake detection
// Don't poll continuously, use event-driven wake
pub struct PowerEfficientOrchestrator {
orchestrator: Orchestrator,
wake_detector: SpikingNetwork,
state: PowerState,
}
#[derive(PartialEq)]
enum PowerState {
Sleep, // No activity, minimal power
Listening, // SNN active, main system sleeping
Active, // Full system active
}
impl PowerEfficientOrchestrator {
pub fn run(&mut self) {
loop {
match self.state {
PowerState::Sleep => {
// Wait for external trigger (user input, notification)
std::thread::park();
self.state = PowerState::Listening;
}
PowerState::Listening => {
// Low-power wake detection
let audio_sample = get_audio_sample(); // Platform-specific
let spikes = audio_to_spikes(&audio_sample);
let output = self.wake_detector.step(&spikes, 1.0);
if output[0] { // Wake word detected
self.state = PowerState::Active;
} else {
// Stay in low-power mode
std::thread::sleep(Duration::from_millis(10));
}
}
PowerState::Active => {
// Process queries
if let Some(query) = check_for_query() {
self.orchestrator.process(query);
} else {
// Timeout back to listening
std::thread::sleep(Duration::from_secs(5));
self.state = PowerState::Listening;
}
}
}
}
}
}// Batch queries to amortize wake-up cost
pub struct BatchProcessor {
pending_queries: Vec<Query>,
batch_timeout: Duration,
last_process: Instant,
}
impl BatchProcessor {
pub fn add_query(&mut self, query: Query) {
self.pending_queries.push(query);
// Process if batch full or timeout
if self.pending_queries.len() >= 10 || self.last_process.elapsed() > self.batch_timeout {
self.process_batch();
}
}
fn process_batch(&mut self) {
// Wake CPU to higher P-state once
// Process all queries
for query in self.pending_queries.drain(..) {
process_query(query);
}
// CPU can return to low P-state
self.last_process = Instant::now();
}
}// Hint CPU governor based on workload
#[cfg(target_os = "linux")]
fn set_cpu_governor(governor: &str) -> Result<(), std::io::Error> {
// Requires appropriate permissions
std::fs::write("/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor", governor)
}
pub fn process_with_power_hint(&mut self, query: Query) -> Result<Response, String> {
// Simple query: use powersave governor
if query.text.len() < 50 {
#[cfg(target_os = "linux")]
let _ = set_cpu_governor("powersave");
let result = self.process(query);
#[cfg(target_os = "linux")]
let _ = set_cpu_governor("schedutil"); // Restore
result
} else {
// Complex query: use performance governor
#[cfg(target_os = "linux")]
let _ = set_cpu_governor("performance");
let result = self.process(query);
#[cfg(target_os = "linux")]
let _ = set_cpu_governor("schedutil");
result
}
}// Batch network requests instead of making individual calls
pub struct NetworkBatcher {
pending_requests: Vec<RemoteQuery>,
timer: Option<Instant>,
}
impl NetworkBatcher {
pub fn add_request(&mut self, query: RemoteQuery) {
self.pending_requests.push(query);
if self.timer.is_none() {
self.timer = Some(Instant::now());
}
// Flush if batch size reached or timeout
if self.pending_requests.len() >= 5 || self.timer.unwrap().elapsed() > Duration::from_millis(100) {
self.flush();
}
}
fn flush(&mut self) {
if self.pending_requests.is_empty() {
return;
}
// Single network request with batch
let responses = send_batch_request(&self.pending_requests);
// Distribute responses
for (query, response) in self.pending_requests.drain(..).zip(responses) {
query.respond(response);
}
self.timer = None;
}
}// Platform-specific wake lock (Android example)
#[cfg(target_os = "android")]
mod android_power {
use jni::JNIEnv;
use jni::objects::{JClass, JObject};
pub fn acquire_wake_lock(env: &JNIEnv) {
// Get PowerManager
let pm = env.call_method(/* context */, "getSystemService", /* ... */);
// Acquire wake lock
let wake_lock = env.call_method(pm, "newWakeLock", /* ... */);
env.call_method(wake_lock, "acquire", /* timeout */);
}
pub fn release_wake_lock(env: &JNIEnv) {
// Release wake lock
env.call_method(/* wake_lock */, "release", /* ... */);
}
}
// Use wake locks only when necessary
pub fn process_critical_query(&mut self, query: Query) -> Result<Response, String> {
#[cfg(target_os = "android")]
android_power::acquire_wake_lock(&env);
let result = self.process(query);
#[cfg(target_os = "android")]
android_power::release_wake_lock(&env);
result
}use std::collections::HashMap;
use std::time::{Duration, Instant};
pub struct ResponseCache {
cache: HashMap<String, (Response, Instant)>,
ttl: Duration,
}
impl ResponseCache {
pub fn new(ttl: Duration) -> Self {
ResponseCache {
cache: HashMap::new(),
ttl,
}
}
pub fn get(&mut self, query_hash: &str) -> Option<Response> {
if let Some((response, timestamp)) = self.cache.get(query_hash) {
if timestamp.elapsed() < self.ttl {
return Some(response.clone());
} else {
// Expired
self.cache.remove(query_hash);
}
}
None
}
pub fn insert(&mut self, query_hash: String, response: Response) {
self.cache.insert(query_hash, (response, Instant::now()));
}
}#[cfg(feature = "network")]
async fn send_compressed_request(query: &str) -> Result<Response, String> {
use flate2::write::GzEncoder;
use flate2::Compression;
// Compress request
let mut encoder = GzEncoder::new(Vec::new(), Compression::default());
encoder.write_all(query.as_bytes()).map_err(|e| e.to_string())?;
let compressed = encoder.finish().map_err(|e| e.to_string())?;
// Send compressed data
let response = reqwest::Client::new()
.post("https://api.example.com/v1/chat")
.header("Content-Encoding", "gzip")
.body(compressed)
.send()
.await
.map_err(|e| e.to_string())?;
// ... parse response
}#[cfg(feature = "network")]
async fn request_with_timeout(
query: &str,
timeout: Duration,
max_retries: u32,
) -> Result<Response, String> {
let mut retries = 0;
loop {
match tokio::time::timeout(timeout, send_request(query)).await {
Ok(Ok(response)) => return Ok(response),
Ok(Err(e)) => {
retries += 1;
if retries >= max_retries {
return Err(format!("Max retries exceeded: {}", e));
}
// Exponential backoff
tokio::time::sleep(Duration::from_millis(100 * 2_u64.pow(retries))).await;
}
Err(_) => {
retries += 1;
if retries >= max_retries {
return Err("Request timeout".to_string());
}
}
}
}
}// Leverage Android Neural Networks API (NNAPI)
#[cfg(target_os = "android")]
mod nnapi {
// Wrapper for NNAPI
// Requires JNI bindings to android.neuralnetworks
pub fn run_mlp_on_nnapi(mlp: &MLP, input: &[f32]) -> Vec<f32> {
// Convert MLP to NNAPI model
// Run on NPU/GPU instead of CPU
// Return results
todo!()
}
}#[cfg(target_os = "android")]
pub fn handle_doze_mode() {
// Use AlarmManager for critical tasks during Doze
// Defer non-critical work
}// Convert MLP to Core ML model
import CoreML
class MLPRouter {
let model: MLModel
init() {
// Load compiled Core ML model
model = try! MLModel(contentsOf: /* model URL */)
}
func route(query: [Float]) -> Int {
// Run inference on Neural Engine
let input = MLPInput(features: query)
let output = try! model.prediction(from: input)
return output.decision
}
}import BackgroundTasks
func scheduleBackgroundProcessing() {
let request = BGProcessingTaskRequest(identifier: "com.example.mobileai.reservoir")
request.requiresNetworkConnectivity = false
request.requiresExternalPower = false
try? BGTaskScheduler.shared.submit(request)
}// Use Qualcomm Hexagon DSP (if available)
#[cfg(all(target_os = "linux", target_arch = "aarch64"))]
mod hexagon {
// Offload SNN to DSP for ultra-low power
pub fn run_snn_on_dsp(snn: &SpikingNetwork, input: &[bool]) -> Vec<bool> {
// Requires Hexagon SDK
todo!()
}
}// Convert f32 weights to int8
pub struct QuantizedMLP {
weights: Vec<Vec<Vec<i8>>>,
scale_factors: Vec<f32>,
zero_points: Vec<i8>,
}
impl QuantizedMLP {
pub fn from_mlp(mlp: &MLP) -> Self {
// Quantize weights: w_int8 = round((w_float - zero_point) / scale)
let mut quantized_weights = Vec::new();
let mut scale_factors = Vec::new();
let mut zero_points = Vec::new();
for layer in &mlp.weights {
let min = layer.iter().flat_map(|row| row.iter()).cloned().fold(f32::INFINITY, f32::min);
let max = layer.iter().flat_map(|row| row.iter()).cloned().fold(f32::NEG_INFINITY, f32::max);
let scale = (max - min) / 255.0;
let zero_point = -min / scale;
let quantized_layer: Vec<Vec<i8>> = layer
.iter()
.map(|row| {
row.iter()
.map(|&w| ((w / scale + zero_point).round() as i8))
.collect()
})
.collect();
quantized_weights.push(quantized_layer);
scale_factors.push(scale);
zero_points.push(zero_point as i8);
}
QuantizedMLP {
weights: quantized_weights,
scale_factors,
zero_points,
}
}
pub fn forward(&self, input: &[f32]) -> Vec<f32> {
// Integer arithmetic forward pass
// Dequantize at the end
todo!()
}
}
// Benefits:
// - 4x memory reduction (f32 → i8)
// - 2-4x faster on mobile (int8 SIMD)
// - <1% accuracy loss (typically)pub fn prune_mlp(mlp: &mut MLP, threshold: f32) -> usize {
let mut pruned_count = 0;
for layer in &mut mlp.weights {
for row in layer {
for weight in row {
if weight.abs() < threshold {
*weight = 0.0;
pruned_count += 1;
}
}
}
}
pruned_count
}
// Example: Prune 50% of weights
let mlp = MLP::new(384, vec![100, 50], 3);
let pruned = prune_mlp(&mut mlp, 0.01); // Threshold = 0.01
println!("Pruned {} weights ({:.1}%)", pruned, pruned as f32 / total_weights * 100.0);
// Convert to sparse representation
pub struct SparseMLP {
weights: Vec<HashMap<(usize, usize), f32>>, // Only store non-zero
}// Train small model to mimic large model
pub fn distill_mlp(
large_mlp: &MLP,
small_mlp: &mut MLP,
training_data: &[Vec<f32>],
temperature: f32,
) {
for input in training_data {
// Get soft targets from large model
let large_output = large_mlp.forward(input);
let soft_targets = softmax_with_temperature(&large_output, temperature);
// Train small model to match
let small_output = small_mlp.forward(input);
let loss = kl_divergence(&soft_targets, &softmax(&small_output));
// Backprop and update
// ...
}
}
// Result: 10x smaller model with 95% accuracy of large modeluse std::sync::atomic::{AtomicU64, Ordering};
pub struct Metrics {
queries_processed: AtomicU64,
total_latency_ms: AtomicU64,
cache_hits: AtomicU64,
cache_misses: AtomicU64,
}
impl Metrics {
pub fn record_query(&self, latency_ms: u64, cache_hit: bool) {
self.queries_processed.fetch_add(1, Ordering::Relaxed);
self.total_latency_ms.fetch_add(latency_ms, Ordering::Relaxed);
if cache_hit {
self.cache_hits.fetch_add(1, Ordering::Relaxed);
} else {
self.cache_misses.fetch_add(1, Ordering::Relaxed);
}
}
pub fn report(&self) {
let queries = self.queries_processed.load(Ordering::Relaxed);
let total_latency = self.total_latency_ms.load(Ordering::Relaxed);
let avg_latency = if queries > 0 { total_latency / queries } else { 0 };
let hits = self.cache_hits.load(Ordering::Relaxed);
let misses = self.cache_misses.load(Ordering::Relaxed);
let hit_rate = if hits + misses > 0 {
hits as f32 / (hits + misses) as f32 * 100.0
} else {
0.0
};
eprintln!("Metrics:");
eprintln!(" Queries: {}", queries);
eprintln!(" Avg latency: {}ms", avg_latency);
eprintln!(" Cache hit rate: {:.1}%", hit_rate);
}
}- Profile on target device (not just development machine)
- Enable SIMD for matrix operations
- Use memory pooling for frequent allocations
- Implement request caching
- Use event-driven architecture for battery savings
- Quantize models (f32 → int8)
- Prune unnecessary weights
- Batch network requests
- Set appropriate CPU governor hints
- Monitor power consumption with platform tools
- Optimize binary size (target <1.5MB)
- Test on low-end devices
- Measure battery impact over 24 hours
Run benchmarks: cargo bench
Profile CPU: perf record && perf report
Check memory: valgrind --tool=massif