Performance Tuning

When to use this runbook: tuning Powernode's throughput, latency, and resource usage when capacity planning, investigating slow queries, or hardening for production load.

Prerequisites
When to use this
Overview
Key Performance Targets
Application Performance Monitoring
Database Performance
Caching Strategy
Redis Tuning
Background Worker Tuning
AI Provider Throughput
Frontend Performance
Load Testing
Capacity Planning
Procedure — investigate a slow endpoint
Procedure — investigate a queue backlog
Verification
Rollback
Troubleshooting

Prerequisites

Backend, worker, and frontend services running.
APM provider configured (Sentry / New Relic / DataDog / AppSignal) with credentials in place.
Postgres EXPLAIN ANALYZE access for the production / staging DB.
Redis client (redis-cli) accessible from the worker host.
For load tests: a non-production environment that mirrors production sizing.

When to use this

Sustained P95 latency above target on a hot endpoint.
Sidekiq queue backlog (any queue > 1000 enqueued).
Memory pressure or OOM-kills on backend / worker pods.
Pre-launch capacity planning for a new wave of AI traffic.
After adding a new agent / mission flow that materially shifts load.

Overview

Performance work at Powernode spans three layers:

Synchronous request path — Rails controllers serving /api/v1/* and /api/v1/ai/*. Targets: low P95 latency, low error rate, no N+1 queries.
Asynchronous worker path — Standalone Sidekiq processing AI orchestration, DevOps, Docker sync, and maintenance. Targets: short queue latency, bounded retry depth, predictable nightly maintenance windows.
External provider path — Calls into AI providers (OpenAI / Anthropic / Ollama / etc.) and managed Docker / Vault / Stripe APIs. Targets: circuit-breaker discipline, per-provider quota awareness, predictable timeouts.

Most tuning work begins by identifying which layer is the bottleneck (APM is the primary tool), then applying the targeted patterns in the rest of this runbook.

Key Performance Targets

API Latency (P95)

Endpoint Type	Target	Maximum Acceptable
Authentication	< 100ms	200ms
Read-heavy CRUD	< 150ms	300ms
Mutation CRUD	< 200ms	400ms
AI orchestration (synchronous)	< 500ms	1000ms
AI orchestration (LLM call)	< 30s	60s
Analytics / reporting	< 500ms	1000ms
Webhook ingest	< 250ms	500ms

Database

Metric	Target
Connection pool size	25-50
Max query time	< 100ms
Index coverage on hot tables	> 95%
N+1 queries	0

Worker / Background Jobs

Metric	Target
Queue latency (steady state)	< 60s
Job processing time (avg)	< 5 seconds
Failed job rate	< 1%
Worker concurrency per instance	5-25

AI Provider Calls

Metric	Target
Circuit breaker open events	< 3 active
Provider P95 round-trip	< 2 × baseline
Retry depth	< 3 per call

Application Performance Monitoring

Configure APM via a Rails initializer (e.g., <server>/config/initializers/performance_monitoring.rb). The pattern below assumes New Relic but applies equally to DataDog / AppSignal — switch the SDK calls accordingly.

# config/initializers/performance_monitoring.rb
if Rails.env.production? || Rails.env.staging?
  Rails.application.config.after_initialize do
    # Mission lifecycle
    ActiveSupport::Notifications.subscribe('ai.mission.completed') do |_n, start, finish, _id, payload|
      duration = (finish - start) * 1000
      record_metric('AI/Mission/Completed', duration, account_id: payload[:account_id])
      log_slow('AI mission', duration, threshold: 60_000)
    end

    # Agent execution
    ActiveSupport::Notifications.subscribe('ai.agent.execution.completed') do |_n, start, finish, _id, payload|
      duration = (finish - start) * 1000
      record_metric('AI/Agent/Execution', duration,
                    agent_id: payload[:agent_id],
                    provider: payload[:provider])
      log_slow('Agent execution', duration, threshold: 30_000)
    end

    # Background job timing
    ActiveSupport::Notifications.subscribe('job.performed') do |_n, start, finish, _id, payload|
      duration = (finish - start) * 1000
      job_class = payload[:job].class.name
      record_metric("Worker/Job/#{job_class}", duration)
      log_slow('Background job', duration, threshold: 30_000, context: { job: job_class })
    end
  end
end

Per-Request Tracking Middleware

class PerformanceTrackingMiddleware
  def initialize(app)
    @app = app
  end

  def call(env)
    start = Time.current
    status, headers, response = @app.call(env)
    duration = (Time.current - start) * 1000

    request = Rack::Request.new(env)
    endpoint = "#{request.method} #{request.path_info}"

    if duration > 1000
      Rails.logger.performance "Slow request: #{endpoint} took #{duration.round(2)}ms"
      record_metric("SlowRequest/#{request.method}", duration)
    end

    if Rails.env.development?
      headers['X-Response-Time'] = "#{duration.round(2)}ms"
      headers['X-DB-Query-Count'] = ActiveRecord::Base.connection.query_cache.size.to_s
    end

    [status, headers, response]
  end
end

Database Performance

Slow Query Logging

class DatabasePerformanceMonitor
  def self.track_slow_queries
    ActiveSupport::Notifications.subscribe('sql.active_record') do |_n, start, finish, _id, payload|
      duration = (finish - start) * 1000
      next if duration < 500

      Rails.logger.performance "Slow query: #{payload[:sql]} (#{duration.round(2)}ms)"
      record_metric('Database/SlowQuery', duration, sql_type: extract_sql_type(payload[:sql]))
    end
  end

  def self.monitor_connection_pool
    Thread.new do
      loop do
        pool = ActiveRecord::Base.connection_pool
        record_metric('Database/Pool/Size', pool.size)
        record_metric('Database/Pool/Available', pool.available_connection_count)
        record_metric('Database/Pool/Active', pool.size - pool.available_connection_count)

        if pool.available_connection_count < 3
          Rails.logger.performance "DB pool nearly exhausted: #{pool.available_connection_count} available"
        end

        sleep 30
      end
    end
  end
end

Query Optimization Heuristics

Common red flags in EXPLAIN ANALYZE:

Seq Scan on a hot table — add a covering index on the WHERE / JOIN column.
Loop over an association without includes() — classic N+1; always use .includes(:assoc) when iterating.
ORDER BY without LIMIT — paginate or chunk.
Large IN (...) lists — prefer a join against a temporary table or batched queries.

The repo enforces eager loading via lint patterns and CI; see ../guides/backend.md for the canonical patterns.

Index Health

All foreign keys are indexed automatically via t.references (never add_index separately).
pgvector tables use HNSW indexes for embedding columns.
Run quarterly: SELECT schemaname, relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 25; and VACUUM ANALYZE heavy-write tables on a maintenance window.

Caching Strategy

Multi-Layer Cache Configuration

class PerformanceCacheService
  CACHE_CONFIGS = {
    user_session:         { ttl: 15.minutes, compress: false },
    user_profile:         { ttl: 1.hour,     compress: true  },
    account_settings:     { ttl: 30.minutes, compress: true  },
    ai_provider_models:   { ttl: 6.hours,    compress: true  },
    ai_pricing:           { ttl: 24.hours,   compress: true  },
    analytics_dashboard:  { ttl: 1.hour,     compress: true  },
    system_configuration: { ttl: 24.hours,   compress: true  }
  }.freeze

  def self.cache_with_performance(key, cache_type: :default, &block)
    config = CACHE_CONFIGS[cache_type] || { ttl: 1.hour, compress: false }
    Rails.cache.fetch(key, expires_in: config[:ttl], compress: config[:compress], &block)
  end

  def self.warm_critical_caches
    %w[system:configuration navigation:menu_items features:enabled_features].each do |key|
      Thread.new { warm_cache(key) }
    end
  end

  def self.invalidate_related_caches(pattern)
    case pattern
    when /user_(\d+)/    then invalidate_user_caches($1)
    when /account_(\w+)/ then invalidate_account_caches($1)
    when /provider_/     then invalidate_provider_caches
    end
  end
end

Cache Hit-Rate Monitoring

Track cache_hits / total_lookups per cache_type. Alert when:

Overall hit rate < 70%.
Individual cache_type hit rate < 50% (indicates TTL too short or eviction pressure).

Redis Tuning

class RedisPerformanceOptimizer
  def self.configure_optimal_connection
    Redis.current = Redis.new(
      url: ENV['REDIS_URL'],
      size: ENV.fetch('REDIS_POOL_SIZE', 25).to_i,
      timeout: ENV.fetch('REDIS_TIMEOUT', 5).to_i,
      tcp_keepalive: 60,
      reconnect_attempts: 3,
      reconnect_delay: 1.5,
      reconnect_delay_max: 10,
      driver: :hiredis,
      connect_timeout: 2,
      read_timeout: 1,
      write_timeout: 1
    )
  end
end

Monitor:

used_memory / maxmemory — alert at > 80%.
connected_clients — alert if persistently > 80% of maxclients.
instantaneous_ops_per_sec — establish a baseline, alert on > 2× sustained.
Slow log (SLOWLOG GET 10) — review weekly.

Optimisation actions:

Large keys (> 1 MB) → compress at the application layer or restructure the value.
Stale keys without TTL → audit producers and add expires_in: to all Rails.cache.write calls.
High eviction count → increase maxmemory or reduce TTLs on the noisiest cache types.

Background Worker Tuning

The worker is the platform's most variable performance dimension. See worker-operations.md for queue layout, schedules, and service management.

Tuning levers (in order of likely impact):

Concurrency — bump WORKER_CONCURRENCY env var on the systemd unit. Default 5; raise to 10-25 for AI-heavy workloads.
Dedicated worker instances — sudo scripts/systemd/powernode-installer.sh add-instance worker ai-heavy then set per-instance config in /etc/powernode/worker-ai-heavy.conf.
Queue weighting — increase the weight of starving queues in worker/config/sidekiq.yml (current default: critical=3, standard=2, low=1).
Circuit breaker thresholds — adjust failure_threshold / recovery_timeout for noisy providers in the relevant circuit-breaker config.

Watch:

Queue latency per queue (Sidekiq dashboard / enqueued_at vs processed_at).
Retry depth (Sidekiq::RetrySet).
Failed-job rate per job class.

AI Provider Throughput

Each AI provider call goes through LlmProxyClient, which wraps the request in a circuit breaker:

Failure threshold: 5
Recovery timeout: 120s
Request timeout: 600s

Common tuning patterns:

Hot-spot provider → register a second credential on the provider and split traffic via Ai::AgentModelSelector.
Provider rate-limit churn → respect the provider's recommended request rate; lower concurrency for the relevant queue.
Cost vs. latency trade-off → use the ModelRouter to route latency-sensitive flows to a faster but more expensive model and batch / non-interactive flows to a cheaper model.

Reference: ai-operations.md for SLO definitions and incident runbooks.

Frontend Performance

Frontend is React + TypeScript + Vite.

Targets:

Initial bundle (gzipped) < 300 KB.
Route-level code splitting on every top-level feature.
Largest Contentful Paint (LCP) < 2.5s on a cold load.
No layout shift after first paint.

Patterns:

Use route-level React.lazy() for every feature directory.
Prefer useInfiniteResourceList over Previous/Next pagination on list pages.
Use data-testid for E2E selectors instead of class / role matchers.
Run npx --prefix frontend tsc --noEmit after TS changes.
Bundle analysis: cd frontend && npm run build && npx vite-bundle-visualizer.

For component-level patterns and conventions, see ../guides/frontend.md.

Load Testing

Realistic Scenarios

Tune the scenarios below to your fleet's actual traffic mix. They're a starting point — not gospel.

Scenario	Endpoint	Method	Concurrent users	Duration	Ramp-up
Auth burst	`/api/v1/auth/login`	POST	50	5m	1m
Agent execute (sync)	`/api/v1/ai/agents/:id/execute`	POST	20	10m	2m
Mission create	`/api/v1/ai/missions`	POST	10	5m	1m
Dashboard load	`/api/v1/ai/monitoring/dashboard`	GET	100	5m	1m
WebSocket subscribe	`/cable`	WS	200	5m	1m

Tooling

k6 (k6.io) for HTTP + WebSocket scenarios.
Apache Bench (ab) for one-off endpoint smoke checks.
Locust if you need Python-defined ramp logic.

Sample k6 script skeleton:

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '1m', target: 50 },
    { duration: '5m', target: 50 },
    { duration: '1m', target: 0 },
  ],
};

const jwt = __ENV.JWT;
export default function () {
  const res = http.get('https://api.staging.powernode.example.com/api/v1/ai/monitoring/dashboard', {
    headers: { Authorization: `Bearer ${jwt}` },
  });
  check(res, { 'status 200': r => r.status === 200, 'p95 < 500ms': r => r.timings.duration < 500 });
  sleep(1);
}

Thresholds

Metric	Threshold	Action
P95 response time	< 2000ms	Investigate slow query / serializer
P99 response time	< 5000ms	Investigate provider timeout or queue contention
Error rate	< 1%	Investigate failing endpoints
Throughput	≥ baseline × 0.8	Investigate regression vs. last run

Capacity Planning

When sizing for a known growth target (new account onboarding, mission burst, etc.):

Establish baseline. Use APM dashboards to capture P50 / P95 / P99 latency and throughput at current load.
Project growth. Multiply baseline traffic by the expected growth factor.
Identify saturation. Run a graduated load test against staging until you hit the first SLO violation. Note the limiting resource (CPU, memory, DB pool, Redis, provider quota).
Plan capacity. Add the appropriate dimension headroom:
- CPU saturation → scale backend / worker horizontally (Docker Swarm / Compose --scale).
- DB pool saturation → bump pool size or add PgBouncer.
- Redis saturation → bump maxmemory or split Sidekiq DBs.
- Provider quota → register additional credentials or upgrade vendor tier.
Re-test. Confirm the SLO violation moves out beyond the projected growth.

Procedure — investigate a slow endpoint

Identify the endpoint in APM. Confirm sustained P95 latency above target for the relevant endpoint class.
Inspect the trace breakdown — pinpoint whether the time is in DB, Ruby code, or downstream HTTP.
DB-bound: run EXPLAIN ANALYZE on the offending query; look for seq scans, large sorts, or missing indexes.
Ruby-bound: profile the controller with rack-mini-profiler locally; check for N+1 queries via bullet gem output.
External call-bound: check circuit breaker state and provider quota.
Apply the fix (index, .includes(), caching, provider-side change) and confirm the trace breakdown improves.

Procedure — investigate a queue backlog

Check the Sidekiq dashboard: which queue is backed up, and what's the latency?
Inspect retry / failed sets for systemic errors.
Capacity issue → bump WORKER_CONCURRENCY or add a worker instance.
Slow job → profile the job; bring the long path under a circuit breaker or split into smaller jobs.
External provider stalling → check provider status; consider failing-fast via the breaker.
Verify the queue depth returns to steady state within the next 15 minutes.

Verification

After applying a tuning change:

APM dashboard shows the relevant percentile within target for at least one full peak window.
Sidekiq dashboard shows queue latency within SLO and no growing retry set.
SELECT * FROM pg_stat_activity WHERE state = 'active' does not show stuck queries.
No new error spikes in Sentry / Rollbar.

Rollback

If a tuning change degrades performance:

Revert the Git commit and redeploy the affected service (backend, worker, or frontend).
Confirm the previous baseline restores within one peak window.
If the change was a config flag (env var / sidekiq.yml weighting), restore via systemd unit edit:
```
sudo nano /etc/powernode/worker-default.conf
sudo systemctl restart powernode-worker@default
```
Capture the failure mode in platform.create_learning (category failure_mode) so we don't repeat.

Troubleshooting

Symptom	Likely cause	First action
P95 latency creeping up	Cache hit rate degraded	Inspect Redis stats + `cache_hits/total_lookups`
Sidekiq retry set growing	Failing job class	Inspect `Sidekiq::RetrySet`; investigate top retried job
Backend OOM-kill	Memory leak in long-running process	Restart backend with `scripts/reload-backend.sh`; profile with `derailed_benchmarks`
Frontend bundle ballooned	New dep without code-split	Run bundle visualiser; lazy-load the new feature
Provider P95 doubled	Provider-side incident	Check provider status page; lower call rate via worker concurrency
Mission completion rate dropped	AI provider quota / circuit breaker	Check provider quota + open breakers

Related runbooks

production-deployment.md — Initial service install + scaling commands
worker-operations.md — Worker job catalogue + queue config
ai-operations.md — AI-specific operational procedures
docker-swarm.md — Multi-host scaling

Materials previously at

docs/infrastructure/PERFORMANCE_OPTIMIZER.md (2025-08-24; rewritten 2026-05-17 for current platform identity)

Last verified: 2026-05-17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tuning

Table of Contents

Prerequisites

When to use this

Overview

Key Performance Targets

API Latency (P95)

Database

Worker / Background Jobs

AI Provider Calls

Application Performance Monitoring

Per-Request Tracking Middleware

Database Performance

Slow Query Logging

Query Optimization Heuristics

Index Health

Caching Strategy

Multi-Layer Cache Configuration

Cache Hit-Rate Monitoring

Redis Tuning

Background Worker Tuning

AI Provider Throughput

Frontend Performance

Load Testing

Realistic Scenarios

Tooling

Thresholds

Capacity Planning

Procedure — investigate a slow endpoint

Procedure — investigate a queue backlog

Verification

Rollback

Troubleshooting

Related runbooks

Materials previously at

FilesExpand file tree

performance-tuning.md

Latest commit

History

performance-tuning.md

File metadata and controls

Performance Tuning

Table of Contents

Prerequisites

When to use this

Overview

Key Performance Targets

API Latency (P95)

Database

Worker / Background Jobs

AI Provider Calls

Application Performance Monitoring

Per-Request Tracking Middleware

Database Performance

Slow Query Logging

Query Optimization Heuristics

Index Health

Caching Strategy

Multi-Layer Cache Configuration

Cache Hit-Rate Monitoring

Redis Tuning

Background Worker Tuning

AI Provider Throughput

Frontend Performance

Load Testing

Realistic Scenarios

Tooling

Thresholds

Capacity Planning

Procedure — investigate a slow endpoint

Procedure — investigate a queue backlog

Verification

Rollback

Troubleshooting

Related runbooks

Materials previously at