Observability in Production: Lessons from Trading System Outages

Introduction #

At 2:47 AM, our trading system stopped processing orders. By 2:52 AM, we had identified the root cause. By 2:55 AM, the system was back online. Total downtime: 8 minutes. Total loss: $12,000.

Three years earlier, a similar incident took 6 hours to resolve and cost $4.2 million.

What changed? Observability.

The Cost of Poor Observability #

Incident #1: The Dark Ages (2021)#

Timeline:

02:47 - First customer reports order failures
03:15 - On-call engineer woken up (28 minutes)
03:45 - Engineer logs into systems (30 minutes)
04:30 - Checking logs across 15 servers (45 minutes)
05:15 - Found nothing in logs (45 minutes)
06:00 - Started checking database (45 minutes)
07:30 - Discovered connection pool exhaustion (90 minutes)
08:00 - Fixed by restarting services (30 minutes)
08:47 - System fully recovered

Total: 6 hours
Lost revenue: $4.2M
Root cause: Database connection leak

Incident #2: The Modern Era (2024)#

Timeline:

02:47 - Alert fires before customers notice
02:48 - On-call engineer receives alert with context
02:50 - Engineer opens trace showing exact issue
02:52 - Root cause identified: connection pool leak
02:55 - Automated failover triggered
02:55 - System recovered

Total: 8 minutes
Lost revenue: $12K
Root cause: Same issue, but caught immediately

The Three Pillars of Observability #

1. Metrics #

What: Numerical measurements over time

typescript

1// Counter: things that only go up
2orderCounter.inc({ status: 'filled', venue: 'NYSE' });
3
4// Gauge: current value
5activeConnectionsGauge.set(pool.getActiveCount());
6
7// Histogram: distribution of values
8orderLatencyHistogram.observe(processingTime, { 
9  symbol: 'AAPL',
10  orderType: 'MARKET' 
11});
12

Key metrics for trading systems:

typescript

1const metrics = {
2  // Business metrics
3  ordersPerSecond: new Counter('orders_total'),
4  fillRate: new Gauge('order_fill_rate'),
5  averageSlippage: new Histogram('order_slippage_dollars'),
6  
7  // System metrics
8  apiLatencyP99: new Histogram('api_latency_ms'),
9  databaseConnections: new Gauge('db_connections_active'),
10  kafkaLag: new Gauge('kafka_consumer_lag'),
11  
12  // Error metrics
13  orderRejections: new Counter('orders_rejected', ['reason']),
14  apiErrors: new Counter('api_errors', ['endpoint', 'status_code']),
15  
16  // Resource metrics
17  cpuUsage: new Gauge('cpu_usage_percent'),
18  memoryUsage: new Gauge('memory_usage_bytes'),
19  diskIOPS: new Counter('disk_operations_total')
20};
21

2. Logs #

What: Discrete events with context

typescript

1// ❌ Bad logging
2console.log('Order failed');
3
4// ✅ Good logging
5logger.error('Order validation failed', {
6  orderId: order.id,
7  accountId: order.accountId,
8  symbol: order.symbol,
9  quantity: order.quantity,
10  reason: 'insufficient_margin',
11  requiredMargin: 50000,
12  availableMargin: 30000,
13  timestamp: new Date().toISOString(),
14  traceId: getCurrentTraceId()
15});
16

Structured logging:

typescript

1interface LogContext {
2  service: string;
3  environment: string;
4  version: string;
5  hostname: string;
6  traceId: string;
7  spanId: string;
8  userId?: string;
9  accountId?: string;
10}
11
12class Logger {
13  private context: LogContext;
14  
15  error(message: string, metadata: object = {}) {
16    console.log(JSON.stringify({
17      level: 'ERROR',
18      message,
19      ...this.context,
20      ...metadata,
21      timestamp: Date.now()
22    }));
23  }
24}
25

3. Traces #

What: Request flow through distributed system

typescript

1import { trace, context } from '@opentelemetry/api';
2
3async function processOrder(order: Order) {
4  const tracer = trace.getTracer('order-service');
5  
6  return await tracer.startActiveSpan('processOrder', async (span) => {
7    span.setAttributes({
8      'order.id': order.id,
9      'order.symbol': order.symbol,
10      'order.quantity': order.quantity
11    });
12    
13    try {
14      // Each step gets its own span
15      const riskCheck = await tracer.startActiveSpan('validateRisk', 
16        async (riskSpan) => {
17          const result = await riskService.validate(order);
18          riskSpan.setAttributes({
19            'risk.approved': result.approved,
20            'risk.required_margin': result.requiredMargin
21          });
22          return result;
23        }
24      );
25      
26      if (!riskCheck.approved) {
27        span.setStatus({ code: SpanStatusCode.ERROR });
28        throw new Error('Risk check failed');
29      }
30      
31      const execution = await tracer.startActiveSpan('executeOrder',
32        async (execSpan) => {
33          const result = await executionService.execute(order);
34          execSpan.setAttributes({
35            'execution.price': result.price,
36            'execution.venue': result.venue
37          });
38          return result;
39        }
40      );
41      
42      span.setStatus({ code: SpanStatusCode.OK });
43      return execution;
44      
45    } catch (error) {
46      span.recordException(error);
47      span.setStatus({ code: SpanStatusCode.ERROR });
48      throw error;
49    } finally {
50      span.end();
51    }
52  });
53}
54

Trace output:

plaintext

1Trace ID: 7a8f3d2e1b4c9f6a
2├─ processOrder (250ms)
3│  ├─ validateRisk (100ms)
4│  │  ├─ database.query (45ms)
5│  │  └─ redis.get (5ms)
6│  ├─ executeOrder (120ms)
7│  │  ├─ kafka.send (80ms)  ← SLOW!
8│  │  └─ database.update (30ms)
9│  └─ auditLog.write (10ms)
10

Advanced Observability Patterns #

1. Golden Signals #

The four metrics that matter most:

typescript

1class GoldenSignals {
2  // Latency: How long does it take?
3  measureLatency(operation: string, duration: number) {
4    latencyHistogram.observe(duration, { operation });
5  }
6  
7  // Traffic: How much demand?
8  measureTraffic(operation: string) {
9    requestCounter.inc({ operation });
10  }
11  
12  // Errors: How many failures?
13  measureErrors(operation: string, error: Error) {
14    errorCounter.inc({ operation, type: error.constructor.name });
15  }
16  
17  // Saturation: How full is the service?
18  measureSaturation() {
19    cpuGauge.set(process.cpuUsage().user / 1000000);
20    memoryGauge.set(process.memoryUsage().heapUsed);
21    connectionPoolGauge.set(pool.getActiveCount() / pool.getMaxSize());
22  }
23}
24

2. Service Level Indicators (SLIs)#

typescript

1// Define what "good" means
2const orderProcessingSLI = new SLI({
3  name: 'order_processing_success_rate',
4  target: 0.9999, // 99.99% success rate
5  window: '30d',
6  
7  goodEvents: metrics.ordersProcessed.labels({ status: 'success' }),
8  totalEvents: metrics.ordersProcessed
9});
10
11// Alert when SLI is at risk
12if (orderProcessingSLI.current() < 0.9999) {
13  alert('SLI violation: order processing below target');
14}
15

3. Distributed Context Propagation #

typescript

1// Propagate context across service boundaries
2import { propagation, context } from '@opentelemetry/api';
3
4// Service A: Set context
5async function createOrder(order: Order) {
6  const span = tracer.startSpan('createOrder');
7  const ctx = trace.setSpan(context.active(), span);
8  
9  // Inject context into HTTP headers
10  const headers = {};
11  propagation.inject(ctx, headers);
12  
13  await fetch('http://risk-service/validate', {
14    headers,
15    body: JSON.stringify(order)
16  });
17}
18
19// Service B: Extract context
20app.post('/validate', async (req, res) => {
21  // Extract context from headers
22  const ctx = propagation.extract(context.active(), req.headers);
23  
24  // Continue the trace
25  const span = tracer.startSpan('validateRisk', undefined, ctx);
26  // ... processing
27});
28

Real-World Debugging Scenarios #

Scenario 1: Slow Orders #

Symptom: Order latency increased from 50ms to 500ms

Investigation:

typescript

1// 1. Check metrics
2// latency_p99{operation="processOrder"} jumped from 50ms to 500ms
3
4// 2. Find slow traces
5const slowTraces = await tracing.query({
6  service: 'order-service',
7  operation: 'processOrder',
8  minDuration: '400ms',
9  limit: 100
10});
11
12// 3. Analyze common pattern
13// All slow traces show kafka.send taking 450ms
14
15// 4. Check Kafka metrics
16// kafka_consumer_lag = 1.2M messages ← PROBLEM!
17
18// 5. Root cause
19// Kafka cluster degraded, consumer lag building up
20

Solution: Scale Kafka cluster, reduce lag

Scenario 2: Intermittent Failures #

Symptom: 0.1% of orders fail with "connection timeout"

Investigation:

typescript

1// 1. Query error logs
2const errors = await logs.query({
3  level: 'ERROR',
4  message: '*connection timeout*',
5  timeRange: 'last 1h'
6});
7
8// 2. Group by trace ID
9const traces = errors.map(e => e.traceId);
10
11// 3. Analyze traces
12// Common pattern: timeout after exactly 30 seconds
13// Happens only for specific database queries
14
15// 4. Check database
16// Long-running queries on `positions` table
17// Missing index on `positions(account_id, symbol)`
18
19// 5. Verify with query plan
20const plan = await db.explain(
21  'SELECT * FROM positions WHERE account_id = $1 AND symbol = $2'
22);
23// Seq Scan on positions (cost=0.00..10000.00)
24

Solution: Add index

sql

1CREATE INDEX CONCURRENTLY idx_positions_account_symbol 
2ON positions(account_id, symbol);
3

Scenario 3: Memory Leak #

Symptom: Service crashes every 3 days with OOM

Investigation:

typescript

1// 1. Check memory metrics over time
2// memory_usage_bytes steadily increasing
3// from 500MB to 4GB over 3 days
4
5// 2. Take heap snapshots
6const snapshot1 = await heapSnapshot();
7await sleep(3600000); // 1 hour
8const snapshot2 = await heapSnapshot();
9
10// 3. Compare snapshots
11const diff = compareSnapshots(snapshot1, snapshot2);
12/*
13Largest growth:
14- Array: +2.5GB
15- Object: +500MB
16- String: +200MB
17*/
18
19// 4. Find what's holding references
20const retainerTree = diff.getRetainerTree('Array');
21/*
22Array (2.5GB)
23└─ OrderCache._orders
24   └─ OrderService.cache
25*/
26
27// 5. Check code
28class OrderCache {
29  private _orders: Map<string, Order> = new Map();
30  
31  add(order: Order) {
32    this._orders.set(order.id, order);
33    // ❌ Never removes old orders!
34  }
35}
36

Solution: Implement LRU cache with eviction

Alerting Strategy #

❌ Bad Alerts #

typescript

1// Too sensitive
2if (errorCount > 0) {
3  alert('ERRORS DETECTED');
4}
5
6// Too vague
7if (cpuUsage > 80) {
8  alert('HIGH CPU');
9}
10
11// Alert fatigue
12if (diskUsage > 70) {
13  alert('DISK ALMOST FULL'); // Every 5 minutes for weeks
14}
15

✅ Good Alerts #

typescript

1// Alert on rate, not count
2if (errorRate > 0.01 && requestCount > 100) {
3  alert('Error rate above 1%', {
4    current: errorRate,
5    threshold: 0.01,
6    runbook: 'https://wiki/runbooks/high-error-rate'
7  });
8}
9
10// Alert on business impact
11if (orderFillRate < 0.95) {
12  alert('Order fill rate degraded', {
13    current: orderFillRate,
14    target: 0.99,
15    impact: 'Customer orders not executing',
16    severity: 'critical'
17  });
18}
19
20// Alert with context
21if (p99Latency > 1000 && avgLatency < 100) {
22  alert('Latency tail degraded', {
23    p99: p99Latency,
24    p50: p50Latency,
25    avg: avgLatency,
26    suggestion: 'Check for slow queries or GC pauses',
27    dashboard: 'https://grafana/d/latency'
28  });
29}
30

Observability Stack #

Our production setup:

yaml

1# Metrics
2metrics:
3  collector: Prometheus
4  storage: Thanos (long-term)
5  visualization: Grafana
6  alerting: Alertmanager
7
8# Logs
9logs:
10  shipper: Fluentd
11  storage: Elasticsearch
12  visualization: Kibana
13  
14# Traces
15tracing:
16  library: OpenTelemetry
17  collector: OTEL Collector
18  storage: Jaeger
19  
20# Unified
21unified:
22  platform: Datadog
23  backup: Grafana Cloud
24

Cost Optimization #

Observability can get expensive:

Before Optimization #

plaintext

1Logs: 10TB/day × $0.50/GB = $5,000/day
2Metrics: 100M series × $0.01 = $1,000/day
3Traces: 100% sampling = $2,000/day
4
5Total: $8,000/day = $240,000/month
6

After Optimization #

typescript

1// 1. Sample traces intelligently
2const sampler = {
3  // Always sample errors
4  shouldSample(span) {
5    if (span.status === 'ERROR') return true;
6    
7    // Sample slow requests
8    if (span.duration > 1000) return true;
9    
10    // Sample 1% of normal requests
11    return Math.random() < 0.01;
12  }
13};
14
15// 2. Aggregate metrics
16// Instead of per-account metrics, use percentiles
17latency.observe(value); // Don't add account_id label
18
19// 3. Drop noisy logs
20if (log.level === 'DEBUG' && env === 'production') {
21  return; // Don't ship debug logs to prod
22}
23
24// 4. Use log levels strategically
25logger.info('Order processed', { orderId }); // Structured, shipped
26logger.debug('Cache hit', { key }); // Local only, not shipped
27

After Optimization #

plaintext

1Logs: 1TB/day × $0.50/GB = $500/day
2Metrics: 10M series × $0.01 = $100/day
3Traces: 5% sampling = $100/day
4
5Total: $700/day = $21,000/month
6Savings: 91%
7

Observability-Driven Development #

Build observability in from the start:

typescript

1// Add observability to every function
2function withObservability<T>(
3  name: string,
4  fn: () => Promise<T>
5): Promise<T> {
6  const start = Date.now();
7  const span = tracer.startSpan(name);
8  
9  return fn()
10    .then(result => {
11      const duration = Date.now() - start;
12      
13      // Metrics
14      operationCounter.inc({ operation: name, status: 'success' });
15      operationLatency.observe(duration, { operation: name });
16      
17      // Trace
18      span.setStatus({ code: SpanStatusCode.OK });
19      span.end();
20      
21      return result;
22    })
23    .catch(error => {
24      const duration = Date.now() - start;
25      
26      // Metrics
27      operationCounter.inc({ operation: name, status: 'error' });
28      operationLatency.observe(duration, { operation: name });
29      
30      // Logs
31      logger.error(`${name} failed`, {
32        error: error.message,
33        stack: error.stack,
34        traceId: span.spanContext().traceId
35      });
36      
37      // Trace
38      span.recordException(error);
39      span.setStatus({ code: SpanStatusCode.ERROR });
40      span.end();
41      
42      throw error;
43    });
44}
45
46// Usage
47const order = await withObservability('processOrder', () => 
48  processOrder(orderData)
49);
50

Conclusion #

Observability isn't optional for production systems—it's essential:

Reduced MTTR from hours to minutes
Proactive detection before customers notice
Root cause analysis in seconds, not days
Cost savings from preventing outages

The investment in observability (time, money, complexity) pays for itself many times over in:

Prevented outages
Faster incident response
Better customer experience
Lower operational costs

Our Approach #

We help clients implement:

Observability architectures (metrics, logs, traces)
Cost-effective monitoring strategies
SLI/SLO frameworks
Incident response runbooks
On-call training

Introduction #

At 2:47 AM, our trading system stopped processing orders. By 2:52 AM, we had identified the root cause. By 2:55 AM, the system was back online. Total downtime: 8 minutes. Total loss: $12,000.

Three years earlier, a similar incident took 6 hours to resolve and cost $4.2 million.

What changed? Observability.

The Cost of Poor Observability #

Incident #1: The Dark Ages (2021)#

Timeline:

02:47 - First customer reports order failures
03:15 - On-call engineer woken up (28 minutes)
03:45 - Engineer logs into systems (30 minutes)
04:30 - Checking logs across 15 servers (45 minutes)
05:15 - Found nothing in logs (45 minutes)
06:00 - Started checking database (45 minutes)
07:30 - Discovered connection pool exhaustion (90 minutes)
08:00 - Fixed by restarting services (30 minutes)
08:47 - System fully recovered

Total: 6 hours
Lost revenue: $4.2M
Root cause: Database connection leak

Incident #2: The Modern Era (2024)#

Timeline:

02:47 - Alert fires before customers notice
02:48 - On-call engineer receives alert with context
02:50 - Engineer opens trace showing exact issue
02:52 - Root cause identified: connection pool leak
02:55 - Automated failover triggered
02:55 - System recovered

Total: 8 minutes
Lost revenue: $12K
Root cause: Same issue, but caught immediately

The Three Pillars of Observability #

1. Metrics #

What: Numerical measurements over time

typescript

1// Counter: things that only go up
2orderCounter.inc({ status: 'filled', venue: 'NYSE' });
3
4// Gauge: current value
5activeConnectionsGauge.set(pool.getActiveCount());
6
7// Histogram: distribution of values
8orderLatencyHistogram.observe(processingTime, { 
9  symbol: 'AAPL',
10  orderType: 'MARKET' 
11});
12

Key metrics for trading systems:

typescript

1const metrics = {
2  // Business metrics
3  ordersPerSecond: new Counter('orders_total'),
4  fillRate: new Gauge('order_fill_rate'),
5  averageSlippage: new Histogram('order_slippage_dollars'),
6  
7  // System metrics
8  apiLatencyP99: new Histogram('api_latency_ms'),
9  databaseConnections: new Gauge('db_connections_active'),
10  kafkaLag: new Gauge('kafka_consumer_lag'),
11  
12  // Error metrics
13  orderRejections: new Counter('orders_rejected', ['reason']),
14  apiErrors: new Counter('api_errors', ['endpoint', 'status_code']),
15  
16  // Resource metrics
17  cpuUsage: new Gauge('cpu_usage_percent'),
18  memoryUsage: new Gauge('memory_usage_bytes'),
19  diskIOPS: new Counter('disk_operations_total')
20};
21

2. Logs #

What: Discrete events with context

typescript

1// ❌ Bad logging
2console.log('Order failed');
3
4// ✅ Good logging
5logger.error('Order validation failed', {
6  orderId: order.id,
7  accountId: order.accountId,
8  symbol: order.symbol,
9  quantity: order.quantity,
10  reason: 'insufficient_margin',
11  requiredMargin: 50000,
12  availableMargin: 30000,
13  timestamp: new Date().toISOString(),
14  traceId: getCurrentTraceId()
15});
16

Structured logging:

typescript

1interface LogContext {
2  service: string;
3  environment: string;
4  version: string;
5  hostname: string;
6  traceId: string;
7  spanId: string;
8  userId?: string;
9  accountId?: string;
10}
11
12class Logger {
13  private context: LogContext;
14  
15  error(message: string, metadata: object = {}) {
16    console.log(JSON.stringify({
17      level: 'ERROR',
18      message,
19      ...this.context,
20      ...metadata,
21      timestamp: Date.now()
22    }));
23  }
24}
25

3. Traces #

What: Request flow through distributed system

typescript

1import { trace, context } from '@opentelemetry/api';
2
3async function processOrder(order: Order) {
4  const tracer = trace.getTracer('order-service');
5  
6  return await tracer.startActiveSpan('processOrder', async (span) => {
7    span.setAttributes({
8      'order.id': order.id,
9      'order.symbol': order.symbol,
10      'order.quantity': order.quantity
11    });
12    
13    try {
14      // Each step gets its own span
15      const riskCheck = await tracer.startActiveSpan('validateRisk', 
16        async (riskSpan) => {
17          const result = await riskService.validate(order);
18          riskSpan.setAttributes({
19            'risk.approved': result.approved,
20            'risk.required_margin': result.requiredMargin
21          });
22          return result;
23        }
24      );
25      
26      if (!riskCheck.approved) {
27        span.setStatus({ code: SpanStatusCode.ERROR });
28        throw new Error('Risk check failed');
29      }
30      
31      const execution = await tracer.startActiveSpan('executeOrder',
32        async (execSpan) => {
33          const result = await executionService.execute(order);
34          execSpan.setAttributes({
35            'execution.price': result.price,
36            'execution.venue': result.venue
37          });
38          return result;
39        }
40      );
41      
42      span.setStatus({ code: SpanStatusCode.OK });
43      return execution;
44      
45    } catch (error) {
46      span.recordException(error);
47      span.setStatus({ code: SpanStatusCode.ERROR });
48      throw error;
49    } finally {
50      span.end();
51    }
52  });
53}
54

Trace output:

plaintext

1Trace ID: 7a8f3d2e1b4c9f6a
2├─ processOrder (250ms)
3│  ├─ validateRisk (100ms)
4│  │  ├─ database.query (45ms)
5│  │  └─ redis.get (5ms)
6│  ├─ executeOrder (120ms)
7│  │  ├─ kafka.send (80ms)  ← SLOW!
8│  │  └─ database.update (30ms)
9│  └─ auditLog.write (10ms)
10

Advanced Observability Patterns #

1. Golden Signals #

The four metrics that matter most:

typescript

1class GoldenSignals {
2  // Latency: How long does it take?
3  measureLatency(operation: string, duration: number) {
4    latencyHistogram.observe(duration, { operation });
5  }
6  
7  // Traffic: How much demand?
8  measureTraffic(operation: string) {
9    requestCounter.inc({ operation });
10  }
11  
12  // Errors: How many failures?
13  measureErrors(operation: string, error: Error) {
14    errorCounter.inc({ operation, type: error.constructor.name });
15  }
16  
17  // Saturation: How full is the service?
18  measureSaturation() {
19    cpuGauge.set(process.cpuUsage().user / 1000000);
20    memoryGauge.set(process.memoryUsage().heapUsed);
21    connectionPoolGauge.set(pool.getActiveCount() / pool.getMaxSize());
22  }
23}
24

2. Service Level Indicators (SLIs)#

typescript

1// Define what "good" means
2const orderProcessingSLI = new SLI({
3  name: 'order_processing_success_rate',
4  target: 0.9999, // 99.99% success rate
5  window: '30d',
6  
7  goodEvents: metrics.ordersProcessed.labels({ status: 'success' }),
8  totalEvents: metrics.ordersProcessed
9});
10
11// Alert when SLI is at risk
12if (orderProcessingSLI.current() < 0.9999) {
13  alert('SLI violation: order processing below target');
14}
15

3. Distributed Context Propagation #

typescript

1// Propagate context across service boundaries
2import { propagation, context } from '@opentelemetry/api';
3
4// Service A: Set context
5async function createOrder(order: Order) {
6  const span = tracer.startSpan('createOrder');
7  const ctx = trace.setSpan(context.active(), span);
8  
9  // Inject context into HTTP headers
10  const headers = {};
11  propagation.inject(ctx, headers);
12  
13  await fetch('http://risk-service/validate', {
14    headers,
15    body: JSON.stringify(order)
16  });
17}
18
19// Service B: Extract context
20app.post('/validate', async (req, res) => {
21  // Extract context from headers
22  const ctx = propagation.extract(context.active(), req.headers);
23  
24  // Continue the trace
25  const span = tracer.startSpan('validateRisk', undefined, ctx);
26  // ... processing
27});
28

Real-World Debugging Scenarios #

Scenario 1: Slow Orders #

Symptom: Order latency increased from 50ms to 500ms

Investigation:

typescript

1// 1. Check metrics
2// latency_p99{operation="processOrder"} jumped from 50ms to 500ms
3
4// 2. Find slow traces
5const slowTraces = await tracing.query({
6  service: 'order-service',
7  operation: 'processOrder',
8  minDuration: '400ms',
9  limit: 100
10});
11
12// 3. Analyze common pattern
13// All slow traces show kafka.send taking 450ms
14
15// 4. Check Kafka metrics
16// kafka_consumer_lag = 1.2M messages ← PROBLEM!
17
18// 5. Root cause
19// Kafka cluster degraded, consumer lag building up
20

Solution: Scale Kafka cluster, reduce lag

Scenario 2: Intermittent Failures #

Symptom: 0.1% of orders fail with "connection timeout"

Investigation:

typescript

1// 1. Query error logs
2const errors = await logs.query({
3  level: 'ERROR',
4  message: '*connection timeout*',
5  timeRange: 'last 1h'
6});
7
8// 2. Group by trace ID
9const traces = errors.map(e => e.traceId);
10
11// 3. Analyze traces
12// Common pattern: timeout after exactly 30 seconds
13// Happens only for specific database queries
14
15// 4. Check database
16// Long-running queries on `positions` table
17// Missing index on `positions(account_id, symbol)`
18
19// 5. Verify with query plan
20const plan = await db.explain(
21  'SELECT * FROM positions WHERE account_id = $1 AND symbol = $2'
22);
23// Seq Scan on positions (cost=0.00..10000.00)
24

Solution: Add index

sql

1CREATE INDEX CONCURRENTLY idx_positions_account_symbol 
2ON positions(account_id, symbol);
3

Scenario 3: Memory Leak #

Symptom: Service crashes every 3 days with OOM

Investigation:

typescript

1// 1. Check memory metrics over time
2// memory_usage_bytes steadily increasing
3// from 500MB to 4GB over 3 days
4
5// 2. Take heap snapshots
6const snapshot1 = await heapSnapshot();
7await sleep(3600000); // 1 hour
8const snapshot2 = await heapSnapshot();
9
10// 3. Compare snapshots
11const diff = compareSnapshots(snapshot1, snapshot2);
12/*
13Largest growth:
14- Array: +2.5GB
15- Object: +500MB
16- String: +200MB
17*/
18
19// 4. Find what's holding references
20const retainerTree = diff.getRetainerTree('Array');
21/*
22Array (2.5GB)
23└─ OrderCache._orders
24   └─ OrderService.cache
25*/
26
27// 5. Check code
28class OrderCache {
29  private _orders: Map<string, Order> = new Map();
30  
31  add(order: Order) {
32    this._orders.set(order.id, order);
33    // ❌ Never removes old orders!
34  }
35}
36

Solution: Implement LRU cache with eviction

Alerting Strategy #

❌ Bad Alerts #

typescript

1// Too sensitive
2if (errorCount > 0) {
3  alert('ERRORS DETECTED');
4}
5
6// Too vague
7if (cpuUsage > 80) {
8  alert('HIGH CPU');
9}
10
11// Alert fatigue
12if (diskUsage > 70) {
13  alert('DISK ALMOST FULL'); // Every 5 minutes for weeks
14}
15

✅ Good Alerts #

typescript

1// Alert on rate, not count
2if (errorRate > 0.01 && requestCount > 100) {
3  alert('Error rate above 1%', {
4    current: errorRate,
5    threshold: 0.01,
6    runbook: 'https://wiki/runbooks/high-error-rate'
7  });
8}
9
10// Alert on business impact
11if (orderFillRate < 0.95) {
12  alert('Order fill rate degraded', {
13    current: orderFillRate,
14    target: 0.99,
15    impact: 'Customer orders not executing',
16    severity: 'critical'
17  });
18}
19
20// Alert with context
21if (p99Latency > 1000 && avgLatency < 100) {
22  alert('Latency tail degraded', {
23    p99: p99Latency,
24    p50: p50Latency,
25    avg: avgLatency,
26    suggestion: 'Check for slow queries or GC pauses',
27    dashboard: 'https://grafana/d/latency'
28  });
29}
30

Observability Stack #

Our production setup:

yaml

1# Metrics
2metrics:
3  collector: Prometheus
4  storage: Thanos (long-term)
5  visualization: Grafana
6  alerting: Alertmanager
7
8# Logs
9logs:
10  shipper: Fluentd
11  storage: Elasticsearch
12  visualization: Kibana
13  
14# Traces
15tracing:
16  library: OpenTelemetry
17  collector: OTEL Collector
18  storage: Jaeger
19  
20# Unified
21unified:
22  platform: Datadog
23  backup: Grafana Cloud
24

Cost Optimization #

Observability can get expensive:

Before Optimization #

plaintext

1Logs: 10TB/day × $0.50/GB = $5,000/day
2Metrics: 100M series × $0.01 = $1,000/day
3Traces: 100% sampling = $2,000/day
4
5Total: $8,000/day = $240,000/month
6

After Optimization #

typescript

1// 1. Sample traces intelligently
2const sampler = {
3  // Always sample errors
4  shouldSample(span) {
5    if (span.status === 'ERROR') return true;
6    
7    // Sample slow requests
8    if (span.duration > 1000) return true;
9    
10    // Sample 1% of normal requests
11    return Math.random() < 0.01;
12  }
13};
14
15// 2. Aggregate metrics
16// Instead of per-account metrics, use percentiles
17latency.observe(value); // Don't add account_id label
18
19// 3. Drop noisy logs
20if (log.level === 'DEBUG' && env === 'production') {
21  return; // Don't ship debug logs to prod
22}
23
24// 4. Use log levels strategically
25logger.info('Order processed', { orderId }); // Structured, shipped
26logger.debug('Cache hit', { key }); // Local only, not shipped
27

After Optimization #

plaintext

1Logs: 1TB/day × $0.50/GB = $500/day
2Metrics: 10M series × $0.01 = $100/day
3Traces: 5% sampling = $100/day
4
5Total: $700/day = $21,000/month
6Savings: 91%
7

Observability-Driven Development #

Build observability in from the start:

typescript

1// Add observability to every function
2function withObservability<T>(
3  name: string,
4  fn: () => Promise<T>
5): Promise<T> {
6  const start = Date.now();
7  const span = tracer.startSpan(name);
8  
9  return fn()
10    .then(result => {
11      const duration = Date.now() - start;
12      
13      // Metrics
14      operationCounter.inc({ operation: name, status: 'success' });
15      operationLatency.observe(duration, { operation: name });
16      
17      // Trace
18      span.setStatus({ code: SpanStatusCode.OK });
19      span.end();
20      
21      return result;
22    })
23    .catch(error => {
24      const duration = Date.now() - start;
25      
26      // Metrics
27      operationCounter.inc({ operation: name, status: 'error' });
28      operationLatency.observe(duration, { operation: name });
29      
30      // Logs
31      logger.error(`${name} failed`, {
32        error: error.message,
33        stack: error.stack,
34        traceId: span.spanContext().traceId
35      });
36      
37      // Trace
38      span.recordException(error);
39      span.setStatus({ code: SpanStatusCode.ERROR });
40      span.end();
41      
42      throw error;
43    });
44}
45
46// Usage
47const order = await withObservability('processOrder', () => 
48  processOrder(orderData)
49);
50

Conclusion #

Observability isn't optional for production systems—it's essential:

Reduced MTTR from hours to minutes
Proactive detection before customers notice
Root cause analysis in seconds, not days
Cost savings from preventing outages

The investment in observability (time, money, complexity) pays for itself many times over in:

Prevented outages
Faster incident response
Better customer experience
Lower operational costs

Our Approach #

We help clients implement:

Observability architectures (metrics, logs, traces)
Cost-effective monitoring strategies
SLI/SLO frameworks
Incident response runbooks
On-call training

NordVarg Team

Join 1,000+ Engineers

Related Posts

NordVarg Team

Join 1,000+ Engineers

Related Posts