Observability in Production: Lessons from Trading System Outages
How proper monitoring, logging, and tracing prevented millions in losses and reduced MTTR from hours to minutes
At 2:47 AM, our trading system stopped processing orders. By 2:52 AM, we had identified the root cause. By 2:55 AM, the system was back online. Total downtime: 8 minutes. Total loss: $12,000.
Three years earlier, a similar incident took 6 hours to resolve and cost $4.2 million.
What changed? Observability.
Timeline:
Total: 6 hours
Lost revenue: $4.2M
Root cause: Database connection leak
Timeline:
Total: 8 minutes
Lost revenue: $12K
Root cause: Same issue, but caught immediately
What: Numerical measurements over time
1// Counter: things that only go up
2orderCounter.inc({ status: 'filled', venue: 'NYSE' });
3
4// Gauge: current value
5activeConnectionsGauge.set(pool.getActiveCount());
6
7// Histogram: distribution of values
8orderLatencyHistogram.observe(processingTime, {
9 symbol: 'AAPL',
10 orderType: 'MARKET'
11});
12Key metrics for trading systems:
1const metrics = {
2 // Business metrics
3 ordersPerSecond: new Counter('orders_total'),
4 fillRate: new Gauge('order_fill_rate'),
5 averageSlippage: new Histogram('order_slippage_dollars'),
6
7 // System metrics
8 apiLatencyP99: new Histogram('api_latency_ms'),
9 databaseConnections: new Gauge('db_connections_active'),
10 kafkaLag: new Gauge('kafka_consumer_lag'),
11
12 // Error metrics
13 orderRejections: new Counter('orders_rejected', ['reason']),
14 apiErrors: new Counter('api_errors', ['endpoint', 'status_code']),
15
16 // Resource metrics
17 cpuUsage: new Gauge('cpu_usage_percent'),
18 memoryUsage: new Gauge('memory_usage_bytes'),
19 diskIOPS: new Counter('disk_operations_total')
20};
21What: Discrete events with context
1// ❌ Bad logging
2console.log('Order failed');
3
4// ✅ Good logging
5logger.error('Order validation failed', {
6 orderId: order.id,
7 accountId: order.accountId,
8 symbol: order.symbol,
9 quantity: order.quantity,
10 reason: 'insufficient_margin',
11 requiredMargin: 50000,
12 availableMargin: 30000,
13 timestamp: new Date().toISOString(),
14 traceId: getCurrentTraceId()
15});
16Structured logging:
1interface LogContext {
2 service: string;
3 environment: string;
4 version: string;
5 hostname: string;
6 traceId: string;
7 spanId: string;
8 userId?: string;
9 accountId?: string;
10}
11
12class Logger {
13 private context: LogContext;
14
15 error(message: string, metadata: object = {}) {
16 console.log(JSON.stringify({
17 level: 'ERROR',
18 message,
19 ...this.context,
20 ...metadata,
21 timestamp: Date.now()
22 }));
23 }
24}
25What: Request flow through distributed system
1import { trace, context } from '@opentelemetry/api';
2
3async function processOrder(order: Order) {
4 const tracer = trace.getTracer('order-service');
5
6 return await tracer.startActiveSpan('processOrder', async (span) => {
7 span.setAttributes({
8 'order.id': order.id,
9 'order.symbol': order.symbol,
10 'order.quantity': order.quantity
11 });
12
13 try {
14 // Each step gets its own span
15 const riskCheck = await tracer.startActiveSpan('validateRisk',
16 async (riskSpan) => {
17 const result = await riskService.validate(order);
18 riskSpan.setAttributes({
19 'risk.approved': result.approved,
20 'risk.required_margin': result.requiredMargin
21 });
22 return result;
23 }
24 );
25
26 if (!riskCheck.approved) {
27 span.setStatus({ code: SpanStatusCode.ERROR });
28 throw new Error('Risk check failed');
29 }
30
31 const execution = await tracer.startActiveSpan('executeOrder',
32 async (execSpan) => {
33 const result = await executionService.execute(order);
34 execSpan.setAttributes({
35 'execution.price': result.price,
36 'execution.venue': result.venue
37 });
38 return result;
39 }
40 );
41
42 span.setStatus({ code: SpanStatusCode.OK });
43 return execution;
44
45 } catch (error) {
46 span.recordException(error);
47 span.setStatus({ code: SpanStatusCode.ERROR });
48 throw error;
49 } finally {
50 span.end();
51 }
52 });
53}
54Trace output:
1Trace ID: 7a8f3d2e1b4c9f6a
2├─ processOrder (250ms)
3│ ├─ validateRisk (100ms)
4│ │ ├─ database.query (45ms)
5│ │ └─ redis.get (5ms)
6│ ├─ executeOrder (120ms)
7│ │ ├─ kafka.send (80ms) ← SLOW!
8│ │ └─ database.update (30ms)
9│ └─ auditLog.write (10ms)
10The four metrics that matter most:
1class GoldenSignals {
2 // Latency: How long does it take?
3 measureLatency(operation: string, duration: number) {
4 latencyHistogram.observe(duration, { operation });
5 }
6
7 // Traffic: How much demand?
8 measureTraffic(operation: string) {
9 requestCounter.inc({ operation });
10 }
11
12 // Errors: How many failures?
13 measureErrors(operation: string, error: Error) {
14 errorCounter.inc({ operation, type: error.constructor.name });
15 }
16
17 // Saturation: How full is the service?
18 measureSaturation() {
19 cpuGauge.set(process.cpuUsage().user / 1000000);
20 memoryGauge.set(process.memoryUsage().heapUsed);
21 connectionPoolGauge.set(pool.getActiveCount() / pool.getMaxSize());
22 }
23}
241// Define what "good" means
2const orderProcessingSLI = new SLI({
3 name: 'order_processing_success_rate',
4 target: 0.9999, // 99.99% success rate
5 window: '30d',
6
7 goodEvents: metrics.ordersProcessed.labels({ status: 'success' }),
8 totalEvents: metrics.ordersProcessed
9});
10
11// Alert when SLI is at risk
12if (orderProcessingSLI.current() < 0.9999) {
13 alert('SLI violation: order processing below target');
14}
151// Propagate context across service boundaries
2import { propagation, context } from '@opentelemetry/api';
3
4// Service A: Set context
5async function createOrder(order: Order) {
6 const span = tracer.startSpan('createOrder');
7 const ctx = trace.setSpan(context.active(), span);
8
9 // Inject context into HTTP headers
10 const headers = {};
11 propagation.inject(ctx, headers);
12
13 await fetch('http://risk-service/validate', {
14 headers,
15 body: JSON.stringify(order)
16 });
17}
18
19// Service B: Extract context
20app.post('/validate', async (req, res) => {
21 // Extract context from headers
22 const ctx = propagation.extract(context.active(), req.headers);
23
24 // Continue the trace
25 const span = tracer.startSpan('validateRisk', undefined, ctx);
26 // ... processing
27});
28Symptom: Order latency increased from 50ms to 500ms
Investigation:
1// 1. Check metrics
2// latency_p99{operation="processOrder"} jumped from 50ms to 500ms
3
4// 2. Find slow traces
5const slowTraces = await tracing.query({
6 service: 'order-service',
7 operation: 'processOrder',
8 minDuration: '400ms',
9 limit: 100
10});
11
12// 3. Analyze common pattern
13// All slow traces show kafka.send taking 450ms
14
15// 4. Check Kafka metrics
16// kafka_consumer_lag = 1.2M messages ← PROBLEM!
17
18// 5. Root cause
19// Kafka cluster degraded, consumer lag building up
20Solution: Scale Kafka cluster, reduce lag
Symptom: 0.1% of orders fail with "connection timeout"
Investigation:
1// 1. Query error logs
2const errors = await logs.query({
3 level: 'ERROR',
4 message: '*connection timeout*',
5 timeRange: 'last 1h'
6});
7
8// 2. Group by trace ID
9const traces = errors.map(e => e.traceId);
10
11// 3. Analyze traces
12// Common pattern: timeout after exactly 30 seconds
13// Happens only for specific database queries
14
15// 4. Check database
16// Long-running queries on `positions` table
17// Missing index on `positions(account_id, symbol)`
18
19// 5. Verify with query plan
20const plan = await db.explain(
21 'SELECT * FROM positions WHERE account_id = $1 AND symbol = $2'
22);
23// Seq Scan on positions (cost=0.00..10000.00)
24Solution: Add index
1CREATE INDEX CONCURRENTLY idx_positions_account_symbol
2ON positions(account_id, symbol);
3Symptom: Service crashes every 3 days with OOM
Investigation:
1// 1. Check memory metrics over time
2// memory_usage_bytes steadily increasing
3// from 500MB to 4GB over 3 days
4
5// 2. Take heap snapshots
6const snapshot1 = await heapSnapshot();
7await sleep(3600000); // 1 hour
8const snapshot2 = await heapSnapshot();
9
10// 3. Compare snapshots
11const diff = compareSnapshots(snapshot1, snapshot2);
12/*
13Largest growth:
14- Array: +2.5GB
15- Object: +500MB
16- String: +200MB
17*/
18
19// 4. Find what's holding references
20const retainerTree = diff.getRetainerTree('Array');
21/*
22Array (2.5GB)
23└─ OrderCache._orders
24 └─ OrderService.cache
25*/
26
27// 5. Check code
28class OrderCache {
29 private _orders: Map<string, Order> = new Map();
30
31 add(order: Order) {
32 this._orders.set(order.id, order);
33 // ❌ Never removes old orders!
34 }
35}
36Solution: Implement LRU cache with eviction
1// Too sensitive
2if (errorCount > 0) {
3 alert('ERRORS DETECTED');
4}
5
6// Too vague
7if (cpuUsage > 80) {
8 alert('HIGH CPU');
9}
10
11// Alert fatigue
12if (diskUsage > 70) {
13 alert('DISK ALMOST FULL'); // Every 5 minutes for weeks
14}
151// Alert on rate, not count
2if (errorRate > 0.01 && requestCount > 100) {
3 alert('Error rate above 1%', {
4 current: errorRate,
5 threshold: 0.01,
6 runbook: 'https://wiki/runbooks/high-error-rate'
7 });
8}
9
10// Alert on business impact
11if (orderFillRate < 0.95) {
12 alert('Order fill rate degraded', {
13 current: orderFillRate,
14 target: 0.99,
15 impact: 'Customer orders not executing',
16 severity: 'critical'
17 });
18}
19
20// Alert with context
21if (p99Latency > 1000 && avgLatency < 100) {
22 alert('Latency tail degraded', {
23 p99: p99Latency,
24 p50: p50Latency,
25 avg: avgLatency,
26 suggestion: 'Check for slow queries or GC pauses',
27 dashboard: 'https://grafana/d/latency'
28 });
29}
30Our production setup:
1# Metrics
2metrics:
3 collector: Prometheus
4 storage: Thanos (long-term)
5 visualization: Grafana
6 alerting: Alertmanager
7
8# Logs
9logs:
10 shipper: Fluentd
11 storage: Elasticsearch
12 visualization: Kibana
13
14# Traces
15tracing:
16 library: OpenTelemetry
17 collector: OTEL Collector
18 storage: Jaeger
19
20# Unified
21unified:
22 platform: Datadog
23 backup: Grafana Cloud
24Observability can get expensive:
1Logs: 10TB/day × $0.50/GB = $5,000/day
2Metrics: 100M series × $0.01 = $1,000/day
3Traces: 100% sampling = $2,000/day
4
5Total: $8,000/day = $240,000/month
61// 1. Sample traces intelligently
2const sampler = {
3 // Always sample errors
4 shouldSample(span) {
5 if (span.status === 'ERROR') return true;
6
7 // Sample slow requests
8 if (span.duration > 1000) return true;
9
10 // Sample 1% of normal requests
11 return Math.random() < 0.01;
12 }
13};
14
15// 2. Aggregate metrics
16// Instead of per-account metrics, use percentiles
17latency.observe(value); // Don't add account_id label
18
19// 3. Drop noisy logs
20if (log.level === 'DEBUG' && env === 'production') {
21 return; // Don't ship debug logs to prod
22}
23
24// 4. Use log levels strategically
25logger.info('Order processed', { orderId }); // Structured, shipped
26logger.debug('Cache hit', { key }); // Local only, not shipped
271Logs: 1TB/day × $0.50/GB = $500/day
2Metrics: 10M series × $0.01 = $100/day
3Traces: 5% sampling = $100/day
4
5Total: $700/day = $21,000/month
6Savings: 91%
7Build observability in from the start:
1// Add observability to every function
2function withObservability<T>(
3 name: string,
4 fn: () => Promise<T>
5): Promise<T> {
6 const start = Date.now();
7 const span = tracer.startSpan(name);
8
9 return fn()
10 .then(result => {
11 const duration = Date.now() - start;
12
13 // Metrics
14 operationCounter.inc({ operation: name, status: 'success' });
15 operationLatency.observe(duration, { operation: name });
16
17 // Trace
18 span.setStatus({ code: SpanStatusCode.OK });
19 span.end();
20
21 return result;
22 })
23 .catch(error => {
24 const duration = Date.now() - start;
25
26 // Metrics
27 operationCounter.inc({ operation: name, status: 'error' });
28 operationLatency.observe(duration, { operation: name });
29
30 // Logs
31 logger.error(`${name} failed`, {
32 error: error.message,
33 stack: error.stack,
34 traceId: span.spanContext().traceId
35 });
36
37 // Trace
38 span.recordException(error);
39 span.setStatus({ code: SpanStatusCode.ERROR });
40 span.end();
41
42 throw error;
43 });
44}
45
46// Usage
47const order = await withObservability('processOrder', () =>
48 processOrder(orderData)
49);
50Observability isn't optional for production systems—it's essential:
The investment in observability (time, money, complexity) pays for itself many times over in:
We help clients implement:
Contact us to improve your observability.
Technical Writer
NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.
Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.