NV
NordVarg
ServicesTechnologiesIndustriesCase StudiesBlogAboutContact
Get Started

Footer

NV
NordVarg

Software Development & Consulting

GitHubLinkedInTwitter

Services

  • Product Development
  • Quantitative Finance
  • Financial Systems
  • ML & AI

Technologies

  • C++
  • Python
  • Rust
  • OCaml
  • TypeScript
  • React

Company

  • About
  • Case Studies
  • Blog
  • Contact

© 2025 NordVarg. All rights reserved.

October 25, 2024
•
NordVarg Team
•

Observability in Production: Lessons from Trading System Outages

How proper monitoring, logging, and tracing prevented millions in losses and reduced MTTR from hours to minutes

DevOpsObservabilityMonitoringProductionIncident Response
9 min read
Share:

Introduction#

At 2:47 AM, our trading system stopped processing orders. By 2:52 AM, we had identified the root cause. By 2:55 AM, the system was back online. Total downtime: 8 minutes. Total loss: $12,000.

Three years earlier, a similar incident took 6 hours to resolve and cost $4.2 million.

What changed? Observability.

The Cost of Poor Observability#

Incident #1: The Dark Ages (2021)#

Timeline:

  • 02:47 - First customer reports order failures
  • 03:15 - On-call engineer woken up (28 minutes)
  • 03:45 - Engineer logs into systems (30 minutes)
  • 04:30 - Checking logs across 15 servers (45 minutes)
  • 05:15 - Found nothing in logs (45 minutes)
  • 06:00 - Started checking database (45 minutes)
  • 07:30 - Discovered connection pool exhaustion (90 minutes)
  • 08:00 - Fixed by restarting services (30 minutes)
  • 08:47 - System fully recovered

Total: 6 hours
Lost revenue: $4.2M
Root cause: Database connection leak

Incident #2: The Modern Era (2024)#

Timeline:

  • 02:47 - Alert fires before customers notice
  • 02:48 - On-call engineer receives alert with context
  • 02:50 - Engineer opens trace showing exact issue
  • 02:52 - Root cause identified: connection pool leak
  • 02:55 - Automated failover triggered
  • 02:55 - System recovered

Total: 8 minutes
Lost revenue: $12K
Root cause: Same issue, but caught immediately

The Three Pillars of Observability#

1. Metrics#

What: Numerical measurements over time

typescript
1// Counter: things that only go up
2orderCounter.inc({ status: 'filled', venue: 'NYSE' });
3
4// Gauge: current value
5activeConnectionsGauge.set(pool.getActiveCount());
6
7// Histogram: distribution of values
8orderLatencyHistogram.observe(processingTime, { 
9  symbol: 'AAPL',
10  orderType: 'MARKET' 
11});
12

Key metrics for trading systems:

typescript
1const metrics = {
2  // Business metrics
3  ordersPerSecond: new Counter('orders_total'),
4  fillRate: new Gauge('order_fill_rate'),
5  averageSlippage: new Histogram('order_slippage_dollars'),
6  
7  // System metrics
8  apiLatencyP99: new Histogram('api_latency_ms'),
9  databaseConnections: new Gauge('db_connections_active'),
10  kafkaLag: new Gauge('kafka_consumer_lag'),
11  
12  // Error metrics
13  orderRejections: new Counter('orders_rejected', ['reason']),
14  apiErrors: new Counter('api_errors', ['endpoint', 'status_code']),
15  
16  // Resource metrics
17  cpuUsage: new Gauge('cpu_usage_percent'),
18  memoryUsage: new Gauge('memory_usage_bytes'),
19  diskIOPS: new Counter('disk_operations_total')
20};
21

2. Logs#

What: Discrete events with context

typescript
1// ❌ Bad logging
2console.log('Order failed');
3
4// ✅ Good logging
5logger.error('Order validation failed', {
6  orderId: order.id,
7  accountId: order.accountId,
8  symbol: order.symbol,
9  quantity: order.quantity,
10  reason: 'insufficient_margin',
11  requiredMargin: 50000,
12  availableMargin: 30000,
13  timestamp: new Date().toISOString(),
14  traceId: getCurrentTraceId()
15});
16

Structured logging:

typescript
1interface LogContext {
2  service: string;
3  environment: string;
4  version: string;
5  hostname: string;
6  traceId: string;
7  spanId: string;
8  userId?: string;
9  accountId?: string;
10}
11
12class Logger {
13  private context: LogContext;
14  
15  error(message: string, metadata: object = {}) {
16    console.log(JSON.stringify({
17      level: 'ERROR',
18      message,
19      ...this.context,
20      ...metadata,
21      timestamp: Date.now()
22    }));
23  }
24}
25

3. Traces#

What: Request flow through distributed system

typescript
1import { trace, context } from '@opentelemetry/api';
2
3async function processOrder(order: Order) {
4  const tracer = trace.getTracer('order-service');
5  
6  return await tracer.startActiveSpan('processOrder', async (span) => {
7    span.setAttributes({
8      'order.id': order.id,
9      'order.symbol': order.symbol,
10      'order.quantity': order.quantity
11    });
12    
13    try {
14      // Each step gets its own span
15      const riskCheck = await tracer.startActiveSpan('validateRisk', 
16        async (riskSpan) => {
17          const result = await riskService.validate(order);
18          riskSpan.setAttributes({
19            'risk.approved': result.approved,
20            'risk.required_margin': result.requiredMargin
21          });
22          return result;
23        }
24      );
25      
26      if (!riskCheck.approved) {
27        span.setStatus({ code: SpanStatusCode.ERROR });
28        throw new Error('Risk check failed');
29      }
30      
31      const execution = await tracer.startActiveSpan('executeOrder',
32        async (execSpan) => {
33          const result = await executionService.execute(order);
34          execSpan.setAttributes({
35            'execution.price': result.price,
36            'execution.venue': result.venue
37          });
38          return result;
39        }
40      );
41      
42      span.setStatus({ code: SpanStatusCode.OK });
43      return execution;
44      
45    } catch (error) {
46      span.recordException(error);
47      span.setStatus({ code: SpanStatusCode.ERROR });
48      throw error;
49    } finally {
50      span.end();
51    }
52  });
53}
54

Trace output:

plaintext
1Trace ID: 7a8f3d2e1b4c9f6a
2├─ processOrder (250ms)
3│  ├─ validateRisk (100ms)
4│  │  ├─ database.query (45ms)
5│  │  └─ redis.get (5ms)
6│  ├─ executeOrder (120ms)
7│  │  ├─ kafka.send (80ms)  ← SLOW!
8│  │  └─ database.update (30ms)
9│  └─ auditLog.write (10ms)
10

Advanced Observability Patterns#

1. Golden Signals#

The four metrics that matter most:

typescript
1class GoldenSignals {
2  // Latency: How long does it take?
3  measureLatency(operation: string, duration: number) {
4    latencyHistogram.observe(duration, { operation });
5  }
6  
7  // Traffic: How much demand?
8  measureTraffic(operation: string) {
9    requestCounter.inc({ operation });
10  }
11  
12  // Errors: How many failures?
13  measureErrors(operation: string, error: Error) {
14    errorCounter.inc({ operation, type: error.constructor.name });
15  }
16  
17  // Saturation: How full is the service?
18  measureSaturation() {
19    cpuGauge.set(process.cpuUsage().user / 1000000);
20    memoryGauge.set(process.memoryUsage().heapUsed);
21    connectionPoolGauge.set(pool.getActiveCount() / pool.getMaxSize());
22  }
23}
24

2. Service Level Indicators (SLIs)#

typescript
1// Define what "good" means
2const orderProcessingSLI = new SLI({
3  name: 'order_processing_success_rate',
4  target: 0.9999, // 99.99% success rate
5  window: '30d',
6  
7  goodEvents: metrics.ordersProcessed.labels({ status: 'success' }),
8  totalEvents: metrics.ordersProcessed
9});
10
11// Alert when SLI is at risk
12if (orderProcessingSLI.current() < 0.9999) {
13  alert('SLI violation: order processing below target');
14}
15

3. Distributed Context Propagation#

typescript
1// Propagate context across service boundaries
2import { propagation, context } from '@opentelemetry/api';
3
4// Service A: Set context
5async function createOrder(order: Order) {
6  const span = tracer.startSpan('createOrder');
7  const ctx = trace.setSpan(context.active(), span);
8  
9  // Inject context into HTTP headers
10  const headers = {};
11  propagation.inject(ctx, headers);
12  
13  await fetch('http://risk-service/validate', {
14    headers,
15    body: JSON.stringify(order)
16  });
17}
18
19// Service B: Extract context
20app.post('/validate', async (req, res) => {
21  // Extract context from headers
22  const ctx = propagation.extract(context.active(), req.headers);
23  
24  // Continue the trace
25  const span = tracer.startSpan('validateRisk', undefined, ctx);
26  // ... processing
27});
28

Real-World Debugging Scenarios#

Scenario 1: Slow Orders#

Symptom: Order latency increased from 50ms to 500ms

Investigation:

typescript
1// 1. Check metrics
2// latency_p99{operation="processOrder"} jumped from 50ms to 500ms
3
4// 2. Find slow traces
5const slowTraces = await tracing.query({
6  service: 'order-service',
7  operation: 'processOrder',
8  minDuration: '400ms',
9  limit: 100
10});
11
12// 3. Analyze common pattern
13// All slow traces show kafka.send taking 450ms
14
15// 4. Check Kafka metrics
16// kafka_consumer_lag = 1.2M messages ← PROBLEM!
17
18// 5. Root cause
19// Kafka cluster degraded, consumer lag building up
20

Solution: Scale Kafka cluster, reduce lag

Scenario 2: Intermittent Failures#

Symptom: 0.1% of orders fail with "connection timeout"

Investigation:

typescript
1// 1. Query error logs
2const errors = await logs.query({
3  level: 'ERROR',
4  message: '*connection timeout*',
5  timeRange: 'last 1h'
6});
7
8// 2. Group by trace ID
9const traces = errors.map(e => e.traceId);
10
11// 3. Analyze traces
12// Common pattern: timeout after exactly 30 seconds
13// Happens only for specific database queries
14
15// 4. Check database
16// Long-running queries on `positions` table
17// Missing index on `positions(account_id, symbol)`
18
19// 5. Verify with query plan
20const plan = await db.explain(
21  'SELECT * FROM positions WHERE account_id = $1 AND symbol = $2'
22);
23// Seq Scan on positions (cost=0.00..10000.00)
24

Solution: Add index

sql
1CREATE INDEX CONCURRENTLY idx_positions_account_symbol 
2ON positions(account_id, symbol);
3

Scenario 3: Memory Leak#

Symptom: Service crashes every 3 days with OOM

Investigation:

typescript
1// 1. Check memory metrics over time
2// memory_usage_bytes steadily increasing
3// from 500MB to 4GB over 3 days
4
5// 2. Take heap snapshots
6const snapshot1 = await heapSnapshot();
7await sleep(3600000); // 1 hour
8const snapshot2 = await heapSnapshot();
9
10// 3. Compare snapshots
11const diff = compareSnapshots(snapshot1, snapshot2);
12/*
13Largest growth:
14- Array: +2.5GB
15- Object: +500MB
16- String: +200MB
17*/
18
19// 4. Find what's holding references
20const retainerTree = diff.getRetainerTree('Array');
21/*
22Array (2.5GB)
23└─ OrderCache._orders
24   └─ OrderService.cache
25*/
26
27// 5. Check code
28class OrderCache {
29  private _orders: Map<string, Order> = new Map();
30  
31  add(order: Order) {
32    this._orders.set(order.id, order);
33    // ❌ Never removes old orders!
34  }
35}
36

Solution: Implement LRU cache with eviction

Alerting Strategy#

❌ Bad Alerts#

typescript
1// Too sensitive
2if (errorCount > 0) {
3  alert('ERRORS DETECTED');
4}
5
6// Too vague
7if (cpuUsage > 80) {
8  alert('HIGH CPU');
9}
10
11// Alert fatigue
12if (diskUsage > 70) {
13  alert('DISK ALMOST FULL'); // Every 5 minutes for weeks
14}
15

✅ Good Alerts#

typescript
1// Alert on rate, not count
2if (errorRate > 0.01 && requestCount > 100) {
3  alert('Error rate above 1%', {
4    current: errorRate,
5    threshold: 0.01,
6    runbook: 'https://wiki/runbooks/high-error-rate'
7  });
8}
9
10// Alert on business impact
11if (orderFillRate < 0.95) {
12  alert('Order fill rate degraded', {
13    current: orderFillRate,
14    target: 0.99,
15    impact: 'Customer orders not executing',
16    severity: 'critical'
17  });
18}
19
20// Alert with context
21if (p99Latency > 1000 && avgLatency < 100) {
22  alert('Latency tail degraded', {
23    p99: p99Latency,
24    p50: p50Latency,
25    avg: avgLatency,
26    suggestion: 'Check for slow queries or GC pauses',
27    dashboard: 'https://grafana/d/latency'
28  });
29}
30

Observability Stack#

Our production setup:

yaml
1# Metrics
2metrics:
3  collector: Prometheus
4  storage: Thanos (long-term)
5  visualization: Grafana
6  alerting: Alertmanager
7
8# Logs
9logs:
10  shipper: Fluentd
11  storage: Elasticsearch
12  visualization: Kibana
13  
14# Traces
15tracing:
16  library: OpenTelemetry
17  collector: OTEL Collector
18  storage: Jaeger
19  
20# Unified
21unified:
22  platform: Datadog
23  backup: Grafana Cloud
24

Cost Optimization#

Observability can get expensive:

Before Optimization#

plaintext
1Logs: 10TB/day × $0.50/GB = $5,000/day
2Metrics: 100M series × $0.01 = $1,000/day
3Traces: 100% sampling = $2,000/day
4
5Total: $8,000/day = $240,000/month
6

After Optimization#

typescript
1// 1. Sample traces intelligently
2const sampler = {
3  // Always sample errors
4  shouldSample(span) {
5    if (span.status === 'ERROR') return true;
6    
7    // Sample slow requests
8    if (span.duration > 1000) return true;
9    
10    // Sample 1% of normal requests
11    return Math.random() < 0.01;
12  }
13};
14
15// 2. Aggregate metrics
16// Instead of per-account metrics, use percentiles
17latency.observe(value); // Don't add account_id label
18
19// 3. Drop noisy logs
20if (log.level === 'DEBUG' && env === 'production') {
21  return; // Don't ship debug logs to prod
22}
23
24// 4. Use log levels strategically
25logger.info('Order processed', { orderId }); // Structured, shipped
26logger.debug('Cache hit', { key }); // Local only, not shipped
27

After Optimization#

plaintext
1Logs: 1TB/day × $0.50/GB = $500/day
2Metrics: 10M series × $0.01 = $100/day
3Traces: 5% sampling = $100/day
4
5Total: $700/day = $21,000/month
6Savings: 91%
7

Observability-Driven Development#

Build observability in from the start:

typescript
1// Add observability to every function
2function withObservability<T>(
3  name: string,
4  fn: () => Promise<T>
5): Promise<T> {
6  const start = Date.now();
7  const span = tracer.startSpan(name);
8  
9  return fn()
10    .then(result => {
11      const duration = Date.now() - start;
12      
13      // Metrics
14      operationCounter.inc({ operation: name, status: 'success' });
15      operationLatency.observe(duration, { operation: name });
16      
17      // Trace
18      span.setStatus({ code: SpanStatusCode.OK });
19      span.end();
20      
21      return result;
22    })
23    .catch(error => {
24      const duration = Date.now() - start;
25      
26      // Metrics
27      operationCounter.inc({ operation: name, status: 'error' });
28      operationLatency.observe(duration, { operation: name });
29      
30      // Logs
31      logger.error(`${name} failed`, {
32        error: error.message,
33        stack: error.stack,
34        traceId: span.spanContext().traceId
35      });
36      
37      // Trace
38      span.recordException(error);
39      span.setStatus({ code: SpanStatusCode.ERROR });
40      span.end();
41      
42      throw error;
43    });
44}
45
46// Usage
47const order = await withObservability('processOrder', () => 
48  processOrder(orderData)
49);
50

Conclusion#

Observability isn't optional for production systems—it's essential:

  • Reduced MTTR from hours to minutes
  • Proactive detection before customers notice
  • Root cause analysis in seconds, not days
  • Cost savings from preventing outages

The investment in observability (time, money, complexity) pays for itself many times over in:

  • Prevented outages
  • Faster incident response
  • Better customer experience
  • Lower operational costs

Our Approach#

We help clients implement:

  • Observability architectures (metrics, logs, traces)
  • Cost-effective monitoring strategies
  • SLI/SLO frameworks
  • Incident response runbooks
  • On-call training

Contact us to improve your observability.

NT

NordVarg Team

Technical Writer

NordVarg Team is a software engineer at NordVarg specializing in high-performance financial systems and type-safe programming.

ObservabilityMonitoringProductionIncident Response

Join 1,000+ Engineers

Get weekly insights on building high-performance financial systems, latest industry trends, and expert tips delivered straight to your inbox.

✓Weekly articles
✓Industry insights
✓No spam, ever

Related Posts

Jan 27, 2025•9 min read
Chaos Engineering for Financial Systems
DevOpschaos-engineeringreliability

Interested in working together?