FIX in Production: Hardening Sessions, Recovery and Resilience

This post focuses on making FIX sessions reliable in production. The protocol gives you primitives for robust delivery (sequencing, resend, gap-fill), but you must build durable storage, replay, and observability around those primitives to survive network failures, crashes, and operator error.

The operational goals #

When hardening FIX sessions, target these operational goals:

Durability: no lost outbound messages after a crash.
Recoverability: quick, correct recovery from sequence gaps and disconnects.
Observability: metrics and logs to detect and diagnose session problems.
Security: authenticated, encrypted connections and controlled access.
Scalability: support high message rates and multiple venues with predictable behavior.

Architecture patterns #

There are three common deployment patterns for production FIX handling:

Embedded session inside trading application
- Simple to implement with a FIX engine library but harder to scale and observe.
Separate FIX gateway process
- Gateway handles sessions, journaling, replay, TLS and exposes a clean API to the trading app (TCP, gRPC, or shared memory).
Managed gateway + router
- Gateway for session concerns, router for multi-venue routing and stickiness. Useful for multi-tenant or high-availability setups.

For production we prefer (2) or (3): offload session concerns to a purpose-built gateway that can be restarted independently and provides durable journaling.

Journaling and persistence (the single most important detail)#

Durable journaling is the core of correctness. The basic rule: persist outbound messages, sequence numbers and any state required for replay to durable storage before acknowledging them to internal systems.

Implementation notes:

Use an append-only journal file per counterparty (rotate daily), written with fsync guarantees.
Store both the wire-format FIX message (the exact bytes) and a small metadata record (seqnum, timestamp, client id).
When sending, write to journal first, then write the bytes to the socket. On success, mark as sent. Periodically truncate or compress old journal segments after confirming counterparty has acknowledged via sequence numbers or administrative resets.
Ensure the journal write path is fast: use pre-allocated files, O_DIRECT or write-behind with careful fsync boundaries (fsync after a configurable window or on critical messages).

Pseudocode (simplified):

python

1# durable_journal.append returns the persisted record id
2record_id = durable_journal.append(msg_bytes, seqnum)
3# only after appended do we send
4socket.send(msg_bytes)
5mark_sent(record_id)
6

Edge cases:

If the process crashes after journal append but before socket send, the restart logic must attempt replay.
If a message was marked sent but not acknowledged by the counterparty (e.g., due to crash), replay logic must be able to resend safely.

Replay and gap-fill handling #

A robust replay implementation must follow FIX semantics carefully:

On reconnect, determine local and peer sequence numbers.
If peer has higher incoming sequence (we are missing messages), send ResendRequest for the range we are missing.
If peer requests replays for messages we still have, read journal and resend.
Implement Sequence Reset / GapFill semantics when appropriate (e.g., when messages are supplanted or intentionally skipped).

Replay algorithm outline:

Load last persisted incoming/outgoing seqnums from stable store.
On connect, exchange logon and sequence information.
If outgoing seqnum < peer.expected, resend from outgoing_seq+1 using journal records.
If peer requests a resend range, stream the saved messages. Where messages are logically irrelevant, send GapFill using Sequence Reset (tag 123/36 depending on FIX version).

High-availability designs #

HA is often handled at two levels:

Active / Passive gateway pair with shared storage for journal and leader election.
Stateless gateway fronting stateful process pool (less common — requires sticky sessions or external shard).

Active/Passive pattern:

Gateways share a durable store (NFS, clustered filesystem, or object store) for journals and sequence metadata.
A lightweight lease (e.g., using Consul or ZooKeeper) indicates which gateway owns the session.
When active fails, passive takes the lease, replays unsent journal records, and reestablishes sessions.

Caveats:

Shared filesystems can introduce latency; prefer fast local SSD with replication and a small, well-tested takeover protocol that guarantees only one active writer.
Ensure the takeover flow replays only what was not acknowledged; otherwise you risk duplicates.

Example architecture diagram (Mermaid):

mermaid

1flowchart LR
2  client[Counterparty]
3  client -->|TCP/TLS| fw[Firewall / LB]
4  fw --> gwActive[FIX Gateway (active)]
5  fw --> gwPassive[FIX Gateway (passive)]
6  gwActive --> journal[Durable Journal (local SSD + replicate)]
7  gwPassive --> journal
8  gwActive --> app[Trading App / OMS]
9  gwPassive --> app
10  note right of journal: Journal stores raw FIX bytes + metadata
11

Load balancing and sticky sessions #

Many venues require a stable TCP connection and will not permit arbitrary reconnection from a different source IP/port during a session. Common options:

Use a single dedicated connection per membership (no LB).
Use a gateway that handles many client sessions and multiplexes them to the trading app (internal sticky mapping).
If you must use a load balancer, configure it for source IP affinity, or better, use a TCP passthrough that preserves client addresses.

Security best practices #

Use TLS with mutual authentication for all remote FIX links when possible. Configure certificate rotation and monitoring.
Use IP allowlists and restrict management interfaces to internal networks.
Store credentials (keys, passwords) in a vault (HashiCorp Vault, AWS Secrets Manager) and inject at runtime.
Audit access to the gateway and journal files — post-trade reconstruction depends on unmodified logs.

Observability and SLOs #

Metrics to track per session:

Message rates (in/out) and average latency per message type.
Retransmit rate (ResendRequest frequency), retransmit duration.
Session uptime, number of reconnects per hour/day.
Journal lag (how many messages are persisted and not acknowledged by counterparty).
Heartbeat misses and test requests.

Set SLOs such as:

Average time to recover (from disconnect) < X seconds.
Retransmit rate < Y%.

Logging: log both wire-format messages (for replay) and structured events (for dashboards). Keep indexes for quick lookup by ClOrdID, OrderID, ExecID and time ranges.

Testing and chaos exercises #

Automate tests for the following scenarios:

Network partition: drop packets/blackhole and verify replay on reconnect.
Gateway crash: ensure passive gateway takes over without losing messages.
Resend storms: simulate massive ResendRequest flood and ensure the gateway rate-limits replay to protect peers.

A scheduled chaos suite for FIX sessions might include periodic short disconnects and sequence gaps to exercise replay paths.

Operational playbook (runbook)#

Check network and firewall logs for packet drops.
Inspect journal: confirm last persisted outgoing seqnum and last acknowledged seq from broker.
If sequence gap exists, issue ResendRequest for the missing range and monitor resend throughput.
If takeover occurs, validate no duplicate messages were accepted by the counterparty (compare ExecIDs and order state).
If broker requests sequence reset, follow broker guidance but coordinate with ops to avoid inconsistent state.

Performance considerations #

Journaling and fsyncs add latency. Use batching strategies: flush every N messages or every M milliseconds. Tune based on risk tolerance.
Use direct I/O and pre-allocated files for predictable latency.
For very high message rates, consider a two-tier approach: memory-first journal with frequent snapshots to disk.

Example: minimal replay handler (conceptual)#

python

1def handle_resend_request(start_seq, end_seq):
2    for seq in range(start_seq, end_seq + 1):
3        record = journal.read(seq)
4        if record is None:
5            # send Sequence Reset / Gap Fill
6            send_gap_fill(from_seq=seq)
7            continue
8        send(record.msg_bytes)
9

Closing thoughts #

Hardening FIX in production is mostly about operational rigor: durable journaling, tested replay, clear takeover semantics, and metrics that surface problems early. Start by implementing a small durable journal and replay flow; then iterate on HA and performance only after correctness is proven.

The operational goals #

When hardening FIX sessions, target these operational goals:

Durability: no lost outbound messages after a crash.
Recoverability: quick, correct recovery from sequence gaps and disconnects.
Observability: metrics and logs to detect and diagnose session problems.
Security: authenticated, encrypted connections and controlled access.
Scalability: support high message rates and multiple venues with predictable behavior.

Architecture patterns #

There are three common deployment patterns for production FIX handling:

Embedded session inside trading application
- Simple to implement with a FIX engine library but harder to scale and observe.
Separate FIX gateway process
- Gateway handles sessions, journaling, replay, TLS and exposes a clean API to the trading app (TCP, gRPC, or shared memory).
Managed gateway + router
- Gateway for session concerns, router for multi-venue routing and stickiness. Useful for multi-tenant or high-availability setups.

For production we prefer (2) or (3): offload session concerns to a purpose-built gateway that can be restarted independently and provides durable journaling.

Journaling and persistence (the single most important detail)#

Implementation notes:

Use an append-only journal file per counterparty (rotate daily), written with fsync guarantees.
Store both the wire-format FIX message (the exact bytes) and a small metadata record (seqnum, timestamp, client id).
When sending, write to journal first, then write the bytes to the socket. On success, mark as sent. Periodically truncate or compress old journal segments after confirming counterparty has acknowledged via sequence numbers or administrative resets.
Ensure the journal write path is fast: use pre-allocated files, O_DIRECT or write-behind with careful fsync boundaries (fsync after a configurable window or on critical messages).

Pseudocode (simplified):

python

1# durable_journal.append returns the persisted record id
2record_id = durable_journal.append(msg_bytes, seqnum)
3# only after appended do we send
4socket.send(msg_bytes)
5mark_sent(record_id)
6

Edge cases:

If the process crashes after journal append but before socket send, the restart logic must attempt replay.
If a message was marked sent but not acknowledged by the counterparty (e.g., due to crash), replay logic must be able to resend safely.

Replay and gap-fill handling #

A robust replay implementation must follow FIX semantics carefully:

On reconnect, determine local and peer sequence numbers.
If peer has higher incoming sequence (we are missing messages), send ResendRequest for the range we are missing.
If peer requests replays for messages we still have, read journal and resend.
Implement Sequence Reset / GapFill semantics when appropriate (e.g., when messages are supplanted or intentionally skipped).

Replay algorithm outline:

Load last persisted incoming/outgoing seqnums from stable store.
On connect, exchange logon and sequence information.
If outgoing seqnum < peer.expected, resend from outgoing_seq+1 using journal records.
If peer requests a resend range, stream the saved messages. Where messages are logically irrelevant, send GapFill using Sequence Reset (tag 123/36 depending on FIX version).

High-availability designs #

HA is often handled at two levels:

Active / Passive gateway pair with shared storage for journal and leader election.
Stateless gateway fronting stateful process pool (less common — requires sticky sessions or external shard).

Active/Passive pattern:

Gateways share a durable store (NFS, clustered filesystem, or object store) for journals and sequence metadata.
A lightweight lease (e.g., using Consul or ZooKeeper) indicates which gateway owns the session.
When active fails, passive takes the lease, replays unsent journal records, and reestablishes sessions.

Caveats:

Shared filesystems can introduce latency; prefer fast local SSD with replication and a small, well-tested takeover protocol that guarantees only one active writer.
Ensure the takeover flow replays only what was not acknowledged; otherwise you risk duplicates.

Example architecture diagram (Mermaid):

mermaid

1flowchart LR
2  client[Counterparty]
3  client -->|TCP/TLS| fw[Firewall / LB]
4  fw --> gwActive[FIX Gateway (active)]
5  fw --> gwPassive[FIX Gateway (passive)]
6  gwActive --> journal[Durable Journal (local SSD + replicate)]
7  gwPassive --> journal
8  gwActive --> app[Trading App / OMS]
9  gwPassive --> app
10  note right of journal: Journal stores raw FIX bytes + metadata
11

Load balancing and sticky sessions #

Many venues require a stable TCP connection and will not permit arbitrary reconnection from a different source IP/port during a session. Common options:

Use a single dedicated connection per membership (no LB).
Use a gateway that handles many client sessions and multiplexes them to the trading app (internal sticky mapping).
If you must use a load balancer, configure it for source IP affinity, or better, use a TCP passthrough that preserves client addresses.

Security best practices #

Use TLS with mutual authentication for all remote FIX links when possible. Configure certificate rotation and monitoring.
Use IP allowlists and restrict management interfaces to internal networks.
Store credentials (keys, passwords) in a vault (HashiCorp Vault, AWS Secrets Manager) and inject at runtime.
Audit access to the gateway and journal files — post-trade reconstruction depends on unmodified logs.

Observability and SLOs #

Metrics to track per session:

Message rates (in/out) and average latency per message type.
Retransmit rate (ResendRequest frequency), retransmit duration.
Session uptime, number of reconnects per hour/day.
Journal lag (how many messages are persisted and not acknowledged by counterparty).
Heartbeat misses and test requests.

Set SLOs such as:

Average time to recover (from disconnect) < X seconds.
Retransmit rate < Y%.

Logging: log both wire-format messages (for replay) and structured events (for dashboards). Keep indexes for quick lookup by ClOrdID, OrderID, ExecID and time ranges.

Testing and chaos exercises #

Automate tests for the following scenarios:

Network partition: drop packets/blackhole and verify replay on reconnect.
Gateway crash: ensure passive gateway takes over without losing messages.
Resend storms: simulate massive ResendRequest flood and ensure the gateway rate-limits replay to protect peers.

A scheduled chaos suite for FIX sessions might include periodic short disconnects and sequence gaps to exercise replay paths.

Operational playbook (runbook)#

Check network and firewall logs for packet drops.
Inspect journal: confirm last persisted outgoing seqnum and last acknowledged seq from broker.
If sequence gap exists, issue ResendRequest for the missing range and monitor resend throughput.
If takeover occurs, validate no duplicate messages were accepted by the counterparty (compare ExecIDs and order state).
If broker requests sequence reset, follow broker guidance but coordinate with ops to avoid inconsistent state.

Performance considerations #

Journaling and fsyncs add latency. Use batching strategies: flush every N messages or every M milliseconds. Tune based on risk tolerance.
Use direct I/O and pre-allocated files for predictable latency.
For very high message rates, consider a two-tier approach: memory-first journal with frequent snapshots to disk.

Example: minimal replay handler (conceptual)#

python

1def handle_resend_request(start_seq, end_seq):
2    for seq in range(start_seq, end_seq + 1):
3        record = journal.read(seq)
4        if record is None:
5            # send Sequence Reset / Gap Fill
6            send_gap_fill(from_seq=seq)
7            continue
8        send(record.msg_bytes)
9

FIX in Production: Hardening Sessions, Recovery and Resilience

The operational goals #

Architecture patterns #

Journaling and persistence (the single most important detail)#

Replay and gap-fill handling #

High-availability designs #

Load balancing and sticky sessions #

Security best practices #

Observability and SLOs #

Testing and chaos exercises #

Operational playbook (runbook)#

Performance considerations #

Example: minimal replay handler (conceptual)#

Closing thoughts #

NordVarg Team

Join 1,000+ Engineers

Related Posts

FIX in Production: Hardening Sessions, Recovery and Resilience

The operational goals #

Architecture patterns #

Journaling and persistence (the single most important detail)#

Replay and gap-fill handling #

High-availability designs #

Load balancing and sticky sessions #

Security best practices #

Observability and SLOs #

Testing and chaos exercises #

Operational playbook (runbook)#

Performance considerations #

Example: minimal replay handler (conceptual)#

Closing thoughts #

NordVarg Team

Join 1,000+ Engineers

Related Posts