What triggers the latency cushion?

Degraded venue health: missed heartbeats, rising RTT, or queue depth creeping past thresholds.

What happens to orders during failover?

We freeze existing orders, clone state to backup, and re-issue resting liquidity from the secondary region.

How does size change when latency rises?

The router trims clip size automatically for slow venues while keeping protected orders intact.

How often do we rehearse?

At least monthly via the latency failover drill; more often after infra changes.

How do we monitor it?

Dual feeds, RTT per client, queue depth, and alerting through Slack + PagerDuty when thresholds trip.

← Back to blog

Feb 27, 2025 · 1 min read

SwipeX Latency Cushion

How we keep SwipeX execution smooth when CEX APIs misbehave and Layer-2 sequencers sneeze.

SwipeX SRE

Why a latency cushion?

SwipeX sits between impatient traders and slow-moving venues. When a centralized exchange throttles order books or a Layer-2 sequencer pauses for upgrades, our trading stack needs to keep filling orders without spraying errors at clients. The latency cushion is a safety layer that soaks up those hiccups.

Architecture snapshot

Dual-feed market data: every venue we connect to has two transports (primary WebSocket + REST heartbeat). We reconcile them in <50ms.
Latency-aware router: fills are routed through a queue that measures RTT from every client connection. Slow venues get smaller clips automatically.
Stateful failover: if a venue goes >2s without a heartbeat, we freeze existing orders, clone state into the backup region, and re-issue resting liquidity from there.

while True:
  refresh_venue_health()
  for venue in venues:
    if venue.isDegraded():
      shrink_order_size(venue)
      reroute_to_backup(venue)
    else:
      run_normal_flow(venue)

Risk knobs we tune daily

Latency budgets — trading pods can’t exceed 1.3× their 7-day moving average, otherwise automation trims size.
Queue depth — if our relayer queue grows past 80 ops we instantly pause all non-critical strategies.
Ops alerts — every failover is announced in Slack + PagerDuty with the impacted instruments, so humans can help.

Results so far

0 emergency halts in the last four venue incidents.
Clients experienced <200ms extra latency even when a major CEX API throttled us for a minute.
Engineering cut investigation time in half thanks to the unified telemetry dashboards we built for this system.

If you’re building your own latency cushion, start with visible metrics, practice failovers weekly, and never assume a single venue will stay healthy when you need it most.