# Pipeline Hardening: Reliability & Speed

**Date:** 2026-02-18 08:50 PST
**Branch:** `streamrift-0` (v0.4)
**Author:** COO (Claude) — deep exploration session
**Status:** PROPOSAL — awaiting review + approval

---

## Executive Summary

A comprehensive audit of the Galactus data pipeline — from external API to rendered pixel — reveals a system that is **fast in the happy path** but **fragile under failure**. WebSocket prices arrive in <100ms, but a single relay hiccup requires a manual page reload. Cold start takes 5-30 seconds due to sequential discovery phases. Errors are silently swallowed, leaving the user staring at stale data with no indication anything is wrong.

This report documents every layer of the current architecture, identifies specific failure modes, and proposes a phased hardening plan optimized for **reliability first, speed second**.

### Key Decisions Made

| Decision         | Choice                       | Rationale                                                                  |
| ---------------- | ---------------------------- | -------------------------------------------------------------------------- |
| Order signing    | Client-side (unchanged)      | Auth works, Kalshi requires browser-side RSA-PSS, no migration risk        |
| Read data        | Relay cache + fan-out        | One upstream connection serves N clients, eliminates per-browser discovery |
| WS failure UX    | Grey/fade stale prices       | User always knows data freshness at a glance                               |
| Credential model | No changes                   | Single-user terminal, server-side auth is a separate future project        |
| Sequencing       | Reliability → Speed → Polish | Broken fast is worse than slow reliable                                    |

---

## Table of Contents

1. [Current Architecture](#1-current-architecture)
2. [Data Flow Audit](#2-data-flow-audit)
3. [Performance Profile](#3-performance-profile)
4. [Reliability Audit](#4-reliability-audit)
5. [Security & Credential Model](#5-security--credential-model)
6. [Relay Server Deep Dive](#6-relay-server-deep-dive)
7. [Identified Gaps](#7-identified-gaps)
8. [Architecture Options Considered](#8-architecture-options-considered)
9. [Chosen Approach: Hybrid Model](#9-chosen-approach-hybrid-model)
10. [Phase 1: Stop the Bleeding (Reliability)](#10-phase-1-stop-the-bleeding)
11. [Phase 2: Warm the Cache (Speed)](#11-phase-2-warm-the-cache)
12. [Phase 3: Harden the Edges (Polish)](#12-phase-3-harden-the-edges)
13. [Dependency Graph & Work Order](#13-dependency-graph--work-order)
14. [Open Questions & Blockers](#14-open-questions--blockers)
15. [Appendix: File Reference](#15-appendix-file-reference)

---

## 1. Current Architecture

```
┌─────────────────────────────────────────────────────┐
│  Browser (Galactus Dashboard)                       │
│  ├─ RSA-PSS signing (WebCrypto) — all requests      │
│  ├─ Private keys in localStorage (plaintext PEM)     │
│  ├─ REST discovery → parallel orderbook fetch        │
│  ├─ WS stream for live prices (via relay)            │
│  ├─ Polls for order fills every 2s (OrderMonitor)    │
│  └─ Polymarket: Gamma API resolution + WS stream     │
└──────────────────────┬──────────────────────────────┘
                       │
          ┌────────────┴────────────────┐
          ▼                             ▼
┌──────────────────┐          ┌──────────────────┐
│  Relay Server    │          │  Direct (bypass)  │
│  Express + WS    │          │  Browser → API    │
│  PM2 port 8789   │          │  (CORS issues)    │
│  ├─ HTTP forward │          └──────────────────┘
│  ├─ WS multiplex │
│  ├─ GET caching  │
│  └─ Smart Relay  │
│     (optional)   │
└────────┬─────────┘
         │
    ┌────┴──────────────────────────┐
    ▼                ▼              ▼
┌────────┐    ┌────────────┐  ┌─────────┐
│ Kalshi │    │ Polymarket │  │ Odds API│
│ REST+WS│    │ Gamma+CLOB │  │ REST    │
└────────┘    └────────────┘  └─────────┘
```

### Component Roles

| Component                     | Role                                     | State                             |
| ----------------------------- | ---------------------------------------- | --------------------------------- |
| `sportsDiscovery/discover.ts` | REST market discovery (one-time on init) | Working                           |
| `sportsStream/stream.ts`      | Orchestrates 5-phase init pipeline       | Working, slow                     |
| `marketStream.ts`             | Kalshi WS orderbook stream               | Working, no reconnect on frontend |
| `kalshiApi.ts`                | REST client with RSA-PSS auth + 3x retry | Working                           |
| `orderMonitor.ts`             | Polls for order fills every 2s           | Working, laggy                    |
| `relayHttp.ts`                | HTTP relay client (frontend)             | Working, no retry                 |
| `relayWs.ts`                  | WS relay client (frontend)               | **No auto-reconnect**             |
| `relay/index.ts`              | Express + WS relay server                | Working                           |
| `relay/wsRelay.ts`            | WS multiplexer with upstream reconnect   | Working                           |
| `relay/responseCache.ts`      | GET response caching (TTL-based)         | Working                           |
| `relay/smart-relay/`          | Optional server-side market aggregation  | Built, not enabled                |
| `oddsApi.ts`                  | External sportsbook odds client          | Built, not wired to UI            |
| `polymarket/`                 | Polymarket integration (Gamma + CLOB)    | Working                           |

---

## 2. Data Flow Audit

### 2.1 Sports Stream Initialization (Current)

The current cold start is a **5-phase sequential pipeline**:

```
Phase 1: discovering-markets (REST)
  └─ getMarkets() → up to 200 markets, cursor-paginated
  └─ 500ms delay between pages (rate limit respect)
  └─ Duration: 1-5s

Phase 2: fetching-orderbooks (REST, parallel)
  └─ getOrderbook() for every unique market ticker
  └─ Duration: 5-30s depending on market count
  └─ No per-orderbook retry on failure

Phase 3: fetching-events (REST, parallel)
  └─ getEvent() for every unique event ticker
  └─ Extracts start times (Pacific/Vegas)
  └─ Duration: 2-10s

Phase 4: hydrating-polymarket (REST + WS, optional)
  └─ Slug resolution via Gamma API (fuzzy name matching)
  └─ Token ID discovery + WS subscription
  └─ Duration: 5-15s per sport type
  └─ HIGHEST LATENCY PHASE

Phase 5: connecting-stream (WS)
  └─ Subscribe to orderbook_delta channel
  └─ Duration: 100-500ms
  └─ Real-time from here on

TOTAL COLD START: 5-30 seconds
```

### 2.2 Live Data Flow (After Init)

```
Kalshi WS ──→ Relay WS ──→ Frontend marketStream
  orderbook_delta              ├─ Updates orderbook Map
  ticker (bid/ask)             ├─ Fires callbacks
  trade (last price)           └─ UI re-renders

Polymarket WS ──→ Frontend polyStream
  market updates        ├─ Updates token prices
                        └─ UI re-renders

OrderMonitor (polling, 2s) ──→ getOpenOrders()
  ├─ Compare with previous state
  ├─ Detect fills (order disappears → check filled orders)
  └─ Emit onFill / onStatusChange
```

### 2.3 Order Execution Flow

```
User clicks odds cell / submits order form
  └─ kalshiApi.placeOrder()
      └─ buildAuthHeaders() — RSA-PSS sign in browser
          └─ relayHttp.forward() — POST to /relay/http
              └─ Relay forwards to Kalshi REST API
                  └─ Response returned to browser
                      └─ OrderMonitor picks up fill (up to 2s later)
```

---

## 3. Performance Profile

### 3.1 Where We're Fast

| Operation                      | Latency   | Path                                        |
| ------------------------------ | --------- | ------------------------------------------- |
| WS price update (Kalshi)       | <100ms    | Kalshi → relay → browser (real-time stream) |
| Relay HTTP forwarding overhead | ~5ms      | Same-server proxying                        |
| Orderbook delta processing     | <1ms      | In-memory Map update                        |
| Response cache hit (relay)     | <1ms      | In-memory lookup                            |
| Order placement (happy path)   | 200-500ms | REST round-trip through relay               |

### 3.2 Where We're Slow

| Operation                  | Latency     | Root Cause                          |
| -------------------------- | ----------- | ----------------------------------- |
| Cold start (total)         | 5-30s       | Sequential 5-phase init             |
| Polymarket slug resolution | 5-15s/sport | External Gamma API + fuzzy matching |
| Orderbook batch fetch      | 5-30s       | Parallel REST but 100+ tickers      |
| Market page pagination     | 500ms × N   | Rate limit sleep between pages      |
| Order fill detection       | Up to 2s    | Polling interval                    |

### 3.3 Polling Intervals & Timers

| Component               | Interval            | Type                                    |
| ----------------------- | ------------------- | --------------------------------------- |
| OrderMonitor            | 2,000ms             | setInterval — order fill polling        |
| WS relay ping/pong      | 10,000ms            | setInterval — keepalive per Kalshi docs |
| Response cache prune    | 60,000ms            | setInterval — remove expired entries    |
| Market cache prune      | 60,000ms            | setInterval — smart relay only          |
| Version check           | 60,000ms            | setInterval — polls `/version.json`     |
| HTTP request timeout    | 10,000ms            | setTimeout — relay client abort         |
| WS connect timeout      | 10,000ms            | setTimeout — both relay + direct        |
| Relay reconnect backoff | 1s → 30s max        | setTimeout — exponential with jitter    |
| Kalshi API retry        | 2^n × 1000 + jitter | setTimeout — up to 3 attempts           |

---

## 4. Reliability Audit

### 4.1 Reconnection Matrix

| Component                   | Auto-Reconnect       | Retries               | Timeout  | Max Backoff      | Status      |
| --------------------------- | -------------------- | --------------------- | -------- | ---------------- | ----------- |
| `relayWs.ts` (frontend)     | **NO**               | No                    | 10s      | N/A              | **FRAGILE** |
| `wsRelay.ts` (relay server) | YES                  | Exponential           | N/A      | 30s              | Robust      |
| `marketStream.ts`           | NO (relies on relay) | No                    | 10s      | N/A              | **FRAGILE** |
| `smartRelayStream.ts`       | YES                  | Exponential           | N/A      | 30s, 20 attempts | Good        |
| `kalshiApi.ts`              | No                   | YES (3×)              | 10s/call | ~7s total        | Decent      |
| Polymarket WS               | YES                  | Exponential, infinite | 15s      | 30s              | Good        |
| `relayHttp.ts`              | No                   | No                    | 10s      | N/A              | **FRAGILE** |
| `orderMonitor.ts`           | Implicit (polling)   | No                    | 2s/poll  | N/A              | Fair        |

### 4.2 Error Handling Patterns

**Silent suppression (BAD):**

- Most `kalshiApi.ts` methods return `[]` on error — UI cannot distinguish "no data" from "error occurred"
- `getMarkets()`, `getPositions()`, `getOpenOrders()` all silently swallow errors
- User sees empty table, doesn't know the API is down

**Auth error handling (GOOD):**

- `OrderMonitor` detects 401/403, stops polling, sets `stoppedDueToAuthError`
- `kalshiApi` re-throws auth errors to callers (not retried)

**Missing patterns:**

- No circuit breaker — failed endpoints get hammered indefinitely
- No 429/rate-limit detection — treated as generic errors
- No request deduplication — multiple components can flood same endpoint
- No staleness tracking — prices show full opacity even when data is minutes old

### 4.3 Failure Scenarios

| Scenario                  | Current Behavior                                           | User Experience                                     |
| ------------------------- | ---------------------------------------------------------- | --------------------------------------------------- |
| Relay server crashes      | Frontend WS dies, no reconnect                             | **Frozen prices, no indication. Must reload page.** |
| Kalshi API returns 500    | 3 retries with backoff, then silent `[]`                   | Empty table, no error message                       |
| Kalshi WS drops           | Relay reconnects (good), but frontend doesn't re-subscribe | **Prices stop updating, no indication**             |
| Kalshi rate limits (429)  | Treated as error, retried without backoff                  | Potential cascading failures                        |
| Internet drops            | All connections fail, timeouts fire                        | Multiple error states, inconsistent UI              |
| Polymarket Gamma API slow | Init hangs for 15+ seconds                                 | Long loading spinner, no partial data               |
| Order fill while WS down  | OrderMonitor polling still works (2s)                      | Fill detected with delay, but works                 |

### 4.4 Race Conditions

| Condition                                     | Risk                                                | Mitigation                                 |
| --------------------------------------------- | --------------------------------------------------- | ------------------------------------------ |
| OrderMonitor + WS both report fill            | Double notification possible                        | None — UI should deduplicate by `order_id` |
| Multiple components call `refreshPositions()` | API flood                                           | None — no debounce/coalescing              |
| Subscribe before upstream WS ready            | Subscription queued but not retried after reconnect | Stored in `subscribedTickers` Set          |
| Clock skew (client vs server)                 | Stale data appears "fresh"                          | None — `Date.now()` used client-side       |

---

## 5. Security & Credential Model

### 5.1 Current State

```
┌─────────────────────────────────────────┐
│  Browser                                │
│  ├─ Kalshi private key (plaintext PEM   │
│  │   in localStorage)                   │
│  ├─ Polymarket private key (plaintext   │
│  │   0x hex in localStorage)            │
│  ├─ RSA-PSS signing via WebCrypto       │
│  │   (CryptoKey marked non-extractable) │
│  └─ Gate password in sessionStorage     │
└─────────────────────────────────────────┘
```

**Trust model:**

- Kalshi private key never leaves the browser (by design — Kalshi API requires client-side signing)
- Relay server sees signed requests but cannot forge new ones
- Polymarket: Ethereum key stored for CLOB API credential derivation
- Gate password: SHA-256 hash comparison, build-time env var

**Strengths:**

- Private key never reaches any server (browser signs all requests)
- CryptoKey object is non-extractable (can't be read back via WebCrypto API)
- Multiple Kalshi profiles supported with quick switching

**Weaknesses (accepted for single-user terminal):**

- Private keys are plaintext in localStorage (XSS risk)
- No session timeout or idle logout
- No encryption at rest for stored credentials
- Gate password hash is compiled into the JS bundle

**Decision:** No changes to credential model in this project. This is a single-user trading terminal. Server-side auth is a separate future project if we go multi-user.

### 5.2 Relay Server Auth

The relay server has two auth layers:

1. **Transport auth:** Client signs requests, relay forwards byte-faithfully. Relay never validates signatures.
2. **Admin auth:** `/admin/keys` endpoints protected by `ADMIN_SECRET` bearer token.
3. **Stream auth:** `/stream/markets` WS protected by `STREAM_TOKEN` (optional).

The Smart Relay's `ApiKeyStore` can hold server-side API keys for read-only operations:

- Round-robin rotation across healthy keys
- Auto-disable on 401
- Rate-limit tracking with `Retry-After` header parsing
- Admin endpoints to manage keys at runtime

---

## 6. Relay Server Deep Dive

### 6.1 Architecture

```
apps/relay/src/
├── index.ts                    # RelayServer class, Express setup, routes
├── config.ts                   # Environment-based configuration
├── types.ts                    # StreamMetadata, ReconnectConfig
├── httpRelay.ts                # HTTP byte-faithful forwarding
├── wsRelay.ts                  # WebSocket multiplexing (1 client → N upstream)
├── responseCache.ts            # GET response caching (TTL-based)
├── logger.ts                   # Structured logging with secret sanitization
├── middleware/
│   ├── cors.ts                 # CORS middleware
│   ├── errorHandler.ts         # Global error handler
│   ├── requestValidator.ts     # Request structure validation
│   └── adminAuth.ts            # Bearer token auth for admin endpoints
└── smart-relay/
    ├── index.ts                # SmartRelay orchestrator
    ├── MarketCache.ts          # In-memory market data store (event-driven)
    ├── KalshiFetcher.ts        # Kalshi REST + WS client (server-side)
    ├── PolymarketFetcher.ts    # Polymarket REST + WS client (server-side)
    ├── StreamBroadcaster.ts    # Client WS management + fan-out broadcast
    └── ApiKeyStore.ts          # API key rotation + rate limit tracking
```

### 6.2 HTTP Relay

- **Model:** Byte-faithful forwarding. Client signs, relay transports, API validates.
- **Production safety:** URLs restricted to `KALSHI_BASE_URL` hostname (prevents request smuggling)
- **Hop-by-hop headers removed:** `connection`, `keep-alive`, `transfer-encoding`
- **Timeout:** Configurable `HTTP_TIMEOUT_MS` (default 30s)
- **Error classes:** `ValidationError` (400), `TimeoutError` (504), `NetworkError` (502)

### 6.3 WebSocket Relay

- **Model:** Multiplexed streams — one client WS can host N independent upstream WS connections
- **Protocol:** JSON frames with `op` field: `connect`, `subscribe`, `send`, `close`
- **Ping/pong:** Every 10s to keep Kalshi connections alive
- **Reconnection:** Exponential backoff (1s → 30s max, configurable max attempts)
- **Non-retryable codes:** 1000, 1001, 1002, 1003, 1008, 4000-4999 (auth/policy)
- **Resource cleanup:** All upstream connections closed on client disconnect

### 6.4 Response Cache

| Pattern                       | TTL        | Scope               |
| ----------------------------- | ---------- | ------------------- |
| `/trade-api/v2/markets?`      | 5 minutes  | Market discovery    |
| `/trade-api/v2/events/`       | 30 minutes | Event metadata      |
| `/trade-api/v2/orderbook/v2/` | 30 seconds | Orderbook snapshots |

- GET requests only
- Cache key: URL + sorted query params
- Prunes expired entries every 60s
- Hit/miss stats exposed on `/health`
- Configurable: `RELAY_CACHE_ENABLED` env var

### 6.5 Smart Relay (Currently Disabled)

When `SMART_RELAY_ENABLED=true`:

1. **KalshiFetcher** discovers NBA markets, subscribes to all orderbooks via WS
2. **PolymarketFetcher** discovers matching Polymarket markets
3. **MarketCache** aggregates data from both venues, emits update events
4. **StreamBroadcaster** fans out updates to clients on `/stream/markets` WS
5. **ApiKeyStore** rotates API keys, handles rate limits, auto-disables on 401

**Backpressure:** `StreamBroadcaster` has `CLIENT_HIGH_WATER_MARK = 100` — disconnects slow clients with close code 4002.

**Current state:** Code is built and compiles. Has not been tested in production. Estimated ~70% complete for full production readiness.

### 6.6 What the Relay Doesn't Do

| Missing Capability         | Impact                                               |
| -------------------------- | ---------------------------------------------------- |
| Rate limiting (per-client) | Vulnerable to client-side abuse                      |
| Request deduplication      | Concurrent identical requests all forwarded          |
| Circuit breaking           | Cascading failures when upstream is down             |
| Response compression       | Larger payloads than necessary                       |
| Distributed tracing        | Hard to debug production issues                      |
| Multi-region failover      | Single upstream target per connection                |
| Backpressure (basic relay) | Slow clients can accumulate unbounded message queues |

---

## 7. Identified Gaps

### 7.1 Critical (Must Fix)

| ID      | Gap                                    | Impact                                          | Component          |
| ------- | -------------------------------------- | ----------------------------------------------- | ------------------ |
| GAP-001 | Frontend WS has NO auto-reconnect      | Relay crash = page reload required              | `relayWs.ts`       |
| GAP-002 | No stale data indication               | User sees frozen prices and thinks they're live | All price displays |
| GAP-003 | Silent error suppression in API client | "No data" indistinguishable from "API down"     | `kalshiApi.ts`     |

### 7.2 High (Should Fix)

| ID      | Gap                           | Impact                                       | Component                |
| ------- | ----------------------------- | -------------------------------------------- | ------------------------ |
| GAP-004 | No circuit breaker            | Dead API gets hammered every 2s forever      | `orderMonitor.ts`        |
| GAP-005 | 5-30s cold start              | User waits before seeing any data            | `sportsStream/stream.ts` |
| GAP-006 | No 429/rate-limit detection   | Cascading failures on rate limit             | `kalshiApi.ts`           |
| GAP-007 | Order fill detection lag (2s) | User doesn't know fill happened for up to 2s | `orderMonitor.ts`        |

### 7.3 Medium (Nice to Fix)

| ID      | Gap                                      | Impact                                  | Component         |
| ------- | ---------------------------------------- | --------------------------------------- | ----------------- |
| GAP-008 | No request deduplication                 | Multiple components flood same endpoint | Various           |
| GAP-009 | No connection health visibility          | User can't see data source status       | UI                |
| GAP-010 | Unbounded WS message queue (basic relay) | Memory growth with slow clients         | `wsRelay.ts`      |
| GAP-011 | Polling without jitter                   | Thundering herd with multiple tabs      | `orderMonitor.ts` |
| GAP-012 | No observability/metrics                 | Hard to diagnose production issues      | Relay server      |

---

## 8. Architecture Options Considered

Three options were evaluated:

### Option A: "Supercharged Relay"

Keep current model, make relay smarter. Frontend still signs everything, relay caches reads.

- **Pros:** Simple, low risk, builds on existing code
- **Cons:** Still per-browser discovery if smart relay is down

### Option B: "Full Server-Side"

Move all credentials to server. Users log in with username/password. Server signs requests.

- **Pros:** Multi-user, proper security, server-managed connections
- **Cons:** Large effort (2-3 weeks), new auth infrastructure, database required

### Option C: "Hybrid" (CHOSEN)

Server handles ALL reads (discovery, streaming, caching). Client only signs writes (orders).

- **Pros:** Maximum speed win, minimum disruption, no auth changes, Smart Relay already 70% built
- **Cons:** Still single-user, still plaintext keys in browser for orders

### Decision Matrix

| Factor          | A: Supercharged | B: Full Server | C: Hybrid             |
| --------------- | --------------- | -------------- | --------------------- |
| Cold start      | <1s             | <1s            | Instant               |
| Read latency    | ~50ms (cache)   | ~50ms          | ~10ms (push)          |
| Write latency   | Same            | Same           | Same                  |
| Multi-user      | No              | Yes            | No                    |
| Private keys    | Browser         | Server         | Browser (orders only) |
| Database needed | No              | Yes            | No                    |
| Auth changes    | None            | Major          | None                  |
| Effort          | Medium          | Large          | Medium-low            |
| Risk            | Low             | Medium         | Low                   |

**Chosen: Option C (Hybrid)** — solves the actual pain points (cold start, read reliability) without touching the auth model or requiring a database.

---

## 9. Chosen Approach: Hybrid Model

### Target Architecture

```
┌─────────────────────────────────────────┐
│  Browser (Dashboard)                    │
│  ├─ READS: Single WS to /stream/markets │◄── Instant data, push model
│  ├─ WRITES: RSA-PSS signed orders       │──► Through /relay/http (unchanged)
│  ├─ Connection state tracking            │
│  └─ Stale data visual indicators         │
└──────────────────────┬──────────────────┘
                       │
              ┌────────┴────────┐
              ▼                 ▼
┌──────────────────┐   ┌───────────────┐
│  Relay Server    │   │  /relay/http  │
│  Smart Relay ON  │   │  (unchanged)  │
│  ├─ KalshiFetcher│   │  byte-forward │
│  ├─ PolyFetcher  │   └───────────────┘
│  ├─ MarketCache  │
│  ├─ Broadcaster  │
│  └─ ApiKeyStore  │
│     (rate limits,│
│      rotation)   │
└────────┬─────────┘
         │
    ┌────┴──────────────────────────┐
    ▼                ▼              ▼
┌────────┐    ┌────────────┐  ┌─────────┐
│ Kalshi │    │ Polymarket │  │ Odds API│
│ REST+WS│    │ Gamma+CLOB │  │ REST    │
└────────┘    └────────────┘  └─────────┘
```

### Key Differences from Current

| Aspect                | Current                    | Target                                  |
| --------------------- | -------------------------- | --------------------------------------- |
| Read data source      | Browser fetches directly   | Relay streams to browser                |
| Cold start            | 5-30s (5-phase sequential) | <1s (relay already warm)                |
| Upstream connections  | 1 per browser per stream   | 1 total on relay, fan-out to N browsers |
| WS failure handling   | Silent freeze              | Auto-reconnect + grey stale prices      |
| Error visibility      | Silent `[]` returns        | Connection state UI + error surfacing   |
| Rate limit management | None                       | ApiKeyStore with rotation + backoff     |

---

## 10. Phase 1: Stop the Bleeding

**North star:** No more silent failures. User always knows the state of their data.

### 1A — Frontend WS Auto-Reconnect

**Problem:** `relayWs.ts` has zero reconnection logic. One relay hiccup = dead prices, manual page reload.

**Solution:** Add exponential backoff reconnection, matching the pattern already used in `smartRelayStream.ts`.

**Scope:**

- Track connection state: `connected | reconnecting | disconnected`
- On close (retryable codes): start reconnect timer with exponential backoff
- Backoff: `min(1000 * 2^attempt, 30000)` — 1s, 2s, 4s, 8s, 16s, 30s cap
- Max attempts: 20 (then give up, show "Connection lost" permanently)
- On reconnect success: re-subscribe all active tickers from `subscribedTickers` Set
- Expose connection state via callback so UI layer can react
- Non-retryable codes (1000, 4000+): don't retry, surface error

**Files to modify:**

- `apps/dashboard/src/lib/relayWs.ts` — add reconnection logic
- `apps/dashboard/src/lib/marketStream.ts` — re-subscribe on reconnect

**Dependencies:** None. Can start immediately.
**Risk:** Low — additive code, doesn't change happy-path behavior.
**Addresses:** GAP-001

---

### 1B — Connection State Tracking + Stale Data UI

**Problem:** When data stops flowing, the user has no visual indication. Numbers look live but are minutes old.

**Solution:** Track connection state per data source. When disconnected or stale, apply visual treatment to all live-data cells.

**Scope:**

**Connection state hook:**

```
useConnectionState() → {
  kalshi: { status: 'connected'|'reconnecting'|'disconnected', lastMessageAt: number },
  polymarket: { status: ..., lastMessageAt: number },
  overall: 'healthy' | 'degraded' | 'disconnected'
}
```

**Stale detection rules:**

- If WS status is `disconnected` or `reconnecting` → immediately stale
- If WS status is `connected` but no message received for >5 seconds → stale
- Per-source staleness (Kalshi can be stale while Polymarket is live)

**Visual treatment for stale data:**

- All price/odds cells: `opacity-40` (dramatically faded, clearly "off")
- Subtle desaturation on colored elements (green/red P&L → grey)
- Apply via CSS class toggled by connection state, NOT per-cell logic
- When connection recovers: snap back to full opacity immediately (no animation delay)

**Connection banner:**

- Fixed position banner below header
- Yellow: "Reconnecting... (attempt 3/20)" — visible but not alarming
- Red: "Connection lost. Retrying..." — after max attempts or extended outage
- Green flash: "Reconnected" — shows for 2 seconds on recovery, then disappears
- Banner does NOT block interaction (user can still place orders via REST)

**Components to apply stale treatment:**

- `NbaOddsTable` — all odds cells
- `ValueOddsTable` — all odds cells
- `OrderBookPanel` — all price/quantity cells
- Any `<Money>` component showing live-streamed data
- Spread/total displays
- Polymarket odds columns

**Files to create:**

- `apps/dashboard/src/hooks/useConnectionState.ts` — connection state tracking
- `apps/dashboard/src/components/ConnectionBanner.tsx` — reconnection banner

**Files to modify:**

- `apps/dashboard/src/components/` — odds table components (add stale class)
- `apps/dashboard/src/globals.css` — stale data CSS utilities

**Dependencies:** 1A (needs connection state events from WS layer)
**Risk:** Low — pure UI layer, no data logic changes.
**Addresses:** GAP-002

---

### 1C — Circuit Breaker + Error Surfacing

**Problem:** When Kalshi is down, OrderMonitor hammers it every 2s. When API returns errors, they're silently swallowed as empty arrays.

**Solution:** Circuit breaker utility + replace silent error suppression.

**Circuit breaker spec:**

```
States: CLOSED (normal) → OPEN (failing) → HALF_OPEN (testing)

CLOSED → OPEN: After 5 consecutive failures
OPEN: All requests immediately fail (no API call) for 30s cooldown
OPEN → HALF_OPEN: After cooldown, allow one test request
HALF_OPEN → CLOSED: If test succeeds
HALF_OPEN → OPEN: If test fails, reset cooldown
```

**Apply to:**

- `OrderMonitor` polling — when circuit opens, pause polling, surface "Order monitoring paused" to UI
- `kalshiApi` REST calls — when circuit opens, fail fast instead of waiting for timeout
- Circuit state exposed to connection state hook (feeds into stale UI from 1B)

**Error surfacing:**

- Replace `return []` patterns with proper error propagation in critical paths
- Add `{ data: T[], error?: string }` return type where silent failure currently exists
- At minimum: `getOpenOrders()`, `getPositions()`, `getMarkets()` should surface errors
- Non-critical paths (discovery, metadata) can remain lenient

**Files to create:**

- `apps/dashboard/src/lib/circuitBreaker.ts` — reusable circuit breaker utility

**Files to modify:**

- `apps/dashboard/src/lib/kalshiApi.ts` — wrap key methods with circuit breaker
- `apps/dashboard/src/lib/orderMonitor.ts` — use circuit breaker, expose state

**Dependencies:** None. Can run in parallel with 1A/1B.
**Risk:** Medium — modifying error paths in `kalshiApi.ts` requires care not to break callers.
**Addresses:** GAP-003, GAP-004

---

## 11. Phase 2: Warm the Cache

**North star:** Eliminate cold start. Data is already warm when user opens the page.

### 2A — Smart Relay as Default Read Path

**Problem:** Every browser does its own market discovery, orderbook fetching, and WS subscription. Wasteful and slow.

**Solution:** Enable Smart Relay in production. Relay subscribes to all markets once, caches everything, serves clients from memory.

**Scope:**

- Set `SMART_RELAY_ENABLED=true` in production relay config
- Configure `KALSHI_API_KEYS` with at least one API key for server-side market reads
- Verify `KalshiFetcher` discovers markets correctly (NBA, CBB, etc.)
- Verify `MarketCache` stores and emits updates properly
- Verify `StreamBroadcaster` fans out to connected clients on `/stream/markets`
- Test `PolymarketFetcher` integration with Gamma API
- Add health monitoring: cache size, last update time, upstream connection status
- Test relay restart behavior: does it re-warm cache automatically?

**Deployment changes:**

- PM2 config: increase memory limit (cache will use more RAM)
- Environment variables to add:
  - `SMART_RELAY_ENABLED=true`
  - `KALSHI_API_KEYS=<accessKey>:<base64PEM>`
  - `STREAM_TOKEN=<random token for client auth>`
- Verify `/health` endpoint reports smart relay status

**Blocker:** Needs a Kalshi API key configured on the server. Can use the same key currently in the browser, or create a dedicated read-only key.

**Dependencies:** API key configuration (StreamRift action).
**Risk:** Medium — smart relay code exists but hasn't been battle-tested in production.
**Addresses:** GAP-005

---

### 2B — Frontend Reads from Relay Stream

**Problem:** Frontend runs a 5-phase sequential init that takes 5-30 seconds.

**Solution:** Frontend connects to `/stream/markets` WS and receives pre-cached data instantly.

**Scope:**

**New hook: `useSmartRelayData()`**

- Connects to `/stream/markets` WS with `STREAM_TOKEN` auth
- On connect: relay sends full market snapshot (instant hydration)
- On updates: relay pushes deltas
- Exposes same data shape as current `sportsStream` output
- Falls back to current 5-phase init if smart relay is unavailable

**Migration strategy:**

- Add feature flag: `VITE_USE_SMART_RELAY=true|false`
- When true: use `useSmartRelayData()` for reads
- When false: use current `sportsStream` init (unchanged)
- Both paths produce the same data shape for downstream components
- Current discovery code stays as fallback, not deleted

**Data format alignment:**

- Smart relay must emit data in format compatible with current table components
- Map relay's `CachedMarket` + `CachedOrderbook` → existing `GameData` shape
- Adapter layer handles any format differences

**Files to create:**

- `apps/dashboard/src/hooks/useSmartRelayData.ts` — relay stream consumer
- `apps/dashboard/src/lib/smartRelayAdapter.ts` — data format adapter (may already exist partially)

**Files to modify:**

- `apps/dashboard/src/components/pages/NbaValueDashboardView.tsx` — consume new hook
- `apps/dashboard/src/components/pages/ValueDashboardView.tsx` — consume new hook

**Dependencies:** 2A (relay must be serving data)
**Risk:** Medium — largest code change in the project. Need to ensure data format compatibility.
**Addresses:** GAP-005

---

### 2C — Kill Cold Start

**Problem:** Even with Smart Relay, the relay itself needs time to warm up after restart.

**Solution:** Relay pre-warms cache on boot. First client gets full snapshot in <1 second.

**Scope:**

- Relay boot sequence: discover → fetch → subscribe → cache warm → ready
- `/health` reports `{ cacheWarmed: true/false, marketsLoaded: N, lastUpdate: timestamp }`
- `StreamBroadcaster` holds latest full snapshot in memory
- First message to new client = full snapshot (not just "subscribe and wait for deltas")
- Loading spinner in frontend: shown only until first snapshot received (should be <1s if relay is warm)

**Dependencies:** 2A + 2B
**Risk:** Low — natural result of 2A + 2B working correctly.
**Addresses:** GAP-005

---

## 12. Phase 3: Harden the Edges

**North star:** Production-grade resilience for sustained operation.

### 3A — Rate Limit Awareness

**Scope:**

- Detect 429 responses in `kalshiApi.ts`, parse `Retry-After` header
- Client-side request queue with backoff when rate limited
- Wire `ApiKeyStore` rate limit tracking into relay HTTP path (not just smart relay)
- Log rate limit events for monitoring

**Addresses:** GAP-006

### 3B — Request Deduplication

**Scope:**

- Pending request map keyed by URL + method
- If identical request is in-flight, return the same Promise (coalesce)
- Window: 100ms — requests within 100ms of each other to the same endpoint share one API call
- Apply to: `getOrderbook()`, `getOpenOrders()`, `getPositions()`

**Addresses:** GAP-008

### 3C — Connection Health UI

**Scope:**

- Status indicator in sidebar or header: green/yellow/red dot per data source
- Tooltip: last update timestamp, connection attempt count
- Optional: relay cache hit rate from `/health` endpoint

**Addresses:** GAP-009

### 3D — Polling Jitter

**Scope:**

- Add random ±500ms jitter to OrderMonitor's 2s interval
- Prevents thundering herd with multiple tabs

**Addresses:** GAP-011

---

## 13. Dependency Graph & Work Order

### Dependency Graph

```
1A (WS reconnect) ──────→ 1B (stale UI) ──────→ 2B (frontend reads relay)
                                                       │
1C (circuit breaker) ────────────────────┐             │
                                         ▼             ▼
                          2A (smart relay) ──────→ 2C (kill cold start)
                                                       │
                                                       ▼
                                              3A, 3B, 3C, 3D (polish)
```

### Parallelism

- **1A and 1C** can run simultaneously (no shared code)
- **2A** (server config) can start during Phase 1 (no frontend changes)
- **3A, 3B, 3C, 3D** are all independent of each other

### Work Order

| Order | Task     | Description                           | Blocks | Can Parallel With      |
| ----- | -------- | ------------------------------------- | ------ | ---------------------- |
| 1     | **1A**   | WS auto-reconnect                     | 1B     | 1C, 2A (server config) |
| 2     | **1C**   | Circuit breaker + error surfacing     | —      | 1A                     |
| 3     | **1B**   | Stale data UI (grey/fade treatment)   | 2B     | 2A                     |
| 4     | **2A**   | Smart Relay enabled in production     | 2B, 2C | 1B                     |
| 5     | **2B**   | Frontend relay stream adapter         | 2C     | —                      |
| 6     | **2C**   | Kill cold start (pre-warm + snapshot) | 3\*    | —                      |
| 7     | **3A-D** | Rate limits, dedup, health UI, jitter | —      | Each other             |

### Estimated Complexity (Not Time)

| Task | New Files | Modified Files | Complexity | Risk                           |
| ---- | --------- | -------------- | ---------- | ------------------------------ |
| 1A   | 0         | 2              | Low        | Low                            |
| 1B   | 2         | 4-6            | Medium     | Low                            |
| 1C   | 1         | 2-3            | Medium     | Medium                         |
| 2A   | 0         | Config only    | Low        | Medium (first production test) |
| 2B   | 2         | 2-3            | High       | Medium                         |
| 2C   | 0         | 1-2            | Low        | Low                            |
| 3A   | 0         | 1-2            | Low        | Low                            |
| 3B   | 1         | 1-2            | Low        | Low                            |
| 3C   | 1         | 1-2            | Low        | Low                            |
| 3D   | 0         | 1              | Trivial    | None                           |

---

## 14. Open Questions & Blockers

### Blocker: Server-Side API Key

**Phase 2A requires a Kalshi API key configured on the relay server** for market discovery and orderbook subscriptions. Two options:

1. **Reuse existing key** — same key in browser + server. Simple but key is in two places.
2. **Create dedicated read-only key** — better isolation, separate rate limit pool.

**Action required:** StreamRift to decide approach and configure `KALSHI_API_KEYS` env var on server.

### Open Questions

| #   | Question                                                          | Impact            | Decision Needed By |
| --- | ----------------------------------------------------------------- | ----------------- | ------------------ |
| 1   | Stale threshold: 5 seconds or configurable?                       | 1B implementation | Phase 1 start      |
| 2   | Should stale treatment apply to Polymarket columns independently? | 1B scope          | Phase 1 start      |
| 3   | Smart Relay: NBA only or all sports?                              | 2A configuration  | Phase 2 start      |
| 4   | Feature flag for smart relay reads, or hard cutover?              | 2B implementation | Phase 2 start      |
| 5   | Should circuit breaker state persist across page reloads?         | 1C design         | Phase 1 start      |
| 6   | Connection banner: always visible or only on failure?             | 1B design         | Phase 1 start      |

---

## 15. Appendix: File Reference

### Frontend — Data Pipeline

| File                              | Purpose                    | Polling | Caching               | Reconnect      |
| --------------------------------- | -------------------------- | ------- | --------------------- | -------------- |
| `lib/sportsDiscovery/discover.ts` | Market discovery REST      | No      | No                    | N/A            |
| `lib/sportsStream/stream.ts`      | 5-phase init orchestrator  | No      | Poly cache (session)  | No             |
| `lib/marketStream.ts`             | Kalshi WS orderbook stream | No      | In-memory (levels)    | **No**         |
| `lib/kalshiApi.ts`                | REST client + RSA-PSS auth | No      | No                    | 3× retry       |
| `lib/orderMonitor.ts`             | Order fill detection       | **2s**  | Implicit (prev state) | N/A            |
| `lib/relayHttp.ts`                | HTTP relay client          | No      | No                    | No             |
| `lib/relayWs.ts`                  | WS relay client            | No      | No                    | **No**         |
| `lib/smartRelayStream.ts`         | Smart relay WS client      | No      | No                    | Yes (20×)      |
| `lib/oddsApi.ts`                  | External sportsbook odds   | No      | **15min TTL**         | N/A            |
| `lib/polymarket/marketStream.ts`  | Polymarket WS stream       | No      | No                    | Yes (infinite) |

### Relay Server

| File                                   | Purpose                         | Key Feature                            |
| -------------------------------------- | ------------------------------- | -------------------------------------- |
| `src/index.ts`                         | Server setup, routes, lifecycle | Graceful shutdown (5s)                 |
| `src/httpRelay.ts`                     | HTTP forwarding                 | Byte-faithful, domain validation       |
| `src/wsRelay.ts`                       | WS multiplexing                 | 1:N streams, upstream reconnect        |
| `src/responseCache.ts`                 | GET caching                     | TTL-based (5m/30m/30s)                 |
| `src/smart-relay/index.ts`             | SmartRelay orchestrator         | Concurrent fetcher startup             |
| `src/smart-relay/MarketCache.ts`       | Market data store               | Event-driven updates, 24h TTL          |
| `src/smart-relay/KalshiFetcher.ts`     | Kalshi server-side client       | REST + WS, full lifecycle              |
| `src/smart-relay/PolymarketFetcher.ts` | Polymarket server-side client   | Gamma API + WS                         |
| `src/smart-relay/StreamBroadcaster.ts` | Client fan-out                  | Backpressure (100 msg HWM)             |
| `src/smart-relay/ApiKeyStore.ts`       | API key management              | Round-robin, auto-disable, rate limits |

### Hooks & State

| File                                | Purpose                                              |
| ----------------------------------- | ---------------------------------------------------- |
| `hooks/useKalshiConnection.ts`      | Credential management, connect/disconnect            |
| `hooks/useVersionCheck.ts`          | Polls `/version.json` every 60s for deploy detection |
| `hooks/usePolymarketIntegration.ts` | Polymarket settings + key derivation                 |

---

_End of report. Ready for cross-thread review._
