Lesson 8

When Things Go Wrong

The previous lessons assumed caches were warm, coordinates were current, and links had enough room for every packet. Real meshes do not behave like that. Nodes join and leave, the tree reconverges, caches expire, link MTUs drop when a transport switches from Ethernet to UDP-over-DSL. FIPS has to keep a session usable through all of it without tearing down the crypto state or falling back to manual routing.

This lesson covers the two mechanisms that keep the mesh self-healing: the coordinate warmup window at the start of every session, and the three error signals transit routers emit when they cannot forward a packet.

Why warmup exists

Multi-hop routing needs two things at every transit node: a reachability hint (bloom filter) and tree coordinates for the destination (coordinate cache). Without the coordinate cache, find_next_hop() returns None before bloom filters are even consulted. Discovering coordinates on every cache miss would be slow and chatty, so FIPS warms the caches along the path during session setup.

The warmup has three complementary sources:

  • SessionSetup carries cleartext coordinates in the routing envelope. Every transit node extracts source and destination coordinates as the handshake passes through, and caches both.
  • CP-flagged data packets. The first 5 data packets of a session set the CP flag in the FSP header and carry the same cleartext coordinates between the header and the ciphertext. Transit routers parse them without decrypting anything and refresh their caches.
  • Standalone CoordsWarmup (0x14). If piggy-backing coordinates onto a data packet would exceed the transport MTU, FSP emits a bare CoordsWarmup first: same CP format, empty inner body, exists only to deliver coordinates to transit nodes.

Once the 5-packet window is over, FSP clears the CP flag and sends minimal 12-byte headers. At that point every router on the path should have current coordinates for both endpoints.

The three error signals

After warmup finishes, a transit router can still fail to forward a packet. Instead of dropping silently (the classic IP behavior), FIPS defines three explicit error signals. Each one travels back to the source as a new SessionDatagram, routed via find_next_hop(src_addr). The cryptographic session keeps running; only routing state is repaired.

CoordsRequired (0x20)

No cached coordinates for the destination. Transit node has nothing to forward with.

PathBroken (0x21)

Coordinates are cached, but no neighbor is closer to the destination than the transit node itself. Greedy routing hit a local minimum.

MtuExceeded (0x22)

Packet does not fit the next-hop link MTU. The mesh layer will not fragment, so the source has to shrink its payloads.

Watch a recovery unfold

Pick a scenario below. Each one starts from the same five-node path (Src → R1 → R2 → R3 → Dst) in steady state. A problem at R2 triggers the error, the signal walks back to Src, and Src drives the repair.

Warmup has finished. A transit router lost the destination's coordinates (cache expired, or the router just joined). It has no way to forward the packet.

CP counter: 0 / 5frame 1 / 12
Steady state, then R2 drops its cacheSrcwarmR1warmR2coldR3warmDstwarm

Steady state, then R2 drops its cache

The session has been running. Warmup is done (CP counter is 0). R2's coordinate cache for Dst has just expired. R2 still has all its FMP link sessions, it just no longer knows where Dst sits in the tree.

Rate limits and why they matter

Two rate limits keep recovery from becoming a storm when the topology changes:

  • Transit side: at most one error signal per destination per 100 ms. If a transit router sees a hundred packets to the same unreachable destination in a burst, it emits one error signal, not a hundred.
  • Source side: at most one standalone CoordsWarmup per destination per 2000 ms in response to errors. This prevents a burst of errors from triggering a burst of warmup messages.

Error signals are processed asynchronously, outside the packet-receive loop. A session getting hammered by PathBroken responses does not stall normal forwarding.

The blind spot

Error signals rely on find_next_hop(src_addr) at the transit node. After the warmup window, a transit node may no longer have the source's coordinates cached. In that case the error is silently dropped. Two things limit the damage:

  • CP-flagged warmup packets seed transit caches with the source's coordinates too, so the first few seconds of a session are always covered.
  • Sessions tear down after 90s of no application data. A session old enough to have lost all transit caches will expire soon anyway, and the next attempt re-runs SessionSetup from scratch.

State kept alive through all of this

Recovery never rebuilds the crypto session. What gets refreshed is:

  • The source's local coordinate cache (on PathBroken).
  • The transit caches along the current path (via CP-flagged packets and CoordsWarmup).
  • The source's path-MTU estimate (on MtuExceeded).

The Noise XK session, the replay counter, the sender and receiver MMP state: all of it persists. From the application's view the session keeps running. A few packets may be lost during the repair, which looks the same to TCP as any other drop.

Putting the signals together

First packet to a new destination
→ SessionSetup carries coords, transit caches warm
→ First 5 data packets set CP, reinforcing caches
Cache evicted at a transit node
← CoordsRequired back to source
→ Standalone CoordsWarmup, CP counter resets
Tree moved, cached coords stale
← PathBroken back to source
→ Drop local entry, LookupRequest, CP counter resets
Packet too big for next-hop MTU
← MtuExceeded back to source
→ Clamp session path_mtu, resend smaller

Lesson quiz

1. What does the CP flag on an FSP data packet tell a transit router?

2. How many data packets at the start of a session carry the CP flag by default?

3. Which error signal means 'I have coordinates for the destination, but no peer is closer to it than I am'?

4. On receiving PathBroken, what does the source do differently from when it receives CoordsRequired?

5. Why does FIPS rate-limit error signals at the transit node?

6. What does MtuExceeded cause the source to do?