Lesson 8
When Things Go Wrong
The previous lessons assumed caches were warm, coordinates were current, and links had enough room for every packet. Real meshes do not behave like that. Nodes join and leave, the tree reconverges, caches expire, link MTUs drop when a transport switches from Ethernet to UDP-over-DSL. FIPS has to keep a session usable through all of it without tearing down the crypto state or falling back to manual routing.
This lesson covers the two mechanisms that keep the mesh self-healing: the coordinate warmup window at the start of every session, and the three error signals transit routers emit when they cannot forward a packet.
Why warmup exists
Multi-hop routing needs two things at every transit node: a reachability hint (bloom filter)
and tree coordinates for the destination (coordinate cache). Without the coordinate cache, find_next_hop() returns
None before bloom filters are even consulted. Discovering coordinates
on every cache miss would be slow and chatty, so FIPS warms the caches along the path during session
setup.
The warmup has three complementary sources:
- SessionSetup carries cleartext coordinates in the routing envelope. Every transit node extracts source and destination coordinates as the handshake passes through, and caches both.
- CP-flagged data packets. The first
5data packets of a session set theCPflag in the FSP header and carry the same cleartext coordinates between the header and the ciphertext. Transit routers parse them without decrypting anything and refresh their caches. - Standalone CoordsWarmup (0x14). If piggy-backing coordinates onto a data packet would exceed the transport MTU, FSP emits a bare CoordsWarmup first: same CP format, empty inner body, exists only to deliver coordinates to transit nodes.
Once the 5-packet window is over, FSP
clears the CP flag and sends minimal 12-byte headers. At that point every router on the path
should have current coordinates for both endpoints.
The three error signals
After warmup finishes, a transit router can still fail to forward a packet. Instead of
dropping silently (the classic IP behavior), FIPS defines three explicit error signals. Each
one travels back to the source as a new SessionDatagram, routed via
find_next_hop(src_addr). The cryptographic session keeps running;
only routing state is repaired.
CoordsRequired (0x20)
No cached coordinates for the destination. Transit node has nothing to forward with.
PathBroken (0x21)
Coordinates are cached, but no neighbor is closer to the destination than the transit node itself. Greedy routing hit a local minimum.
MtuExceeded (0x22)
Packet does not fit the next-hop link MTU. The mesh layer will not fragment, so the source has to shrink its payloads.
Watch a recovery unfold
Pick a scenario below. Each one starts from the same five-node path (Src → R1 → R2 → R3 → Dst) in steady state. A problem at R2 triggers the error, the signal walks back to Src, and
Src drives the repair.
Warmup has finished. A transit router lost the destination's coordinates (cache expired, or the router just joined). It has no way to forward the packet.
Steady state, then R2 drops its cache
The session has been running. Warmup is done (CP counter is 0). R2's coordinate cache for Dst has just expired. R2 still has all its FMP link sessions, it just no longer knows where Dst sits in the tree.
Rate limits and why they matter
Two rate limits keep recovery from becoming a storm when the topology changes:
- Transit side: at most one error signal per destination per
100ms. If a transit router sees a hundred packets to the same unreachable destination in a burst, it emits one error signal, not a hundred. - Source side: at most one standalone
CoordsWarmupper destination per2000ms in response to errors. This prevents a burst of errors from triggering a burst of warmup messages.
Error signals are processed asynchronously, outside the packet-receive loop. A session getting hammered by PathBroken responses does not stall normal forwarding.
The blind spot
Error signals rely on find_next_hop(src_addr) at the transit node. After
the warmup window, a transit node may no longer have the source's coordinates cached. In that
case the error is silently dropped. Two things limit the damage:
- CP-flagged warmup packets seed transit caches with the source's coordinates too, so the first few seconds of a session are always covered.
-
Sessions tear down after
90s of no application data. A session old enough to have lost all transit caches will expire soon anyway, and the next attempt re-runs SessionSetup from scratch.
State kept alive through all of this
Recovery never rebuilds the crypto session. What gets refreshed is:
- The source's local coordinate cache (on PathBroken).
- The transit caches along the current path (via CP-flagged packets and CoordsWarmup).
- The source's path-MTU estimate (on MtuExceeded).
The Noise XK session, the replay counter, the sender and receiver MMP state: all of it persists. From the application's view the session keeps running. A few packets may be lost during the repair, which looks the same to TCP as any other drop.
Putting the signals together
Lesson quiz
1. What does the CP flag on an FSP data packet tell a transit router?
2. How many data packets at the start of a session carry the CP flag by default?
3. Which error signal means 'I have coordinates for the destination, but no peer is closer to it than I am'?
4. On receiving PathBroken, what does the source do differently from when it receives CoordsRequired?
5. Why does FIPS rate-limit error signals at the transit node?
6. What does MtuExceeded cause the source to do?