Group-Based Tunnel Load Balancing With Built-in Health Checks (How rustunnel Does What FRP Does, Explained)
rustunnel's group-LB feature lets two or more tunnel clients claim the same hostname and share traffic via round-robin or weighted-random, with active health checks that pull dead backends out of rotation. A complete walkthrough vs frp's group_health_check, with diagrams and a worked example.
Tunnel load balancing is a feature most users discover the day they need it. You're running two copies of your local app for redundancy, or you're A/B-ing two backends on the same hostname, or you're rolling out a deploy without dropping requests. All of those need the same hostname fronting multiple tunnels with health checks that yank a backend out of rotation when it dies.
frp ships this as group_health_check. rustunnel ships it as the group-LB feature, complete with active health checks. This post is the long-form explanation — the docs reference is at docs/reference/load-balancing.
Status: the mechanism is shipped (TUNNEL-7 / TUNNEL-8 — see the rustunnel changelog) and running in production.
What "group" means here
The mental model:
- A tunnel is a client process holding open a control connection to the relay. One tunnel = one process.
- A group is a named bag of tunnels that all claim the same public hostname. The relay round-robins or weighted-randomly across the bag for incoming requests.
- A health check is the relay pinging each member of the bag on a path you specify, on a cadence you specify, and removing it from rotation if N consecutive checks fail.
This is L7 traffic distribution — the relay terminates the inbound connection, picks a member, and proxies. It is not L4 IP-LB; the relay always sees the request before forwarding. That trade-off (slight extra latency, much better observability) is intentional.
The frp baseline (for context)
frp's group_health_check config:
# frpc.ini on backend A
[web-a]
type = http
local_port = 8080
custom_domains = api.example.com
group = api-group
group_key = shared-secret
health_check_type = http
health_check_url = /health
health_check_interval_s = 5
health_check_max_failed = 2# frpc.ini on backend B (same group_key, same custom_domain)
[web-b]
type = http
local_port = 8080
custom_domains = api.example.com
group = api-group
group_key = shared-secret
health_check_type = http
health_check_url = /healthThe relay rejects mismatched group_keys on the same domain (so a third party can't join your group by guessing the name).
How rustunnel does it
The same shape, but configured on the rustunnel client with flags or a TOML config:
# Backend A
rustunnel http 8080 \
--subdomain api \
--group api-group \
--group-key shared-secret \
--health-path /health \
--health-interval 5s \
--health-max-fails 2
# Backend B (different machine, same group)
rustunnel http 8080 \
--subdomain api \
--group api-group \
--group-key shared-secret \
--health-path /healthBoth clients open separate control connections to the relay; the relay reconciles them by (subdomain, group, group_key) and starts active probing.
What an active health check looks like
┌─────────────┐ ┌──────────────┐
│ Relay │ GET /health │ Backend A │
│ (rustunnel- │ ───────────────▶ │ (200 OK) │
│ server) │ ◀─────────────── │ │
│ │ HTTP/1.1 200 └──────────────┘
│ │ Content-Length:0
│ │
│ │ GET /health ┌──────────────┐
│ │ ───────────────▶ │ Backend B │
│ │ ◀─────────────── │ (502) │
│ │ HTTP/1.1 502 └──────────────┘
│ │
│ State: │
│ A: healthy │
│ B: failing │
│ (1/2) │
└─────────────┘
After two consecutive 502s, B is marked unhealthy and removed from rotation. New requests go only to A. The relay continues probing B; the moment it returns 200, B re-enters rotation.
Worked example: two backends + an alert webhook
Let's build this end-to-end: two machines, one public hostname, automatic failover, and a Slack alert within seconds of a backend dying.
Setup
Two machines:
- Machine A (home office, macOS) — running a Node.js Express app on port 8080
- Machine B (co-working space, Ubuntu) — same app, same port
// server.js (identical on both machines)
const express = require('express')
const app = express()
let healthy = true
app.get('/health', (_req, res) =>
res.status(healthy ? 200 : 503).json({ ok: healthy })
)
app.get('/', (_req, res) =>
res.json({ machine: process.env.MACHINE_ID || 'unknown' })
)
// POST /break to simulate a health-check failure
app.post('/break', (_req, res) => {
healthy = false
res.json({ ok: false })
})
app.post('/recover', (_req, res) => {
healthy = true
res.json({ ok: true })
})
app.listen(8080)Starting both tunnels
Machine A:
$ MACHINE_ID=A node server.js &
$ rustunnel http 8080 \
--subdomain api \
--group api-group \
--group-key s3cr3t \
--health-path /health \
--health-interval 5s \
--health-max-fails 2
rustunnel v0.7.1 relay: eu.edge.rustunnel.com
✓ connected control channel established
✓ registered api.eu.edge.rustunnel.com → localhost:8080
✓ group joined api-group (1 member, waiting for peers)
✓ health GET /health → 200 [4.1ms]
Machine B (a few seconds later):
$ MACHINE_ID=B node server.js &
$ rustunnel http 8080 \
--subdomain api \
--group api-group \
--group-key s3cr3t \
--health-path /health \
--health-interval 5s \
--health-max-fails 2
rustunnel v0.7.1 relay: eu.edge.rustunnel.com
✓ connected control channel established
✓ registered api.eu.edge.rustunnel.com → localhost:8080
✓ group joined api-group (2 of 2 members healthy)
✓ health GET /health → 200 [3.8ms]
Machine A's terminal updates the moment B joins:
ℹ group api-group: 2 of 2 members healthy
What the dashboard shows
Open /dashboard/groups after both clients connect. The groups table shows one row per group with columns for group name, member count, healthy count, and the subdomain it's bound to. Click the group name for a per-member breakdown:
| Member | Peer ID | Status | Last probe | Consecutive fails |
|---|---|---|---|---|
| A | a1b2c3 | healthy | 200ms ago | 0 |
| B | d4e5f6 | healthy | 180ms ago | 0 |
The relay actively polls both backends every 5 seconds. The "consecutive fails" counter increments in real time as probes fail — you can watch it tick up when you break a backend, then watch it reset to zero when the backend recovers. No page refresh needed; the table streams updates via SSE.
Simulating a partial outage
Break Machine B's /health without stopping the process (the rest of the app keeps serving):
$ curl -X POST http://localhost:8080/break
{"ok":false}
The health endpoint now returns 503. Two failed probes later — ten seconds at --health-interval 5s --health-max-fails 2 — Machine A's terminal shows:
⚠ group api-group: B (d4e5f6) removed — 2 consecutive fails
⚠ group api-group: 1 of 2 members healthy, routing to A only
The dashboard updates to show B as "unhealthy (2/2 fails)". Every inbound request to api.eu.edge.rustunnel.com routes exclusively to A. Recover B:
$ curl -X POST http://localhost:8080/recover
{"ok":true}
Within one probe interval (at most 5 seconds):
✓ group api-group: B (d4e5f6) re-entered rotation
✓ group api-group: 2 of 2 members healthy
Wiring up Prometheus → Alertmanager → Slack
rustunnel-server exposes a rustunnel_group_member_unhealthy_total counter on its Prometheus metrics endpoint (:9090/metrics by default — see the self-hosting guide if you're running your own relay). Each time a member is pulled from rotation the counter increments, labelled with group and peer_id.
1. Scrape config
# prometheus.yml
scrape_configs:
- job_name: rustunnel
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s2. Alert rule
# rules/rustunnel.yml
groups:
- name: tunnel_lb
rules:
- alert: TunnelGroupMemberUnhealthy
expr: increase(rustunnel_group_member_unhealthy_total[2m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Group {{ $labels.group }} lost a backend"
description: >
Backend {{ $labels.peer_id }} was pulled from rotation in group
{{ $labels.group }}. Check the tunnel client on that machine.3. Alertmanager receiver
# alertmanager.yml
receivers:
- name: slack-tunnel-alerts
slack_configs:
- channel: '#infra-alerts'
api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
title: '{{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts -}}
*{{ .Annotations.summary }}*
{{ .Annotations.description }}
{{ end }}
route:
receiver: slack-tunnel-alerts
group_by: [alertname, group]
matchers:
- alertname = TunnelGroupMemberUnhealthyWhen Machine B drops, Prometheus picks up the counter increment on the next scrape (at most 15 seconds), Alertmanager evaluates the rule, and your #infra-alerts channel receives a message within roughly 30 seconds of the second failed probe. Tighten the scrape interval to 5s if you need faster alerting — just be aware of the additional scrape load.
Round-robin vs weighted-random vs sticky
These are the three dispatch modes available to a group. The relay picks the mode based on flags the members register with.
Round-robin (default)
With no --weight or --sticky flag the relay cycles through healthy members in strict order: request 1 goes to A, request 2 to B, request 3 to A, and so on. This is the right default for homogeneous backends — same hardware, same app version, equal traffic.
Round-robin is stateless on the relay side. The relay keeps an atomic counter per group and increments it on every accepted request. There is no session affinity and no memory between requests; even long-polling connections land wherever the counter points at the moment the request arrives.
Weighted-random
When backends are not equal in capacity, use --weight to tilt the distribution:
# Machine A: larger box, absorbs 70% of traffic
rustunnel http 8080 \
--group api-group --group-key s3cr3t \
--weight 0.7
# Machine B: smaller box, absorbs 30%
rustunnel http 8080 \
--group api-group --group-key s3cr3t \
--weight 0.3Weights are relative ratios, not absolute percentages. Registering both machines at --weight 2.0 is identical to both at --weight 1.0. The relay normalises weights across the healthy member set only — so if Machine B fails health checks and leaves the pool, all traffic goes to Machine A regardless of the configured ratio.
This makes weighted-random useful for gradual rollouts: deploy the new build on Machine B with --weight 0.1, verify it handles the 10% slice cleanly, then bump the weight incrementally. If B breaks, the weight normalisation means 100% of traffic falls back to A automatically — no manual intervention needed.
Sticky sessions (cookie hash)
When requests for the same user session must land on the same backend — shopping carts, in-progress file uploads, WebSocket upgrades handled at the application layer — use --sticky cookie:
# Both backends register with cookie stickiness
rustunnel http 8080 \
--group api-group --group-key s3cr3t \
--sticky cookieThe relay sets a __rt_sticky session cookie on the first response from the group. Subsequent requests from the same browser include that cookie, and the relay routes them back to the same member as long as it is healthy. If the pinned backend fails health checks, the relay re-hashes to the next available member, re-sets the cookie, and the user loses their sticky slot but does not receive a 503.
The canonical use case in development is rolling-deploy testing: run the new build on Machine B with --sticky cookie, share a staging URL with your QA team, and they consistently land on the new version because their browsers carry the sticky cookie. Every other user hits round-robin across both versions normally. When QA signs off, swap --weight to cut over traffic gradually.
For non-browser HTTP clients (curl, SDKs, API consumers) that don't preserve cookies, --sticky ip hashes on the client IP address instead. See the docs/reference/load-balancing reference for the full flag list including per-group TOML config equivalents.
Caveats
Group-LB is stable and ships in the open-source server, but four things are worth knowing before you lean on it in production.
WebSocket upgrades pin to one backend for the connection lifetime. Round-robin and weighted-random operate at HTTP request granularity — each GET /api/data can land on a different backend. A WebSocket upgrade is a single HTTP request followed by a protocol switch; the relay picks a backend at upgrade time and holds that connection on that backend until it closes. There is no mid-stream migration. If most of your traffic is long-lived WebSocket connections, you will see natural stickiness even without --sticky — and if a backend fails while connections are open, those connections drop (the client is responsible for reconnecting). WebSocket pinning applies equally to P2P tunnels in relayed mode; see the P2P tunnel reference for the distinction between direct and relayed paths.
Health checks consume real bandwidth at high polling frequency. Each probe is a full HTTP round-trip over the tunnel's control channel. At the default 10-second interval across a handful of members the overhead is negligible. At 1-second intervals across a 50-member group you are sending 50 probes per second per relay; at roughly 200 bytes per round-trip that is around 10 KB/s of pure probe traffic, before any user traffic arrives. Default to 10-second intervals in production. Drop to 2 seconds if your recovery-time SLA requires it. Sub-second intervals are available but rarely justified outside synthetic testing.
If all members fail health checks, the relay returns 503. There is no fallthrough to a static error page or a secondary group yet. The 503 Service Unavailable response body carries a JSON payload with the group name and a timestamp so your monitoring can distinguish a "no healthy backend" 503 from a generic upstream failure. A configurable fallback target — either a static URL or a secondary group name — is tracked in TUNNEL-9; if that matters for your use case, follow that issue.
The group_key is a shared secret, not an ACL. Any client that knows the (subdomain, group, group_key) triple can join the group. On the managed cloud, group keys are scoped to your account and tenant-isolated at the relay. On self-hosted relays, treat group_key the same way you would treat a shared HMAC secret: rotate it if it leaks, and keep it out of public dotfiles and source control.
How this stacks up vs frp / cloudflared / ngrok
| Feature | frp | cloudflared | ngrok | rustunnel |
|---|---|---|---|---|
| Group LB | ✅ | ❌ (Cloudflare-side LB only) | Paid (Pro+) | ✅ |
| Active health checks | ✅ | ✅ | ✅ | ✅ |
| Weighted distribution | ❌ | ✅ | ✅ | ✅ |
| Sticky sessions | ❌ | ✅ | ✅ | ✅ |
| Self-hostable | ✅ | ❌ | ❌ | ✅ |
Next steps
- Read the load-balancing reference — full flag list and TOML config.
- Set up your own relay so you can A/B groups freely on a self-hosted edge.
- Compare rustunnel and frp for the broader managed-cloud-vs-DIY decision.