Group-Based Tunnel Load Balancing With Built-in Health Checks (How rustunnel Does What FRP Does, Explained)

rustunnel's group-LB feature lets two or more tunnel clients claim the same hostname and share traffic via round-robin or weighted-random, with active health checks that pull dead backends out of rotation. A complete walkthrough vs frp's group_health_check, with diagrams and a worked example.

João Henrique··12 min read

Tunnel load balancing is a feature most users discover the day they need it. You're running two copies of your local app for redundancy, or you're A/B-ing two backends on the same hostname, or you're rolling out a deploy without dropping requests. All of those need the same hostname fronting multiple tunnels with health checks that yank a backend out of rotation when it dies.

frp ships this as group_health_check. rustunnel ships it as the group-LB feature, complete with active health checks. This post is the long-form explanation — the docs reference is at docs/reference/load-balancing.

Status: the mechanism is shipped (TUNNEL-7 / TUNNEL-8 — see the rustunnel changelog) and running in production.

What "group" means here

The mental model:

  • A tunnel is a client process holding open a control connection to the relay. One tunnel = one process.
  • A group is a named bag of tunnels that all claim the same public hostname. The relay round-robins or weighted-randomly across the bag for incoming requests.
  • A health check is the relay pinging each member of the bag on a path you specify, on a cadence you specify, and removing it from rotation if N consecutive checks fail.

This is L7 traffic distribution — the relay terminates the inbound connection, picks a member, and proxies. It is not L4 IP-LB; the relay always sees the request before forwarding. That trade-off (slight extra latency, much better observability) is intentional.

The frp baseline (for context)

frp's group_health_check config:

# frpc.ini on backend A
[web-a]
type = http
local_port = 8080
custom_domains = api.example.com
group = api-group
group_key = shared-secret
health_check_type = http
health_check_url = /health
health_check_interval_s = 5
health_check_max_failed = 2
# frpc.ini on backend B (same group_key, same custom_domain)
[web-b]
type = http
local_port = 8080
custom_domains = api.example.com
group = api-group
group_key = shared-secret
health_check_type = http
health_check_url = /health

The relay rejects mismatched group_keys on the same domain (so a third party can't join your group by guessing the name).

How rustunnel does it

The same shape, but configured on the rustunnel client with flags or a TOML config:

# Backend A
rustunnel http 8080 \
  --subdomain api \
  --group api-group \
  --group-key shared-secret \
  --health-path /health \
  --health-interval 5s \
  --health-max-fails 2
 
# Backend B (different machine, same group)
rustunnel http 8080 \
  --subdomain api \
  --group api-group \
  --group-key shared-secret \
  --health-path /health

Both clients open separate control connections to the relay; the relay reconciles them by (subdomain, group, group_key) and starts active probing.

What an active health check looks like

        ┌─────────────┐                   ┌──────────────┐
        │   Relay     │  GET /health      │  Backend A   │
        │ (rustunnel- │ ───────────────▶  │ (200 OK)     │
        │  server)    │ ◀───────────────  │              │
        │             │  HTTP/1.1 200     └──────────────┘
        │             │  Content-Length:0
        │             │
        │             │  GET /health      ┌──────────────┐
        │             │ ───────────────▶  │  Backend B   │
        │             │ ◀───────────────  │ (502)        │
        │             │  HTTP/1.1 502     └──────────────┘
        │             │
        │ State:      │
        │  A: healthy │
        │  B: failing │
        │     (1/2)   │
        └─────────────┘

After two consecutive 502s, B is marked unhealthy and removed from rotation. New requests go only to A. The relay continues probing B; the moment it returns 200, B re-enters rotation.

Worked example: two backends + an alert webhook

Let's build this end-to-end: two machines, one public hostname, automatic failover, and a Slack alert within seconds of a backend dying.

Setup

Two machines:

  • Machine A (home office, macOS) — running a Node.js Express app on port 8080
  • Machine B (co-working space, Ubuntu) — same app, same port
// server.js (identical on both machines)
const express = require('express')
const app = express()
 
let healthy = true
 
app.get('/health', (_req, res) =>
  res.status(healthy ? 200 : 503).json({ ok: healthy })
)
 
app.get('/', (_req, res) =>
  res.json({ machine: process.env.MACHINE_ID || 'unknown' })
)
 
// POST /break to simulate a health-check failure
app.post('/break', (_req, res) => {
  healthy = false
  res.json({ ok: false })
})
 
app.post('/recover', (_req, res) => {
  healthy = true
  res.json({ ok: true })
})
 
app.listen(8080)

Starting both tunnels

Machine A:

$ MACHINE_ID=A node server.js &

$ rustunnel http 8080 \
    --subdomain api \
    --group api-group \
    --group-key s3cr3t \
    --health-path /health \
    --health-interval 5s \
    --health-max-fails 2

rustunnel v0.7.1  relay: eu.edge.rustunnel.com
✓  connected    control channel established
✓  registered   api.eu.edge.rustunnel.com → localhost:8080
✓  group        joined api-group (1 member, waiting for peers)
✓  health       GET /health → 200  [4.1ms]

Machine B (a few seconds later):

$ MACHINE_ID=B node server.js &

$ rustunnel http 8080 \
    --subdomain api \
    --group api-group \
    --group-key s3cr3t \
    --health-path /health \
    --health-interval 5s \
    --health-max-fails 2

rustunnel v0.7.1  relay: eu.edge.rustunnel.com
✓  connected    control channel established
✓  registered   api.eu.edge.rustunnel.com → localhost:8080
✓  group        joined api-group (2 of 2 members healthy)
✓  health       GET /health → 200  [3.8ms]

Machine A's terminal updates the moment B joins:

ℹ  group        api-group: 2 of 2 members healthy

What the dashboard shows

Open /dashboard/groups after both clients connect. The groups table shows one row per group with columns for group name, member count, healthy count, and the subdomain it's bound to. Click the group name for a per-member breakdown:

MemberPeer IDStatusLast probeConsecutive fails
Aa1b2c3healthy200ms ago0
Bd4e5f6healthy180ms ago0

The relay actively polls both backends every 5 seconds. The "consecutive fails" counter increments in real time as probes fail — you can watch it tick up when you break a backend, then watch it reset to zero when the backend recovers. No page refresh needed; the table streams updates via SSE.

Simulating a partial outage

Break Machine B's /health without stopping the process (the rest of the app keeps serving):

$ curl -X POST http://localhost:8080/break
{"ok":false}

The health endpoint now returns 503. Two failed probes later — ten seconds at --health-interval 5s --health-max-fails 2 — Machine A's terminal shows:

⚠  group        api-group: B (d4e5f6) removed — 2 consecutive fails
⚠  group        api-group: 1 of 2 members healthy, routing to A only

The dashboard updates to show B as "unhealthy (2/2 fails)". Every inbound request to api.eu.edge.rustunnel.com routes exclusively to A. Recover B:

$ curl -X POST http://localhost:8080/recover
{"ok":true}

Within one probe interval (at most 5 seconds):

✓  group        api-group: B (d4e5f6) re-entered rotation
✓  group        api-group: 2 of 2 members healthy

Wiring up Prometheus → Alertmanager → Slack

rustunnel-server exposes a rustunnel_group_member_unhealthy_total counter on its Prometheus metrics endpoint (:9090/metrics by default — see the self-hosting guide if you're running your own relay). Each time a member is pulled from rotation the counter increments, labelled with group and peer_id.

1. Scrape config

# prometheus.yml
scrape_configs:
  - job_name: rustunnel
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s

2. Alert rule

# rules/rustunnel.yml
groups:
  - name: tunnel_lb
    rules:
      - alert: TunnelGroupMemberUnhealthy
        expr: increase(rustunnel_group_member_unhealthy_total[2m]) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Group {{ $labels.group }} lost a backend"
          description: >
            Backend {{ $labels.peer_id }} was pulled from rotation in group
            {{ $labels.group }}. Check the tunnel client on that machine.

3. Alertmanager receiver

# alertmanager.yml
receivers:
  - name: slack-tunnel-alerts
    slack_configs:
      - channel: '#infra-alerts'
        api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        title: '{{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts -}}
          *{{ .Annotations.summary }}*
          {{ .Annotations.description }}
          {{ end }}
 
route:
  receiver: slack-tunnel-alerts
  group_by: [alertname, group]
  matchers:
    - alertname = TunnelGroupMemberUnhealthy

When Machine B drops, Prometheus picks up the counter increment on the next scrape (at most 15 seconds), Alertmanager evaluates the rule, and your #infra-alerts channel receives a message within roughly 30 seconds of the second failed probe. Tighten the scrape interval to 5s if you need faster alerting — just be aware of the additional scrape load.

Round-robin vs weighted-random vs sticky

These are the three dispatch modes available to a group. The relay picks the mode based on flags the members register with.

Round-robin (default)

With no --weight or --sticky flag the relay cycles through healthy members in strict order: request 1 goes to A, request 2 to B, request 3 to A, and so on. This is the right default for homogeneous backends — same hardware, same app version, equal traffic.

Round-robin is stateless on the relay side. The relay keeps an atomic counter per group and increments it on every accepted request. There is no session affinity and no memory between requests; even long-polling connections land wherever the counter points at the moment the request arrives.

Weighted-random

When backends are not equal in capacity, use --weight to tilt the distribution:

# Machine A: larger box, absorbs 70% of traffic
rustunnel http 8080 \
  --group api-group --group-key s3cr3t \
  --weight 0.7
 
# Machine B: smaller box, absorbs 30%
rustunnel http 8080 \
  --group api-group --group-key s3cr3t \
  --weight 0.3

Weights are relative ratios, not absolute percentages. Registering both machines at --weight 2.0 is identical to both at --weight 1.0. The relay normalises weights across the healthy member set only — so if Machine B fails health checks and leaves the pool, all traffic goes to Machine A regardless of the configured ratio.

This makes weighted-random useful for gradual rollouts: deploy the new build on Machine B with --weight 0.1, verify it handles the 10% slice cleanly, then bump the weight incrementally. If B breaks, the weight normalisation means 100% of traffic falls back to A automatically — no manual intervention needed.

When requests for the same user session must land on the same backend — shopping carts, in-progress file uploads, WebSocket upgrades handled at the application layer — use --sticky cookie:

# Both backends register with cookie stickiness
rustunnel http 8080 \
  --group api-group --group-key s3cr3t \
  --sticky cookie

The relay sets a __rt_sticky session cookie on the first response from the group. Subsequent requests from the same browser include that cookie, and the relay routes them back to the same member as long as it is healthy. If the pinned backend fails health checks, the relay re-hashes to the next available member, re-sets the cookie, and the user loses their sticky slot but does not receive a 503.

The canonical use case in development is rolling-deploy testing: run the new build on Machine B with --sticky cookie, share a staging URL with your QA team, and they consistently land on the new version because their browsers carry the sticky cookie. Every other user hits round-robin across both versions normally. When QA signs off, swap --weight to cut over traffic gradually.

For non-browser HTTP clients (curl, SDKs, API consumers) that don't preserve cookies, --sticky ip hashes on the client IP address instead. See the docs/reference/load-balancing reference for the full flag list including per-group TOML config equivalents.

Caveats

Group-LB is stable and ships in the open-source server, but four things are worth knowing before you lean on it in production.

WebSocket upgrades pin to one backend for the connection lifetime. Round-robin and weighted-random operate at HTTP request granularity — each GET /api/data can land on a different backend. A WebSocket upgrade is a single HTTP request followed by a protocol switch; the relay picks a backend at upgrade time and holds that connection on that backend until it closes. There is no mid-stream migration. If most of your traffic is long-lived WebSocket connections, you will see natural stickiness even without --sticky — and if a backend fails while connections are open, those connections drop (the client is responsible for reconnecting). WebSocket pinning applies equally to P2P tunnels in relayed mode; see the P2P tunnel reference for the distinction between direct and relayed paths.

Health checks consume real bandwidth at high polling frequency. Each probe is a full HTTP round-trip over the tunnel's control channel. At the default 10-second interval across a handful of members the overhead is negligible. At 1-second intervals across a 50-member group you are sending 50 probes per second per relay; at roughly 200 bytes per round-trip that is around 10 KB/s of pure probe traffic, before any user traffic arrives. Default to 10-second intervals in production. Drop to 2 seconds if your recovery-time SLA requires it. Sub-second intervals are available but rarely justified outside synthetic testing.

If all members fail health checks, the relay returns 503. There is no fallthrough to a static error page or a secondary group yet. The 503 Service Unavailable response body carries a JSON payload with the group name and a timestamp so your monitoring can distinguish a "no healthy backend" 503 from a generic upstream failure. A configurable fallback target — either a static URL or a secondary group name — is tracked in TUNNEL-9; if that matters for your use case, follow that issue.

The group_key is a shared secret, not an ACL. Any client that knows the (subdomain, group, group_key) triple can join the group. On the managed cloud, group keys are scoped to your account and tenant-isolated at the relay. On self-hosted relays, treat group_key the same way you would treat a shared HMAC secret: rotate it if it leaks, and keep it out of public dotfiles and source control.

How this stacks up vs frp / cloudflared / ngrok

Featurefrpcloudflaredngrokrustunnel
Group LB❌ (Cloudflare-side LB only)Paid (Pro+)
Active health checks
Weighted distribution
Sticky sessions
Self-hostable

Next steps