No More Hop Limits: What if Every Hop Cost Just 1 TX Instead of n? #9936

ClemensSimon · 2026-03-18T05:45:16Z

ClemensSimon
Mar 18, 2026

Context: What Meshtastic Already Does Well

Meshtastic's routing (v2.6/2.7) is already substantially better than naive flooding:

Managed Flooding suppresses ~40-50% of rebroadcasts via SNR-based contention windows
Next-Hop Routing (v2.6) learns relay nodes for direct messages, reducing DM cost to ~hops after the first flood
ROUTER/ROUTER_LATE roles ensure backbone coverage
Congestion Scaling automatically stretches intervals for 40+ node networks

This proposal doesn't replace these — it builds on the same principles and asks: can we extend directed routing to all message types, not just DMs?

The Remaining Bottleneck

Both managed flooding and next-hop still scale as O(n) per broadcast message. The hop limit (3-7) remains necessary because each hop multiplies transmissions proportional to network size. This caps effective range.

Proposal: System 5 — O(hops) for Everything

A routing approach that achieves ~1 TX per hop for all traffic types:

Geo-Clustering — Nodes self-organize by GPS geohash prefix. Full topology within clusters, border nodes between.
Multi-Path Routing — 2-3 cached paths per destination with instant failover (no rediscovery flood).
Weighted Load Balancing — W(r) = α·Q + β·(1-Load) + γ·Batt distributes traffic proportionally.
Adaptive QoS — Network Health Score per cluster throttles low-priority traffic. SOS always passes.
Fallback — Scoped cluster flooding (not full network) when all routes fail.

Simulation: System 5 vs. Managed Flooding

Python simulator with EU868 LoRa model, tested on identical networks with 4 routers (Naive Flood, Managed Flood, Next-Hop, System 5):

Scenario	Managed Flood TX	System 5 TX	S5 Delivery	Savings vs Managed
Small (20 nodes, 1km)	17,045	112	100%	99.3%
City (100 nodes, 5km)	155,774	196	100%	99.9%
Regional (500 nodes, 20km)	708,720	497	100%	99.9%
Dense Urban (200, 3km)	1,239,692	136	100%	100%
1000 nodes (40km)	1,462,489	10,635	99%	99.3%
1500 nodes (50km)	2,119,189	38,850	97%	98.2%
30% degraded links	170,214	4,701	100%	97.2%
50% degraded links	183,019	16,055	73%	91.2%
20% nodes killed	108,413	2,853	85%	97.4%

The key metric: max load on the busiest node drops from 4,500-19,900 (managed flood) to 6-80 (System 5).

Biggest Practical Consequence

The hop limit becomes irrelevant. Each hop costs ~1 TX regardless of network size. 20 hops cost less than managed flooding costs for 1. This means:

No artificial range cap
SHORT_FAST with more hops works as well as LONG_SLOW with fewer
Battery nodes far from the path don't transmit at all

Try It

Live interactive demo: https://clemenssimon.github.io/MeshRoute/
GitHub (simulator + docs): https://github.com/ClemensSimon/MeshRoute
MIT licensed

The demo shows side-by-side animations of all four routing approaches on identical topology, simulation results with interactive charts, and resilience testing.

Questions for the Community

Does this approach address real pain points you're seeing in larger meshes?
What scenarios should the simulator test that I'm missing?
Is a firmware prototype (as optional routing module alongside managed flooding) worth pursuing?
The GPS requirement for geo-clustering — dealbreaker or acceptable tradeoff?

shalberd · 2026-03-18T11:28:27Z

shalberd
Mar 18, 2026

Congestion Scaling automatically stretches intervals for 40+ node networks

not anymore ... I do not agree with it, but in any case, it seems that in firmware 2.7.20, scaling does not apply anymore for ROUTER_LATE and other device roles #9818

I don't know why that is ... my only explanation is: comfortable US frequency slot / number of slots situation and disregard for situation in other areas of the world in terms of LoRa. It'd be nice if there were feature gates for stuff like 0-cost routing (was suggested once by @GUVWAF) and exemptions to scaling .. gating those features to short presets only or to any region except EU_868 ... but it is as it is.

2 replies

thebentern Mar 18, 2026
Maintainer

ROUTER_LATE should be taking on the course-grained interval miniums ROUTER had previously. I will double check that though, because if not, this is a missed spot.

shalberd Mar 18, 2026

cool, thank you very much, @thebentern do you mean those defaults for IF_ROUTER https://github.com/meshtastic/firmware/blob/master/src/mesh/Default.h#L17

and those applying, or not yet applying, to ROUTER_LATE? I think @h3lix1 fixed this in https://github.com/meshtastic/firmware/pull/9815/changes

It is good to see that, despite not scaling up the intervals even more, at least those two roles have very course default intervals of a day / half a day.

It's be good if, for certain regions like EU_868 with mostly one frequency slot https://meshtastic.org/docs/overview/radio-settings/#frequency-slot-calculator only and for non-short presets, there were feature gates for the exemptions, even for ROUTER and ROUTER_LATE, but especially so for certain sensor / tracker roles.
Why feature-gating the exemption from scaling and 0-cost-hops for all except / but EU_868 and non-short presets? Cause the traffic situation in EU_868 is very, very congested, so anything that adds to congestion in large networks or does not take it into account, in our case on MEDIUM_FAST preset, is bad. All frequency slots except LONG_MODERATE and LONG_SLOW use the same frequency; Thanks, regulator ;-)

Lastly, thank you for your great work in the project, we appreciate it very much. As mentioned ... we've just got different regulatory and frequency slot situation contexts 'round here. Most of our EU_868 presets work on frequency 869.525 Mhz and to add to it, _TURBO presets with 500 kHz bandwidth are not allowed, either. Still .. we like Meshtastic and the spirit of your project and do our best here to work within our constraints. Cheerio / have a good day.

ClemensSimon · 2026-03-19T05:12:39Z

ClemensSimon
Mar 19, 2026
Author

Suggested labels: mesh, enhancement

Keywords for discoverability: range extension, performance, scalability, bandwidth efficiency, hop limit

0 replies

ClemensSimon · 2026-03-19T06:10:17Z

ClemensSimon
Mar 19, 2026
Author

Update: Realistic Hop Limits Reveal Delivery Collapse

The original simulation used TTL=30, which masked a critical problem. With Meshtastic's actual hop limits (3, 5, 7), managed flooding's delivery rate collapses at scale:

Scenario	Nodes	3-hop	5-hop	7-hop	System 5
Small Local	20	100%	100%	100%	100%
Medium City	100	92%	100%	100%	100%
Large Regional	500	14%	31%	51%	76%
1000 Nodes	1000	2%	6%	6%	45%
1500 Nodes	1500	2%	4%	5%	51%
Rural Long Range	50	77%	91%	88%	100%
Maritime	30	69%	82%	84%	100%
Disaster Relief	80	62%	87%	84%	78%

Key Insight

The hop limit is not just a range cap — it is a delivery ceiling. At 1000+ nodes, managed flooding delivers fewer than 1 in 10 messages regardless of hop limit setting. System 5 delivers 7.5x more messages with fewer total transmissions.

This means the current routing doesn't just waste bandwidth — it fails to deliver in the exact scenarios where mesh networking matters most (large, spread-out, disaster relief).

Updated demo and results: https://clemenssimon.github.io/MeshRoute/

0 replies

shalberd · 2026-03-20T11:52:00Z

shalberd
Mar 20, 2026

can we extend directed routing to all message types, not just DMs

that'd be awesome and a real differentiator from MeshCore

The GPS requirement for geo-clustering — dealbreaker or acceptable tradeoff?

Would not call it a GPS requirement. More a position info requirement. At least for major mountain role CLIENT, ROUTER_LATE nodes, we already use coordinates / position info. A little fuzzied out in terms of precision, and at times also just manually input as fixed coordinates when no GPS module is present or when activating it would take too much energy consumption. Personally, I'be be fine with taking into account position information of nodes.

0 replies

h3lix1 · 2026-03-21T05:42:36Z

h3lix1
Mar 21, 2026

My question with System 5 is how does it work with asyncronous paths?

It looks clean in a simulated setup like this, but as paths change there is an increase of out of order messaging. (i.e. messages sent A B C can be received C B A if sent via 3 different paths), and meshtastic doesn't really do a good job of keeping sequence in the app like TCP does.

The second part is async routing - the path back can vary greatly from the path sent. The path to get to a mountain top router might take 3 hops, but the path back is a single hop. Trying to route the traffic back through the same 3 hops to get to the destination really isn't efficient.

For bay mesh, the best way to describe it is that the flood routing mesh all happens above 2000 feet. The challenge we see today with chutil is that for 5% actual utilization, the router mountain nodes hear about 10 different roof nodes sending packets at the same time, 4 more other long distance router nodes, etc. The result is a mess of collisions that result in 5% utilization turning into 50% utilization.

Let's take for example, SUNL - my node in the bay area.

The challenge is all the clients repeating packets cause a mess at high altitudes. We try our best to remove this by having enough routers to make sure most clients hear each message at least twice (supressing responses), but that itself also causes higher utilization.

It would be interesting to see more simulations where a single node can send to 100 nodes, but only hears 10. Where the circles for mountain top nodes can send far and wide (30+ miles) to other mountain top nodes and be heard by those mountain top nodes, but can also hear hliltop nodes.. and then those hilltop/roof nodes hear both valley nodes and in-building nodes. The fun part is the mountain nodes also have a good chance of sending traffic directly to the valley and in-building nodes without needing the hilltop or roof nodes at all.

Keep in mind, while the hilltop and roof nodes are sending, the mountain nodes are blocked from sending. Somehow this also needs to be added to the simulation to get past the real crux of the issue, which is half-duplex communication means the inability to send due to listening to a bunch of other nodes sending.

It also doesn't really solve the problem of how a client knows it missed a message. For example, in your example "How the Network Builds Itself", when it gets to the "load balancing" part, there is only one path that works, A-C-F-L-M-K-O. It's not very load balanced, but it also means there are 3 nodes that all make a path of failure. If any of those 3 nodes fails due to intermittent issues, the message will not be received.

Lastly, routing tables require memory. Some vendors are moving to using nrf52 devices for 1 watt nodes, and many love using solar nodes (also, nrf52) as routers. Many of these nodes can only hold 80-100 nodes in the nodedb. What is the memory expectation for something like this with 1000-10000 nodes?

0 replies

ClemensSimon · 2026-03-21T10:05:18Z

ClemensSimon
Mar 21, 2026
Author

Hey @h3lix1,

Thank you for the detailed feedback! Your questions about half-duplex blocking at SUNL, out-of-order messaging, and
nRF52 memory directly led to 5 new features being implemented. Here's what changed and what the numbers look like now.

What Your Feedback Built

Half-Duplex Model — You said: "mountaintop nodes are blocked from sending."
Done. Per-node radio state machine (IDLE/TX/RX) in simulator. Nodes can't TX while receiving.

Node Silencing — You said: "clients repeating packets at high elevations cause a mess."
Done. Redundant valley nodes are muted — they still listen, but don't rebroadcast. Battery-fair rotation every 10 min.
128 of 193 valley nodes silenced, TX cost halved.

Sequence Numbers — You said: "messages A B C can be received C B A."
Done. 2-byte per-(src,dst) sequence counter in packet header. App can detect gaps and reorder. Zero extra TX.

Emergency Re-Route — You said: "only one path works, 3 single points of failure."
Done. Fresh BFS excluding failed nodes before corridor flooding. 7 failover layers total.

Bay Area 3-Tier Topology — You said: "mountaintop routers hear 10 rooftop nodes simultaneously."
Done. 7 mountain (45km range) + 35 hill (10km) + 193 valley (2.5km) nodes with asymmetric links.

For the technical deep-dive, see https://clemenssimon.github.io/MeshRoute/how-it-works.html — sections on
https://clemenssimon.github.io/MeshRoute/how-it-works.html#silencing,
https://clemenssimon.github.io/MeshRoute/how-it-works.html#halfduplex, and
https://clemenssimon.github.io/MeshRoute/how-it-works.html#seqnums.

Bay Area Results (235 nodes, half-duplex)

                  Managed Flood    System 5    S5 + Silencing

Delivery Rate 6.0% 77.5% 74.5%
Total TX 6,752 540,780 267,927
Under Stress 4.0% 52.0% 51.0%
Nodes Silenced 0 0 134 (57%)

Key finding: Half-duplex collapses managed flooding from 87.5% to 6% delivery — your SUNL problem exactly. Mountaintop
stuck in RX from 10+ simultaneous rebroadcasts. System 5 holds at 77.5% because directed routing sends 1 packet
instead of 14. Node Silencing halves TX by muting 128 valley nodes. All 7 mountain nodes stay active.

Your Questions — Quick Answers

Async paths / out-of-order: 2-byte sequence counter, zero extra TX. App detects gaps.
Asymmetric return paths (3 up, 1 down): Already works — per-direction link qualities.
SUNL collision cascade: 3 layers — directed routing + node silencing + backpressure.
Missing messages: Sequence numbers for gap detection. Full ACKs too expensive for LoRa.
Single point of failure: 5 cached routes + emergency BFS + corridor flood = 7 layers.
nRF52 memory at 10K nodes: ~30KB with clustering, ~15KB with reduced params. Seq counters: 128 bytes via LRU.

Try It

https://clemenssimon.github.io/MeshRoute/simulator.html — select "Bay Area Mesh" or "Bay Area + Silencing"
https://clemenssimon.github.io/MeshRoute/how-it-works.html — step-by-step technical deep dive
https://clemenssimon.github.io/MeshRoute/ — all 26 scenarios with category filters.
https://github.com/ClemensSimon/MeshRoute — MIT license

Your feedback genuinely made this better. The half-duplex insight alone was worth the entire conversation — it
revealed that the real problem isn't routing efficiency but radio physics at elevated nodes.es.
Clemens

2 replies

h3lix1 Mar 21, 2026

This seems good for unicast, and potentially could be accomplished with bloom filters, but meshtastic is currently 98% broadcast traffic for us - all nodes should receive the same message. (For example, position packets, nodeinfo packets, telemetry packets)

for the simulators I can only see examples for unicast messaging. Is there an example of messaging that is broadcast to all nodes?

shalberd Mar 22, 2026

Hi, thank you @h3lix1 for also having documented this, bloom filter concept, back then on Github. @ClemensSimon FYI #8592

Also, have a look at this comment back then on how to improve things by @fifieldt #6199 (reply in thread)

He said in answer to someone from Seattle WA Puget sound area (similar in terms of lateral valleys, high elevation mountains) regarding regional aspects / regions clustering and a mix of Autobahn style directed routing from key location to key location, e.g. mountain to mountain vs. normal flood routing for regional distribution:

Purely theoretical at the moment and only in my head. Just an idea from the old days of community wireless networks. For large meshes, separate the mesh into "interior" and "exterior" routing. "Interior" routing works just like right now. "Exterior" routing would be used to join disparate clusters of "interior" routed mesh that are separated by geography. I'll have to get out a diagram and more words than I can type on a cellphone to explain properly. Haven't even thought through the concept fully yet :)

Big thanks from me, too, to @ClemensSimon for the new ideas and trains of thought.

ClemensSimon · 2026-03-21T23:54:31Z

ClemensSimon
Mar 21, 2026
Author

▎ Hey @h3lix1,

▎ Good question — but I want to make sure I understand the requirement correctly.

▎ You say 98% of traffic is broadcast. But broadcast to whom exactly? If there's no hop limit anymore, does that mean
every position packet from every node should reach all nodes in the entire network? For a 1000+ node mesh, that seems
like it would be the root cause of the collision problem you described, not the solution.

▎ Could you help me understand the intent behind these broadcasts?

▎ - Position/Telemetry: Does every node really need to know the position of every other node? Or is it more like
"nodes within my region" or "nodes I've recently communicated with"?
▎ - NodeInfo: Same question — is this needed network-wide or just locally?
▎ - Groups: If the use case is "send to a defined group of nodes" (e.g., all nodes in the Bay Area, or all members of
a channel), that's something I could implement as a group routing feature — not full broadcast, but targeted
multicast.

▎ Right now the simulator handles unicast (point-to-point) and managed flood (full broadcast). If the real need is
something in between — like group-based multicast — I'd need to build that. Happy to do so if that's the actual use
case.

1 reply

h3lix1 Mar 22, 2026

Hi @ClemensSimon -

Most of the state in Meshtastic (nodedb entries, etc) are push-based braodcast messages. This includes position packets, nodeinfo packets, telemetry packets, and channel messages. The only things unicast today are DMs and traceroute packets. There are also a few on-demand information gathering that a node can do, but the majority of the time things are flood routed.

There could be a push in future to instead do pull-based information gathering instead of push-based as it is today, but it almost all messages are network-wide. The channels aren't split into locality today (i..e. "Bay Area") but just the default channel "MediumFast" for example. I could see a world where channels are more localized, but it's not configured that way today.

I can't seem to find how system 5 will handle the flood option (if there is an option to do flood routing with a system 5 configuration)

I do appreciate the effort here though - if we can crack how to make flood routing with as little airtime as possible, we might be on to something.

ClemensSimon · 2026-03-23T09:30:28Z

ClemensSimon
Mar 23, 2026
Author

Draft Response to @h3lix1 and @shalberd — Broadcast Routing in System 5

STATUS: DRAFT — Zur Überprüfung vor dem Posten

Hey @h3lix1, @shalberd,

Thank you for the clarity on the 98% broadcast reality — that's the critical piece I needed. You're right that System 5 as demonstrated primarily optimizes unicast. Let me lay out how the architecture can handle broadcast traffic, and where @h3lix1's Bloom Filter idea fits in.

The Broadcast Problem, Precisely

In a 235-node Bay Area mesh with managed flooding:

1 position packet → 235+ TX (every node rebroadcasts)
100 nodes sending position every 15 min → ~94,000 TX/hour
With half-duplex collisions, most of those TX are wasted

The question isn't "should every node hear every position?" — it's "can we deliver the same broadcast reach with fewer TX?"

Three Approaches That Could Work Together

1. Cluster-Scoped Broadcast (System 5 native)

System 5 already has geo-clusters. Use them for broadcast scope:

Intra-cluster: Flood normally within your cluster (small, ~10-30 nodes, manageable)
Inter-cluster: Only border nodes relay to adjacent clusters — 1 TX per cluster boundary instead of N
Result: Broadcast cost goes from O(n) to O(clusters × cluster_size), roughly O(√n)

For Bay Area (7 mountain + 35 hill + 193 valley):

Valley nodes broadcast only within their cluster (~15-20 nodes each)
Hill/mountain border nodes relay between clusters
Estimated: ~15-25 TX per broadcast instead of ~235

2. Bloom Filter Hybrid (@h3lix1's RBF from #8592)

Your Bloom Filter approach and System 5 are complementary:

System 5 knows the cluster topology and border nodes → where to route
Bloom Filters track which nodes have already seen a packet → who to skip

Combined: Border nodes carry a Bloom filter in the broadcast packet. When relaying to the next cluster, nodes already in the filter don't rebroadcast. This handles the overlap zones where clusters share radio range.

The 11-35 byte filter cost is negligible vs. saving dozens of redundant TX at cluster boundaries.

3. @fifieldt's Interior/Exterior Split — Already Built

@shalberd, great catch. System 5's geo-clustering is the interior/exterior split that @fifieldt described:

Interior = intra-cluster routing (flood within cluster, small scope)
Exterior = inter-cluster routing via border nodes (directed, 1 TX per hop)

The only missing piece is applying this to broadcast traffic, not just unicast. The cluster infrastructure is already there.

What I'll Build Next

Cluster-scoped broadcast mode in the simulator — measure TX savings vs. delivery rate
Bloom filter integration at cluster boundaries for overlap deduplication
Broadcast scenario benchmarks — 100 nodes all sending position packets, System 5 cluster-broadcast vs. managed flooding

Honest Limitations

Latency: Cluster-scoped broadcast adds relay hops → slightly higher latency than direct flooding for nearby nodes
Consistency: Not all nodes will have the same view at the same time (but that's already true with hop limits)
OGM overhead: Neighbor discovery still needs some flooding — can't route what you haven't discovered

Would this address your use case? Specifically: if position/telemetry packets reached all nodes within ~5-10 seconds instead of ~1-3 seconds, but used 90% less airtime — would that tradeoff work for Bay Mesh?

— Clemens

0 replies

ClemensSimon · 2026-03-23T10:29:26Z

ClemensSimon
Mar 23, 2026
Author

Hey @h3lix1, @shalberd,

Your feedback on broadcast traffic being 98% of Meshtastic's workload was the key insight I was missing. I've now implemented and benchmarked a broadcast-specific routing mode that directly addresses this.

The Problem You Identified

System 5 optimized unicast brilliantly (1 TX per hop), but had no answer for broadcast packets (position, nodeinfo, telemetry, channel messages). Managed flooding costs O(n) per broadcast. For Bay Area's 235 nodes, that's 4,301 TX per single position packet.

Solution: Cluster-Distributor Broadcast

Instead of flooding the entire network, broadcast propagates as a wave through clusters:

Elect a Distributor per cluster -- a valley node with high local reach but low signal leakage to other clusters
Source's cluster: distributor does a scoped mini-flood (only within cluster)
Border nodes relay to the next cluster's distributor (1 directed TX to cross boundary)
Next distributor floods its cluster, border nodes relay further
Repeat until all clusters covered

This is essentially @fifieldt's interior/exterior routing concept -- interior = flood within cluster, exterior = directed relay between clusters.

Key Design Decisions

Valley nodes as distributors, not mountain nodes. A mountaintop node broadcasting reaches 10+ clusters simultaneously, causing a collision storm (your SUNL problem exactly). A valley node broadcasting stays contained by terrain -- only its cluster hears it. The distributor election scores:

Score = coverage * (0.3 * containment + 0.4 * elevation_bonus + 0.3 * tier_bonus)

Where valley nodes score ~1.0 and mountain nodes score ~0.1.

Mountain nodes receive but don't relay. During intra-cluster mini-flood, mountain nodes hear the broadcast (they hear everything) but don't rebroadcast -- their TX range is too large and would leak to other clusters. They're passive receivers, not active relays.

Natural signal spillover is free. When a valley distributor floods its cluster, nearby nodes in adjacent clusters often hear it too -- counted as reached with zero extra TX cost.

Benchmark Results

Tested with 20 broadcasts per scenario, averaged:

Scenario	Managed Flood Reach	Managed Flood TX/msg	Cluster-Dist Reach	Cluster-Dist TX/msg	TX Savings
Small (20 nodes)	71.8%	55	100.0%	22	61%
Medium (50 nodes)	100.0%	430	100.0%	52	88%
Large (100 nodes)	100.0%	2,130	99.8%	116	95%
Dense (200 nodes)	100.0%	8,731	99.9%	220	97%
Regional (500 nodes)	91.5%	5,869	100.0%	517	91%
Bay Area (235 nodes, 3-tier)	90.0%	4,301	96.0%	220	95%

Bay Area: 96% reach with 95% fewer transmissions -- and 6% MORE reach than managed flooding, because directed routing avoids the collision cascades that kill flooding at scale.

What This Means for Real Traffic

If 100 nodes each send position every 15 minutes:

Managed Flooding: 100 x 4,301 = 430,100 TX per 15 min
Cluster-Distributor: 100 x 220 = 22,000 TX per 15 min

That's the difference between network congestion collapse and comfortable headroom.

Bloom Filter Integration

@h3lix1 your Bloom Filter approach from #8592 fits naturally at the cluster boundaries. When a border node relays to the next cluster, it can carry an RBF of which nodes already received the broadcast. The next cluster's distributor checks the filter before relaying to nodes that might already have it from signal spillover. This would reduce the remaining redundancy even further.

Honest About Limitations

Half-duplex remains brutal -- both approaches collapse to single-digit delivery rates. This is radio physics (nodes stuck in RX), not a routing problem. The cluster-distributor does produce 95% fewer TX though, which means fewer collisions.
Latency is slightly higher -- the wave propagates cluster-by-cluster instead of flooding simultaneously. For position packets (not time-critical), this should be acceptable.
Distributor failure requires re-election. Currently not implemented but straightforward -- second-best valley node takes over.

Try It

The live simulator lets you compare all routing approaches side-by-side. Select any scenario including "Bay Area Mesh" and step through hop-by-hop. Source code: simulator/routing.py -- classes ClusterDistributorBroadcast and ManagedFloodBroadcast.

Your question "if we can crack how to make flood routing with as little airtime as possible, we might be on to something" -- I think this is that something. The trick is: don't flood the whole network. Flood small clusters, relay between them.

-- Clemens

0 replies

korbinianbauer · 2026-03-23T14:34:00Z

korbinianbauer
Mar 23, 2026

Since this is clearly AI-generated, I'll feel free, too:

Non-local information requirements
- Cluster-level Network Health Score (NHS), path-wide battery, and load-aware routing require data from multiple nodes, which is not locally available.
- Example: To compute NHS for an 8-node cluster, each node would need link-quality and queue data from all neighbors every 30 s, creating extra traffic.
Memory constraints
- Multi-path route tables (5 routes × ~70 destinations × ~410 bytes per route) could use ~143 KB, exceeding nRF52 usable RAM (~64 KB).
- Neighbor table (16 entries × 80 B = 1.3 KB) and cluster metadata (~800 B) further add to memory pressure.
Compute overhead
- BFS-based multi-path route computation 5× per destination, route decay updates every 30 s, and emergency reroutes may overload the 64 MHz nRF52840 CPU.
- Example: For a 70-node cluster with 5 routes per destination, BFS complexity is O(V+E) ≈ several hundred operations per route, repeated every update cycle.
Radio / airtime limitations
- OGMs every 30 s per node with rich metadata (~20–40 B) on a 100-node network → ~100 messages per 30 s.
- LoRa SF12 airtime: 500 ms–2 s per packet → channel may be occupied >50 s in a 30 s window if multiple nodes transmit simultaneously.
- Multi-hop retries (3–5 per hop) for 3–5 hop paths multiply transmissions, increasing collisions and duty-cycle risk.
Topology propagation
- Directed routing assumes partial multi-hop topology knowledge beyond immediate neighbors.
- Example: To route across 3 clusters with 2 border nodes each, nodes need ~16 extra entries in routing tables, which requires repeated propagation of neighbor and border info.

Btw. the simulator doesn't do anything when I open the page

0 replies

ClemensSimon · 2026-03-23T20:50:37Z

ClemensSimon
Mar 23, 2026
Author

Thanks for the detailed review -- these are valid engineering concerns that deserve concrete answers. I'll go point by point.

Re: "clearly AI-generated" -- yes, Claude helped with the writeup and the simulator code. The routing concepts and the constraints analysis are mine though. Speaking of which:

Simulator fix

The simulator was broken due to orphaned code fragments from a bad file split (leftover lines from roundRect polyfill in sim-scenarios.js and RNG class methods in sim-network.js that caused TypeError/SyntaxError on load). Fixed now -- should work if you reload. Sorry about that.

1. Non-local information requirements

Fair point, but the critique assumes more global knowledge than the design requires.

The weight formula W(r) = a*Q + b*(1-Load) + g*Batt uses local data only: Q = link quality to next hop (measured via SNR/RSSI), Load = own queue depth, Batt = own battery. The only "remote" value is the next hop's battery level, which piggybacks on the OGM it already sends.

NHS is not a global aggregate -- it's a local average of what a node sees from its direct neighbors' OGMs. An 8-node cluster doesn't need extra polling; the OGMs that maintain neighbor tables already carry this data.

Where you're right: "path-wide" battery and load awareness (across multiple hops) is not feasible on LoRa. The implementation should evaluate only the next hop, not the full path. I'll clarify this in the proposal.

2. Memory constraints

The math is correct but the assumptions are worst-case:

5 routes x 70 destinations x 410 bytes -- in practice, a node in a clustered network knows its own cluster members (~20-30) plus border nodes to adjacent clusters (~4-8). That's ~35 destinations, not 70. And 2 routes (primary + backup) suffice, not 5.
410 bytes per route is too high. A route entry needs: dst_id (4B) + next_hop (4B) + quality (1B) + age (2B) + hop_count (1B) = 12 bytes. Even with a 4-node path hint: ~20 bytes.

Realistic calculation: 2 routes x 35 destinations x 20 bytes = 1.4 KB. Plus neighbor table (16 x 20B = 320B) and cluster metadata (~200B). Total: ~2 KB -- fits comfortably in nRF52 RAM.

You're right that explicit memory budgets should be in the proposal. I'll add a table.

3. Compute overhead

BFS on a 30-node cluster with ~100 edges is ~130 operations -- microseconds on a 64 MHz Cortex-M4. Even 3x per destination for 35 destinations = ~14,000 operations, well under 1ms.

But more importantly: BFS doesn't need to run on the node at all. Routes are built incrementally via distance-vector updates (like RIP/AODV): when a neighbor's OGM says "I can reach node X in 3 hops with quality 0.8", the node updates one table entry. That's a single comparison + write, not a graph traversal.

Route decay (quality *= 0.95 per entry every 30s) for 70 entries is trivial. Emergency reroutes only fire on link failure -- not a periodic cost.

I should describe the routing table mechanism as distance-vector rather than BFS in the proposal. The simulator uses BFS for clarity, but a real implementation wouldn't.

4. Radio / airtime -- this is the strongest objection

Your math is correct for the naive case: 100 nodes x 1 OGM/30s x 500ms-2s airtime = channel saturation. But OGMs don't flood globally in System 5. They stay cluster-local (1 hop only):

A cluster of 25 nodes: 25 OGMs/30s x ~500ms = 12.5s airtime -- feasible on a single channel
Border nodes exchange condensed cluster summaries: 1 message per border-pair per cycle
Total for a 4-cluster network: ~100 local OGMs + ~12 border summaries = manageable

Still, I acknowledge this needs more work:

OGM interval should be adaptive: few neighbors -> 30s, many neighbors -> 120s+
OGM payload can be reduced to ~8 bytes (node ID + battery + quality summary)
EU868 duty cycle (1% = 360ms per 36s) is the hard constraint -- need explicit airtime budgets
Retries per hop need quantification against duty cycle limits

The retry concern (3-5 per hop x 3-5 hops) is valid but less severe than it sounds: System 5 sends unicast (1 TX per hop), not broadcast. Total airtime for a 5-hop message with 2 retries = ~15 TX. Managed flooding for the same message: hundreds of TX. The per-message efficiency is real even with retries.

5. Topology propagation

System 5 does not require multi-hop topology knowledge. Each node knows:

Its neighbors (1-hop, from OGMs)
Its cluster members (from local OGMs)
Border nodes (locally detectable -- has neighbors in other clusters)
Which adjacent clusters are reachable via which border nodes (from border-to-border OGM exchange)

The "16 extra entries" for routing across 3 clusters is correct and trivial: 16 x 20B = 320 bytes. Propagation cost: ~2-4 border summary messages per cluster-pair per 30s cycle.

What's missing from the proposal is an explicit diagram showing what data lives where and how it propagates. I'll add that.

Summary of what I'll improve based on your feedback:

Explicit memory budget table for nRF52
Clarify that routing uses distance-vector, not BFS-on-device
Add adaptive OGM intervals and airtime budget calculation for EU868
Add data flow diagram for topology propagation
Remove "path-wide" language -- next-hop metrics only

Good feedback overall. The airtime point is the one that needs the most engineering work before this could be real.

0 replies

ClemensSimon · 2026-03-23T21:03:43Z

ClemensSimon
Mar 23, 2026
Author

Hey @h3lix1, @shalberd,

Quick update -- based on @korbinianbauer's feedback (and your earlier points about broadcast traffic being 98% of the network), I've made significant revisions to the proposal and the documentation:

What changed

1. Distance-Vector instead of BFS
Routes are now built incrementally from OGM data (like RIP/B.A.T.M.A.N.), not by running graph algorithms on-device. Each route entry is 12 bytes (dst + next-hop + quality + age + hops). Total routing table: ~1.5 KB. This directly addresses the nRF52 memory concern -- 64 KB RAM is more than enough.

2. Next-hop metrics only
The weight formula W(r) = a*Q(r) + b*(1-Load) + g*Batt now explicitly uses next-hop node data only (from its last OGM). No more "path-wide battery" or "minimum battery along route" -- those require multi-hop state propagation that creates exactly the traffic overhead we're trying to avoid.

3. Adaptive OGM interval
Fixed 30s replaced with density-adaptive intervals: 30s (sparse, <8 neighbors), 60s (moderate), 120s (dense), 180s (very dense). Includes an explicit EU868 duty cycle airtime budget table in the docs.

4. Cluster-Distributor Broadcast (new -- directly from @h3lix1's point about 98% broadcast traffic)
This is the big one. Broadcasts no longer flood the entire network. Instead:

Each cluster elects a distributor (valley node with high local coverage, low cross-cluster leakage)
Broadcast source sends to its cluster distributor via unicast (1-3 TX)
Distributor does a mini-flood within the cluster only (~20-30 TX for a 25-node cluster)
Border nodes relay to the next cluster's distributor
Wave propagates cluster-by-cluster

Results: Bay Area (235 nodes): 4,301 TX with managed flooding vs 220 TX with cluster-distributor = 95% savings. Regional (500 nodes): 95,869 vs 517 TX = 99.5% savings.

5. Simulator fixed
@korbinianbauer noted it wasn't working -- turned out two JS files had orphaned code fragments from a bad file split. Fixed and pushed. Should work now: https://clemenssimon.github.io/MeshRoute/simulator.html

@h3lix1 -- re: your Bay Area concerns

The broadcast routing directly addresses your point about position/nodeinfo/telemetry dominating traffic. With cluster-distributors, a position beacon from one node costs ~30 TX to reach the whole 235-node Bay Area mesh, instead of ~4,000 TX with flooding. The half-duplex mountaintop blocking issue is also less severe because the distributor model generates far fewer simultaneous transmissions.

Out-of-order delivery (your A-B-C -> C-B-A concern) is handled by the 2-byte sequence counter in the packet header. Gap detection is cheap and doesn't add TX overhead.

@shalberd -- re: EU868 and GPS

The adaptive OGM interval now explicitly accounts for EU868's 1% duty cycle. At 60s intervals (moderate density), a node uses ~0.8% of its duty budget for maintenance traffic. The airtime budget table is in the updated How It Works page.

GPS remains a soft requirement -- nodes without GPS can use pre-set coordinates or inherit cluster assignment from a GPS-capable neighbor.

All changes are live on the site. The How It Works page has the full technical details including the new broadcast section.

0 replies

korbinianbauer · 2026-03-24T07:59:31Z

korbinianbauer
Mar 24, 2026

But OGMs don't flood globally in System 5. They stay cluster-local (1 hop only):

They may not flood beyond 1 hop, but that doesn't mean they just stop at the borders of your geo-cluster. Every node in range or even slightly beyond it will detect a busy channel and cannot use this airtime.

1 reply

shalberd Mar 24, 2026

They may not flood beyond 1 hop, but that doesn't mean they just stop at the borders of your geo-cluster.

correct, we get 60-80 km range for a hop in every direction on preset MEDIUM_FAST.

ClemensSimon · 2026-03-24T18:08:55Z

ClemensSimon
Mar 24, 2026
Author

The 60-80km Elephant: Why Geo-Clusters Can't Be Radio-Isolated

@korbinianbauer and @shalberd -- you're absolutely right, and this is the most important feedback so far. Let me address it head-on.

The core problem: At MEDIUM_FAST, a single OGM "meant" for a 5km cluster occupies the channel for every node within 60-80km. Geographic clustering provides logical isolation but zero radio isolation. The airtime cost is real regardless of the intended scope.

I've been thinking about this since your comments, and I see three viable paths forward:

1. Power-Controlled Routing Packets
OGMs could be sent at reduced TX power (e.g. -12dB from normal), shrinking their radio footprint to match the intended cluster radius. A 5km cluster doesn't need routing packets transmitted at 60km range. This is the most direct fix -- the OGM is physically inaudible beyond the cluster. Trade-off: requires per-packet power control support in firmware.

2. Connectivity-Based Clustering Instead of Geo-Clustering
Rather than clustering by GPS coordinates, cluster by who can actually hear whom (the connectivity graph). Nodes that share strong bidirectional links form a cluster organically. This sidesteps the "overlapping radio range" problem entirely -- the cluster IS the radio neighborhood. No OGMs needed for cluster formation; the neighbor table (which Meshtastic already builds from received packets) defines the cluster implicitly.

3. Piggyback Routing on Existing Traffic
Instead of dedicated OGMs, embed routing metadata (next-hop, link quality, cluster-ID) into packets that are already being sent -- position broadcasts, telemetry, nodeinfo. Since these packets are transmitted anyway (and consume the same airtime), the routing overhead becomes effectively zero additional airtime. The trade-off is slower convergence (routing updates only happen when regular traffic flows), but for a mesh that already sends position every 15 minutes, this may be sufficient.

My honest assessment: Option 3 (piggybacking) combined with Option 2 (connectivity-based clusters) is probably the most realistic path. It adds zero airtime overhead, works within existing packet structures, and doesn't require hardware-level changes. The 60-80km range actually helps here -- it means a node's natural radio neighborhood IS a meaningful routing cluster.

I'll update the simulator to model connectivity-based clustering with piggybacked routing metadata and post results. The key metric will be: how much routing convergence time do we sacrifice vs. dedicated OGMs, and is the delivery rate still acceptable?

Updated airtime budget analysis for EU868 coming as well -- with explicit accounting for the shared channel problem you've identified.

0 replies

ClemensSimon · 2026-03-25T05:22:48Z

ClemensSimon
Mar 25, 2026
Author

@h3lix1 -- What is your opinion?
--Clemens

0 replies

h3lix1 · 2026-03-25T06:31:57Z

h3lix1
Mar 25, 2026

It feels like I'm talking a lot with Claude :)

The details of implementation will be the difficult part. The bay mesh has about 1200 nodes online today, so the scope is a little greater than first anticipated. For most routing protocols, there are challenges around node movement and making sure that it doesn't impact delivery if a node moves between 'clusters' setup by system 5.

OGM beacons every 30 seconds (or 120 seconds or more), for example, is a lot of excess air time that is very much at a premium today, but I think that was already discussed above. One aspect I don't think was discussed was the out-of-order message delivery when using load adaptive load balancing.

I guess I'm still confused about what we're trying to accomplish with this though. The simulations all assume the nodes will repeat the traffic, where we're already repressing nodes repeating traffic by trying to make sure each node hears two routers. 1200 nodes, 5% are routers. The expectation seems to already meet what system 5 is proposing today. It doesn't seem to change that channel utilization is still 40%.

Things that worry me most about this proposal is:

It's pretty complex and will make troubleshooting quite difficult. There is a UX part of this solution that still needs to be addressed, and I can't see how clusters will make explaining this to the typical meshtastic user any easier.
I'm not convinced that simpler solutions wouldn't work just as well. For example, bloom filters seem like it would define a relative path without OGM packet overhead. I personally would like to know how this is a considerable improvement over much simpler mechanisms.
Of all people, I'm one of the last to be critical about this, but I'd be highly concerned this will become a vibe coded mess without a much more defined scope of the problem this is solving, and managability of the code becomes an issue. There is a certain elegance and difficulty in simplicity.

I'm a big believer in Occam’s razor. While I like novel ideas sometimes, but there are many different mechanisms out there.. for example

MPR / selective flooding (OLSR-style) — only chosen relay nodes forward, so broadcast coverage with much less airtime.
AODV — discovers routes only when needed; less steady overhead, good for changing meshes.
DSR — on-demand like AODV, but puts the full route in the packet header; strong route caching, more header overhead.
OLSR — proactive shortest-path routing; routes are already known before sending, faster forwarding but more control chatter.
RPL — builds a tree/DAG, best for low-power networks sending toward a gateway/root.
Opportunistic routing (ExOR-like) — multiple possible forwarders compete/cooperate; good on lossy links, more complex coordination.
Geographic routing — forwards toward the destination’s location; low state, but needs location knowledge.
Source routing — sender picks the whole path; simple forwarding in the middle, but packet headers grow with hop count.

I guess what I would like to see is an RFC-like document to describe why system 5 sets it apart from everything else out there for ad-hoc mesh networks.

Personally, I see a move to caching state/messages for pull-based requests instead of everything push-based today.

A way to establish trust between a router node and end nodes to provide authoratative data. For example, store and forward type packets, or cached entries. Right now everything operates in a very high trust environment which introduces problems.

Before moving to something like, this, personally I'd like to see Meshtastic move some traffic to pull-based methods, or a pub/sub model where nodes can subscribe to a publisher to get updates directly (without needing to broadcast everything everywhere) It's more of an in-depth breaking change, but it's one that I think is ultimately necessary and worth more effort than fixing the current flood routing mechanims today. If we can reduce the amount of flood routing required, it will accomplish much the same goals.

I do appreciate the effort on trying to find the most optimal routing mechanism, but I don't think it's the biggest problem the mesh is facing today.

0 replies

ClemensSimon · 2026-03-26T17:16:15Z

ClemensSimon
Mar 26, 2026
Author

Hey @h3lix1,

First — I owe you an apology. Yes, you've been talking mostly to Claude, and that's on me. My English isn't strong enough for technical discussions at this level, so I use it as a translation tool. The ideas and direction are mine. I understand if that's frustrating, and I'll be upfront about it going forward.

Second — thank you. Your critique killed System 5, and what emerged from the ashes is dramatically better. I mean it.

System 5 Is Dead. Meet WalkFlood.

You said three things that stuck:

"It's pretty complex and will make troubleshooting quite difficult"
"I'm not convinced that simpler solutions wouldn't work just as well"
"There is a certain elegance and difficulty in simplicity"

So I threw away the geo-clustering, the OGM beacons, the multi-path weighted selection, the QoS gating, the proactive probes — all of it. I researched every protocol you mentioned (AODV, DSR, OLSR, RPL, bloom filters), plus BATMAN, CTP, goTenna's ECHO/VINE, MeshCore, Reticulum, fountain codes, ant colony optimization, and real deployment reports from BayMesh, Wellington NZ, and Austin TX. I also dug into the LoRa radio physics (half-duplex cascade, SF orthogonality, time-on-air math) and — most importantly — I read the actual Meshtastic firmware routing code (Router -> FloodingRouter -> NextHopRouter -> ReliableRouter).

The result is WalkFlood — four phases, each a fallback for the previous:

1. LEARN:      Listen to all traffic. Learn routes passively. Zero overhead.
2. DIRECT:     Route known? -> Forward hop-by-hop, 1 TX per hop.
3. WALK:       Stuck? -> Step toward the neighbor most likely to know a route.
4. MINI-FLOOD: Still stuck? -> Tiny selective flood from current position
               (2 best neighbors per node, max 4 hops deep, ~30 TX).

No GPS. No beacons. No control packets. No geo-clustering. 12 bytes per route entry. ~3KB RAM for 235 nodes.

Gradual Migration: No Flag Day Required

The critical design decision: WalkFlood doesn't replace Meshtastic — it grows inside it. A WalkFlood node joining an existing mesh behaves identically to a normal Meshtastic client at first:

Phase 1 — "Listener" (Day 1): WalkFlood node uses managed flooding like everyone else. But it listens to ALL traffic and builds its routing table passively. From the outside, it's indistinguishable from a regular node.

Phase 2 — "Hybrid" (after hours/days): The node has learned enough routes. When it knows a directed path, it uses it (1 TX). When it doesn't, it floods normally. Legacy nodes notice nothing — they just see slightly less traffic on the channel.

Phase 3 — "Sweep" (enough WalkFlood nodes): Once ~30% of nodes run WalkFlood, the network tips. Directed routing becomes dominant, flooding drops dramatically, and airtime opens up for more useful traffic.

Here's what this looks like in the simulator — Bay Area + Stress (235 nodes, 15% failure):

Left: Managed Flooding — 480 TX, yellow chaos everywhere. Right: WalkFlood — 9 TX, clean purple directed paths. Same network, same messages. WalkFlood saved 98% of transmissions by learning to route directly. Purple rings show nodes that have switched to directed routing.

You can watch this migration live: Click "Demo" in the simulator — it auto-plays 20 messages showing both panels starting identical (both flood), then WalkFlood gradually switches to purple directed paths as it learns.

Results: 1200-Node Bay Area (Your Scale)

You mentioned the Bay Mesh has about 1200 nodes online. So I tested at that scale:

Router	Delivery	TX Cost
Managed Flooding (hop=7)	4%	38,644
WalkFlood	88%	5,909

At 235 nodes:

Scenario	Managed Flood	WalkFlood
Small (20 nodes)	87% / 1,640 TX	100% / 159 TX
Medium (100 nodes)	27% / 3,380 TX	100% / 368 TX
Node Kill 20%	16% / 2,696 TX	100% / 383 TX
Stress 30% degraded	19% / 3,380 TX	99% / 492 TX
Bay Area (235 nodes)	6% / 6,752 TX	84% / 8,894 TX

Broadcast: The 98% Problem

You're right that unicast routing alone doesn't solve the mesh — 98% of Meshtastic traffic is broadcast (telemetry, position, nodeinfo). WalkFlood addresses this with a 3-tier approach:

Traffic Type	Current (Flood)	WalkFlood Approach	Savings
Position/Telemetry	118 TX/event	Pull-based (request via unicast, like MeshCore)	85%
Group/Channel messages	118 TX/event	Scoped flood (hop-limited, 3-hop radius)	78%
NodeInfo / SOS	118 TX/event	MPR relay (only selected relays rebroadcast)	30%
Weighted average	118 TX		~70%

Pull-based telemetry is exactly what you asked for. Scoped flooding is trivially implemented (just a hop counter). MPR selection reuses WalkFlood's existing neighbor knowledge.

The Key Insight: Flooding Is the Poison

The breakthrough came from analyzing WHY managed flooding gets only 4-6% on Bay Area:

When a mountain node (234 neighbors) broadcasts, ALL 234 neighbors are half-duplex blocked for ~2.3 seconds (SF12). The flood dies in one hop.

The math confirms: on a 5% quality link (typical valley->mountain), retrying is 9.7x more expensive than routing around it via hills. A 5-hop hill path [0.7, 0.6, 0.5, 0.6, 0.7] has 99.5% delivery at 8.2 TX. A 2-hop mountain path [0.05, 0.05] has 11.3% delivery at 79.4 TX. WalkFlood's Dijkstra bootstrap (weight = -log(quality)) finds the hill path automatically.

Questions for You

1. Is Managed Flooding the right baseline?
I read the firmware — Meshtastic 2.6+ uses NextHopRouter for DMs, which learns relay nodes from ACKs. My simulation models the flooding part accurately but NOT the next-hop learning. Is the real-world Bay Mesh primarily using managed flooding for most traffic (since 98% is broadcast)?

2. Could WalkFlood be tested as a Meshtastic module?
WalkFlood is ~200 lines of C. It doesn't change the packet format. It's backward-compatible: WalkFlood nodes coexist with flood-only nodes. Would you be open to a firmware prototype as an optional routing mode alongside ReliableRouter?

3. Validation plan
I found the Meshtasticator (runs real firmware) and BayMesh MQTT broker (mqtt.bayme.sh). Plan: collect real traffic data, run same topology in Meshtasticator, compare against WalkFlood simulator. Does this make sense?

Try It

Live simulator with Demo mode: clemenssimon.github.io/MeshRoute/simulator.html
RFC: docs/rfc-walkflood.md
Source: github.com/ClemensSimon/MeshRoute

Thank you again for the sharp feedback. You were right about Occam's Razor.

"Listen to traffic. Remember what you hear. Walk toward the destination. If lost, ask the neighbors."

--Clemens

0 replies

ClemensSimon · 2026-03-26T17:23:42Z

ClemensSimon
Mar 26, 2026
Author

Greetings from Bavaria! -- Clemens

0 replies

ClemensSimon · 2026-03-28T07:04:54Z

ClemensSimon
Mar 28, 2026
Author

Hey @h3lix1,

Wanted to follow up on a few things from your last message that I think deserve a direct answer — not another wall of simulation results.

On "talking to Claude" — fair point, and I apologize for that. I use it as a translation crutch (my English isn't great), but the effect was that you spent time giving thoughtful feedback and got back what felt like auto-generated responses. That wasn't respectful of your time. Going forward: shorter, more honest, less polished.

On the actual problem — I think you nailed it when you said "I don't think routing is the biggest problem the mesh is facing today." After reading through BayMesh's real numbers (1200 nodes, 40% chutil, 5% actual utilization), I'm starting to agree. The collision cascade from half-duplex at mountaintop nodes is a radio physics problem, not a routing problem. Better routing helps at the margins, but it doesn't fix 234 nodes being blocked for 2.3 seconds every time a mountain router transmits.

On pull-based / pub-sub — this is the idea I keep coming back to from your feedback. If nodes only request position/telemetry when they actually need it (instead of every node broadcasting every 15 minutes), the airtime savings dwarf anything routing can achieve. Have you seen any concrete proposals for this in the Meshtastic ecosystem? I'd rather contribute to an existing effort than start another parallel thing.

On the RFC — you asked for an RFC-like document explaining why WalkFlood vs. existing protocols. That's written: docs/rfc-walkflood.md. The honest answer is: WalkFlood isn't revolutionary — it's basically passive AODV with a walking fallback. The only thing that might set it apart is zero-overhead route learning and backward compatibility with existing Meshtastic nodes. Whether that's enough to justify the complexity vs. simpler bloom filter deduplication — I'm genuinely not sure.

Would be curious to hear your take on priorities: if you had one firmware change to reduce BayMesh's chutil from 40% to 20%, what would it be?

-- Clemens

0 replies

NomDeTom · 2026-05-05T01:11:14Z

NomDeTom
May 5, 2026

@ClemensSimon It's 2am here, and I've only just been pointed to this thread. I've read all the discussion (well, 50-70% of it, but the important bits).

Having looked at the rapid evolution of the routing strategy, it now looks like you've approached something like the next-hop system (but for broadcast messages) from the opposite direction. I'd be very much interested to see a meshtasticator simulation of the strategy evolving from a base start, and see how the mesh self-organises and determines how to direct the packets.

I've had a conversation with @h3lix1 about how the baymesh is setup, and all I can say is that I'm in awe of their setup in terms of coverage and reach - I'm unsure that the walkflood routing approach will solve their issues, but I can't see easy ways to fix it myself either.

FWIW, I appreciate people at least trying to improve the routing and the overall system generally. If we don't think of these things, then nothing gets better. You're more than welcome to the main discord - there's other languages on there, but you're more than welcome to interact directly. It's not really synchronous comms, so don't be put off thinking you need to translate everything.

0 replies

Uh oh!

No More Hop Limits: What if Every Hop Cost Just 1 TX Instead of n? #9936

Uh oh!

Context: What Meshtastic Already Does Well

The Remaining Bottleneck

Proposal: System 5 — O(hops) for Everything

Simulation: System 5 vs. Managed Flooding

Biggest Practical Consequence

Try It

Questions for the Community

Replies: 20 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

thebentern Mar 18, 2026 Maintainer

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 19, 2026 Author

Uh oh!

ClemensSimon Mar 19, 2026 Author

Update: Realistic Hop Limits Reveal Delivery Collapse

Key Insight

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 21, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 21, 2026 Author

Uh oh!

Uh oh!

ClemensSimon Mar 23, 2026 Author

The Broadcast Problem, Precisely

Three Approaches That Could Work Together

1. Cluster-Scoped Broadcast (System 5 native)

2. Bloom Filter Hybrid (@h3lix1's RBF from #8592)

3. @fifieldt's Interior/Exterior Split — Already Built

What I'll Build Next

Honest Limitations

Uh oh!

ClemensSimon Mar 23, 2026 Author

The Problem You Identified

Solution: Cluster-Distributor Broadcast

Key Design Decisions

Benchmark Results

What This Means for Real Traffic

Bloom Filter Integration

Honest About Limitations

Try It

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 23, 2026 Author

Simulator fix

1. Non-local information requirements

2. Memory constraints

3. Compute overhead

4. Radio / airtime -- this is the strongest objection

5. Topology propagation

Uh oh!

ClemensSimon Mar 23, 2026 Author

What changed

@h3lix1 -- re: your Bay Area concerns

@shalberd -- re: EU868 and GPS

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 24, 2026 Author

The 60-80km Elephant: Why Geo-Clusters Can't Be Radio-Isolated

Replies: 20 comments 6 replies

thebentern Mar 18, 2026
Maintainer

ClemensSimon
Mar 19, 2026
Author

ClemensSimon
Mar 19, 2026
Author

ClemensSimon
Mar 21, 2026
Author

ClemensSimon
Mar 21, 2026
Author

ClemensSimon
Mar 23, 2026
Author

ClemensSimon
Mar 23, 2026
Author

ClemensSimon
Mar 23, 2026
Author

ClemensSimon
Mar 23, 2026
Author

ClemensSimon
Mar 24, 2026
Author