|
| 1 | ++++ |
| 2 | +title = "AUSNOG 2025 Notes" |
| 3 | +date = "2025-05-17" |
| 4 | +description = "Just some general notes from attending AUSNOG 2025" |
| 5 | +tags = [ |
| 6 | + "ausnog", |
| 7 | + "network automation", |
| 8 | + "python", |
| 9 | +] |
| 10 | +showComments = "true" |
| 11 | +robots = "all" |
| 12 | +draft = "true" |
| 13 | ++++ |
| 14 | + |
| 15 | +## Nexthop |
| 16 | + |
| 17 | +- Strategies to ping everything in/out of our network from generic monitoring and routers themselves |
| 18 | +- Arista allows you to run containers within their routers/switches as its just an AlmaLinux base |
| 19 | +- Can also just runs scripts on the switch itself |
| 20 | +- Running a python daemon that connects to kakfa |
| 21 | +- Arista comes with ProcMgr inbuilt to monitor your custom script/process |
| 22 | + |
| 23 | +## Reannz |
| 24 | + |
| 25 | +Network State Checking |
| 26 | + |
| 27 | +- perfSONAR |
| 28 | +- config creation, then validation during migration using an inbuilt tool |
| 29 | + |
| 30 | +# AWS |
| 31 | + |
| 32 | +No Packet Left Behind |
| 33 | + |
| 34 | +- Network owned end to end |
| 35 | +- 96% automation due to scale (1 million network devices) |
| 36 | +- Single chip routers, fixed ports |
| 37 | +- No fabric, cellification, dual REs, state sync, etc. |
| 38 | +- 32 x 400G 12.8TB switch (can breakout to 100G) |
| 39 | +- Deployed 1 rack at a time in a clos fabric, pre-cabled |
| 40 | +- Auto remediation stages |
| 41 | + - Detect/Isolate |
| 42 | + - Identify Root Case |
| 43 | + - Mitigate impact |
| 44 | + - Remediate underlying problem |
| 45 | + - Return to Service |
| 46 | +- Vendor discard counters vary per vendor, even platform in the same vendor |
| 47 | + - No standards |
| 48 | + - draft-evans-discardclass |
| 49 | +- Shift / Drain / etc. |
| 50 | + - Take device out of service (as path, is-is pref, etc.) |
| 51 | + - ECMP allows to take out single link |
| 52 | + - Rollback no matter the change, can always roll forward again |
| 53 | +- Deployment |
| 54 | + - Change -> Validate -> Intended State -> Deployment System -> Applied state -> Observed State |
| 55 | + - Batfish for Correctness of Intent |
| 56 | + - Containerlab / NetLab / etc. for actual implementation testing |
| 57 | +- At scale |
| 58 | + - 2500 lines of instructions to an LLM agent |
| 59 | + - This is just tribal knowledge |
| 60 | + |
| 61 | +## Nokia |
| 62 | + |
| 63 | +- 8 x GPU collective operations (RCCL, NCCL, etc.) |
| 64 | +- Dataset with Weights / Gradients |
| 65 | +- All-reduce, all-gather for distributed parallel flows |
| 66 | +- All of these have a large impact on the network |
| 67 | + - 3 Phases (Sync, Compute, Communicate) |
| 68 | + - Periodic bursts of large traffic (elephant flow problem) |
| 69 | +- Rail network is used to build a GPU -> GPU topology (ultra ethernet) |
| 70 | + - Consider Intra and Inter GPU server comms |
| 71 | + - Inter GPU uses RoCEv2 |
| 72 | +- Physical (Modular AI clusters) |
| 73 | + - SU (Scalable unit) aka Stripe |
| 74 | + - Line rate across the whole cluster |
| 75 | + - 32 x GPU Servers (other configs available) |
| 76 | + - 3 x Mgmt TOR |
| 77 | + - 2 spine 3 leaf |
| 78 | +- Logical Networks |
| 79 | + - Interconnect |
| 80 | + - In-band |
| 81 | + - Out-of-band |
| 82 | + - Backend GPU |
| 83 | + - Storage |
| 84 | +- Considerations |
| 85 | + - RDMA primary traffic |
| 86 | + - Lossless RoCEv2 -> UEC |
| 87 | +- Getting to lossless ethernet |
| 88 | + - ECN + ECN Bits |
| 89 | + - PFC (microbursts) 802.1Qbb |
| 90 | + - Pause frames |
| 91 | +- Storage |
| 92 | + - Checkpointing to compensate for GPU failures |
| 93 | + - Each SU has a dedicated storage fabric (clos) |
| 94 | + - NVMe-oF |
| 95 | + |
| 96 | +## Nodal/ys |
| 97 | + |
| 98 | +How geopolitics is rerouting submarine cables |
| 99 | + |
| 100 | +- Hyperscalers beginning to owning cables outright |
| 101 | +- US/China |
| 102 | + - PLCN, BtoBE, MEA-ME-WE6, SJC2 |
| 103 | + |
| 104 | +# Cloudflare |
| 105 | + |
| 106 | +Beyond the Firewall |
| 107 | + |
| 108 | +- Sarah Armstong-smith books |
| 109 | + - Human hacking |
| 110 | + - Social Engineering |
| 111 | +- https://radar.cloudflare.com |
| 112 | +- No blame culture |
| 113 | +- Move beyond generic annual training, focus on the why and make it relevant |
| 114 | +- Know your business, communicate with employees |
| 115 | + - What are they doing and need to do (no workarounds/shadow IT) |
| 116 | +- Project Zero Trust - George Finney |
| 117 | +- Flipper |
| 118 | + |
| 119 | +## Kentik |
| 120 | + |
| 121 | +The Scourge of Excessive AS-Sets |
| 122 | + |
| 123 | +- AS-SET in IRR |
| 124 | +- bgp4q to build BGP filter lists on AS-SETs |
| 125 | +- Check the authoritative source (APNIC/RIPE/etc.) of the AS-SET in PeeringDB |
| 126 | +- When making your AS-SET |
| 127 | + - Remove recursive AS-SETs where possible |
| 128 | + - Otherwise keep to minimum |
| 129 | +- Long term solution |
| 130 | + - In-band BGP signaling (RFC 9234) |
| 131 | + - RPKI-based signaling using ASPA verification |
| 132 | + - Future RPKI extensions |
| 133 | + |
| 134 | +## BGP Tools |
| 135 | + |
| 136 | +How far can you go with IX route servers only? |
| 137 | + |
| 138 | +- IX Route servers |
| 139 | + - Solves the problem of reaching out to individuals for peering at IXPs |
| 140 | + - Generally far safer than bi-lat peering |
| 141 | +- Cumulative IX peers diminishing returns |
| 142 | +- map.bgp.tools |
| 143 | +- CGNAT /24's generally don't respond to pings but accounts for lots of traffic |
| 144 | + |
| 145 | +## 5G Networks |
| 146 | + |
| 147 | +BFD going down, from BGP timers expiring? |
| 148 | + |
| 149 | +- JTAC Bug and process of case |
| 150 | + |
| 151 | +## Telair |
| 152 | + |
| 153 | +A smaller NBN rollout |
| 154 | + |
| 155 | +- NBN NNI attached to EVPN PWE |
| 156 | +- Terminated on Juniper BNG |
| 157 | + |
| 158 | + |
| 159 | +# Day 2 |
| 160 | + |
| 161 | +## Nokia |
| 162 | + |
| 163 | +Quantum Technology |
| 164 | + |
| 165 | +- Qubits / Superposition / Entanglement |
| 166 | +- 4 Qubit word is 2^4 (all possible permutations) |
| 167 | +- HNDL Attacks post Q-Day (RSA-2048 cracked in 24h) |
| 168 | + - Harvest data now, decrypt later |
| 169 | +- Symmetric Crypto in use today, either PSK or asymmetric Key Sharing |
| 170 | + - Move to AES256 & SHA512 for safety |
| 171 | +- Public Key Cryptography |
| 172 | + - Larger keys, one way Algos (RSA, DH, etc.) |
| 173 | + - Algos use large 1 way prime factors, reversing this is compute heavy |
| 174 | + - Shor's Algorithm |
| 175 | + - QKD may be used in future to share these keys |
| 176 | + - Otherwise PQC coming up with new Algorithm for key exchange |
| 177 | + - Mosca's theorem |
| 178 | + |
| 179 | +## Leaptel |
| 180 | + |
| 181 | +Mikrotiks doing CGNAT |
| 182 | + |
| 183 | +- ~40Gb |
| 184 | + |
| 185 | +## Juniper |
| 186 | + |
| 187 | +Evolving Broadband Design in Australia |
| 188 | + |
| 189 | +- Cloudified BGP network (Spine/Leaf for access + BNG for scaleout) |
| 190 | + - ISIS SRv4 |
| 191 | + - EVPN-VPWS + ESI |
| 192 | +- BNG CUPS |
| 193 | + - DBNG-MP (Mgmt) |
| 194 | + - DBNG-CP (Ctrl) |
| 195 | + - DBNG-UP (User) |
| 196 | + - Allows for Local resiliency and Geo Redundancy designs |
| 197 | +- IPv6, QoS, Merchant, Automation |
| 198 | + |
| 199 | +## Cisco |
| 200 | + |
| 201 | +Transport Protocols Evolution |
| 202 | + |
| 203 | +- MPLE TE Challenges |
| 204 | + - RSVP hard to state (large headend / midpoint) |
| 205 | + - Core device stats k*n^2 |
| 206 | +- SR/SRv6 |
| 207 | + - State is in packet |
| 208 | + - Eliminates LDP/RSVP |
| 209 | + - No Tunnel interfaces |
| 210 | + - Multi-domain with PCE/BSID |
| 211 | + - SRv6 goes back to OSI model due to IPv6 headers |
| 212 | +- 80-90% engineers used RSVP TE For FRR |
| 213 | +- Achieved in SR with Ti-LFA |
| 214 | +- Flex Algo (native steering of traffic of FA path) |
| 215 | + - Can use Metric / Constraints |
| 216 | +- On Demand Nexthop |
| 217 | + - HE automatically creates SR policy to BGP NH |
| 218 | + - Uses SR-PCE (when different HE and TE) |
| 219 | + - Can do per-flow |
| 220 | + |
| 221 | +## APNIC Labs |
| 222 | + |
| 223 | +Evolution of TCP Transport Protocols |
| 224 | + |
| 225 | +- Speed evolving, plateaus recently |
| 226 | +- Optical Transmission outpacing TCP speed |
| 227 | +- Reno, increase conservatively but not as suitable these days for Gbit networks |
| 228 | +- Ramp up based on MSS / RTT, drop 50% on loss |
| 229 | + - Would take 3 years on a Tbit link |
| 230 | +- Cubic used on modern Linux |
| 231 | + - Non linear algorithim |
| 232 | + - Reacts quickly to capacity in network |
| 233 | +- Buffer bloat causes delay loops (queue never drains) |
| 234 | +- Small buffers also bad (can't utilize link) |
| 235 | +- BW * RTT / sqrt(N) |
| 236 | +- Turn on Pacing on servers |
| 237 | +- ECN (network should hit hte point of just beginning to buffer) |
| 238 | +- ECN not in use |
| 239 | +- TCP BBR tries to accomodate for this |
| 240 | + |
| 241 | +## Telstra |
| 242 | + |
| 243 | +Breaking the light barrier: Optical spectrum sharing |
| 244 | + |
| 245 | +- WDM expensive to deploy |
| 246 | +- Providers now offering spectrum with users owning their transponders |
| 247 | +- Services (Wavelength / Spectrum sharing / Dark Fibre) |
| 248 | + |
| 249 | +## AARNet |
| 250 | + |
| 251 | +AARNet Network Architecture |
| 252 | + |
| 253 | +- Automation |
| 254 | + - Model Driven approach (services, lifecycle, etc.) |
| 255 | + - Cisco NSO implementation |
| 256 | + - Northbound RESTCONF API |
| 257 | + - Multivendor, flexible (orchestrator or device can be master) |
| 258 | + - Service defined as YANG |
| 259 | + - YANG model then used to build device config template |
| 260 | + |
| 261 | +## Vocus |
| 262 | + |
| 263 | +Defending Telco Networks |
| 264 | + |
| 265 | +- Essentials |
| 266 | + - Separate networks |
| 267 | + - No Generic credentials |
| 268 | + - Harden |
| 269 | + - MFA, complex passwords that are rotated |
| 270 | + - CLI audits |
| 271 | + - Jumphosts (MFA) |
| 272 | + - Zero Trust |
| 273 | + |
| 274 | +## Arista |
| 275 | + |
| 276 | +Comfortable Complexity of Overlays |
| 277 | + |
| 278 | +- EVPN single service plane |
| 279 | + - Protocol reduction |
| 280 | + - Repeatable Model |
| 281 | + - Flexible multi-homing |
| 282 | + - Any Encapsulation |
| 283 | + - Converged teams |
| 284 | +- EVPN Gateway to stitch transport domains (i.e. dc to wan to campus) |
| 285 | + - RFC 9014 |
| 286 | + - Type-5 used and readvertised using GW next-hop |
| 287 | + |
0 commit comments