
I was recently fortunate enough to be called on to diagnose and fix some apparent issues with the IP network at a site of national importance. This is a tech-y one so grab a cuppa. In a nutshell an engineering oversight on the part of a well known, quality camera manufacturer coupled with a less-than-ideal network topology led to a Denial of Service event bringing a critical network to its knees. Whoops.
The problem:
A hapless engineer emailed describing periodic downtime of the network video recorders (NVRs) coupled with access control playing up on some areas of the complex and couldn't figure out what was going on. The cameras were a well known European brand as were the NVRs, network infrastructure was from a popular Californian manufacturer; all quality equipment correctly installed by certified engineers. On my arrival the issues were immediately apparent: NVRs were going bananas left right and centre, the access control system in one of the buildings was periodically denying entry and remote management of the switch in said remote building was flaky at best. These problems also appeared to come in surges, quieten down, then surge again. This all smelt of link saturation, loosely, but a few things didn't make sense:
Saturation of parts of the network would not cause this model of NVR to reset - likely one may instead see packet loss of the video streams inbound to the NVR, resulting in "greening" in the decoded h.264 video.
The nature of these "surges" was not periodic, yet strangely regular.
The problems did not coincide with any particular network engineering change but had been slowly worsening over time.
The analysis:
Wondering initially if there were broadcast storms from a loop somewhere I fired up Wireshark; what I saw next was truly awesome...
Initially, after connecting to a port on that VLAN everything looked normal. There was the usual trickle of ARPs, BPDUs, mDNS trafic etc. then suddenly WHOOMPH, Wireshark lit up - I was receiving hundreds of megabits of UDP with brief moments of full gigabit saturation. Crumbs. During this time all the NVRs started to reboot, then slowly the whole thing started to quieten down a bit then we were back to square one. Quiet as a mouse.
/* Allow me to digress to networking 101. When a switch receives a packet it matches the destination MAC address with the port in its MAC address table and forwards traffic to that port. Lacking an entry it sends the packet out of all ports in order to discover the device with that MAC. If the device responds the MAC address table is updated and traffic is once again forwarded to that single port. A quick sidestep to RTSP/RTCP/RTP; RTSP is used to request real-time data, which in the case of an IP camera usually amounts to RTP over UDP. RTCP reports are then used to keep sending "Hello, I still want that data" messages to the camera. If those "wellness" messages stop the device will stop sending RTP data. (RFC2326 pp.77) To tangent again, Cisco C6K series platforms, for example, have unicast flood rate-limiting but alas, very few installations use grade of kit so flood control is usually limited to broadcast and multicast traffic. The stage is set. Thus... */
A rebooting device will have had its network interface down for a time therefore all camera traffic destined for that device will be broadcast over the L2 domain until the interface is back up. In the case of a single NVR there should have been a calculable, small-ish amount of traffic broadcast; why I was seeing many times this? Why were those flows not then stopping after a sensibly short period? The NVR takes about one minute to reboot by which time these stale video streams should have long since timed out.
After drilling into the initial Wireshark capture it became apparent that for reasons unknown there were no fewer than six separate streams (based on RTP sequence numbers) from one camera destined for one NVR. Six streams eh? At this point a camera investigation seemed prudent. After isolating a camera, rebooting it, connecting to it using good ol' VLC media player then killing the application (preventing VLC from sending its usual RTCP teardown message) lo and behold the RTP stream continued. I connected again and got a second stream. Repeated, then a third. Again, a fourth. I then waited and waited. All four streams were flying into my laptop unabated; the only way of stopping the deluge was to reboot the camera. Wow. After a subsequent reboot of every IP camera mapped to the busiest NVR things settled down a little.
As was now apparent the sequence of events from a freshly booted, quiescent system was as follows:
NVRs boot up and all request streams from their IP cameras,
IP cameras dutifully respond.
***SOMETHING HAPPENS*** i.e. a network outage between the NVR and camera causes the NVR to flush that connection and request a new one once the camera reappears. Perfectly normal stuff. But the camera never stops sending its original stream, so:
The NVRs all now have two inbound data streams from the camera, one of which they are discarding.
Step 3 happens again and again for multiple camera channels until...
NVRs are receiving so much extra traffic they reboot in an attempt to flush the buffers and connections.
During the reboot sequence the switch correctly floods unicast traffic to all other ports on that VLAN causing other NVRs to reboot. This cluster of reboots sliding in and out of phase with one another was generating periods of relative calm followed by periods of total calamity; these were the "surges" I had been observing.
Concurrently, due to the VTP/VLAN trunk design of the network, this traffic is flooded to all corners of the VLAN involved in security i.e. everywhere.
All hell breaks loose. NVRs start cycling, links slow to a crawl, slow links completely saturate and access control devices are flooded causing them to misbehave.
The solution:
Rebooting every IP camera temporarily alleviated the issues and upgrading the camera firmware stopped this multiple stream problem for good; a simple solution to a horrendous set of events. However there's another question that should really be asked.
How could this camera problem been prevented from turning into a network wide disaster?
Keep firmware up to date and carefully test those firmware releases.
Test all devices fully before installing them. A short session with VLC and Wireshark would have thrown up the issue with the firmware version in this camera.
Segregate services; there was no need to have a VLAN for the whole security network, access control, cameras, NVRs and all. Split things up! There are 4094(ish) VLAN tags available.
Protect slow links. There were a couple of 100Mbit/s FDX links which were being completely saturated causing access control to be unavailable.
Implement out of band management. The nature of this problem meant that many network devices' management interfaces became completely unavailable during the surges.
Implement a fully functional monitoring system. sFlow / netFlow would have shown this multiplication of bandwidth on the first iteration before it became an operational issue.
Consider implementing QoS for the access control system and other critical low bandwidth systems.
Implement a routed L3 network. L2 networks based around VTP, VLANs and spanning tree are not resilient to certain types of problems or particularly secure. L3 networks (ideally L3 to the client, or at least L3 to the closet) based around routing protocols and ACLs have inherent advantages (protection from broadcast storms and firewall security for example). This topology would have mitigated outages caused by a unicast storm by preventing a flood of the entire broadcast domain.
Be aware of the capabilities of your equipment. All the kit installed on this site was capable of running dynamic routing, flow monitoring and ACLs. A bit of planning by skilled engineer could have implemented a system which was secure, manageable and detected these spikes and dynamically blocked them, generating alerts in the process - all with no extra capital costs.
Unfortunately equipment mishaps are more common that one might hope, as are malicious attacks, but these potential disasters can be mitigated by careful design decisions on the part of the network engineer. So there we have it - whoops. But which was the worst whoops?