Wednesday, February 17, 2016

Are you ready for disaster prevent and recovery?

Things that a good network administrator must do.


If you manage a medium or wide network you probably know what this post is talking about. To be a network manager is funny and gratifying… if there are not problems. Unfortunately, SHIT HAPPENS. You can have the best devices of the market, and the best topology and network strategy, but anybody have a bad day. Are you prepared for network troubleshooting?

Here are some tips that can help you to make your job easier when troubles appear.

clusterize all your services.

In other words: make your services independent of the devices that run it. Make sure that the shutdown of a single device will not affect the running services.
  • Some services can run in different devices at the same time: pppoe servers, RADIUS servers...
  • Use dynamic routing instead of static routing anytime you can. It will be your network auto-adaptable in case of device or link fail.
  • If you cannot avoid using static routing, use first hop redundancy protocols (VRRP, HSRP, GLBP)
  • Replicate critical resources (databases, file systems)
  • Use a dual stack in layer 2 and 3 for critical devices: more than one switch with more than one IP network.
Combine this methods in order to avoid any single point of failure. For example, if you have a critical application that attack a database, this database must be replicated in more than one server. Each server must be connected to more than one switch. The communication between database motors must be done via loopback interfaces routed by a routing protocol that runs in each interface (in each IP network and in each switch). Then, the IP that serves the database connection to the application must have a failover method like VRRP in the servers.
The goal is that a single device shutdown could not affect any network service.

Backup all configurations.

The most frequent trouble in a wide network is broken hardware. Replacing it can be as easy as prepared you are. Many devices has CLI interfaces that can be easily backed with an appropriate software. I use rancid. Rancid connect via SSH, telnet or any other protocol that you enable. It collects configuration and other useful information (firmware version, hardware properties, etc) and stores all this data in a file.
When it detects any change in the device, it will inform you via email. All changes in devices will be registered with a CSV repository, so you can trace changes made in any device.
If rancid can't connect to a device in more than 24 hours, it will warn you.
Running rancid once a day you have the security that you can configure a replacement device in a short time.


Monitor everything.

Troubleshooting a problem without information is a hard work. It’s a very hard work. Troubleshooting a sporadic problem that appears in shorts time ranges without information is impossible. So, you must be prepared. Any network device can give you a lot of information that can be collected and stored for real-time analysis or later analysis.
Well, at this point I’d like to make a distinction between two types of data:
  • Data that can be graphed: Interfaces traffic or errors, temperature, CPU usage, amount of BGP routes, etc.
    This information can be stored in a graph application like cacti, MRTG, Munin… It's very easy to analyze graphs to find information about a problem.
  • Data that cannot be graphed: syslog events, interfaces states, or any abnormal state in general. This data can be collected in two ways:
    • Data that devices report. i.e: syslog events. It is important to organize this data at the moment of collecting it.
      Specifically, syslog is a good example of this: If you divide the information in files with the name of the device that come from, searching information about a single host will be easier.
    • Data that we collect from devices with external system like Nagios or Icinga.
I have a law: every data that can be monitored must be monitored. Some of this data can be used to warn you about an abnormal state, other can be used only for informational purposes, but all of them can be useful in a future time. There is a lot of software that collect all data automatically, process it and report you alerts if something goes wrong.

A little example:
It can appears that collects temperature from devices is irrelevant, but I worked with a SHDSL modem that self reboots when its temperature reaches 70ºC. This trouble could be easily discovered because I had a graph with the device temperature.

Stay prepared for dumping network traffic.

Sometimes it is very useful to sniff a specifical interface of a device. A lot of troubles can be detected by sniffing traffic. The problem is that not all devices had this feature. Mikrotik or Linux hosts can sniff traffic with tools like “/tool sniffer” or “tcpdump”, but Cisco or Ubiquiti has not a useful tool for this.
It's a good idea to have an ace up your sleeve in remote network nodes where you don't have devices that can sniff traffic. A simple and very useful method is to prepare a small sniffer device with more than one interface connected to the switch (or switches): one of them for managing the router and other for sniffing traffic. To sniff a specific device interface is as easy as configure the switch interface that is connected to the sniffer as “mirroring port” of the device interface you want to monitor.
In a future post I will explain better how to do this with a cheap Mikrotik router.

Alert of network changes

Like I said before, there is a lot of software that collects, analyze and report abnormal states of your network. Use them. At first it's hard to configure, but it will be one of the best ways to make your network safe.

Be careful and search any abnormal state. Some of them are very obvious: shutdown devices, down links… other can be less obvious, but these are going to alert you about an abnormal situation before it becomes a problem: exceed of traffic, fan that doesn’t work, a short number of OSPF neighbors, a big number of errors in an interface…
Correct the little thing before it becomes a big problem is the best way for making your network stable.

New services or topology changes must also be alerted. You must decide if each new situation is under control and meets your quality standards.

Automate and centralize management.

How many problems are caused because of an error while configuring a new service or making a change in network topology?
Use tools to automate all tasks you do frequently. Humans make errors, but a well designed and configured tool for doing changes never mistakes. My prefered software for this is Ansible. With a single playbook you can make a wide range of changes remotely without syntax error or forgotten parameters.

keep informed of news about your equipment and services.

Companies updates their equipment or software because something can be improved. Bug fixes, security fixes, new features... A new that you have read six months ago can give you a track of a current problem.