Saturday, April 30, 2016

Some results in Network Automation

On last Friday my college Jose Montero wrote an Ansible playbook using the module I wrote. He used this module for configuring Cisco PCE by the serial port giving only a few incoming parameters. The result was spectacular: he configured dozens of default CPE routers in a few hours.

Recently I wrote another playbook that configures all the equipment of our network for provisioning a customer VRF: a couple of MPLS PE, NAT servers, switches and several couples of BRAS in remote nodes. Some of these devices are Cisco, and some others are Netgear swtiches, Openwrt routers, etc., so the playbook is not Cisco dependant and configures different manufacturers devices. Do the VRF provision procedure by hand could take some hours, but with this playbook only take a few minutes.

The power of Networt Automation!

Wednesday, April 6, 2016

Sniff traffic in a remote node

Sniffing traffic in an interface is an excellent tool for a network manager. Linux based routers like openWRT or Linux servers can use tools like tcpdump or wireshark to capture the traffic in their interfaces. Mikrotik has its own tool too, “/tool/sniffer”. The problems start when you need to capture the traffic in a device that doesn't have a facility for this purpose, and it grows when the device is placed in a remote node.

Fortunately, there is a solution for each problem. In this post I will explain how to capture traffic in any interface of any device placed in any remote node of your network and how to send this capture to your computer in real time for viewing it with a graphical application like Wireshark. All you need is a little device with RouterOS and two network interfaces connected in the switch of the remote node.

See the scheme below to illustrate the example.


In the picture we can see a server. We want to sniff the interface of this server. We can see a Mikrotik router too. The router has two interfaces connected to the switch; one of them will be used for managing the device and the other one will be the sniffer interface.

Ok. Let's do magic:
The first thing you must do is to configure a mirror port in the switch. A mirror port will send all packet received in a source interface to a destination interface. Obviously we want to configure the port connected to the server as source port of the mirror and the port connected to the sniffer router (Mikrotik router) as destination interface.

The way to configure a couple of ports as mirror ports can differ between manufacturers. In a RouterOS switch the command is:


/interface ethernet switch
  set switch1 mirror-source=ether3-slave-local mirror-target=ether4-slave-local

In a Cisco IOS the command is:

monitor session 1 source interface gigabitEthernet 1/1 both
monitor session 1 destination interface gigabitEthernet 1/2

With this first step the RouterOS router see all the server´s traffic. Now we need to send this traffic to our computer. RouterOS has a useful tool (/tool/sniffer) that can do it. This is the configuration:

/tool sniffer
  set filter-interface=ether4-slave-local streaming-enabled=yes streaming-server=192.168.2.5

Ok. Now all the traffic of the server is sent to our computer, but RouerOS sends the traffic using TZSP protocol, so you must configure a Wireshark filter for viewing only this type of packet.
Here is an example:


Now you can filter the packets of the server you want to view:


Note that the traffic sended to our computer comes from the IP 192.168.0.1 (the sniffer router), but the source shown in Wireshark is 192.168.150.226 (the Server). You must to see the packet encapsulated in the TZSP header.

Sunday, March 13, 2016

Ansible and cisco

I pushed a collection of Ansible modules for cisco IOS routers on my Github .
There are two modules with an example for using them: one of the modules gather some usefull facts from cisco IOS routers and the other module execute in a cisco IOS router a list of commands from a file.
The cisco_exec_commands module doesn't replace the ansible core module that do the same (http://docs.ansible.com/ansible/list_of_network_modules.html#ios), it's only another usefull module that can be run in version 1.7.0.

The facts gathered by cisco_gather_facts can be used as inputs to other modules or can be used in templates to create documentation used for inventorying, assessments, etc.

The playbook example adds a VRF and some interfaces in the VRF. In order to do that, the playbook has the following tasks:

  Search in BGP Reflector routers of de network if the RD you want to configure already exists.
  Search in BRAS if a VRF with the same name exists.
  Adds the new VRF, some interfaces, and place the interfaces in the VRF.

And here is de link: https://github.com/AntonioArriaga/ansible-cisco


NOTE: I used netlib python module from https://github.com/jtdub/netlib

UPDATE (04/04/2016): I added a module that connects to cisco via serial console port and an example of use of this module.

Wednesday, February 17, 2016

Are you ready for disaster prevent and recovery?

Things that a good network administrator must do.


If you manage a medium or wide network you probably know what this post is talking about. To be a network manager is funny and gratifying… if there are not problems. Unfortunately, SHIT HAPPENS. You can have the best devices of the market, and the best topology and network strategy, but anybody have a bad day. Are you prepared for network troubleshooting?

Here are some tips that can help you to make your job easier when troubles appear.

clusterize all your services.

In other words: make your services independent of the devices that run it. Make sure that the shutdown of a single device will not affect the running services.
  • Some services can run in different devices at the same time: pppoe servers, RADIUS servers...
  • Use dynamic routing instead of static routing anytime you can. It will be your network auto-adaptable in case of device or link fail.
  • If you cannot avoid using static routing, use first hop redundancy protocols (VRRP, HSRP, GLBP)
  • Replicate critical resources (databases, file systems)
  • Use a dual stack in layer 2 and 3 for critical devices: more than one switch with more than one IP network.
Combine this methods in order to avoid any single point of failure. For example, if you have a critical application that attack a database, this database must be replicated in more than one server. Each server must be connected to more than one switch. The communication between database motors must be done via loopback interfaces routed by a routing protocol that runs in each interface (in each IP network and in each switch). Then, the IP that serves the database connection to the application must have a failover method like VRRP in the servers.
The goal is that a single device shutdown could not affect any network service.

Backup all configurations.

The most frequent trouble in a wide network is broken hardware. Replacing it can be as easy as prepared you are. Many devices has CLI interfaces that can be easily backed with an appropriate software. I use rancid. Rancid connect via SSH, telnet or any other protocol that you enable. It collects configuration and other useful information (firmware version, hardware properties, etc) and stores all this data in a file.
When it detects any change in the device, it will inform you via email. All changes in devices will be registered with a CSV repository, so you can trace changes made in any device.
If rancid can't connect to a device in more than 24 hours, it will warn you.
Running rancid once a day you have the security that you can configure a replacement device in a short time.


Monitor everything.

Troubleshooting a problem without information is a hard work. It’s a very hard work. Troubleshooting a sporadic problem that appears in shorts time ranges without information is impossible. So, you must be prepared. Any network device can give you a lot of information that can be collected and stored for real-time analysis or later analysis.
Well, at this point I’d like to make a distinction between two types of data:
  • Data that can be graphed: Interfaces traffic or errors, temperature, CPU usage, amount of BGP routes, etc.
    This information can be stored in a graph application like cacti, MRTG, Munin… It's very easy to analyze graphs to find information about a problem.
  • Data that cannot be graphed: syslog events, interfaces states, or any abnormal state in general. This data can be collected in two ways:
    • Data that devices report. i.e: syslog events. It is important to organize this data at the moment of collecting it.
      Specifically, syslog is a good example of this: If you divide the information in files with the name of the device that come from, searching information about a single host will be easier.
    • Data that we collect from devices with external system like Nagios or Icinga.
I have a law: every data that can be monitored must be monitored. Some of this data can be used to warn you about an abnormal state, other can be used only for informational purposes, but all of them can be useful in a future time. There is a lot of software that collect all data automatically, process it and report you alerts if something goes wrong.

A little example:
It can appears that collects temperature from devices is irrelevant, but I worked with a SHDSL modem that self reboots when its temperature reaches 70ºC. This trouble could be easily discovered because I had a graph with the device temperature.

Stay prepared for dumping network traffic.

Sometimes it is very useful to sniff a specifical interface of a device. A lot of troubles can be detected by sniffing traffic. The problem is that not all devices had this feature. Mikrotik or Linux hosts can sniff traffic with tools like “/tool sniffer” or “tcpdump”, but Cisco or Ubiquiti has not a useful tool for this.
It's a good idea to have an ace up your sleeve in remote network nodes where you don't have devices that can sniff traffic. A simple and very useful method is to prepare a small sniffer device with more than one interface connected to the switch (or switches): one of them for managing the router and other for sniffing traffic. To sniff a specific device interface is as easy as configure the switch interface that is connected to the sniffer as “mirroring port” of the device interface you want to monitor.
In a future post I will explain better how to do this with a cheap Mikrotik router.

Alert of network changes

Like I said before, there is a lot of software that collects, analyze and report abnormal states of your network. Use them. At first it's hard to configure, but it will be one of the best ways to make your network safe.

Be careful and search any abnormal state. Some of them are very obvious: shutdown devices, down links… other can be less obvious, but these are going to alert you about an abnormal situation before it becomes a problem: exceed of traffic, fan that doesn’t work, a short number of OSPF neighbors, a big number of errors in an interface…
Correct the little thing before it becomes a big problem is the best way for making your network stable.

New services or topology changes must also be alerted. You must decide if each new situation is under control and meets your quality standards.

Automate and centralize management.

How many problems are caused because of an error while configuring a new service or making a change in network topology?
Use tools to automate all tasks you do frequently. Humans make errors, but a well designed and configured tool for doing changes never mistakes. My prefered software for this is Ansible. With a single playbook you can make a wide range of changes remotely without syntax error or forgotten parameters.

keep informed of news about your equipment and services.

Companies updates their equipment or software because something can be improved. Bug fixes, security fixes, new features... A new that you have read six months ago can give you a track of a current problem.

Wednesday, January 13, 2016

Ansible and Mikrotik

Overview.

If you are a network administrator you probably have dozens of devices to manage. Usually, each device is built by a manufacturer and although its administration may seem similar, it uses to be different.
For some tasks you will need to do several configurations in a group of devices that have different administration interfaces. This is a lot of work with a lot of possibilities of error.
Some manufacturers can provide a centralized platform to configure their equipment, but there is no one platform that could manage different devices for a single task that configure them all.
In this way, you have two alternatives:
  • Do yourself. Develop your own platform that connect to your devices, update their configurations and report the result.
  • Adapt an existing platform. There are some free software, but obviously you must configure and adapt them. We will take this way in this post.

What is Ansible.

Ansible’s web site describes itself like: “a radically simple IT automation platform that makes your applications and systems easier to deploy”.
Ansible is a software that has a collection of well described hosts, scripts, templates and variables, uses them for managing groups of hosts in a simple and automatic way and report the result of changes made in the hosts.

One of most common examples is update a config file of a web server farm and reload the web server daemon in all hosts of the farm. Ansible will connect to each host, change the config file, reload the daemon and report the result to the system administrator.

It's a powerful software, but not everything that shines is gold.

The problem.

Unfortunately, Ansible is oriented to manage Linux hosts. More exactly, Ansible expects that remote hosts runs python. Fortunately, you are a good network administrator that read good blogs and you can adapt Ansible to work with almost any device that can be administered by SSH, telnet or API.

Scenario.

The goal is to show how ansible can be configured for managing almost any network device. To do this we will need to build a module and use it. This module must connect to the network device in the way that you chose, and must report if the configurations had changed something in the remote device, if a problem had occurred, or if everything was fine.

For the example we will create a complete set of queues in a Mikrotik router. We will build a module that read a YAML file with the queues description, it connects to Mikrotik via API and adds all the queues to the router. This module will use API for configuring the Mikrotik in order to expose that any managing protocol supported by network devices can be used, but I have modules that manages routers and switches by SSH and Telnet.

Basic configurations.

As I comment before, this is not a manual about Ansible's installation. I guess that you can read Ansible documentation by yourself and you can install it without my help.
The first thing we need to do is to declare a host in Ansible's host file. We need to provide a user/password for API access (Only for this example that uses API, the best way is do this with a SSH key pair with no user/password).

[mikrotik]
192.168.150.1 username=ansible password=s0mEStr0ngP4ssw0rd

And a basic YAML playbook for testing the connection. We will use Roles. They aren't needed for this basic example, but it’s a good habit to order the information from the first steps.

# cat mktQueue.yml
---
- name: Test connection
  hosts: 192.168.150.1
  gather_facts: no

  roles:
  - mikrotik

By default Ansible will try to gather a lot of information from remote hosts, but it will use python for this. With “gather_facts: no” we ensure that Ansible will not recover this information.

# cat roles/mikrotik/tasks/main.yml
- name: Test connection
  addqueues.py:
    hostname: "{{ inventory_hostname }}"
    username: "{{ username }}"
    password: "{{ password }}"
  delegate_to: 127.0.0.1

The line “delegate_to: 127.0.0.1” says to Ansible that module “addqueues.py” must be run locally (in the same host where Ansible is running) and not in remote devices.

# cat roles/mikrotik/library/addqueues.py
#! /usr/bin/python

import rosapi
import socket

from ansible.module_utils.basic import *

def main():

  module = AnsibleModule(
    argument_spec=dict(
      hostname=dict(required=True),
      username=dict(required=True),
      password=dict(required=True),
      )
    )

  hostname = module.params['hostname']
  username = module.params['username']
  password = module.params['password']
  changed = False
  msg = ""

  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.connect((hostname, 8728))
  apiros = rosapi.RosAPI(s)
  apiros.login(username, password)

  module.exit_json(changed=False, msg=msg, username=username, password=password)

if __name__ == '__main__':
  main()

I have used this python module: https://pypi.python.org/pypi/rosapi to connect via API.
With this basic configuration we can test

Build your own module.

Now the more interesting part of the post: build an Ansible module. This python module will read a YAML file placed in “files” directory of the role, it will build a Mikrotik queue tree configuration, it will connect to the Mikrotik router Via API and it will apply the configuration.

# cat roles/mikrotik/library/addqueues.py      
#! /usr/bin/python

import sys
import string
import rosapi
import socket

from yaml import load, dump
try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper

from ansible.module_utils.basic import *

#
# Function ApplyQueue
# Connect to Mikrotik via API and apply all queues previously treated by processQueues
#

def applyQueue (hostname, username, password, queues):

  returnValue ={'changed': False, 'error': ""}
  error=None
  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.connect((hostname, 8728))
  apiros = rosapi.RosAPI(s)
  apiros.login(username, password)

  for singleQueue in queues:
    newQueue=[]
    newQueue.append("/queue/tree/add")
    for (param, value) in singleQueue.iteritems():
        newQueue.append("=" + str(param) + "=" + str(value))

    apiros.write_sentence(newQueue)
    output=apiros.read_sentence()

    if output[0] != "!done":
        returnValue['error']=str(output[0]) + ": " + str(output[1] + " in \"" + singleQueue['name'] + "\"")
    else:
        returnValue['changed']=True

  return returnValue

#
# function processQueues
# For a well formated dictionary of properties/values return an ordered array of dictionaries with
# a description of a queue and its children.
#


def processQueues( queues ):
  newQueue={}
  orderedQueues=[]

#
# In a first round search all properties/values of the queue.
#
  for param, value in queues.iteritems():
    if not isinstance(value, list):
#
# for property "comment", add a couple of colons
#
      if param=="comment":
        newQueue[param]="\""+value+"\""
      else:
       newQueue[param]=value

  orderedQueues.append(newQueue)

#
# In second round search its children. Each child is treated as a new queue (recursive call)
#
  for param, value in queues.iteritems():
    if isinstance(value, list):
      for subQueue in value:
        subQueue['parent']=newQueue['name']
        orderedQueues = orderedQueues + processQueues(subQueue)

  return orderedQueues



def main():

  module = AnsibleModule(
    argument_spec=dict(
      hostname=dict(required=True),
      username=dict(required=True),
      password=dict(required=True),
      queuesFile=dict(required=True)
      )
    )

  hostname = module.params['hostname']
  username = module.params['username']
  password = module.params['password']
  queuesFile = module.params['queuesFile']
  changed = False
  queuesToApply = []

#
#  Open the YAML file with queues configuration
#
  yamlFile=open(queuesFile, 'r')
  queues = load(yamlFile, Loader=Loader)

#
# for each queue in the YAML file process the queue (build it and its children, grandsons, etc.
# After all queues are processed, apply them.
#

  for queue in queues:
    queue['parent']="global"
    queuesToApply = queuesToApply + processQueues(queue)

  result=applyQueue (hostname, username, password, queuesToApply)

#
# return the result of the operation.
#

  changed=result['changed']

  if result['error']:
     module.fail_json(changed=changed, msg=result['error'])
  else :
    module.exit_json(changed=changed, result=result['error'], username=username, password=password)


if __name__ == '__main__':
    main()

For doing a tree example I will use the tree configuration of Greg Sowell's blog. But I want to show a three-level tree structure, so I have configured two extra queues called “high-priority-in” and “high-priority-out” and I have put the queues VoIP and admin like children of the queues high-priority:

# cat roles/mikrotik/files/queuesDefinition.yml    
- max-limit: 10M
  name: in
  parent: global
  queue: default
  children:
  - limit-at: 3M
    max-limit: 10M
    name: http-in
    packet-mark: http-in
    priority: 4
    queue: default
  - limit-at: 4M
    max-limit: 10M
    name: streaming-video-in
    packet-mark: streaming-video-in
    priority: 3
    queue: default
  - limit-at: 500k
    max-limit: 10M
    name: gaming-in
    packet-mark: games-in
    priority: 2
    queue: default
  - max-limit: 10M
    name: download-in
    packet-mark: in
    queue: default
  - limit-at: 1M
    max-limit: 10M
    name: customer-servers-in
    packet-mark: customer-servers-in
    priority: 1
    queue: default
  - limit-at: 500k
    max-limit: 10M
    name: vpn-in
    packet-mark: vpn-in
    priority: 2
    queue: default
  - name: high-priority-in
    priority: 1
    queue: default
    children:
    - limit-at: 500k
      max-limit: 10M
      name: voip-in
      packet-mark: voip-in
      priority: 1
      queue: default
    - limit-at: 500k
      max-limit: 10M
      name: admin-in
      packet-mark: admin-in
      priority: 5
      queue: default
- max-limit: 10M
  name: out
  parent: global
  queue: default
  children:
  - max-limit: 10M
    name: upload-out
    packet-mark: out
    queue: default
  - name: high-priority-out
    priority: 1
    queue: default
    children:
    - limit-at: 1M
      max-limit: 10M
      name: customer-servers-out
      packet-mark: customer-servers-out
      priority: 6
      queue: default
    - limit-at: 500k
      max-limit: 10M
      name: voip-out
      packet-mark: voip-out
      priority: 1
      queue: default
    - limit-at: 500k
      max-limit: 10M
      name: admin-out
      packet-mark: admin-out
      priority: 3
      queue: default
  - limit-at: 500k
    max-limit: 10M
    name: gaming-out
    packet-mark: games-out
    priority: 2
    queue: default
  - limit-at: 3M
    max-limit: 10M
    name: http-out
    packet-mark: http-out
    priority: 4
    queue: default
  - limit-at: 4M
    max-limit: 10M
    name: streaming-video-out
    packet-mark: streaming-video-out
    priority: 3
    queue: default
  - limit-at: 500k
    max-limit: 10M
    name: vpn-out
    packet-mark: vpn-out
    priority: 2
    queue: default

With this extra configurations we must update the “tasks” file:

# cat roles/mikrotik/tasks/main.yml
- name: Test"
  addqueues.py:
    hostname: "{{ inventory_hostname }}"
    username: "{{ username }}"
    password: "{{ password }}"
    queuesFile: "{{ playbook_dir }}/roles/mikrotik/files/queuesDefinition.yml"
  delegate_to: 127.0.0.1

And finally, we can run it:


An error will show something like this:
Final notes and conclusions

Obviously, nobody needs an Ansible configuration to apply a dozen of queues. It has no sense doing so much work for a task that probably you don't need to repeat anymore. But this is only an example of how Ansible can manage network devices in a centralized way.
Some more interesting cases can be:

  • A module that connect to Mikrotik, create a Mikrotik script from a template placed in the host that runs Ansible, run it on Mikrotik devices and return the result. This module can be a good method to update any general configuration in any number of Mikrotik devices in your network (for example, update your syslog server). If you build this module, the next time that you need to do a task in all your Mikrotik devices the only thing you must do is the Mikrotik script. Applying it in all network will be easy.
  • A group of roles with its own modules that connect to groups Mikrotik, cisco and switches and configure some specific services that needs changes in these devices.