I’m a big fan of RANCID for managing configurations for embedded devices, such as most routers and switches. While you can go buy CiscoWorks, OpenView, or any number of proprietary products, RANCID is good enough for the overwhelming majority of us. (Those products do have other advantages, but simple configuration revision control isn’t one of them.)
For those who haven’t used RANCID: it logs into your devices every hour, gets the device configuration, and compares it to the stored configuration. If the configuration has changed, RANCID checks the new version into CVS. Combined with CVSWeb, RANCID really simplifies embedded device management.
Every now and then it breaks, however. Last week, I started getting an email every hour, whining that RANCID couldn’t get the configuration of one of my Mikrotik border routers. I hadn’t changed the router configuration in several days. My cow-orkers claimed they hadn’t touched the router.
So, let’s see what RANCID is having trouble with.
Log into the RANCID server, and su – to your RANCID account. Use clogin(1) to log into the device.
%clogin edge-1
edge-1
spawn ssh -c 3des -x -l admin+ct edge-1
admin+ct@edge-1.lodden.com's password:
...
[admin@edge-1] >
So, I can log in.
The main command to get a Mikrotik configuration is export
. I run the command. It completes, but takes a few minutes. Not really a shock — this device has several full BGP feeds on IPv4 and IPv6, packet filtering, traffic shaping, and folds my socks in its spare time.
So, it’s not the obvious problem; the router can export its config, and RANCID can log into the router.
So, run RANCID for the group that includes the trouble router.
%rancid-run mikrotik
%
No error messages, but let’s check the log. It’s full of messages like this:
...
Trying to get all of the configs.
edge-1: End of run not found
Error: TIMEOUT reached
=====================================
...
Well, that’s not good. Let’s try running a single command on the router, setting the timeout to the usual 90 seconds.
%clogin -t 90 -c "export;quit" edge-1
edge-1
spawn ssh -c 3des -x -l admin+ct edge-1
admin+ct@edge-1.lodden.com's password:
...
/ipv6 nd prefix default
set autonomous=yes preferred-lifetime=1w valid-lifetime=4w2d
Error: TIMEOUT reached
%
So, the export takes longer to run than RANCID’s default timeout. How long does it need? Run RANCID under time(1) to find out. Add -t 1000
to set the timeout to 1000 seconds.
% time clogin -t 1000 -c “export;quit” edge-1
Walk away. Eventually, come back to look at it.
...
set accounting=yes default-group=read exclude-groups="" interim-update=0s \
use-radius=no
[admin@edge-1] > quit
interrupted
Connection to edge-1.lodden.com closed.
Error: EOF received
0.102u 0.094s 2:57.84 0.1% 87+948k 0+0io 0pf+0w
This export took almost three minutes, or 180 seconds. Twice the default timeout. Ick.
Now we have to tell RANCID to use a different timeout. I didn’t find anything in the manual pages, so I asked on the rancid-discuss mailing list. John Heasley quickly answered. It seems that the timeout
option in .cloginrc should cover this, but that the feature is missing from the Mikrotik login script. He included a patch. I applied the patch and added
add password edge-1 blahblah
add user edge-1 admin+ct
add method edge-1 ssh
add noenable edge-1 {1}
add timeout edge-1 500
I then re-run RANCID. It completes silently. I can’t be sure that the change actually works until I see RANCID check in a change. I logged into the router, corrected a typo in the login message, and tried again. This time, changes appeared in CVS and I received my email. So I can conclude the patch works.
The most important thing to do in all of this, though? Close the loop with John Heasley. Verify the patch works, so others can benefit from my annoyance.
On a related note, RANCID is one of those tools that gets less attention than it deserves. I’m pondering writing a short book about it, rather like SSH Mastery. Would anyone actually be interested, however?