Debugging RANCID

I’m a big fan of RANCID for managing configurations for embedded devices, such as most routers and switches. While you can go buy CiscoWorks, OpenView, or any number of proprietary products, RANCID is good enough for the overwhelming majority of us. (Those products do have other advantages, but simple configuration revision control isn’t one of them.)

For those who haven’t used RANCID: it logs into your devices every hour, gets the device configuration, and compares it to the stored configuration. If the configuration has changed, RANCID checks the new version into CVS. Combined with CVSWeb, RANCID really simplifies embedded device management.

Every now and then it breaks, however. Last week, I started getting an email every hour, whining that RANCID couldn’t get the configuration of one of my Mikrotik border routers. I hadn’t changed the router configuration in several days. My cow-orkers claimed they hadn’t touched the router.

So, let’s see what RANCID is having trouble with.

Log into the RANCID server, and su – to your RANCID account. Use clogin(1) to log into the device.

%clogin edge-1
edge-1
spawn ssh -c 3des -x -l admin+ct edge-1
admin+ct@edge-1.lodden.com's password:
...

[admin@edge-1] >

So, I can log in.

The main command to get a Mikrotik configuration is export. I run the command. It completes, but takes a few minutes. Not really a shock — this device has several full BGP feeds on IPv4 and IPv6, packet filtering, traffic shaping, and folds my socks in its spare time.

So, it’s not the obvious problem; the router can export its config, and RANCID can log into the router.

So, run RANCID for the group that includes the trouble router.

%rancid-run mikrotik

%

No error messages, but let’s check the log. It’s full of messages like this:

...
Trying to get all of the configs.
edge-1: End of run not found
Error: TIMEOUT reached
=====================================
...

Well, that’s not good. Let’s try running a single command on the router, setting the timeout to the usual 90 seconds.

%clogin -t 90 -c "export;quit" edge-1
edge-1
spawn ssh -c 3des -x -l admin+ct edge-1
admin+ct@edge-1.lodden.com's password:
...
/ipv6 nd prefix default
set autonomous=yes preferred-lifetime=1w valid-lifetime=4w2d

Error: TIMEOUT reached
%

So, the export takes longer to run than RANCID’s default timeout. How long does it need? Run RANCID under time(1) to find out. Add -t 1000 to set the timeout to 1000 seconds.

% time clogin -t 1000 -c “export;quit” edge-1

Walk away. Eventually, come back to look at it.

...
set accounting=yes default-group=read exclude-groups="" interim-update=0s \
use-radius=no
[admin@edge-1] > quit
interrupted
Connection to edge-1.lodden.com closed.

Error: EOF received
0.102u 0.094s 2:57.84 0.1% 87+948k 0+0io 0pf+0w

This export took almost three minutes, or 180 seconds. Twice the default timeout. Ick.

Now we have to tell RANCID to use a different timeout. I didn’t find anything in the manual pages, so I asked on the rancid-discuss mailing list. John Heasley quickly answered. It seems that the timeout option in .cloginrc should cover this, but that the feature is missing from the Mikrotik login script. He included a patch. I applied the patch and added

add password edge-1 blahblah
add user edge-1 admin+ct
add method edge-1 ssh
add noenable edge-1 {1}
add timeout edge-1 500

I then re-run RANCID. It completes silently. I can’t be sure that the change actually works until I see RANCID check in a change. I logged into the router, corrected a typo in the login message, and tried again. This time, changes appeared in CVS and I received my email. So I can conclude the patch works.

The most important thing to do in all of this, though? Close the loop with John Heasley. Verify the patch works, so others can benefit from my annoyance.

On a related note, RANCID is one of those tools that gets less attention than it deserves. I’m pondering writing a short book about it, rather like SSH Mastery. Would anyone actually be interested, however?

7 Replies to “Debugging RANCID”

  1. Yes, please write a RANCID book. I would happily buy a copy and if it is priced the same as your SSH book, I could push it on a couple of my junior admins when they’re ready for it.

  2. Yes, I would definitely be interested in a rancid book! Good documentation with useful examples (like above) are certainly lacking!

  3. hi,

    i have same problem taking mikrotik router backup with rancid please will you explain in detail where should i apply pache to fix Error: TIMEOUT reached.

  4. Hi Venky,

    Try this patch, you must edit these in bin/mtlogin.in

    > — bin/mtlogin.in (revision 2457)
    > +++ bin/mtlogin.in (working copy)
    > @@ -71,7 +71,7 @@
    > # tracks if we receive them on the command line.
    > set do_passwd 1
    > # Sometimes routers take awhile to answer (the default is 10 sec)
    > -set timeout 45
    > +set timeoutdflt 45
    >
    > # Find the user in the ENV, or use the unix userid.
    > if {[ info exists env(CISCO_USER) ]} {
    > @@ -177,7 +177,7 @@
    > -T* {
    > if {! [ regexp .\[tT\](.+) $arg ignore timeout]} {
    > incr i
    > – set timeout [ lindex $argv $i ]
    > + set timeoutdflt [ lindex $argv $i ]
    > }
    > # Command file
    > } -x* –
    > @@ -466,6 +466,12 @@
    > set autoenable 1
    > set enable 0
    >
    > + # device timeout
    > + set timeout [find timeout $router]
    > + if { [llength $timeout] == 0 } {
    > + set timeout $timeoutdflt
    > + }
    > +
    > # Figure out passwords
    > if { $do_passwd } {
    > set pswd [find password $router]

    Hope so, it should work…!!!!
    Thanks.

Comments are closed.