notes from my FreeBSD and Nagios upgrade

My Nagios system ran FreeBSD-current/i386 from October 2010 and Nagios 3.0.6. Business factors drove me to make some changes, and I decided to upgrade the server before making those changes. Here’s some things I observed. I don’t know if these is useful to you, but I’ll need them for other upgrades, so what the heck.

Back up before you start. (Yes, obvious, but everyone needs a reminder.)

Building 9-stable on a -current box that old is tricky. You have to do a variety of ugly things. So don’t. I NFS-mounted another machine running 9-BETA2/i386 and installed from that.

Remove the old libraries and obsolete programs from the core system. While you have a full backup, I find it useful to have a separate, convenient backup of removed libraries on the existing system.

# cd /usr/src
# make check-old-libs | grep '^/' | tar zcv -T - -f $HOME/2010Oct-old-libs.tgz
...
# yes | make delete-old-libs

In the event that I cannot recompile some program for FreeBSD 9, I can install the necessary libraries under /usr/lib/compat and get on with my life.

I ran portmaster-L > ports.txt to get a list of all installed software in hierarchical order, deleted what I didn’t need any longer, then used portmaster -d --no-confirm portname on my leaf ports.

I had trouble building a couple of ports. I elected to use packages for these ports. FreeBSD-9’s packages are built against Perl 5.12. In 2010, they were built against Perl 5.8. It was simpler to remove all Perl ports and reinstall them from scratch. The ports that were giving me trouble worked fine with the newer Perl.

Then there’s Nagios. Ah, there’s nothing like upgrading Nagios. Actually, the Nagios upgrade itself ran perfectly with portmaster. The problem with the upgrade is all of the additional NagiosExchange scripts I installed. Lots of them ran fine under Perl 5.8, but choked when run by Nagios in Perl 5.12. The problem scripts started with #/usr/bin/perl -w. By removing the -w (warnings) flag, they ran under Nagios again.

When you reactivate Nagios after this upgrade, either turn off email or redirect all email to /dev/null. Do not leave email on. Nagios might well generate spurious errors, spam your coworkers, and cause either alarm or annoyance, depending on their temperament.

Once I fixed all the scripts that were failing, Nagios generated intermittent errors. All of the scripts that failed were SNMP-based. I ran snmpwalks from the Nagios box, and they all died partway through. I ran tcpdump -vv -i em0 udp port 161 on the target machines, and saw that they all reported “bad UDP checksum.” The server was still running 9-BETA2. Rather than tracking down an error on an older version, I upgraded the system to 9-RC2. The problem disappeared. I dislike not understanding the problem’s cause, but obviously someone else fixed it between BETA2 and RC2.

The only plugins that still failed were check_snmp_proc and check_snmp_disk, from the nagios-snmp-plugins port. Every one of them failed consistently.

Running the plugins by hand showed that they were generating correct answers, but they were also picking up MIB file errors from my $MIBS and $MIBDIRS. I have a whole bunch of MIB files that I use for developing and testing Nagios plugins. I normally restart Nagios with sudo. On a hunch, I used su - to become root with a clean environment and restarted Nagios. The errors stopped, and Nagios ran perfectly.

I suspect that this is the same problem that broke the perl -w plugins. The newer Nagios apparently chokes on extra debugging output. I’ve gone through the release notes for the versions I skipped, but didn’t find that. In all fairness, I probably just missed it.