Absolute OpenBSD, 2nd Edition

I promised I’d announce the title of my next No Starch Press book in my BSDCan talk. That happened. The rest of you had to wait until now to hear that I’m rewriting Absolute OpenBSD. The technical reviewer is Peter Hansteen, author of The Book of PF.

Most of the book does not exist yet. Best guess for a release date is some time in 2012.

Why did a second edition take so long?

I will only write books about tools I use in production, out in the real world. (Desktop use does not count.) In my previous job, senior network engineer at a global automotive supplier, I had no opportunity to use OpenBSD. That meant I couldn’t offer advice about using OpenBSD, or discuss how it fit into my infrastructure. I could have written the book, but it would have sucked.

I’m also working on a second nonfiction project, but I’ll announce that separately.

new package system coming for FreeBSD

From the BSDCan FreeBSD developer summit:

The ports team has developed new package management tools and methods to simplify FreeBSD package management. The hope is to have these as the default in FreeBSD 10. Erwin Lansing has posted slides from his brief presentation, and a Web search for “pkgng FreeBSD” will get you all sorts of details.

OpenSSH: requiring keys, but allow passwords from some locations

Most of my OpenSSH servers now require public key authentication for users. On a few systems, however, I must allow remote access with password auth. I need SSH to allow password auth from those IP addresses and only those addresses, but still require public keys from other locations.

Do this with OpenSSH’s match keyword.

Start by configuring sshd for the most common case — in this case, requiring public key authentication. This requires only two changes to the default configuration:

ChallengeResponseAuthentication no
PasswordAuthentication no

sshd will now only allow authentication with public keys.

Now use the match keyword to set a different configuration for certain circumstances. Match will let you compare based on user, group, host (as in DNS hostname), or address. I don’t trust DNS for security, so I chose to match a configuration based on IP addresses. Here, I specifically enable password authentication for connections from selected IP addresses.

Match Address 192.0.2.128/25,10.10.10.32/27
PasswordAuthentication yes

If the connection comes from either of the specified address ranges, the user can try to authenticate with a password. Otherwise, the user must use a public key.

I could have chosen to allow password authentication based on the incoming user, but that wouldn’t block the ongoing “Hail Mary” SSH-guessing attacks. Matching based on user or group would be useful for, say, allowing X11 forwarding. I can’t imagine why I would ever use a Match based on a hostname in DNS, but I concede it might be sensible in some very special circumstance.

One thing to note is that not all sshd_config options work in a Match block. ChallengeResponseAuthentication, for example, can only be set at the global level, so I didn’t activate it in this example. See the sshd_config man page for the list of usable configuration options.

NFSv4 and UIDs on OpenSolaris and Ubuntu

NFS clients and servers negotiate to use the highest NFS version they both support. NFSv4 usually performs much better than NFSv3, but requires a little more setup. Here I get NFSv4 working between an OpenSolaris file server and a diskless Ubuntu client. In theory, a plain mount(8) gives us a NFSv4 mount.

# mount server:/data1/opennebula/on22 /mnt/
#

Use nfsstat -m to see what kind of mount they negotiated

# nfsstat -m
...
/mnt from server:/data1/opennebula/on22
Flags: rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.0.2.2,minorversion=0,addr=192.0.2.1

We have NFSv4, huzzah! Go look at the files.

# ls -lai /mnt/
total 12K
5 drwxr-xr-x 8 4294967294 4294967294 8 2011-04-19 11:50 .
3 drwxr-xr-x 21 root root 26 2011-03-17 10:22 ..
6 drwxr-xr-x 2 4294967294 4294967294 24 2011-04-19 11:50 bin
29 drwxr-xr-x 16 4294967294 4294967294 21 2011-04-19 11:50 etc
74 drwxr-xr-x 2 4294967294 4294967294 2 2011-04-19 11:50 include
75 drwxr-xr-x 7 4294967294 4294967294 7 2011-04-19 11:50 lib
296 drwxr-xr-x 5 4294967294 4294967294 5 2011-04-19 11:50 share
332 drwxr-xr-x 5 4294967294 4294967294 10 2011-04-19 11:50 var

A UID of 4294967294? That’s awesome. Wrong, but awesome. 4294967294 is -1 on a 64-bit system. Many modern Linuxish systems assign nobody and nogroup (the standard unprivileged NFS accounts) a UID and GID of -1. While my files are owned by uid 1003 on the server, and the client’s mount point is owned by uid 1003, NFSv4 defaults to mapping all UIDs to nobody. Use rpc.idmapd to map UIDs between systems. Go to /etc/default/nfs-common and enable idmapd.

NEED_IDMAPD=yes

Lower case seems to be required: I originally set this to YES and the process didn’t start.

Reboot the client, and the files are now owned by nobody. Well, at least that’s a legitimate system user, one originally created for NFS. The files are owned by UID 1003 on the server, however.

Here’s where NFSv4 gets interesting. In NFSv3 and earlier, file ownership over NFS is controlled by UID. Systems administrators worked hard to keep UIDs synchronized across their systems so that NFS permissions would be consistent. You can remap UIDs over NFS, of course, but maintaining those maps is vastly annoying.

NFSv4 maps file permissions by UIDs, but uses usernames for ACLs and ownership. Both must be correct, or common operations won’t work as expected. I have an OpenSolaris NFS server that contains lots of files for lots of diskless systems with lots of different usernames. Some of those usernames do not exist on the fileserver. While I keep user accounts in LDAP, I (mostly) don’t bother with system or program accounts. To share files over NFSv4, though, the accounts must exist on both client and server.

NFSv4 uses helper programs to map usernames and UIDs: nfsmapid on OpenSolaris, rpc.idmapd on Ubuntu, and nfsuserd on FreeBSD. (Please insert a screaming rant here: these are all basically the same program. Why, why, why change the name? We don’t give ping(8) different names even though it has completely different under-the-hood implementations on each program, do we? Sheesh.)

NFSv4 maps usernames within a domain, generally (but not necessarily) the machine’s domain name. If the NFSv4 client and server domain names doesn’t match, all the usernames will show up as “nobody.” OpenSolaris’ nfsmapid pulls the domain name from the machine’s domain name. I had to set the domain name on Ubuntu 10.10 in /etc/idmapd.conf.

NFSv4 now works in my environment.

Note that NFSv4 also has a variety of other changes. All exports are part of a single unified namespace. OpenSolaris handles that for you. If you use a different NFSv4 server, you might need to manage that namespace yourself. But that’ll be a topic for another post, when I get my FreeBSD/ZFS/iSCSI/NFSv4 server working.

mod_security2 and referer spam returns

I just upgraded my Web server’s operating system and applications. As part of this, I upgraded to the latest mod_security rule set. I started with mod_security to block referer spam, but I’m also using it to block connections from blacklisted addresses. mod_security has a steep learning curve, and I’ve had to tweak my rules more than once.

With these upgrades, my anti-referer-spam rules stopped working. Incoming requests with referer spam headers were logging to my access log and showing up in my reports. I asked on the mod-security-users mailing list, and was told that mod_security wasn’t supposed to work the way I was using it.

I’m not sure why it worked. I don’t want to roll back my other upgrades. It’s time to reassess the problem.

I can block referer spam with mod_rewrite, but that requires maintaining rewrite rules across a variety of other sites on my server, and integrating referer spam rules with the rewrite rules those servers need. That would quickly become ugly.

I want the mod_security RBL functionality, and I intend to use some of its other functions later, so mod_security will stay on my server. Mod_security is big and complex, mainly because the real world is big and complex, and trying to debug my simple referer spam problem lead me towards some very complicated debugging. So I stepped back and thought, and decided to try an alternative approach.

Apache has the ability to separate logs based on environment variables. Mod_security can set environment variables.

I changed my mod_security rules like so:

SecRule REQUEST_HEADERS:REFERER "(?i:(porn))" deny,status:412,setenv:spam

If a request contains a refering site that contains the case-insensitive string porn, the client will receive a 412 response. (I considered a 406 response, but my research says that 412 is most appropriate. mod_security will also set the environment variable spam to 1.

On the Apache side, I changed my access log configuration:

CustomLog "|/usr/local/sbin/rotatelogs /var/log/mwl/mwl_spam_log.%Y-%m-%d-%H_%M_%S 86400 -300" combined env=spam
CustomLog "|/usr/local/sbin/rotatelogs /var/log/mwl/mwl_access_log.%Y-%m-%d-%H_%M_%S 86400 -300" combined env=!spam

Apache checks the environment for every request before logging. If a log message has the environment variable spam, it is logged to the spam log. If that variable is not set, the request is logged to the regular access log.

I think that in the long run, this setup will be easier to troubleshoot.

WordPress LDAP auth on Ubuntu

I support too many servers and applications to manage separate user databases for each. LDAP is a must. If an application can’t hook up to LDAP, I don’t want it. WordPress can be configured to use LDAP, and has several different LDAP plugins. I’ve had mixed results with PHP LDAP plugins. I usually find that having the application trust Apache’s authentication, and attaching Apache to LDAP, gives better results in my environment.

Note that my WordPress installations usually have only one or two registered users. They are administrators. Most people cannot register. If you want to hook hundreds of LDAP users into WordPress, and manage them completely through LDAP, you’ll need to find an LDAP-specific plugin that meets your needs. In this environment, where I’m just looking for administrator password synchronization, it’s good enough.

This particular Web server runs Ubuntu 10.04 with Apache and WordPress 3.1. To enable LDAP auth in Apache, run:

# a2enmod authnz_ldap
# /etc/init.d/apache2 restart

On the WordPress side, install for the HTTP Authentication plugin. This tells WordPress to trust the Web server’s authentication.

WordPress won’t read a list of usernames from basic auth. You’ll need to create your users. (Again, this is for a couple of admin accounts, not for massive user databases.)

WordPress protects its administrative directory, /wp-admin/, automatically redirecting requests to the page wp-login.php. For this plugin to work, we must require LDAP auth to the one file wp-login.php. Here’s the Apache configuration for the WordPress directory.


Options Indexes FollowSymLinks MultiViews
AllowOverride None
Order allow,deny
allow from all

AuthType Basic
AuthName "Web Admins Only"
AuthBasicProvider ldap
AuthLDAPURL "ldap://ldapserver1.domain.com/dc=domain,dc=com" STARTTLS
AuthLDAPGroupAttribute memberUid
AuthLDAPGroupAttributeIsDN off
require ldap-group cn=wordpressadmins,ou=groups,dc=domain,dc=com

Note that my LDAP servers do not require a LDAP login to validate a user. If yours do, you’ll need to add the username and password to this configuration.

Restart Apache, open a new browser, go to the site, and hit the Login button. You should get an Apache login window. Enter your username and password, and you’ll reach the WordPress control panel.

You’re now handing your LDAP username and password to WordPress. You do have WordPress available over SSL, don’t you? Configure Apache so that http://wordpress.domain.com is also available as https://wordpress.domain.com, and add the following near the top of wp-config.php.

//we like SSL
define('FORCE_SSL_LOGIN', true);
define('FORCE_SSL_ADMIN', true);

WordPress will now pass user credentials and cookies over SSL.

New FreeBSD Installer test and walkthrough

For those who missed the announcement, FreeBSD 9 has replaced sysinstall with a new installer. It’s based on the PC-BSD installer’s back end and a text-based front end by Nathan Whitehorn. So, what is this new installer we’ve waited sixteen years for? I decided to find out.

As an official snapshot with the BSD installer isn’t yet available, I grabbed one of nwhitehorn’s recent snapshots. The back end is not integrated, but let’s see what the front end looks like. I assume there will be changes between this February snapshot and the next official snapshot. For my testing, I used a VM running Microsoft Virtual PC (for ease of screenshots) with ACPI disabled (so the install would finish).

Booting the image, I saw:

First Boot Screen
Select Install

A recovery shell and a live CD are great, but I want to install.

The next screen tells me to select a keyboard. By virtue of alphabetical order, the default is Armenian. When I arrow down to find the US keyboard, there’s several different options. I know exactly what kind of keyboard I have, it’s an Endurapro with the mouse pointer between the G and the H. But that’s not an option. Fortunately, I use a Dvorak layout, and that is an option.

I’m now asked for a hostname, and then allowed to choose optional components.

what optional parts to install
Distribution Choices

Of course I want source code!

I will miss the ability to choose “the smallest distribution possible,” but in reality, upgrading added the entire base system anyway.

Now we partition the disk.

Partition Methods
Partitioning Tools

Ooooh, a “guided” method. That sounds newbie-friendly. Let’s try that.

Use Entire Disk?
Conquer, or Divide?

We want to use the whole disk.

Partitions
Partitions

Wait a minute… where are /var, /tmp, /usr, and all the other partitions? Don’t tell me that one of the last survivors of proper partitioning is stumbling into the one large root filesystem trap? Ick. (I assume that the other partitioning methods will let you create proper partitions.)

Note that these are GPT partitions, not MBR partitions (slices).

Highlighting a partition and pressing ENTER shows details.

partition details
partition details

We can set a label in the installer, which is a nice feature. And it appears that two ESC takes you out of this submenu into the main menu. That seems odd, but perhaps it’s related to this being in Virtual PC. Arrow over to EXIT and hit ENTER to get the final confirmation screen.

are you sure?
you're committed now

“Are you sure?” Save and Abort aren’t exactly answers to that question – actually “save” is kind of ambiguous in this context. Save your existing data? Yes, we’re sure, choose Save. The filesystem is initialized:

newfs
newfs running

The installer verifies the files on disk and begins installing.

unpacking the archive
Oooh, a lock order reversal!

We must watch for a panic in the background, versus a simple lock order reversal. This is -current, either one can happen and both are displayed by default.

While watching this, I wanted to hit ALT-F2 and see just how well the ports were extracting. That’s apparently no longer an option.

After extracting everything, I was prompted for the root password, then for a network interface.

Choose Interfaces
Choose Interfaces

I took the default.

Yes, please
Use DHCP?

Why yes, I would like DHCP with my host, thank you.

Now I’m asked for the services I want enabled at boot.

just a few choices
services to start at boot

Compared to sysinstall there aren’t many choices, but I’m all right with being offered only the lowest common denominators. Far more vexing is how the network configuration gracelessly handled the unplugged network interface. While CTRL-L redrew the screen, a new link down message reappeared every second or so. I’d never tried Virtual PC before, and apparently I need to decipher the network configuration.

Once I choose the essential services, I’m asked if I want to add users. When I say yes, it drops me to text mode for the usual adduser(8) dialog.

adding our first user
adding a user, plus broken networking

All right, the network message is getting pretty annoying now.

After adding my user, I get the final screen.

Final Install Options
last chance to change things

You can change many of your optional settings, plus set the time zone. You get the usual warnings about removing the CD-ROM from the drive, and then reboot into the new install.

Overall, I like many of the changes. It doesn’t have many of the libdialog annoyances that plague sysinstall. I dislike the guided partitioning system, or more exactly the lack of actual partitioning therein. I do not want /tmp files filling my hard disk!

The good news is, with the separation between the front end and the back end, changing the installer will be much simpler than it ever was with sysinstall. For all the petty annoyances in this installer, it’s a big step in the right direction. I want to commend the creators for actually pushing Sysinstall Replacement Attempt #82,319 to completion.

[emerg] (13)Permission denied: couldn’t grab the accept mutex

I installed Apache on a diskless FreeBSD-9/amd64 server. Once I added SSL, the web server wouldn’t start. It died with:

[Mon Mar 21 15:37:16 2011] [emerg] (13)Permission denied: couldn't grab the accept mutex
[Mon Mar 21 15:37:16 2011] [emerg] (13)Permission denied: couldn't grab the accept mutex
[Mon Mar 21 15:37:16 2011] [emerg] (13)Permission denied: couldn't grab the accept mutex
[Mon Mar 21 15:37:16 2011] [emerg] (13)Permission denied: couldn't grab the accept mutex
[Mon Mar 21 15:37:16 2011] [emerg] (13)Permission denied: couldn't grab the accept mutex
[Mon Mar 21 15:37:17 2011] [alert] Child 1733 returned a Fatal error... Apache is exiting!

Google says that the fix for this is to define AcceptMutex flock somewhere in httpd.conf. Google shows dozens of mailing list discussions that give this advice, but without explanation. Every so often, someone whinges about it not working, and doesn’t get a response.

This is an example of what I call “Occult IT.” There are certain recipes that people just follow, without understanding why. We stumble around, grasping for invocations and incantations that will fix our problem. I don’t just want the ritual that will solve the problem; I want to know why the problem happened and deepen my understanding of the issue.

Besides, adding AcceptMutex flock didn’t work.

To identify the problem, I set LogLevel to debug in httpd.conf and tried starting httpd again.

[Mon Mar 21 16:03:18 2011] [debug] prefork.c(1018): AcceptMutex: flock (default: flock)

I didn’t have that configuration setting in my httpd.conf, but it’s the Apache default anyway. So, what are my other choices?

Apache documentation says that the AcceptMutex setting determines how Apache serializes incoming requests. The flock setting dictates how Apache locks the LockFile.

My configuration doesn’t have a defined LockFile, so choosing how we lock it isn’t going to help.

I don’t like lock files. Bad stuff happens to them. Let’s try a locking method without a lockfile. The documentation lists two different classes of locking mechanisms. flock and fcntl work on lock files. posixsem, pthread, and sysvsem use the in-memory mechanisms of semaphores and/or mutexes to provide locking.

As I don’t like lock files, I’ll try one of the in-memory mechanisms.

AcceptMutex posixsem

And Apache starts and runs perfectly.

I can’t find any details on the differences between these in-memory mechanisms, from a system administrator’s point of view. I imagine that the System V mechanism wouldn’t work if you’d removed that support from your kernel. But the point is:

Do not rely on occult IT. Read The Fine Manual.

Roundcube/pgsql on FreeBSD

My employer’s current webmail solution is pretty tightly tied to the mail server. We need a webmail solution, but I want to be able to move, change, and upgrade webmail independently of the mail server. I’m testing Roundcube, running on FreeBSD-current/amd64 diskless on KVM.

First install Roundcube. Go to /usr/ports/mail/roundcube and run make config-recursive. You can use MySQL, PostgreSQL, or SQLite. Personally, I’m a Postgres bigot, mainly because Postgres has inflicted less pain than MySQL over the years. (Remember, punishing someone less feels like a reward!) I selected postgres, disabled mysql, then chose other environment-specific options such as LDAP. After configuring all of roundcube’s dependencies, run make all install clean.

No matter the database you prefer, in FreeBSD the client is packaged and built separately from the database server. I built /usr/ports/databases/postgresql84-server. Normally I would enable PAM, to enable database authentication via the operating system authentication, but that’s apparently broken as of this writing.

I want to modify httpd.conf as little as possible, so I do most of my configuration via files in /usr/local/etc/apache/Includes. I create the following files:

vhosts.conf
NameVirtualHost *:80
NameVirtualHost *:443

And of course, we need to acces webmail over SSL. Here’s ssl.conf:

Listen 443
AddType application/x-x509-ca-cert .crt
AddType application/x-pkcs7-crl .crl
SSLPassPhraseDialog builtin
SSLSessionCache "shmcb:/var/run/ssl_scache(512000)"
SSLSessionCacheTimeout 300
AcceptMutex posixsem
SSLMutex "file:/var/run/www/ssl_mutex"

I use RCS for minor stuff like this, so I need to block web site visitors from viewing my RCS files. Here’s blockRCS.conf


Order allow,deny
Deny from all
Satisfy All

Finally, here’s the configuration for the webmail virtual server, webmail.conf. The most interesting thing here is that I automatically redirect HTTP connections to HTTPS. (Perhaps interesting is the wrong word. “Least uninteresting?” Yeah, that’s better.)

AddType application/x-httpd-php .php

ServerName webmail.domain.com
Redirect permanent / https://webmail.domain.com/
ServerAdmin webmail@domain.com
ErrorLog "|/usr/local/sbin/rotatelogs /var/log/httpd/webmail_error_log.%Y-%m-%d-%H_%M_%S 86400 -300"
CustomLog "|/usr/local/sbin/rotatelogs /var/log/httpd/webmail_access_log.%Y-%m-%d-%H_%M_%S 86400 -300" combined
DocumentRoot /usr/local/www/webmail/

Options None
AllowOverride All
Allow from all



ServerAdmin webmail@domain.com
ServerName webmail.domain.com
SSLEngine on
SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL

SSLOptions +StdEnvVars

BrowserMatch ".*MSIE.*" \
nokeepalive ssl-unclean-shutdown \
downgrade-1.0 force-response-1.0
SSLCertificateFile etc/apache22/certs/webmail.domain.com.crt
SSLCertificateKeyFile etc/apache22/certs/webmail.domain.com.key
ErrorLog "|/usr/local/sbin/rotatelogs /var/log/httpd/webmail_ssl_error_log.%Y-%m-%d-%H_%M_%S 86400 -300"
CustomLog "|/usr/local/sbin/rotatelogs /var/log/httpd/webmail_ssl_access_log.%Y-%m-%d-%H_%M_%S 86400 -300" combined
DocumentRoot /usr/local/www/roundcube/

Options Indexes FollowSymLinks
AllowOverride All
Allow from all

Initialize your postgresql database. I got Postgres help from here.

# /usr/local/etc/rc.d/postgresql initdb

Edit your Postgres config file, /usr/local/pgsql/data/postgresql.conf. Be sure that your database is only listening on the loopback address.

listen_addresses = 'localhost'

If you’re a Postgres guru, make any other changes you like. Then configure loggin in /etc/syslog.conf:

!postgres
*.* /var/log/pg.log

The log files must exist before syslogd will write to them.

# touch /var/log/pg.log
# chown pgsql:pgsql /var/log/pg.log

Now run /usr/local/etc/rc.d/postgresql start and check for errors. It should start without trouble, unless you mucked with your configuration too much.

Normally I would edit pg_hba.conf to tie my account password to PAM, and through PAM to LDAP, but as that’s broken right now, I’ll create a local user.

# su pgsql
$ createuser -sdrP mwlucas
Enter password for new role:
Enter it again:

Now create the Roundcube database, as per the instructions.

$ createuser roundcube
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) n
Shall the new role be allowed to create more new roles? (y/n) n
$ createdb -O roundcube -E UNICODE roundcubemail
$ psql roundcubemail
psql (8.4.7)
Type "help" for help.

roundcubemail=# alter user roundcube with password 'WeHatesWebmail';
ALTER ROLE
roundcubemail=# \c - roundcube
psql (8.4.7)
You are now connected to database "roundcubemail" as user "roundcube".
roundcubemail=> \i SQL/postgres.initial.sql

After a bunch of SQL spammage, we have a database. Log out with:

roundcubemail=> \q

Roundcube needs two configuration files, db.inc.php and main.inc.php. Go to /usr/local/www/roundcube/config and copy the .dist versions of these files.

Tell Roundcube where to find its database by setting a DSN in db.inc.php.

$rcmail_config['db_dsnw'] = 'pgsql://roundcube:WeHatesWebmail@localhost/roundcubemail';

In main.inc.php, set your mail server. (If you don’t set a mail server, Roundcube will let you connect to any mail server you like. This would confuse my users. Confusion leads to phone calls, something I avidly avoid.)

$rcmail_config['default_host'] = 'mail.domain.com';

Verify that pgsql and Apache are running, and browse to your webmail site. You have a webmail server!

Overall, the setup was pretty straightforward. I’m not saying I’ll keep Roundcube, but my test users are basically content, so it’s a strong candidate.

Wherein I learn about initrd

Post summary: will someone PLEASE port a recent KVM to any BSD? There’s beer in it for you.

I’ve been attempting to upgrade my diskless virtualization cluster to Ubuntu 10.10. Diskless boot worked fine in the ESXi test area, but real hardware would not boot. This same hardware booted fine with Ubuntu 10.04 and 9.whatever. When I looked at the console, I saw:

ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
/init: .: line 3: can't open '/tmp/net-*.conf'
[ 2.300079] Kernel panic - not syncing: Attempted to kill init!
[ 2.306052] Pid: 1, comm: init Not tainted 2.6.35-27-server #48-Ubuntu
[ 2.312653] Call Trace:
[ 2.315161] [] panic+0x90/0x113
[ 2.320025] [] forget_original_parent+0x33d/0x350
[ 2.326433] [] ? put_files_struct+0xc4/0xf0
[ 2.332339] [] exit_notify+0x1b/0x190
[ 2.337699] [] do_exit+0x1d5/0x400
[ 2.342817] [] ? do_page_fault+0x159/0x350
[ 2.348609] [] do_group_exit+0x55/0xd0
[ 2.354076] [] sys_exit_group+0x17/0x20
[ 2.359617] [] system_call_fastpath+0x16/0x1b

The useful messages are obviously further up, but the scrollback buffer is fubar. (Apparently when an Ubuntu box dies, it dies really really hard.) A serial console let me scroll back through the boot messages.

...
[ 2.004954] Uniform CD-ROM driver Revision: 3.20
[ 2.009944] sr 0:0:1:0: Attached scsi generic sg0 type 5
[ 2.015651] Freeing unused kernel memory: 836k freed
[ 2.021283] Write protecting the kernel read-only data: 10240k
[ 2.027551] Freeing unused kernel memory: 320k freed
[ 2.033118] Freeing unused kernel memory: 1620k freed
Loading, please wait...
[ 2.063067] udev[81]: starting version 163
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/nfs-top ... done.
FATAL: Could not load /lib/modules/2.6.35-27-server/modules.dep: No such file or directory
FATAL: Could not load /lib/modules/2.6.35-27-server/modules.dep: No such file or directory
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure

The machine cannot find its modules directory? Odd. A packet sniffer found that the diskless client didn’t send an NFS request. It was just giving up after running initrd. I carefully reviewed the serial console output and compared it to the test Ubuntu systems, and found that the initial ramdisk wasn’t attaching a device driver to the Ethernet interface.

Initrd is an “initial ramdisk.” It loads a kernel and device drivers, for the purpose of finding the root file system and loading the real kernel and actual device drivers. If you install a machine in one environment, the initial ramdisk includes only the device drivers for that environment.

Checking /etc/udev/rules.d/70-persistent-net.rules of the older system revealed:

# PCI device 0x14e4:0x1659 (tg3)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:17:31:d8:42:52", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

The Ethernet cards on my physical servers use Linux’s tg3 driver.

To add this driver to initrd, I went to the Ubuntu 10.10 server on my ESXi test box and created the file /usr/share/initramfs-tools/modules.d/tg3. That file contained a single word, tg3. I then created the new initrd with:

# update-initramfs -u -k all
# mkinitramfs -o /home/mwlucas/initrd.img-2.6.35-27-server-pxe

Copy that image to my TFTP server, reboot the hardware, and everything boots.