FreeBSD Disk Partitioning

A couple weeks ago, I monopolized the freebsd-hackers mailing list by asking a couple simple, innocent questions about managing disks using gpart(8) instead of the classic fdisk(8) and disklabel(8). This is my attempt to rationalize and summarize a small cup of the flood of information I received.

The FreeBSD kernel understands several different disk partitioning schemes, including the traditional x86 MBR (slices), the current GPT, BSD disklabels, as well as schemes from Apple, Microsoft, NEC (PC98), and Sun. The gpart(8) tool is intended as a generic interface that lets you manage disk partitioning in all of these schemes, and abstract away all of the innards in favor of saying “Use X partitioning system on this disk, and put these partitions on it.” It’s a great goal.

FreeBSD storage, and computing storage in general, is in a transitory state today. The older tools, fdisk and bsdlabel, aren’t exactly deprecated but they are not encouraged. x86 hardware is moving towards GPT, but there’s an awful lot of MBR-only gear deployed. Disks themselves are moving from the long-standing 512B sector size to 4KB, eight times larger.

And as gravy on top of all this, disks lie about their sector size. Because why wouldn’t they?

Traditional disks have a geometry defined by cylinders, heads, and sectors. MBR-style partitions (aka slices) are expected to end on cylinder boundaries — that is, slices are measured in full cylinders. Cylinders and heads aren’t really relevant to modern disks–they use LBA. With flash drives, SSDs, and whatever other sort of storage people come up with, disk geometry is increasingly obsolete. The traditional MBR partitions expect to run on top of cylinder-based partitions, however, and any BSD throws a wobbler if you use a MBR partition that doesn’t respect this.

gpart must handle those traditional cylinder boundaries as well as partitioning schemes without such boundaries. If you create a 1GB MBR partition, it will round the size to the nearest cylinder.

The sector size changes create orthogonal problems. If you write to the disk in 512B sectors, but the underlying disk has 4K sectors, the disk will perform many more writes than necessary. If you write to the disk in 4K sectors, but the underlying disk uses 512B sectors, there’s no real harm done.

But if your logical 4K sectors don’t line up with the disk’s physical 4K sectors, performance will drop in half.

For best results, create all partitions aligned to a 4K sector. If the underlying disk has 512B sectors, it won’t matter; you must do more writes to fill those sectors anyway. Use the -a 4k arguments with gpart to have created partitions aligned to 4K sectors.

How do you do this? It depends on if you’re using GPT or MBR partitions.

For GPT partitions, you must start partitioning the disk at a multiple of 4K. The front of your disk might have all kinds of boot code or boot managers in it, however. Start your first partition at the 1MB mark, and only create partitions that are even multiples of a megabyte. Today you’d have to go out of your way to create a partitions that was 1.5MB, so this isn’t a huge constraint.

For MBR partitions, it’s slightly more difficult. Use the -a 4k command-line arguments to gpart when creating BSD partitions inside a MBR slice. This tells gpart that even if the slice isn’t 4k aligned, the BSD partitions must be.

I could put a bunch of instructions here, but Warren Block has a nice detailed walk-through of the actual commands used to partition disks with these standards.

next book(s): FreeBSD storage

I’m writing about FreeBSD disk and storage management. (The folks on my mailing list already knew this.) For the last few months, I’ve been trying to assimilate and internalize GEOM.

I’ve always used GEOM in a pretty straightforward: decide what I want to achieve, read a couple man pages, find an archived discussion where someone achieved my goal, blindly copy their commands, and poof! I have deployed an advanced GEOM feature. GEOM was mostly for developers who invented cool new features.

Turns out that GEOM is for systems administrators. It lets us do all sorts of cool things.

GEOM is complicated because the world is complicated. It lets you configure your storage any way you like, which is grand. But in general, I’ve approached GEOM like I would any other harmless-looking but deadly thing. Now I’m using a big multi-drive desktop from iX Systems to fearlessly test GEOM to destruction.

I’m learning a lot. The GEOM book will be quite useful. But it’s taking longer than I thought. Everything else flows out of GEOM. I’ve written some non-GEOM parts, but I’m holding off writing anything built on top of GEOM. Writing without understanding means rewriting, and rewriting leads to fewer books.

My GEOM comprehension is expanding, and many developers are giving me very good insight into the system. GEOM is an underrated feature, and I think my work will help people understand just how powerful it is and what a good selling point it is for FreeBSD.

My research has gone as far as the man pages can take me. Now I need to start pestering the mailing lists for answers. Apparently my innocuous questions can blow up mailing lists. I would apologize, but an apology might imply that I won’t do it again.

FreeBSD storage is a big topic. I suspect it’s going to wind up as three books: one on GEOM and UFS, one on ZFS, and one on networked storage. I wouldn’t be shocked if I can get it into two. I would be very surprised if it takes four. (I’m assuming each book is roughly the size of SSH Mastery — people appear to like that length and price point.) I will adjust book lengths and prices as needed to make them a good value.

The good thing with releasing multiple books is that you only need buy the ones you need. You need to learn about iSCSI and NFS? Buy the third book. You want everything but ZFS? Skip that one. And so on.

As I don’t know the final number of books or how they will be designed, I’m not planning an advance purchase program.

I am planning to release all the books almost simultaneously, or at least very close together.

So, a mini-FAQ:

  • When will they be released?
    When I’m done writing them.

  • How much will they cost?
    Dunno.

  • How many will there be?
    “Five.” “Three, sir.” Or four. Or two. Definitely a positive integer.

  • Do you know anything?
    I like pie.

    I’m pondering how to give back to FreeBSD on this project.

    I auctioned off the first copy of Absolute FreeBSD to support the FreeBSD Foundation. That raised $600 and was rather fun. These books will be print-on-demand, though, so “first print” is a little more ambiguous. It also has a ceiling, where OpenBSD’s ongoing SSH Mastery sales keep giving.

    I’ve had tentative discussions with Ed Maste over at the FreeBSD Foundation about using those books as fundraisers. I’d let the FF have the books at my cost, and they could include them as rewards for larger donations. A million and ten things could go wrong with that, so it might not work out. If nothing else, shipping stuff is a lot of work, and the FF folks might decide that their time is better spent knocking on big corporate doors than playing PBS. I couldn’t blame them — that’s why I don’t ship paper books.

    If that fails for whatever reason, I’ll sponsor a FreeBSD devsummit or something.

  • virtio NIC on OpenBSD 5.5-current

    My Ansible host is OpenBSD. Because if I’m going to have a host that can manage my network, it needs to be ridiculously secure. The OpenBSD host runs on KVM (through the SolusVM virtualization management system).

    During heavy data transfers, the network card would occasionally stop passing traffic. I could run any Ansible command without issue, but downloading an ISO caused hangs. This was most obvious during upgrades. Downloads would stall. I could restart them with ^Z, then a “ifconfig vio0 down && ifconfig vio0 up && fg” but this still isn’t desirable.

    The vio(4) man page includes the following text:

         Setting flags to 0x02 disables the RingEventIndex feature.  This can be
         tried as a workaround for possible bugs in host implementations or vio at
         the cost of slightly reduced performance.

    (Thanks to Philip Guenther for pointing that out. I would kind of expect this to have been in the BUGS section, or maybe say “Try this if you have weird problems,” but at least the info is there.)

    So: download the new bsd.rd kernel, set the flag, and try to upgrade.

    #config -ef /bsd.rd
    OpenBSD 5.5-current (RAMDISK_CD) #147: Wed May 28 13:56:39 MDT 2014
    deraadt@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/RAMDISK_CD
    Enter 'help' for information
    ukc> find vio
    146 vio* at virtio* flags 0x0
    ukc> change 146
    146 vio* at virtio* flags 0x0
    change [n] y
    flags [0] ? 2
    146 vio* changed
    146 vio* at virtio* flags 0x2
    ukc> quit
    Saving modified kernel.

    The upgrade now runs flawlessly, and I can no longer reproduce the hangs.

    Be sure to repeat this on the new kernel.

    LibreSSL at BSDCan

    Thanks to various airline problems, we had an open spot on the BSDCan schedule. Bob Beck filled in at the last moment with a talk on the first thirty days of LibreSSL. Here are some rough notes on Bob’s talk (slides now available).

    LibreSSL forked from OpenSSL 1.0.1g.

    Why did “we” let OpenSSL happen? Nobody looked. Or nobody admitted that they looked. We all did it. The code was too horrible to look at. This isn’t just an OpenSSL thing, or just an open source thing. It’s not unique in software development, it’s just the high profile one of the moment.

    Heartbleed was not the final straw that caused the LibreSSL fork. The OpenSSL malloc replacement layer was the final straw. Default OpenSSL never frees memory, so tools can’t spot bugs. It uses LIFO recycling, so you can use after free. The debugging malloc sends all memory information to a log. Lots more in Bob’s slides, but this all combined into an exploit mitigation technique countermeasure. Valgrind, Coverity, and OpenBSD’s randomized memory tools don’t catch this.

    Someone discovered all this this four years ago and opened an OpenSSL bug. It’s still sitting there.

    LibreSSL started by ripping out features. VMS support, 16-bit Windows support, all gone.

    LibreSSL goals:

  • Preserve API/ABI compatibility – become a drop-in replacement.
  • Bring more people into working on the codebase, by making the code less horrible
  • Fix bugs and modern coding processes
  • Do portability right

    As an example, how does OpenSSH (not LibreSSL, but another OpenBSD product) do portable?

  • Assume a sane target OS, and code to that standard.
  • Build and maintain code on the above, using modern C
  • Provide portability shims to correctly do things that other OS’s don’t provide, only for those who need it.
    – No ifdef maze
    – No compromise on what the intrinsic functions actually do
    – Standard intrinsics
    – Don’t reimplement libc

    How does OpenSSL do portable?

  • Assume the OS provides nothing, because you mustn’t break support for Visual C 1.52.
  • Spaghetti mess of #ifdef #ifndef horror nested 17 deep
  • Written in OpenSSL C – essentially it’s own dialect – to program to the worst common denominator
  • Implement own layers and force all platforms to use it

    The result? “Chthulhu sits in his house in #define OPENSSL_VMS and dreams”

    Removed scads of debugging malloc and other nasties.

    What upstream packages call and use them? No way to tell. LibreSSL makes some of the very dangerous options no-ops. Turn on memory debugging? Replace malloc wrappers at runtime? These do nothing. The library internally does not use them.

    Some necessary changes that were implemented in massive sweeps:

  • malloc+memset -> calloc
  • malloc (X*Y) -> reallocarray(X, Y)
  • realloc and free handle NULL, so stop testing everywhere

    OpenSSL used EGD for entropy, and faked random data. OpenSSL gathered entropy from the following sources:

  • Your RSA private key is pretty random
  • “string to give random number generator entropy”
  • getpid()
  • gettimeofday()

    In LibreSSL, entropy is the responsibility of the OS. If your OS cannot provide you with entropy, LibreSSL will not fake it.

    LibreSSL is being reformatted into KNF – the OpenBSD code style. OpenSSL uses whatever style seemed right at the moment. The reformatting makes other problems visible, which is the point. More readable code hopefully means more developer involvement.

    The OpenSSL bug tracking RT has been and continues to be a valuable resource.

    OpenSSL exposes just about everything via public header files. Lots of the API should probably not be used outside of the library, but who knows who calls what? OpenBSD is finding out through constant integration testing with their ports tree.

    The LibreSSL team wants to put the API on a diet so that they can remove potentially dangerous stuff. Their guys are being careful in this by testing against the OpenBSD ports tree. Yes, this conflicts with the “drop-in replacement” goal.

    Internally, LibreSSL uses only regular intrinsic functions provided by libc. OpenSSL’s custom APIs remain for now only to maintain compatibility with external apps.

    Surprises LibreSSL guys in OpenSSL:

  • big endian amd64 support
  • Compile options NO_OLD_ASN1 and option NO_ASN1_OLD are not the same
  • You can turn off sockets, but you can’t turn off debugging malloc
  • socklen_t – if your OS doesn’t have socklen_t, it’s either int or size_t. But OpenSSL does horrible contortions to define its own. If the size of socklen_t changes while your program is running, openssl will cope.
  • OpenSSL also copes if /dev/null moves while openssl is running.

    So far:

  • OpenSSL 1.0.1g was a 388,000 line code base
  • As of yesterday, 90,000 lines of C source deleted, about 150,000 lines of files
  • Approximately 500,000 line unidiff from 1.0.1g at this point
  • Many bugs fixed
  • The cleaning continues, but they’ve started adding new features (ciphers)
  • Code has become more readable – portions remain scary

    LibreSSL has added the following cipher suites under acceptable licenses – Brainpool, ChaCha, poly1305, ANSSI FRP256v1, and several new ciphers based on the above.

    FIPS mode is gone. It is very intrusive. In other places governments mandate use of certain ciphers (Cameilla, GOST, etc). As long as they’re not on by default, and are provided as clean implementations under an acceptable license they will include them. They believe it’s better people who must use these use them in a sane library with a sane API than rolling their own.

    If you want to use the forthcoming portable LibreSSL, you need:

  • modern POSIX environment
  • OS must provide random data – readiness and quality are responsibility of OS
  • malloc/free/calloc/realloc (overflow checking)
  • modern C string capabilities (strlcat, strlcpy, asprintf, etc)
  • explicit_bzero, reallocarry, arc4random

    You can’t replace explicit_bzero with bzero, or arc4random with random. LibreSSL wants a portability team that understands how to make it work correctly.

    LibreSSL’s eventual goals:

  • provide better (replacement, reduced) api
  • reduce code base even more
  • split out non-crypto things from libcrypto
  • split libcrypto from libssl

    There’s lots of challenges to this. The biggest is stable funding.

    The OpenBSD Foundation wants to fund several developers to rewrite key pieces of code. They want to sponsor efforts of the portability team, and the ports people track the impact of proposed API changes.

    They will not do this at the expense of OpenSSH or OpenBSD.

    The OpenBSD Foundation has asked the Linux Foundation for support, but the Linux Foundation has not yet committed to supporting the effort. (I previously said that they hadn’t responded to the request, which is different. The LF has received Bob’s email and discussions are ongoing.)

    In Summary:

  • OpenSSL’s code is awful
  • LibreSSL can be done
  • They need support

    If you’re interested in supporting the effort, contact the OpenBSD Foundation. The Foundation is run by Bob Beck and Ken Westerback, and they manage all funding. (While Theo de Raadt leads the OpenBSD Project, he actually has nothing to do with allocating funding.)

  • BSDCan keynote

    Karl Lehenbauer, CTO of FlightAware, is giving an excellent BSDCan keynote: a retrospective of his BSD experience. As part of the mass of flight troubles plaguing North America this week, his flight to Ottawa was cancelled. He landed in Toronto at midnight last night.

    I wouldn’t have blamed him for canceling the keynote.

    Instead, he rented a car and drove to Ottawa. Overnight. After a bad day of travel. That’s about a four-hour drive.

    Lehenbauer is clearly a man who keeps his promises.

    Plus, he speaks very well despite not sleeping. Or maybe because he didn’t sleep. Whatever, he’s good.

    FreeBSD devsummit notes: ports & packages

    The ports and packages summit was a lot more discussion of options as opposed to the state of items and future plans. A very dynamic session, where each of the dozen or so scheduled speakers was more “moderator of the moment.” Plus, I staggered in half an hour late, because breakfast was really really good.

    But, in general, what happened:

    I walked in on an overview of Debian packages. It’s always good to assess others’ work.

    Discussions on dependencies.

    Ed Maste on possibly using certificate transparency via X.509v3 extension, rather than creating our own signing infrastructure.

    Using qemu BSD user mode for cross-building packages. Qemu still needs some work, and you can pitch in.

    bapt: want to control what scripts can do, so arbitrary scripts can’t harm system. Have the system provide a utility that will let programs check config files or update a database, rather than run arbitrary scripts. Would also help with cross-building packages.

    Cross-building is improving. Now nightly ARM crossbuild packages in test. Hopefully ready by EuroBSDCon.

    PortCI: project for build cluster automation. Various port building processes are manual, such as testing and QAT. PortCI lets you manage these queues easily via a simple front end. The idea is to eventually let commiters request and configure their own experimental runs.

    Jenkins – https://jenkins.freebsd.org. Uses bhyve VMs. Testing ports on all platforms.

    Do not use freebsd-version(1) in the ports tree. Designed only for use in the base system. Security fixes that don’t touch the kernel won’t affect uname -r, and freebsd-version doesn’t apply to releases built from source. Ports tree needs something to say exactly what version you have no matter which how that version was produced.

    Discussion on handling port licenses.

    Packaging base! pkg doesn’t handle chflags yet, but they’re working on it. Split packages per build system option. But this changes how some programs are linked–what about NIS? Bapt is pondering that. We could offer multiple versions of packages, such as NIS-free. But FreeBSD’s “build system is not a paragon of configurability, but a bunch of hacks on what annoyed people the most” (Warner Losh).

    I’m teaching in less than an hour, so I left the discussion here.

    FreeBSD devsummit virtualization session

    Some notes from the FreeBSD virtualization devsummit. Very rough, but my understanding is very rough, so all is as it should be.

    Bhyve moving to UEFI loader away from FreeBSD and grub2
    • Fork of intel EDKII (BSD License), OVMF build target
    • For bhyve instead of Qemu
    • Includes CSM BIOS emulation for non-EFI aware OS’s
    • Currently in-house, being moved to public git repo
    • Buildable on FreeBSD (GCC 4.6 or later), needs to be a port – bhyve folks need port creation help
    • Serial console only: working on VGA emulation with VNC client

    Networking:
    • Virtio doesn’t support modern networking features
    • One NIC, e1000 ((multiq, jumbo frames, TSO) under way
    • e1000e (82580) dev emulation in progress
    • each has thousands of registers, still working on them

    Considering:
    • user mode using WANProxy/libuinet
    • simple kernel eth switch

    Storage
    • zvol GEOM-avoidance in place (mav@) – prevent geom from sniffing ZFS partition tables, so host will never see VM filesystem
    • virtio todo: asynch block writes, add virtio SCSI
    • Wanted: BSD-licensed sparse image tools for working with vmdk, qcow2, vhd, etc. Would be nice to point bhyve at a VMDK file and so “go!”

    Future
    • AMD-SVM
    • Windows guest support (requires UEFI)
    • Illumos doesn’t need UEFI, needs a real BIOS – use BIOS compat in UEFI
    • ARM(64) chips have virtualization support, get bhyve to work on it.
    • Save state/restore/migrate
    • configuration file, as the command line is unwieldy for hierarchical info – use UCL because the ports people also use UCL
    • Regression suite – bhyve supports lots of different hardware and operating systems, so we need to have automated testing

    Other virtualization
    • Virtualbox – FreeBSD is tracking very closely, 4.3.10 came out 25 March, port updated on 28 March.
    • HyperV – 10.0, amd64 and i386 guest support

    o Recent Azure image announcement
    o Nobody in the FreeBSD community tracks Hyper-V, it could use a nanny

    Luigi Rizzo on performance with device drivers
    • One option – e1000 emulation, performance will be poor, will be slow
    • Some emulation drivers fake TSO, etc
    • No good solutions outside paravirtualization
    • High performance = modify guest device driver to be virtualization-aware
    • Luigi got 17GB/s using netmap with bhyve

    Roger Pau Monne’ on Xen
    • Changes in FreeBSD 10

    o Vector callback for injecting event channel interrupts
    o PV timer
    o PV IPIs
    o Add Xen support into GENERIC – can now use freebsd-update
    o Sponsored by SpectraLogic and Citrix

    • PVH domU

    o Supported guest mode since 4.4
    o Builds atop of the PVHVM work introduced in FreeBSD 10
    o Half-merged into –current, some work remains
    o Same speed as PVHVM, main difference is way it boots
    o Not as intrusive as a traditional PV port

    • PVM Dom0

    o Xen side patches almost fully merged
    o Main difference between PVH DomU and Dom0 is that on Dom0 FreeBSD needs to manage the hardware
    o Add support for PIRQ (physical interrupts routed atop event channels)
    o ACPI tables parsed by Dom0, and Xen must be made aware of the underlying devices
    o Xen user-space control devices needed by the toolstack:

     Privcmd – allows issuing hypercalls into Xen and mapping foreign domain memory from userspace
     Evtchn – allows registering and receiving interrupts by user-space applications

    • Big items remaining

    o Add multiboot support to the FreeBSD bootloader – right now, you must use pxelinux or grub
    o Improve if.xn – doesn’t work correctly with a NetBSD dom0, doesn’t work properly between guests on the same host, paravirtualized interface does not perform well yet.

    • Hoping to have Xen work for FreeBSD 11

    VirtIO/VMWare guest drivers by Brian Venteicher
    • Work done over last year
    • VirtIO: new

    o Unmapped IO – block and SCSI
    o Network multiqueue
    o Random (entropy) device
    o Initial console driver –can do multi-consoles, hotplug is so-so

    • VirtIO: remaining

    o Support missing devices – MMIO
    o Non-x86 architectures
    o SCSI multiqueue
    o VirtIO version 1 specification – very similar to existing virtio

    • VMWare

    o Vmxnet3

     Vmware provided driver, messy
     OpenBSD imported their own vmx driver May 2013
     Ported to FreeBSD 10.0

    • TSO/LRO offload
    • Multiqueue

     To do: PVSCSI & VMWare tools

    Device emulation in bhyve

    • Most emulated in userspace usr.sbin/bhyve
    • Kernel ones in vmm/io/ (PICs and timers)
    • ISA-LPC – uart, rtc
    • PCI

    o Virtio

     Block – storage
     Net – tap
     Rng – random entropy from /dev/random

    o Ahci
    o Pass-through

    • Go through how virtio device drivers work. Interesting, lots of diagrams he should post, but way above my head, so I didn’t take too many notes
    • Virtio random number generator

    o Usr.sbin/bhyve/pci_virtio_rnd.c
    o Guest rng driver requests 32-bit number to replenish its random pool
    o FreeBSD /dev/random is non-blocking, using Yarrow and (soonish) Fortuna

    FreeBSD 11 feature goals

    I’m at the BSDCan FreeBSD devsummit, and the current topic is FreeBSD 11 Goals.

    As the Great Committer John Baldwin has requested that people take notes and blog about the discussions, and this might be of wider interest, here’s the goals.

    These are my notes. I probably missed things. I would be shocked if I didn’t, actually. And I probably misunderstood some stuff.

    Test suite/QA (jmmv) – some stuff merged to 10
    Mips64 & more MIPS stuff
    Scatter/Gather mbufs (scottl) – collapse down mbufs from a long chain into one unit
    Lldb (emaste) – make it first-class citizen, fully functioning & working in 11 on all platforms, native cross-platform debugging
    Uefi boot and install support (emaste)
    Package the base system (gjb/bapt)
    Open to floor:
    AES GCM added to ipsec – jmg
    ASLR – Shawn “The Goats Are His Fault” Webb
    DNS improvements – Erwin
    Suspend/resume
    Libc++
    OpenMP
    FreeBSD devs – Want icc? Talk to gnn@
    Kload – hot swap kernel upgrade
    Dragonfly mailer as default
    Ncurses cleanup
    Capsicum and casper improvements – use casper to help apps use dnssec correctly
    TCP performance and enhancements (gnn) – project as a whole needs broader TCP patch reviewers
    L2 rework
    Libuinet
    Arm64 (andyt)
    Package building Mips32 packages via qemu
    External toolchain improvements (imp) – some people need GCC
    Remove gcc by 11 – gnn willing to remove it right after this devsummit session
    Remove ia64 (marcel)
    Useland dtrace (marki)
    Xen dom0 for x86 (rogier)
    Kqueue64 from osX – available from Apple, we could pull this in (jmg)
    Async sendfile (glebius)
    Lightweight reference counts (maybe) (glebius)
    Kdbus (need for desktop)
    Vt + newcons default (emaste)
    KMS, DRM, AGI impromements (dumbbell & kip)
    SMT (need)
    Encrypted kernel dump (gnn)
    Nand flash (warner)
    Superpages for certain arm & mips
    Multi-endian ufs
    Libdispatch (sson)
    Move libraries to private (need, bapt, bdrewery help)
    /etc/src.conf improvements
    64-bit linuxulator
    newer linuxulator (xmj)
    new autofs (emaste)
    unionfs improvements (need)
    64-bit struct stat & dirent, mount_pathlen, max_pathlen, (benno)
    X32 – alternate abi for amd64 – jhb wants, but no commitment
    PF – improve its internal API so we can manage stable branches, merge newer version, IPv6 improvements (glebius)
    Bhyve – UEFI
    Vxlan
    Reproducible release builds – remove usernames, host names, timestamps from builds
    IPv6 security improvements
    Network stack backpressure
    Network multipath (stretch goal)
    Capsicum shell – will sandbox package building
    Non-root image building

    Will all this happen? Who knows. But plans are nice.