Sunday, 29 August 2010

Noflushd Patch for 2.6.32+ Kernels

After a recent kernel upgrade (from 2.6.26 to 2.6.34), Noflushd[1] stopped working. Worse, it broke quite badly in that CPU usage spiked as soon as sync's started to happen and in all probability some sort of a continuous loop seems to have kicked in. It appears that Noflushd relies on writing a "0" into the /proc/sys/vm/dirty_writeback_centisecs file which results in the new pdflush (threaded) implementation essentially continuously writing (since the write daemon now sleeps for "0" seconds between writes, and spawns multiple threads, one per partition/mount-point, to do a write). There are a lot of side effects of the pdflush implementation changes and noflushd is one that is quite severely affected.

Noflushd is not the preferred choice for spindowns when you factor in the availability of laptop_mode and the capabilities of hard disks to spindown when there is no activity (as configured via 'hdparm -S ...'). Unfortunately if one uses a Western Digital Scorpio Blue Notebook drive (on Linux), many (most? Or worse, all?) drives seem to have "broken" firmware in that they either spindown extremely aggressively (~8s) or not at all. They blatantly ignore hdparm values and if you factor in the loadcycle ("head parking") that also happens at the same rate of ~8s, there's a very strong likelihood that the drive will die fairly quickly. So, in the unfortunate event you have similar drive behaviour, short of writing a periodic spindown command job (which would not be aware of writes and their benefits on resetting timers) it is best to rely on Noflushd.

After a fair bit of tweaking and debugging, here's a set of changes (see below) that makes Noflushd work as it is supposed to, like with the 2.6.26 (and older) kernels. Essentially these changes also hinge on having the old EXT3 behaviour (journal mode is ordered instead of the now-default writeback) - so make sure that you are using EXT3 with the right behaviour (since this too has changed in the recent kernels).

First a bit of background on the new pdflush implementation. Essentially the /proc/sys/vm/dirty_writeback_centisecs file has a changed meaning (or rather behaviour) when it is set at zero. In older kernels (including 2.6.26), a zero in dirty_writeback_centisecs meant that the background flush daemon was disabled (i.e. not woken up periodically to flush writes to disk). This was crucial in using noflushd correctly to prevent unnecessary spinups of the drives since noflushd used that mechanism to disable the background writes before forcing the hard disk to sleep/standby. In the new kernels, instead of a "0", a "-1" seems to disable the background writes completely (and also results in "correct" behaviour when using noflushd as the harddisk correctly spins down and in general works as before). A fair bit of background on the pdflush changes (not necessarily related to the "0" vs "-1") are here [2,3,4,5].

Disabling the writeback daemon/threads using a "-1" in /proc/sys/vm/dirty_writeback_centisecs (instead of "0" that worked before) requires a few key changes to noflushd (especially when you want it to also work correctly with the older kernels that still write a "0"). One more interesting issue is that if /proc/sys/vm/laptop_mode has a non-zero value in it, the kernel will force a full sync that many seconds after any other write (including a noflushd sync), which will result in a forced disk write and wakeup the disk if it is sleeping. This is a big deviation from the past where a (forced) sync would not force a fresh write later. As a result, to get the disk to spindown correctly (using noflushd), it is essential to disable laptop_mode. Of course all this this is irrelvant if you have a well behaved disk to begin with.

So to support both the new 2.6.32+ kernels and the older kernels, noflushd now respects a "-1" in the dirty_writeback_centisecs entry and uses it accordingly. The patch (below) also has some other useful changes as well. Noflushd will now track how long drives have been spun down as well as how many times it has been spun up/down along with average duration of spin downs. The interval (as specified on the command line) is now based on 10s intervals (so 5 means 50s) instead of the original minute based spec. The new spec allows much finer grained control of the spin down times. The statistics are dumped (via syslog entries) on shutdown as well as when a SIGUSR1 signal is received. Additionally when a SIGHUP is received, the next (default) timeout value is used and that also is logged to syslog making the workings of noflushd much clearer to review and tweak the time settings.

In summary to have noflushd work correctly on the 2.6.32+ kernels, the following changes have to be made to the system configuration:

Download the patch for noflushd here.

Keywords: linux kernel 2.6.32 2.6.26 writeback expiry dirty disk spindown hdparm pdflush noflushd western digital scorpio blue spindown firmware

URL[1]: http://noflushd.sourceforge.net/
URL[2]: http://lwn.net/Articles/326552/
URL[3]: http://axboe.livejournal.com/1819.html
URL[4]: http://axboe.livejournal.com/2258.html
URL[5]: http://lwn.net/Articles/9521/
URL[6]: http://www.kernel.org/doc/Documentation/laptops/laptop-mode.txt

[/technology] permanent link