August 26, 2009
Benchmarking Is Hard, Let's Go Shopping
It's been a while since I started telling people that benchmarking systems is hard. I'm here today because of an article about an article about an article from the ACM Transactions on Storage. (If anyone refers to this post, they should cite it as "blog post about an article about ... ;) .)
While the statement "benchmarking systems is hard" is true for most of systems benchmarking (yes, that's an assertion without supporting data, but this is a blog and so I can state these opinions left and right!), the underlying article (henceforth the article) is about filesystem and storage benchmarks specifically.
For those of you who are getting the TL;DR feeling already, here's a quick summary:
- FS benchmarking is hard to get right.
- Many commonly accepted fs benchmarks are wrong.
- Many people misconfigure benchmarks yielding useless data.
- Many people don't specify their experimental setup properly.
Hrm, I think I just summarized a 56 page journal article in 4-bullet points. I wonder what the authors will have to say about this :)
On a related note, it really bothers me when regular people attempt to "figure out" which filesystem is the best, and they share their findings. It's the sharing part. Why? Because they are uniformly bad.
Here's an example of a benchmark gone wrong...
Take Postmark. It's a rather simple benchmark that simulates the IO workload of an email server. Or does it? What do mail servers do? They read. They write. But above all, they try to ensure that the data actually hit the disk. POSIX specifies a wonderful way to ensure data hits the disk - fsync(2). (You may remember fsync from O_PONIES & and Other Assorted Wishes.) So, a typical email server will append a new email to the mail box, and then fsync it. Only then it'll acknowledge the receiving the email to the remote host. How often does Postmark run fsync? The answer is simple: never.
Now you may be thinking...I've never heard of Postmark, so who uses it? Well, according to the article (the 56-pages long one), out of the 107 papers surveyed, 30 used Postmark. Postmark is so easy to run, that even non-experts try to use it. (The people at Phoronix constantly try to pretend that they figured out benchmarking. For example, on EXT4, Btrfs, NILFS2 Performance Benchmarks they are shocked (see page 2) that some filesystems take 500 times longer for one of their silly tests, even though people have pointed out to them what barriers are, and that they will have an impact on performance.)
Granted, non-experts are expected to make mistakes, but you'd expect that people at Sun would know better. Right? Well, they don't. In their SOLARIS ZFS AND MICROSOFT SERVER 2003 NTFS FILE SYSTEM PERFORMANCE WHITE PAPER (emphasis added by me):
This white paper explores the performance characteristics and differences of Solaris ZFS and the Microsoft Windows Server 2003 NTFS file system through a series of publicly available benchmarks, including BenchW, Postmark, and others.
Sad. Perhaps ZFS isn't as good as people make it out to be! ;)
Alright, fine Postmark doesn't fsync but it should be otherwise ok, right? Wrong again! Take the default parameters (table taken from the article):
|Parameter||Default Value||Number Disclosed (out of 30)|
|File sizes||500-10,000 bytes||21|
|Number of files||500||28|
|Number of transactions||500||25|
|Number of subdirectories||0||11|
|Read/write block size||512 bytes||7|
First of all, note that some parameters weren't specified by a large number of papers. The other interesting thing is the default configuration. Suppose that all 500 files will grow to 10000 bytes (they'll have random sizes in the specified range). That means that the maximum size they'll take up is 5000000 bytes, or under 5 MB. Since there's no fsync, chances are that the data will never hit the disk! This easily explains why the default configuration executes in a fraction of a second. These defaults were reasonable many years ago, but not today. As the article points out:
Having outdated default parameters creates two problems. First, there is no standard configuration, and since different workloads exercise the system differently, the results across research papers are not comparable. Second, not all research papers precisely describe the parameters used, and so results are not reproducible.
Later on in the 3 pages dedicated to Postmark, it states:
An essential feature for a benchmark is accurate timing. Postmark uses the time(2) system call internally, which has a granularity of one sec. There are better timing functions available (e.g., gettimeofday) that have much finer granularity and therefore provide more meaningful and accurate results.
Anyway, now that we have beaten up Postmark and it is cowering in the corner of the room, let's take a look at another favorite benchmark people like to use - a compile benchmark.
The great thing about compile benchmarks is that they are really easy to set up. Chances are that someone interested in running benchmarks already has some toolchain set up - so a compile benchmark consists of timing the compile! Easy? Definitely.
One problem with compile benchmarks is that they depend on a whole lot of state. They depend on the hardware configuration, software configuration (do you have libfoo 2.5 or libfoo 2.6 installed?), as well as the version of the toolchain (gcc 2.95? 2.96? 3.0? 3.4? 4.0? 4.2? or is it LLVM? or MSVC? or some other compiler? what about the linker?).
The other problem with them is...well, they are CPU bound. So why are they used for filesystem benchmarks? My argument is that it is useful to demonstrate that the change the research did does not incur a significant amount of CPU overhead.
Anyway, I think I'll stop ranting now. I hope you learned something! You should go read at least the Linux Magazine or Byte and Switch article it's a good read. If you are brave enough, feel free to dive into the 56-pages of text. All of these will be less rant-y than this post. Class dismissed!
August 24, 2009
A blog is supposed to ...
... mention other blogs, right?
I tripped across this article: Storage Basics: Clustered File Systems, by Charlie Schluting (August 18, 2009).
He briefly describes the Red Hat and Oracle offerings in Linux, VMware, Luster, and Hadoop. (Maybe he is not too precise in his taxonomy, eh?)
Anyway, it's a quick read, so check it out and tell me what I am supposed to think about it!
August 17, 2009
git engineering for pNFS
I'm reading a month-old thread on the pNFS Linux developer's mailing list that helps understand the very hard problem of factoring code for pNFS operation into a set of patches in a way that is both generic and useful.
August 13, 2009
Roadmap for pNFS in the Linux kernel
In early 2008, we sketched out a road map for pNFS that tried to predict progress on NFSv4.1 implementation, standardization, and inclusion in the Linux kernel. Briefly, we predicted:
• Complete interoperable and functional implementations
• Convergence IETF Internet drafts
• NFSv4.1 RFC issued
• NFSv4.1 merged into mainline Linux kernel
• Developers tune pNFS performance at scale
This note looks at progress in adding NFSv4.1 to the Linux kernel. We're not far off track, maybe a couple months.
Linux kernels are not released on a specific schedule, but there is a discernible pattern.
When a kernel is released, a development kernel "opens up." Kernel maintainers then have a window of about two weeks to merge in major changes ready to see the light of day. The development kernel is then worked over for a couple months by maintainers. When the development kernel has stabilized, it is released, and the process starts anew.
The last several kernels were released on the following dates:
2.6.26 on July 13, 2008
2.6.27 on October 9. 2008
2.6.28 on December 24, 2008
2.6.29 on March 23, 2009
2.6.30 on June 9, 2009
This is consistent with a two and a half month cycle, with an extra two weeks over the winter holidays. So our best guess for the schedule of future releases is
2.6.31 in late August 2009
2.6.32 in early November 2009
2.6.33 in early February 2010
Pieces of NFSv4.1 are already present in the Linux kernel, in particular the sessions communication layer, mandatory in NFSv4.1, has a toehold: 2.6.30 has some preliminary server-side sessions code, although it lacks a few things:
• back channel
• SSV (and some other security-related features)
• reboot recovery (no RECLAIM_COMPLETE)
• some miscellaneous — but mandatory — state operations, like DESTROY_CLIENTID and TEST_STATEID
None of the optional NFSv4.1 features, e.g., directory delegations, pNFS, and file delegation enhancements, are in 2.6.30.
Although the 2.6.31 is not yet released — it is "in stabilization" — the important stuff has already been accepted, so we know that 2.6.31 will have preliminary client-side sessions code, with more or less the same caveats as the 2.6.30 server-side sessions code.
When 2.6.32 opens up, we're expecting the sessions back channel — necessary for delegation and layout recalls — to be merged in. Developers have been testing this code at interoperability events, and it passes artificial tests, but it will need some TLC as 2.6.32 stabilizes before it can be used in production.
I'll write about the prospects for pNFS (i.e., layout ops) in my next post.
August 01, 2009
This web log is a place for CITI staff to post their news and thoughts about ongoing activities at CITI. Discussion among the staff and CITI followers is encouraged.