September 24, 2009
PAPI - Getting at Hardware Performance Counters
Recently, I wanted to figure out whether or not an application I was analyzing was memory bound or not. While on this quest, I was introduced to Performance Application Programming Interface (PAPI).
There is a rather good HOWTO that shows step-by-step instructions on getting it all running on Debian. The text below is more or less just a short version of that HOWTO, with my thoughts interspersed.
PAPI is a library that hooks into the hardware performance counters, and presents them in a uniform way. Installation is rather simple if you pay attention to the installation instructions.
- Get the kernel source
- Get the perfctr tarball
- Extract the sources, and run the update-kernel script. I really mean this, if you try to be clever and apply the patch by hand, you'll have a broken source tree. (The script runs patch to fixup some existing kernel files, and then it copies a whole bunch of other files into kernel tree.)
- Configure, build, install, and reboot into the new kernel
- You can modprobe perfctr and see spew in dmesg
That's it for perfctr. Now PAPI itself...
- Get & extract the source
- ./configure, make, make fulltest, make install-all
That's it for PAPI. The make fulltest will run the tests. Chances are that they will all either pass or all fail. If they fail, then something is wrong (probably with perfctr). If they pass, then you are all set.
There are some examples in the src/examples directory. Those should get you started with using PAPI. It takes about 100 lines of C to get an arbitrary counter going.
Some other time, I'll talk more about PAPI, and how I used it in my experiments.
September 16, 2009
Fluid Dynamics Computing using GPUs
This summer I had the opportunity to work on an emerging new platform for scientific computing - Graphics Processing Units (GPUs).
GPUs have long been very powerful alternative for graphics processing, but until recently were built on a platform that didn't allow them to be used for any other application. With the launch of the NVIDIA G80 and G200 series on the common unified device architecture (CUDA), GPUs now have an intuitive interface that allows researchers to harness their full potential. These now have over 100 processor cores (currently at 1.3 GHz), making them a powerful coprocessor able to accelerate scientific codes by one or more orders of magnitude by taking advantage of parallelism in the most computationally expensive mathematical operations.
The goal for my summer research was to investigate the use of these processors for fluid dynamics applications (historically one of the most computationally expensive fields). In order to test the GPU for non-idealistic conditions, a first order unstructured finite volume algorithm was chosen, as it is somewhat simple, yet relatively difficult to parallelize. It was found that even for this algorithm, a speedup of over 25 times can be achieved. The tech report can be found at http://johanndahm.com/papers.php.
I am currently working to develop new solvers for general fluid dynamics (and convection) problems which are less memory-bound. One problem with many current solvers is that the methods they utilize store too much data (often many many gigabytes). Our hope is that explicit or implicit-explicit iterative methods can be developed that may accelerate convergence of these problems with far less memory usage. In addition, we hope that these methods will be structured such that they can be run on a number of parallel platforms, including CPU clusters (using MPI) or GPUs.
September 13, 2009
Haskell Kernel Modules
Insanity! Someone has made it possible to write kernel modules in Haskell. (FYI, Haskell is a functional language with very strong typing.) Currently, they support only x86, but I wouldn't be surprised if some other architectures got a port soonish.
September 02, 2009
Roadmap for pNFS in the Linux kernel, continued
In this note, we look ahead at adding pNFS to the Linux kernel.
We expect the 2.6.32 kernel to "open up" in a few days. That kernel will have a preliminary implementation of the client and server sides of the sessions communication layer, and the back channel is being merged in.
So when will pNFS RPC operations be merged in?
Although prototype implementations of client and server support for file, block, and object layouts have been around for some time, it's not looking good for layout ops in 2.6.32. Let's take a look at each of them.
The file layout client has been tested at numerous interoperability events. However, before the code can be merged into the kernel, the developers have to submit patches for review, and this step has not yet been taken. Moreover, we hear that the developers intend to rewrite the client layout code before refactoring and submitting patches. So, it's fair to say that the file layout client is iffy even for 2.6.33.
There are two candidate implementations of the file layout metadata server, one based on GFS2, the other on spNFS.
CITI developed a file metadata layout server by extending GFS2, one of two cluster file systems in the Linux kernel. Lock contention in GFS2 may limit scalability in large clusters, but CITI is working on a performance test bed — an eight-node cluster that uses Linux iSCSI targets as shared storage — where we can look at scaling properties.
Andy Adamson is enhancing and completing the effort begun at CITI, so a GFS2-based metadata server in 2.6.33 possible.
The other file layout metadata server is NetApp's spNFS, a user space implementation that uses local disks on the data servers instead of a common shared disk. spNFS uses NFS as the server-to-server protocol. It is our understanding that the I/O path between clients and the metadata server proves difficult to implement, and the project seems stalled for the moment. We expect NetApp to revive the effort, but not in time for 2.6.32.
One (mandatory) feature lacking from both file layout metadata servers candidates is I/O stateid enforcement, which requires an (as yet unspecified) server-to-server protocol. We hear rumors that NetApp is working on a solution.
Panasas wrote an OSD-based local file system called exofs, that has been merged into the kernel. Their exofs-based pNFS implementation currently supports only a single OSD, limiting scalability and making it less interesting for pNFS, but work is underway for multiple OSD support. The pNFS code hasn't been reviewed by anyone outside of Panasas yet. It may be ready for 2.6.33.
LSI developed a block server based in part on infrastructure from spNFS, but stopped working on it and posted the code last month. Probably, no one other than the main developer has looked at or tested that code yet. It may need a lot of work.
For servers, there's a good chance that a simple version of the GFS2-based file layout server will be merged in 2.6.33. The exofs-based object layout server might be ready at about the same time. The LSI block layout server is a big question mark.
The client side has the advantage that there is not the variety of backend storage architectures to choose from, so there need only be a single project for each layout type. There are still a number of architectural issues to work out to make the three layout type implementations fit together well, so we estimate client layout code will be merged into 2.6.33 or 2.6.34.
We expect that the initial submissions will pass artificial tests, but will have limitations that will prevent them from being useful in production, and that some additional months will be required to make them fast and reliable. Exactly when the various distributions will start picking them up will depend on the intended audience of the distributions, their tolerance for rough edges, and on what developers and maintainers communicate about the readiness of the code.
45 disks in a 4u box
That's 12 cents a gigabyte as opposed to 12 dimes a gigabyte for ten. Looks like it's probably also slower. (No idea what disk bandwidth they'd get, but it probably doesn't matter since they appear to have only one gigabit network interface.)
September 01, 2009
Delegations and leases
This is part of a recent report we prepared for Google, who sponsored some of CITI's Linux NFS work.
Management of delegation and leases in NFSv4 involves some tricky VFS surgery. There are basically two problems to solve:
A mutating operation breaks leases, then updates. Leases have traditionally been broken by a single call into the locking code. This introduces a potential race condition if new leases are requested after the old leases are broken but before the mutating operation completes.
For NFSv4 (and also Samba), leases must be revoked on all mutating operation, but they are currently revoked only on conﬂicting opens.
We have a patch set that addresses both issues by
- replacing the single break_lease call by a break_lease_start ... break_lease_end pair, and
- adding calls into all the other mutating VFS operations: unlink, rename, chmod, chown, creat, mknod, mkdir, symlink, link, and rmdir.
In some cases, the modiﬁcations for completeness require delicate surgery on core parts of the VFS. For example, rename takes kernel mutex locks on the source and target directory before calling lookup, i.e., before we discover whether there are leases to break. But breaking a lease might take dozens of seconds if the client is unreachable, so we cannot afford to break a lease while holding kernel mutex locks. Therefore, if the lookup reveals that there are leases to break, we back out of the kernel mutex locks, break the leases, then start over. (This is not guaranteed to terminate ... hope that's OK!)
To implement this, we introduced a try_break_lease operation, a non-blocking operation that tries to break a lease and either succeeds immediately or returns an error. In the latter case, the caller can release mutex locks, issue a blocking break_lease operation, then retry the operation. This implementation also meets the needs of NFSD, which cannot afford to let server threads block while waiting for an established lease to be broken.
We have been tinkering with these patches on our own for too long—regression testing, ﬁnding and ﬁxing some small bugs, adding comments, and reworking the interface to make the goals clearer—when we should have been sending them out for comments. That will be remedied soon. For now the patch set is available from the “leases” branch of:
which is browsable here.
We have also written some prototype code to support directory leases, which are needed to support NFSv4 directory delegations.