Sunday, March 23, 2008

rsync and sparse files

Unfortunately there's not a syscall in Linux to check if a file is sparse or not. There're some nice ideas to extend the lseek() syscall and implement such feature in ZFS (SEEK_HOLE and SEEK_DATA for sparse files), but there's nothing ready for production filesystems yet.

The common approach for user space applications is to implement a heuristic to check if a file can be treated as sparse (and save disk space) or not (and just write bytes to disk).

rsync, for example, checks every chunk of 1024 bytes before writing data to a generic destination file. If a chunk starts or ends with 0-s, these 0-s are just skipped by lseek().

Unfortunately, in some filesystems, typically optimized for large sequential I/O throughputs (like IBM GPFS, IBM SAN FS, or distributed filesystems in general), a lot of lseek()s operations can strongly impact on performances.

In this cases it can be very helpful to enlarge the block size used to handle sparse files.

For example, using a sparse write size of 32KB, I've been able to increase the transfer rate of an order of magnitude copying the output files of scientific applications from GPFS to GPFS or GPFS to SAN FS.

Read this thread on rsync mailing list.

And here is the patch to add --sparse-block=SIZE option to rsync, allowing to tune this parameter at run-time.

No comments: