Monday, April 16, 2007

ZERO_PAGE or not ZERO_PAGE...

Interesting discussion on the LKML about the opportuninty to remove the ZERO_PAGE for anonymous mappings (http://lkml.org/lkml/2007/4/3/432).

ZERO_PAGE is a single physical page that is always filled by 0 and it's used for zero-mapped memory areas.

It is used for example to initialize the anonymous pages of a task (not file-backed memory that exists only during the life of the task). When a program performs a malloc() the buffer returned by the function should be filled by zero. If the program tries to read from that buffer, the kernel, instead of allocating new physical free pages without any reasonable purpose, maps all the virtual accessed memory to the ZERO_PAGE.

Anyway, in general, an application that reads from a just allocated empty buffer is a quite stupid application :-) (except when you have to work with sparse matrices!) and the ZERO_PAGE handling has a cost in every COW faults.

The following patch removes the handling of the ZERO_PAGE for anonymous memory mappings and it simply allocates new physical pages in the case that a program wants to read empty buffers. Depending on your applications you should see a small improvement in terms of performance, but a bigger memory consumption if you runs that kind of applications mentioned above.

side note: I'm using it in my notebook and it works fine! :-)


--- linux-2.6.20.4/mm/memory.c.orig 2007-04-06 00:23:52.000000000 +0200
+++ linux-2.6.20.4/mm/memory.c 2007-04-06 00:25:48.000000000 +0200
@@ -1569,16 +1569,11 @@

if (unlikely(anon_vma_prepare(vma)))
goto oom;
- if (old_page == ZERO_PAGE(address)) {
- new_page = alloc_zeroed_user_highpage(vma, address);
- if (!new_page)
- goto oom;
- } else {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
- if (!new_page)
- goto oom;
- cow_user_page(new_page, old_page, address, vma);
- }
+
+ new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ if (!new_page)
+ goto oom;
+ cow_user_page(new_page, old_page, address, vma);

/*
* Re-check the pte - we dropped the lock
@@ -2088,38 +2083,24 @@
spinlock_t *ptl;
pte_t entry;

- if (write_access) {
- /* Allocate our own private page. */
- pte_unmap(page_table);
+ /* Allocate our own private page. */
+ pte_unmap(page_table);

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
- page = alloc_zeroed_user_highpage(vma, address);
- if (!page)
- goto oom;
+ if (unlikely(anon_vma_prepare(vma)))
+ goto oom;
+ page = alloc_zeroed_user_highpage(vma, address);
+ if (!page)
+ goto oom;

- entry = mk_pte(page, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);

- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_none(*page_table))
- goto release;
- inc_mm_counter(mm, anon_rss);
- lru_cache_add_active(page);
- page_add_new_anon_rmap(page, vma, address);
- } else {
- /* Map the ZERO_PAGE - vm_page_prot is readonly */
- page = ZERO_PAGE(address);
- page_cache_get(page);
- entry = mk_pte(page, vma->vm_page_prot);
-
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (!pte_none(*page_table))
- goto release;
- inc_mm_counter(mm, file_rss);
- page_add_file_rmap(page);
- }
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (unlikely(!pte_none(*page_table)))
+ goto release;
+ inc_mm_counter(mm, anon_rss);
+ lru_cache_add_active(page);
+ page_add_new_anon_rmap(page, vma, address);

set_pte_at(mm, address, page_table, entry);

Saturday, April 14, 2007

A fast way to get users' disk usage on ext2/ext3 filesystems

The simplest way to get the per user disk usage is to mount the filesystem with the quota accounting (see "man 8 mount"). This is the most reliale way to be sure that the quota limits will be respected, since the accounting and the checks are performed synchronously by the filesystem itself. Unfortunately this adds a little overhead in the whole filesystem.

Another approach is to periodically check the disk usage with a script (typically with a cron job). The script should sum the size of each file and directory in the filesystem grouping them by the respective owners.

This program (e2fsusage) uses the second approach, but it doesn't read directly the files and directories, it analyze the filesystem metadata, performing a sequential scan of all the inodes.

In this way it bypass the process for the translation of the file/dir names into the respective inodes and it strongly reduces the total time to scan the entire filesystem. Moreover, since it evaluates the real allocated blocks of the filesystem using i_blocks, instead of i_size (see the struct ext2_inode in /usr/include/ext2fs/ext2_fs.h) it is able to detect the true size
occupied by each user (read it as: it is able to correctly handle sparse files).

Wednesday, April 4, 2007

How to bypass the buffer cache in Linux

Linux has 2 kind of caches: the page cache and the buffer cache. The role of the page cache is to speed-up the access of the files on disks, in a similar way the buffer cache contains buffers of pages read from or being written to block devices. Both of them are memory areas managed in different ways (one more optimized for file objects and the other more block device oriented).

From /proc/meminfo is possible to monitor the memory allocated for both caches (Buffers is the buffer cache, Cached is the page cache), for example:
# cat /proc/meminfo
...
Buffers:         15116 kB
Cached:          67912 kB
...

To perform an I/O benchmark on block devices (like /dev/sda, /dev/sdb, etc.) we usually use a simple `dd`, that loads data from device into memory (in read tests) or write from memory to device (in write tests). But in this cases data are accessed only once! There are no more reads or writes on their buffers. In these cases the buffer cache is only an overhead and it should be meaningful to bypass it.

A way is to open the files using the flag O_DIRECT. This flag allows to bypass the caching mechanisms and exploit directly the DMA from/to the block device and the userspace source/destination buffers.

Obviously there's not in the kernel a global flag to say: "ok just disable buffer cache" and it's not even possibile to disable the buffer cache for a single process.

In the case that you can (and you want) to patch and recompile your application you could explicitly set the flag O_DIRECT in every open()s, but it wouldn't be so handy... ;-)

Another solution is to write a simple glibc wrapper that intercepts all the open() and set the O_DIRECT flag.

Following an example:

libdirectio.c
#define _GNU_SOURCE
#define __USE_GNU

#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <fcntl.h>
#include <dlfcn.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>

#define DEBUG

#ifdef DEBUG
#define DPRINTF(format, args...) fprintf(stderr, "debug: " format, ##args)
#else
#define DPRINTF(format, args...)
#endif

int open(const char *, int, ...) __attribute__ ((weak, alias("wrap_open")));
int __open(const char *, int, ...) __attribute__ ((weak, alias("wrap_open")));
int open64(const char *, int, ...) __attribute__ ((weak, alias("wrap_open64")));
int __open64(const char *, int, ...) __attribute__ ((weak, alias("wrap_open64")));

static int (*orig_open)(const char *, int, ...) = NULL;
static int (*orig_open64)(const char *, int, ...) = NULL;

static int __do_wrap_open(const char *name, int flags, mode_t mode,
int (*func_open)(const char *, int, ...))
{
    if (strncmp("/dev/null", name, sizeof("/dev/null"))) {
        DPRINTF("setting flags O_DIRECT on %s\n", name);
        flags |= O_DIRECT;
    }
    if (!strncmp("/dev/", name, sizeof("/dev/") - 1) ||
            !strncmp("/proc/", name, sizeof("/proc/") - 1))
        return fd;
    return func_open(name, flags, mode);
}

int wrap_open(const char *name, int flags, ...)
{
    va_list args;
    mode_t mode;

    va_start(args, flags);
    mode = va_arg(args, mode_t);
    va_end(args);

    DPRINTF("calling libc open(%s, 0x%x, 0x%x)\n", name, flags, mode);

    return __do_wrap_open(name, flags, mode, orig_open);
}

int wrap_open64(const char *name, int flags, ...)
{
    va_list args;
    mode_t mode;

    va_start(args, flags);
    mode = va_arg(args, mode_t);
    va_end(args);

    DPRINTF("calling libc open64(%s, 0x%x, 0x%x)\n", name, flags, mode);

    return __do_wrap_open(name, flags, mode, orig_open64);
}

void _init(void)
{
    orig_open = dlsym(RTLD_NEXT, "open");
    if (!orig_open) {
        fprintf(stderr, "error: missing symbol open!\n");
        exit(1);
    }
    orig_open64 = dlsym(RTLD_NEXT, "open64");
    if (!orig_open64) {
        fprintf(stderr, "error: missing symbol open64!\n");
        exit(1);
    }
}

Makefile
VERSION=0.1

TARGET=libdirectio.so.$(VERSION)
OBJS=libdirectio.o
CC=gcc
CFLAGS= -fPIC -Wall -O2 -g
SHAREDFLAGS= -nostartfiles -shared -W1,-soname,libdirectio.so.0

all: $(TARGET)

%.o: %.c
$(CC) -I. $(CFLAGS) -c $< -o $@

$(TARGET): $(OBJS)
$(CC) $(SHAREDFLAGS) $(OBJS) -o $(TARGET) -lc -ldl

clean:
rm -f $(OBJS) $(TARGET)
To compile the library simply run `make`. You can pre-load it using the LD_PRELOAD environment variable in this way:
# export LD_PRELOAD=$FULL_PATH_OF_YOUR_LIBRARY/libdirectio.so.0.1
Then you can run your brand-new direct I/O benchmark (typically `dd`) for block devices. To unload the library and restore the standard access simply run:
# unload LD_PRELOAD