You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
331 lines
15 KiB
Markdown
331 lines
15 KiB
Markdown
5 years ago
|
# Fast directory listing
|
||
|
|
||
|
In order to find untracked files in a git repository, [gitstatusd](../README.md) needs to list the
|
||
|
contents of every directory. gitstatusd does it 27% faster than a reasonable implementation that a
|
||
|
seasoned C/C++ practitioner might write. This document explains the optimizations that went into it.
|
||
|
As directory listing is a common operation, many other projects can benefit from applying these
|
||
|
optimizations.
|
||
|
|
||
|
## v1
|
||
|
|
||
|
Given a path to a directory, `ListDir()` must produce the list of files in that directory. Moreover,
|
||
|
the list must be sorted lexicographically to enable fast comparison with Git index.
|
||
|
|
||
|
The following C++ implementation gets the job done. For simplicity, it returns an empty list on
|
||
|
error.
|
||
|
|
||
|
```c++
|
||
|
vector<string> ListDir(const char* dirname) {
|
||
|
vector<string> entries;
|
||
|
if (DIR* dir = opendir(dirname)) {
|
||
|
while (struct dirent* ent = (errno = 0, readdir(dir))) {
|
||
|
if (!Dots(ent->d_name)) entries.push_back(ent->d_name);
|
||
|
}
|
||
|
if (errno) entries.clear();
|
||
|
sort(entries.begin(), entries.end());
|
||
|
closedir(dir);
|
||
|
}
|
||
|
return entries;
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Every directory has entries `"."` and `".."`, which we aren't interested in. We filter them out with
|
||
|
a helper function `Dots()`.
|
||
|
|
||
|
```c++
|
||
|
bool Dots(const char* s) { return s[0] == '.' && (!s[1] || (s[1] == '.' && !s[2])); }
|
||
|
```
|
||
|
|
||
|
To check how fast `ListDir()` performs, we can run it many times on a typical directory. One million
|
||
|
runs on a directory with 32 files with 16-character names takes 12.7 seconds.
|
||
|
|
||
|
## v2
|
||
|
|
||
|
Experienced C++ practitioners will scoff at our implementation of `ListDir()`. If it's meant to be
|
||
|
efficient, returning `vector<string>` is an unaffordable convenience. To avoid heap allocations we
|
||
|
can use a simple arena that will allow us to reuse memory between different `ListDir()` calls.
|
||
|
|
||
|
(Changed and added lines are marked with comments.)
|
||
|
|
||
|
```c++
|
||
|
void ListDir(const char* dirname, string& arena, vector<char*>& entries) { // +
|
||
|
entries.clear(); // +
|
||
|
if (DIR* dir = opendir(dirname)) {
|
||
|
arena.clear(); // +
|
||
|
while (struct dirent* ent = (errno = 0, readdir(dir))) {
|
||
|
if (!Dots(ent->d_name)) {
|
||
|
entries.push_back(reinterpret_cast<char*>(arena.size())); // +
|
||
|
arena.append(ent->d_name, strlen(ent->d_name) + 1); // +
|
||
|
}
|
||
|
}
|
||
|
if (errno) entries.clear();
|
||
|
for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)]; // +
|
||
|
sort(entries.begin(), entries.end(), // +
|
||
|
[](const char* a, const char* b) { return strcmp(a, b) < 0; }); // +
|
||
|
closedir(dir);
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
To make performance comparison easier, we can normalize them relative to the baseline. v1 will get
|
||
|
performance score of 100. A twice-as-fast alternative will be 200.
|
||
|
|
||
|
| version | optimization | score |
|
||
|
|---------|----------------------------|----------:|
|
||
|
| v1 | baseline | 100.0 |
|
||
|
| **v2** | **avoid heap allocations** | **112.7** |
|
||
|
|
||
|
Avoiding heap allocations makes `ListDir()` 12.7% faster. Not bad. As an added bonus, those casts
|
||
|
will fend off the occasional frontend developer who accidentally wanders into the codebase.
|
||
|
|
||
|
## v3
|
||
|
|
||
|
`opendir()` is an expensive call whose performance is linear in the number of subdirectories in the
|
||
|
path because it needs to perform a lookup for every one of them. We can replace it with `openat()`,
|
||
|
which takes a file descriptor to the parent directory and a name of the subdirectory. Just a single
|
||
|
lookup, less CPU time. This optimization assumes that callers already have a descriptor to the
|
||
|
parent directory, which is indeed the case for gitstatusd, and is often the case in other
|
||
|
applications that traverse filesystem.
|
||
|
|
||
|
```c++
|
||
|
void ListDir(int parent_fd, const char* dirname, string& arena, vector<char*>& entries) { // +
|
||
|
entries.clear();
|
||
|
int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC); // +
|
||
|
if (dir_fd < 0) return; // +
|
||
|
if (DIR* dir = fdopendir(dir_fd)) {
|
||
|
arena.clear();
|
||
|
while (struct dirent* ent = (errno = 0, readdir(dir))) {
|
||
|
if (!Dots(ent->d_name)) {
|
||
|
entries.push_back(reinterpret_cast<char*>(arena.size()));
|
||
|
arena.append(ent->d_name, strlen(ent->d_name) + 1);
|
||
|
}
|
||
|
}
|
||
|
if (errno) entries.clear();
|
||
|
for (char*& p : entries) p = &arena[reinterpret_cast<size_t>(p)];
|
||
|
sort(entries.begin(), entries.end(),
|
||
|
[](const char* a, const char* b) { return strcmp(a, b) < 0; });
|
||
|
closedir(dir);
|
||
|
} else { // +
|
||
|
close(dir_fd); // +
|
||
|
} // +
|
||
|
}
|
||
|
```
|
||
|
|
||
|
This is worth about 3.5% in speed.
|
||
|
|
||
|
| version | optimization | score |
|
||
|
|---------|--------------------------------------|----------:|
|
||
|
| v1 | baseline | 100.0 |
|
||
|
| v2 | avoid heap allocations | 112.7 |
|
||
|
| **v3** | **open directories with `openat()`** | **116.2** |
|
||
|
|
||
|
## v4
|
||
|
|
||
|
Copying file names to the arena isn't free but it doesn't seem like we can avoid it. Poking around
|
||
|
we can see that the POSIX API we are using is implemented on Linux on top of `getdents64` system
|
||
|
call. Its documentation isn't very encouraging:
|
||
|
|
||
|
```text
|
||
|
These are not the interfaces you are interested in. Look at
|
||
|
readdir(3) for the POSIX-conforming C library interface. This page
|
||
|
documents the bare kernel system call interfaces.
|
||
|
|
||
|
Note: There are no glibc wrappers for these system calls.
|
||
|
```
|
||
|
|
||
|
Hmm... The API looks like something we can take advantage of, so let's try it anyway.
|
||
|
|
||
|
First, we'll need a simple `Arena` class that can allocate 8KB blocks of memory.
|
||
|
|
||
|
```c++
|
||
|
class Arena {
|
||
|
public:
|
||
|
enum { kBlockSize = 8 << 10 };
|
||
|
|
||
|
char* Alloc() {
|
||
|
if (cur_ == blocks_.size()) blocks_.emplace_back(kBlockSize, 0);
|
||
|
return blocks_[cur_++].data();
|
||
|
}
|
||
|
|
||
|
void Clear() { cur_ = 0; }
|
||
|
|
||
|
private:
|
||
|
size_t cur_ = 0;
|
||
|
vector<string> blocks_;
|
||
|
};
|
||
|
```
|
||
|
|
||
|
Next, we need to define `struct dirent64_t` ourselves because there is no wrapper for the system
|
||
|
call we are about to use.
|
||
|
|
||
|
```c++
|
||
|
struct dirent64_t {
|
||
|
ino64_t d_ino;
|
||
|
off64_t d_off;
|
||
|
unsigned short d_reclen;
|
||
|
unsigned char d_type;
|
||
|
char d_name[];
|
||
|
};
|
||
|
```
|
||
|
|
||
|
Finally we can get to the implementation of `ListDir()`.
|
||
|
|
||
|
```c++
|
||
|
void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) { // +
|
||
|
entries.clear();
|
||
|
int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC);
|
||
|
if (dir_fd < 0) return;
|
||
|
arena.Clear(); // +
|
||
|
while (true) { // +
|
||
|
char* buf = arena.Alloc(); // +
|
||
|
int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize); // +
|
||
|
if (n <= 0) { // +
|
||
|
if (n) entries.clear(); // +
|
||
|
break; // +
|
||
|
} // +
|
||
|
for (int pos = 0; pos < n;) { // +
|
||
|
auto* ent = reinterpret_cast<dirent64_t*>(buf + pos); // +
|
||
|
if (!Dots(ent->d_name)) entries.push_back(ent->d_name); // +
|
||
|
pos += ent->d_reclen; // +
|
||
|
} // +
|
||
|
} // +
|
||
|
sort(entries.begin(), entries.end(),
|
||
|
[](const char* a, const char* b) { return strcmp(a, b) < 0; });
|
||
|
close(dir_fd);
|
||
|
}
|
||
|
```
|
||
|
|
||
|
How are we doing with this one?
|
||
|
|
||
|
| version | optimization | score |
|
||
|
|---------|----------------------------------|----------:|
|
||
|
| v1 | baseline | 100.0 |
|
||
|
| v2 | avoid heap allocations | 112.7 |
|
||
|
| v3 | open directories with `openat()` | 116.2 |
|
||
|
| **v4** | **call `getdents64()` directly** | **137.8** |
|
||
|
|
||
|
Solid 20% speedup. Worth the trouble. Unfortunately, we now have just one `reinterpret_cast` instead
|
||
|
of two, and it's not nearly as scary-looking. Hopefully with the next iteration we can get back some
|
||
|
of that evil vibe of low-level code.
|
||
|
|
||
|
As a bonus, every element in `entries` has `d_type` at offset -1. This can be useful to the callers
|
||
|
that need to distinguish between regular files and directories (gitstatusd, in fact, needs this).
|
||
|
Note how `ListDir()` implements this feature at zero cost, as a lucky accident of `dirent64_t`
|
||
|
memory layout.
|
||
|
|
||
|
## v5
|
||
|
|
||
|
The CPU profile of `ListDir()` reveals that almost all userspace CPU time is spent in `strcmp()`.
|
||
|
Digging into the source code of `std::sort()` we can see that it uses Insertion Sort for short
|
||
|
collections. Our 32-element vector falls under the threshold. Insertion Sort makes `O(N^2)`
|
||
|
comparisons, hence a lot of CPU time in `strcmp()`. Switching to `qsort()` or
|
||
|
[Timsort](https://en.wikipedia.org/wiki/Timsort) is of no use as all good sorting algorithms fall
|
||
|
back to Insertion Sort.
|
||
|
|
||
|
If we cannot make fewer comparisons, perhaps we can make each of them faster? `strcmp()` compares
|
||
|
characters one at a time. It cannot read ahead as it can be illegal to touch memory past the first
|
||
|
null byte. But _we_ know that it's safe to read a few extra bytes past the end of `d_name` for every
|
||
|
entry except the last in the buffer. And since we own the buffer, we can overallocate it so that
|
||
|
reading past the end of the last entry is also safe.
|
||
|
|
||
|
Combining these ideas with the fact that file names on Linux are at most 255 bytes long, we can
|
||
|
invoke `getdents64()` like this:
|
||
|
|
||
|
```c++
|
||
|
int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256);
|
||
|
```
|
||
|
|
||
|
And then compare entries like this:
|
||
|
|
||
|
```c++
|
||
|
[](const char* a, const char* b) { return memcmp(a, b, 255) < 0; }
|
||
|
```
|
||
|
|
||
|
This version doesn't give any speedup compared to the previous but it opens an avenue for another
|
||
|
optimization. The pointers we pass to `memcmp()` aren't aligned. To be more specific, their
|
||
|
numerical values are `N * 8 + 3` for some `N`. When given such a pointer, `memcmp()` will check the
|
||
|
first 5 bytes one by one, and only then switch to comparing 8 bytes at a time. If we can handle the
|
||
|
first 5 bytes ourselves, we can pass aligned memory to `memcmp()` and take full advantage of its
|
||
|
vectorized loop.
|
||
|
|
||
|
Here's the implementation:
|
||
|
|
||
|
```c++
|
||
|
uint64_t Read64(const void* p) { // +
|
||
|
uint64_t x; // +
|
||
|
memcpy(&x, p, sizeof(x)); // +
|
||
|
return x; // +
|
||
|
} // +
|
||
|
|
||
|
void ByteSwap64(void* p) { // +
|
||
|
uint64_t x = __builtin_bswap64(Read64(p)); // +
|
||
|
memcpy(p, &x, sizeof(x)); // +
|
||
|
} // +
|
||
|
|
||
|
void ListDir(int parent_fd, Arena& arena, vector<char*>& entries) {
|
||
|
entries.clear();
|
||
|
int dir_fd = openat(parent_fd, dirname, O_NOATIME | O_RDONLY | O_DIRECTORY | O_CLOEXEC);
|
||
|
if (dir_fd < 0) return;
|
||
|
arena.Clear();
|
||
|
while (true) {
|
||
|
char* buf = arena.Alloc();
|
||
|
int n = syscall(SYS_getdents64, dir_fd, buf, Arena::kBlockSize - 256); // +
|
||
|
if (n <= 0) {
|
||
|
if (n) entries.clear();
|
||
|
break;
|
||
|
}
|
||
|
for (int pos = 0; pos < n;) {
|
||
|
auto* ent = reinterpret_cast<dirent64_t*>(buf + pos);
|
||
|
if (!Dots(ent->d_name)) {
|
||
|
ByteSwap64(ent->d_name); // +
|
||
|
entries.push_back(ent->d_name);
|
||
|
}
|
||
|
pos += ent->d_reclen;
|
||
|
}
|
||
|
}
|
||
|
sort(entries.begin(), entries.end(), [](const char* a, const char* b) {
|
||
|
uint64_t x = Read64(a); // +
|
||
|
uint64_t y = Read64(b); // +
|
||
|
return x < y || (x == y && a != b && memcmp(a + 5, b + 5, 256) < 0); // +
|
||
|
});
|
||
|
for (char* p : entries) ByteSwap64(p); // +
|
||
|
close(dir_fd);
|
||
|
}
|
||
|
```
|
||
|
|
||
|
This is for Little Endian architecture. Big Endian doesn't need `ByteSwap64()`, so it'll be a bit
|
||
|
faster.
|
||
|
|
||
|
| version | optimization | score |
|
||
|
|---------|----------------------------------|----------:|
|
||
|
| v1 | baseline | 100.0 |
|
||
|
| v2 | avoid heap allocations | 112.7 |
|
||
|
| v3 | open directories with `openat()` | 116.2 |
|
||
|
| v4 | call `getdents64()` directly | 137.8 |
|
||
|
| **v5** | **hand-optimize `strcmp()`** | **143.3** |
|
||
|
|
||
|
Fast and respectably arcane.
|
||
|
|
||
|
## Conclusion
|
||
|
|
||
|
Through a series of incremental improvements we've sped up directory listing by 43.3% compared to a
|
||
|
naive implementation (v1) and 27.2% compared to a reasonable implementation that a seasoned C/C++
|
||
|
practitioner might write (v2).
|
||
|
|
||
|
However, these numbers are based on an artificial benchmark while the real judge is always the real
|
||
|
code. Our goal was to speed up gitstatusd. Benchmark was just a tool. Thankfully, the different
|
||
|
versions of `ListDir()` have the same comparative performance within gitstatusd as in the benchmark.
|
||
|
In truth, the directory chosen for the benchmark wasn't arbitrary. It was picked by sampling
|
||
|
gitstatusd when it runs on [chromium](https://github.com/chromium/chromium) git repository.
|
||
|
|
||
|
The final version of `ListDir()` spends 97% of its CPU time in the kernel. If we assume that it
|
||
|
makes the minimum possible number of system calls and these calls are optimal (true to the best
|
||
|
of my knowledge), it puts the upper bound on possible future performance improvements at just 3%.
|
||
|
There is almost nothing left in `ListDir()` to optimize.
|
||
|
|
||
|
![ListDir() CPU profile](
|
||
|
https://raw.githubusercontent.com/romkatv/gitstatus/1ac366952366d89980b3f3484f270b4fa5ae4293/cpu-profile-listdir.png)
|
||
|
|
||
|
(The CPU profile was created with [gperftools](https://github.com/gperftools/gperftools) and
|
||
|
rendered with [pprof](https://github.com/google/pprof)).
|