[tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space

Wed Oct 17 11:20:58 MSK 2018

On Tue, Oct 16, 2018 at 10:05:22PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev at gmail.com> [18/10/08 13:52]:
> > If a replica permanently stops working for some reason, it will pin WAL
> > files it would need to resume until it is deleted from the _cluster
> > system space or the master is restarted. This happens in production when
> > an admin drops a replica and forgets to remove it from the master, and
> > this is quite annoying, because it may result in ENOSPC errors on the
> > master.
> 
> I started benching this patch to check whether fallocate() introduces a 
> performance regression and  discovered that there is a general 45% regression
> between 1.6 and 1.10.
> 
> I hope finally once I have pointed it out and A.Lyapunov has
> pointed it out, it will be addressed.
> 
> In any case we need to measure fallocate() impact very carefully
> before adding it. It seems we make things unnecessarily
> complicated all in order to spare the user from spurious ENOSPC,
> In my opinion it's a non-goal.
> 
> If we get rid of this requirement, we don't need fallocate(), and
> the patch could be made simpler in a couple more dimensions.

This requirement is a must IMO. What's the point of returning ENOSPC and
alerting the user if we can avoid that? I can foresee users complaining
about it and opening issues, like "spurious ENOSPC when there's enough
disk space" (because there will be enough disk space once gc has removed
stale replicas).

> 
> Please consider making a trivial patch which follows the steps of
> the patch by @belyak

It's not trivial and it's ugly. I wonder why you fail to see that.
Ping-ponging messages from WAL to TX in order to remove files?
Introducing yet another pipe for that. What for? WAL thread already has
all the necessary information about xlog files. It just needs to be told
what's the oldest WAL row it has to preserve in any case.

Moreover, shooting off consumers before deleting WAL files is
semantically incorrect, because the garbage collector knows nothing
about WAL files. For GC there's a continuous range of WAL rows it
tracks. Dividing those rows in files is a business of the WAL thread.
So how's it going to work if we put the TX thread responsible for
triggering WAL file deletion? WAL sends ENOSPC signal to TX. TX shoots
off a consumer. WAL retries, ENOSPC again, because no file was deleted!
Sends ENOSPC to TX again and so forth. Do you really want this?!

The design proposed in this patch is simple and clear. When invoked, GC
lets WAL know about rows that can be pruned right now and rows that can
be pruned in case of emergency. When hitting ENOSPC, WAL deletes old
WALs on its own basing on this information and notifies TX via the
existing notification subsystem (wal_watcher) so that the latter can
shoot off replicas that would need those files.

Regarding usage of falloate. I could implement this patch without it,
but it would be a bit more difficult, because there wouldn't be a clear
point of ENOSPC failure. Besides, what would happen if we wrote half of
a transaction to disk? How replication would work then? BTW, triggering
WAL deletion on behalf of TX suffers from the very same problem. That is
we are risking not only returning ENOSPC to the user, but also breaking
replication in a peculiar way.

Anyway, you seem to be unaware of the fact that one of fallocate use
cases is speeding writes by reducing the number of file size updates
(which require a write to the inode table). I wrote a simple test that
demonstrates that, see below.

vlad at esperanza test$ gcc -O2 fallocate_test.c -o fallocate_test
vlad at esperanza test$ ./fallocate_test
Usage: ./fallocate_test <filename> <write_count> <write_size> <alloc_size>
filename - test file
write_count - number of writes (append)
write_size - write(2) size
alloc_size - fallocate(2) size
returns time in seconds
vlad at esperanza test$ ./fallocate_test dummy 1000000 100 0
1.548161
vlad at esperanza test$ ./fallocate_test dummy 1000000 100 0
1.505698
vlad at esperanza test$ ./fallocate_test dummy 1000000 100 100000
1.195223
vlad at esperanza test$ ./fallocate_test dummy 1000000 100 100000
1.137996

I have stock ext4 over hdd on my laptop, configured by Debian, no
tuning. The code is right below. That is for an append-only workload
similar to WAL fallocate yields ~25% gain.

True, fallocate might need some tuning (how much to allocate for
different write sizes), but it's something we definitely want to have on
board.

vlad at esperanza test$ cat fallocate_test.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <time.h>

double gettime(void)
{
        struct timespec ts;
        if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) {
                perror("clock_monotonic");
                exit(EXIT_FAILURE);
        }
        return ts.tv_sec + ts.tv_nsec / 1e9;
}

int main(int argc, char **argv)
{
        if (argc != 5) {
                fprintf(stderr, "Usage: %s <filename> <write_count> "
                        "<write_size> <alloc_size>\n"
                        "filename - test file\n"
                        "write_count - number of writes (append)\n"
                        "write_size - write(2) size\n"
                        "alloc_size - fallocate(2) size\n"
                        "returns time in seconds\n",
                        argv[0]);
                return -1;
        }

        const char *filename = argv[1];
        int write_count = atoi(argv[2]);
        int write_size = atoi(argv[3]);
        int alloc_size = atoi(argv[4]);

        char *buf = malloc(write_size);
        if (buf == NULL) {
                perror("malloc");
                exit(EXIT_FAILURE);
        }
        memset(buf, 1, write_size);

        int fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
        if (fd < 0) {
                perror("open");
                exit(EXIT_FAILURE);
        }

        double t1 = gettime();

        off_t offset = 0;
        int prealloced = 0;
        for (int i = 0; i < write_count; i++) {
                if (alloc_size > 0 && prealloced < write_size) {
                        errno = posix_fallocate(fd, offset, alloc_size);
                        if (errno != 0) {
                                perror("posix_fallocate");
                                exit(EXIT_FAILURE);
                        }
                        prealloced += alloc_size;
                }
                ssize_t written = write(fd, buf, write_size);
                if (written < 0) {
                        perror("write");
                        exit(EXIT_FAILURE);
                }
                offset += written;
                prealloced -= written;
                if (prealloced < 0)
                        prealloced = 0;
        }

        double t2 = gettime();

        close(fd);
        unlink(filename);
        free(buf);

        printf("%f\n", __func__, t2 - t1);
        return 0;
}