Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: Konstantin Osipov <kostja@tarantool.org>
Cc: tarantool-patches@freelists.org
Subject: Re: [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space
Date: Wed, 17 Oct 2018 11:20:58 +0300	[thread overview]
Message-ID: <20181017082058.ebdyx3kwk42ln5tw@esperanza> (raw)
In-Reply-To: <20181016190522.GH5454@chai>

On Tue, Oct 16, 2018 at 10:05:22PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/08 13:52]:
> > If a replica permanently stops working for some reason, it will pin WAL
> > files it would need to resume until it is deleted from the _cluster
> > system space or the master is restarted. This happens in production when
> > an admin drops a replica and forgets to remove it from the master, and
> > this is quite annoying, because it may result in ENOSPC errors on the
> > master.
> 
> I started benching this patch to check whether fallocate() introduces a 
> performance regression and  discovered that there is a general 45% regression
> between 1.6 and 1.10.
> 
> I hope finally once I have pointed it out and A.Lyapunov has
> pointed it out, it will be addressed.
> 
> In any case we need to measure fallocate() impact very carefully
> before adding it. It seems we make things unnecessarily
> complicated all in order to spare the user from spurious ENOSPC,
> In my opinion it's a non-goal.
> 
> If we get rid of this requirement, we don't need fallocate(), and
> the patch could be made simpler in a couple more dimensions.

This requirement is a must IMO. What's the point of returning ENOSPC and
alerting the user if we can avoid that? I can foresee users complaining
about it and opening issues, like "spurious ENOSPC when there's enough
disk space" (because there will be enough disk space once gc has removed
stale replicas).

> 
> Please consider making a trivial patch which follows the steps of
> the patch by @belyak

It's not trivial and it's ugly. I wonder why you fail to see that.
Ping-ponging messages from WAL to TX in order to remove files?
Introducing yet another pipe for that. What for? WAL thread already has
all the necessary information about xlog files. It just needs to be told
what's the oldest WAL row it has to preserve in any case.

Moreover, shooting off consumers before deleting WAL files is
semantically incorrect, because the garbage collector knows nothing
about WAL files. For GC there's a continuous range of WAL rows it
tracks. Dividing those rows in files is a business of the WAL thread.
So how's it going to work if we put the TX thread responsible for
triggering WAL file deletion? WAL sends ENOSPC signal to TX. TX shoots
off a consumer. WAL retries, ENOSPC again, because no file was deleted!
Sends ENOSPC to TX again and so forth. Do you really want this?!

The design proposed in this patch is simple and clear. When invoked, GC
lets WAL know about rows that can be pruned right now and rows that can
be pruned in case of emergency. When hitting ENOSPC, WAL deletes old
WALs on its own basing on this information and notifies TX via the
existing notification subsystem (wal_watcher) so that the latter can
shoot off replicas that would need those files.

Regarding usage of falloate. I could implement this patch without it,
but it would be a bit more difficult, because there wouldn't be a clear
point of ENOSPC failure. Besides, what would happen if we wrote half of
a transaction to disk? How replication would work then? BTW, triggering
WAL deletion on behalf of TX suffers from the very same problem. That is
we are risking not only returning ENOSPC to the user, but also breaking
replication in a peculiar way.

Anyway, you seem to be unaware of the fact that one of fallocate use
cases is speeding writes by reducing the number of file size updates
(which require a write to the inode table). I wrote a simple test that
demonstrates that, see below.

vlad@esperanza test$ gcc -O2 fallocate_test.c -o fallocate_test
vlad@esperanza test$ ./fallocate_test
Usage: ./fallocate_test <filename> <write_count> <write_size> <alloc_size>
filename - test file
write_count - number of writes (append)
write_size - write(2) size
alloc_size - fallocate(2) size
returns time in seconds
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0
1.548161
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0
1.505698
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000
1.195223
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000
1.137996

I have stock ext4 over hdd on my laptop, configured by Debian, no
tuning. The code is right below. That is for an append-only workload
similar to WAL fallocate yields ~25% gain.

True, fallocate might need some tuning (how much to allocate for
different write sizes), but it's something we definitely want to have on
board.

vlad@esperanza test$ cat fallocate_test.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <time.h>

double gettime(void)
{
        struct timespec ts;
        if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) {
                perror("clock_monotonic");
                exit(EXIT_FAILURE);
        }
        return ts.tv_sec + ts.tv_nsec / 1e9;
}

int main(int argc, char **argv)
{
        if (argc != 5) {
                fprintf(stderr, "Usage: %s <filename> <write_count> "
                        "<write_size> <alloc_size>\n"
                        "filename - test file\n"
                        "write_count - number of writes (append)\n"
                        "write_size - write(2) size\n"
                        "alloc_size - fallocate(2) size\n"
                        "returns time in seconds\n",
                        argv[0]);
                return -1;
        }

        const char *filename = argv[1];
        int write_count = atoi(argv[2]);
        int write_size = atoi(argv[3]);
        int alloc_size = atoi(argv[4]);

        char *buf = malloc(write_size);
        if (buf == NULL) {
                perror("malloc");
                exit(EXIT_FAILURE);
        }
        memset(buf, 1, write_size);

        int fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
        if (fd < 0) {
                perror("open");
                exit(EXIT_FAILURE);
        }

        double t1 = gettime();

        off_t offset = 0;
        int prealloced = 0;
        for (int i = 0; i < write_count; i++) {
                if (alloc_size > 0 && prealloced < write_size) {
                        errno = posix_fallocate(fd, offset, alloc_size);
                        if (errno != 0) {
                                perror("posix_fallocate");
                                exit(EXIT_FAILURE);
                        }
                        prealloced += alloc_size;
                }
                ssize_t written = write(fd, buf, write_size);
                if (written < 0) {
                        perror("write");
                        exit(EXIT_FAILURE);
                }
                offset += written;
                prealloced -= written;
                if (prealloced < 0)
                        prealloced = 0;
        }

        double t2 = gettime();

        close(fd);
        unlink(filename);
        free(buf);

        printf("%f\n", __func__, t2 - t1);
        return 0;
}

  reply	other threads:[~2018-10-17  8:20 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-07 20:27 Vladimir Davydov
2018-10-07 20:27 ` [PATCH 1/5] xlog: fix filename in error messages Vladimir Davydov
2018-10-12  8:19   ` Vladimir Davydov
2018-10-16 19:07   ` [tarantool-patches] " Konstantin Osipov
2018-10-07 20:27 ` [PATCH 2/5] wal: preallocate disk space before writing rows Vladimir Davydov
2018-10-16 19:09   ` [tarantool-patches] " Konstantin Osipov
2018-10-07 20:27 ` [PATCH 3/5] xlog: allow to limit number of files deleted by xdir_collect_garbage Vladimir Davydov
2018-10-07 20:27 ` [PATCH 4/5] wal: notify watchers about wal file removal Vladimir Davydov
2018-10-07 20:27 ` [PATCH 5/5] wal: delete old wal files when running out of disk space Vladimir Davydov
2018-10-16 19:05 ` [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if " Konstantin Osipov
2018-10-17  8:20   ` Vladimir Davydov [this message]
2018-10-23  8:37     ` Vladimir Davydov
2018-10-23  8:46       ` Konstantin Osipov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181017082058.ebdyx3kwk42ln5tw@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=kostja@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='Re: [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox