From: Vladimir Davydov <vdavydov.dev@gmail.com> To: Konstantin Osipov <kostja@tarantool.org> Cc: tarantool-patches@freelists.org Subject: Re: [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space Date: Wed, 17 Oct 2018 11:20:58 +0300 [thread overview] Message-ID: <20181017082058.ebdyx3kwk42ln5tw@esperanza> (raw) In-Reply-To: <20181016190522.GH5454@chai> On Tue, Oct 16, 2018 at 10:05:22PM +0300, Konstantin Osipov wrote: > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/08 13:52]: > > If a replica permanently stops working for some reason, it will pin WAL > > files it would need to resume until it is deleted from the _cluster > > system space or the master is restarted. This happens in production when > > an admin drops a replica and forgets to remove it from the master, and > > this is quite annoying, because it may result in ENOSPC errors on the > > master. > > I started benching this patch to check whether fallocate() introduces a > performance regression and discovered that there is a general 45% regression > between 1.6 and 1.10. > > I hope finally once I have pointed it out and A.Lyapunov has > pointed it out, it will be addressed. > > In any case we need to measure fallocate() impact very carefully > before adding it. It seems we make things unnecessarily > complicated all in order to spare the user from spurious ENOSPC, > In my opinion it's a non-goal. > > If we get rid of this requirement, we don't need fallocate(), and > the patch could be made simpler in a couple more dimensions. This requirement is a must IMO. What's the point of returning ENOSPC and alerting the user if we can avoid that? I can foresee users complaining about it and opening issues, like "spurious ENOSPC when there's enough disk space" (because there will be enough disk space once gc has removed stale replicas). > > Please consider making a trivial patch which follows the steps of > the patch by @belyak It's not trivial and it's ugly. I wonder why you fail to see that. Ping-ponging messages from WAL to TX in order to remove files? Introducing yet another pipe for that. What for? WAL thread already has all the necessary information about xlog files. It just needs to be told what's the oldest WAL row it has to preserve in any case. Moreover, shooting off consumers before deleting WAL files is semantically incorrect, because the garbage collector knows nothing about WAL files. For GC there's a continuous range of WAL rows it tracks. Dividing those rows in files is a business of the WAL thread. So how's it going to work if we put the TX thread responsible for triggering WAL file deletion? WAL sends ENOSPC signal to TX. TX shoots off a consumer. WAL retries, ENOSPC again, because no file was deleted! Sends ENOSPC to TX again and so forth. Do you really want this?! The design proposed in this patch is simple and clear. When invoked, GC lets WAL know about rows that can be pruned right now and rows that can be pruned in case of emergency. When hitting ENOSPC, WAL deletes old WALs on its own basing on this information and notifies TX via the existing notification subsystem (wal_watcher) so that the latter can shoot off replicas that would need those files. Regarding usage of falloate. I could implement this patch without it, but it would be a bit more difficult, because there wouldn't be a clear point of ENOSPC failure. Besides, what would happen if we wrote half of a transaction to disk? How replication would work then? BTW, triggering WAL deletion on behalf of TX suffers from the very same problem. That is we are risking not only returning ENOSPC to the user, but also breaking replication in a peculiar way. Anyway, you seem to be unaware of the fact that one of fallocate use cases is speeding writes by reducing the number of file size updates (which require a write to the inode table). I wrote a simple test that demonstrates that, see below. vlad@esperanza test$ gcc -O2 fallocate_test.c -o fallocate_test vlad@esperanza test$ ./fallocate_test Usage: ./fallocate_test <filename> <write_count> <write_size> <alloc_size> filename - test file write_count - number of writes (append) write_size - write(2) size alloc_size - fallocate(2) size returns time in seconds vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0 1.548161 vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0 1.505698 vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000 1.195223 vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000 1.137996 I have stock ext4 over hdd on my laptop, configured by Debian, no tuning. The code is right below. That is for an append-only workload similar to WAL fallocate yields ~25% gain. True, fallocate might need some tuning (how much to allocate for different write sizes), but it's something we definitely want to have on board. vlad@esperanza test$ cat fallocate_test.c #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include <errno.h> #include <time.h> double gettime(void) { struct timespec ts; if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) { perror("clock_monotonic"); exit(EXIT_FAILURE); } return ts.tv_sec + ts.tv_nsec / 1e9; } int main(int argc, char **argv) { if (argc != 5) { fprintf(stderr, "Usage: %s <filename> <write_count> " "<write_size> <alloc_size>\n" "filename - test file\n" "write_count - number of writes (append)\n" "write_size - write(2) size\n" "alloc_size - fallocate(2) size\n" "returns time in seconds\n", argv[0]); return -1; } const char *filename = argv[1]; int write_count = atoi(argv[2]); int write_size = atoi(argv[3]); int alloc_size = atoi(argv[4]); char *buf = malloc(write_size); if (buf == NULL) { perror("malloc"); exit(EXIT_FAILURE); } memset(buf, 1, write_size); int fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) { perror("open"); exit(EXIT_FAILURE); } double t1 = gettime(); off_t offset = 0; int prealloced = 0; for (int i = 0; i < write_count; i++) { if (alloc_size > 0 && prealloced < write_size) { errno = posix_fallocate(fd, offset, alloc_size); if (errno != 0) { perror("posix_fallocate"); exit(EXIT_FAILURE); } prealloced += alloc_size; } ssize_t written = write(fd, buf, write_size); if (written < 0) { perror("write"); exit(EXIT_FAILURE); } offset += written; prealloced -= written; if (prealloced < 0) prealloced = 0; } double t2 = gettime(); close(fd); unlink(filename); free(buf); printf("%f\n", __func__, t2 - t1); return 0; }
next prev parent reply other threads:[~2018-10-17 8:20 UTC|newest] Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-10-07 20:27 Vladimir Davydov 2018-10-07 20:27 ` [PATCH 1/5] xlog: fix filename in error messages Vladimir Davydov 2018-10-12 8:19 ` Vladimir Davydov 2018-10-16 19:07 ` [tarantool-patches] " Konstantin Osipov 2018-10-07 20:27 ` [PATCH 2/5] wal: preallocate disk space before writing rows Vladimir Davydov 2018-10-16 19:09 ` [tarantool-patches] " Konstantin Osipov 2018-10-07 20:27 ` [PATCH 3/5] xlog: allow to limit number of files deleted by xdir_collect_garbage Vladimir Davydov 2018-10-07 20:27 ` [PATCH 4/5] wal: notify watchers about wal file removal Vladimir Davydov 2018-10-07 20:27 ` [PATCH 5/5] wal: delete old wal files when running out of disk space Vladimir Davydov 2018-10-16 19:05 ` [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if " Konstantin Osipov 2018-10-17 8:20 ` Vladimir Davydov [this message] 2018-10-23 8:37 ` Vladimir Davydov 2018-10-23 8:46 ` Konstantin Osipov
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20181017082058.ebdyx3kwk42ln5tw@esperanza \ --to=vdavydov.dev@gmail.com \ --cc=kostja@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='Re: [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox