From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 17 Oct 2018 11:20:58 +0300 From: Vladimir Davydov Subject: Re: [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space Message-ID: <20181017082058.ebdyx3kwk42ln5tw@esperanza> References: <20181016190522.GH5454@chai> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181016190522.GH5454@chai> To: Konstantin Osipov Cc: tarantool-patches@freelists.org List-ID: On Tue, Oct 16, 2018 at 10:05:22PM +0300, Konstantin Osipov wrote: > * Vladimir Davydov [18/10/08 13:52]: > > If a replica permanently stops working for some reason, it will pin WAL > > files it would need to resume until it is deleted from the _cluster > > system space or the master is restarted. This happens in production when > > an admin drops a replica and forgets to remove it from the master, and > > this is quite annoying, because it may result in ENOSPC errors on the > > master. > > I started benching this patch to check whether fallocate() introduces a > performance regression and discovered that there is a general 45% regression > between 1.6 and 1.10. > > I hope finally once I have pointed it out and A.Lyapunov has > pointed it out, it will be addressed. > > In any case we need to measure fallocate() impact very carefully > before adding it. It seems we make things unnecessarily > complicated all in order to spare the user from spurious ENOSPC, > In my opinion it's a non-goal. > > If we get rid of this requirement, we don't need fallocate(), and > the patch could be made simpler in a couple more dimensions. This requirement is a must IMO. What's the point of returning ENOSPC and alerting the user if we can avoid that? I can foresee users complaining about it and opening issues, like "spurious ENOSPC when there's enough disk space" (because there will be enough disk space once gc has removed stale replicas). > > Please consider making a trivial patch which follows the steps of > the patch by @belyak It's not trivial and it's ugly. I wonder why you fail to see that. Ping-ponging messages from WAL to TX in order to remove files? Introducing yet another pipe for that. What for? WAL thread already has all the necessary information about xlog files. It just needs to be told what's the oldest WAL row it has to preserve in any case. Moreover, shooting off consumers before deleting WAL files is semantically incorrect, because the garbage collector knows nothing about WAL files. For GC there's a continuous range of WAL rows it tracks. Dividing those rows in files is a business of the WAL thread. So how's it going to work if we put the TX thread responsible for triggering WAL file deletion? WAL sends ENOSPC signal to TX. TX shoots off a consumer. WAL retries, ENOSPC again, because no file was deleted! Sends ENOSPC to TX again and so forth. Do you really want this?! The design proposed in this patch is simple and clear. When invoked, GC lets WAL know about rows that can be pruned right now and rows that can be pruned in case of emergency. When hitting ENOSPC, WAL deletes old WALs on its own basing on this information and notifies TX via the existing notification subsystem (wal_watcher) so that the latter can shoot off replicas that would need those files. Regarding usage of falloate. I could implement this patch without it, but it would be a bit more difficult, because there wouldn't be a clear point of ENOSPC failure. Besides, what would happen if we wrote half of a transaction to disk? How replication would work then? BTW, triggering WAL deletion on behalf of TX suffers from the very same problem. That is we are risking not only returning ENOSPC to the user, but also breaking replication in a peculiar way. Anyway, you seem to be unaware of the fact that one of fallocate use cases is speeding writes by reducing the number of file size updates (which require a write to the inode table). I wrote a simple test that demonstrates that, see below. vlad@esperanza test$ gcc -O2 fallocate_test.c -o fallocate_test vlad@esperanza test$ ./fallocate_test Usage: ./fallocate_test filename - test file write_count - number of writes (append) write_size - write(2) size alloc_size - fallocate(2) size returns time in seconds vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0 1.548161 vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0 1.505698 vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000 1.195223 vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000 1.137996 I have stock ext4 over hdd on my laptop, configured by Debian, no tuning. The code is right below. That is for an append-only workload similar to WAL fallocate yields ~25% gain. True, fallocate might need some tuning (how much to allocate for different write sizes), but it's something we definitely want to have on board. vlad@esperanza test$ cat fallocate_test.c #include #include #include #include #include #include #include #include #include double gettime(void) { struct timespec ts; if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) { perror("clock_monotonic"); exit(EXIT_FAILURE); } return ts.tv_sec + ts.tv_nsec / 1e9; } int main(int argc, char **argv) { if (argc != 5) { fprintf(stderr, "Usage: %s " " \n" "filename - test file\n" "write_count - number of writes (append)\n" "write_size - write(2) size\n" "alloc_size - fallocate(2) size\n" "returns time in seconds\n", argv[0]); return -1; } const char *filename = argv[1]; int write_count = atoi(argv[2]); int write_size = atoi(argv[3]); int alloc_size = atoi(argv[4]); char *buf = malloc(write_size); if (buf == NULL) { perror("malloc"); exit(EXIT_FAILURE); } memset(buf, 1, write_size); int fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) { perror("open"); exit(EXIT_FAILURE); } double t1 = gettime(); off_t offset = 0; int prealloced = 0; for (int i = 0; i < write_count; i++) { if (alloc_size > 0 && prealloced < write_size) { errno = posix_fallocate(fd, offset, alloc_size); if (errno != 0) { perror("posix_fallocate"); exit(EXIT_FAILURE); } prealloced += alloc_size; } ssize_t written = write(fd, buf, write_size); if (written < 0) { perror("write"); exit(EXIT_FAILURE); } offset += written; prealloced -= written; if (prealloced < 0) prealloced = 0; } double t2 = gettime(); close(fd); unlink(filename); free(buf); printf("%f\n", __func__, t2 - t1); return 0; }