From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Wed, 9 Jan 2019 16:48:09 +0300 From: Konstantin Osipov Subject: Re: [tarantool-patches] [PATCH v3 2/6] [RAW] swim: introduce failure detection component Message-ID: <20190109134809.GE20509@chai> References: <9b4bdddc30f9554b85c9890c0d6c70645c0c7930.1546077015.git.v.shpilevoy@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9b4bdddc30f9554b85c9890c0d6c70645c0c7930.1546077015.git.v.shpilevoy@tarantool.org> To: tarantool-patches@freelists.org Cc: vdavydov.dev@gmail.com List-ID: * Vladislav Shpilevoy [18/12/29 15:07]: > enum { > /** How often to send membership messages and pings. */ > HEARTBEAT_RATE_DEFAULT = 1, > + /** > + * If a ping was sent, it is considered to be lost after > + * this time without an ack. > + */ > + ACK_TIMEOUT = 1, The timeout should be configurable. A reasonable default looks more close 30 seconds at least - we have many cases in (malfunctioning) production when long requests stall the event loop for 10-15 seconds, such requests should not lead to membership storms. > + /** > + * If a member has not been responding to pings this > + * number of times, it is considered to be dead. > + */ > + NO_ACKS_TO_DEAD = 3, > + /** > + * If a not pinned member confirmed to be dead, it is > + * removed from the membership after at least this number > + * of failed pings. > + */ > + NO_ACKS_TO_GC = NO_ACKS_TO_DEAD + 2, > }; > > + bool is_pinned; > + /** Growing number to refute old messages. */ reject, or perhaps ignore? refute is usually used for arguments in a heated conversation. > + /** > + * How many pings did not receive an ack in a row. After > + * a threshold the instance is marked as dead. After more > + * it is removed from the table (if not pinned). > + */ > + int failed_pings; These are more like unacknowledged pings. Have they failed? Maybe. > + /** When the latest ping was sent to this member. */ > + double ping_ts; last_ping_time? Why use double and not ev_time_t? > + /** Type of the failure detection message: ping or ack. */ > + SWIM_FD_MSG_TYPE, > + /** > + * Incarnation of the sender. To make the member alive if > + * it was considered to be dead, but ping/ack with greater > + * incarnation was received from it. > + */ > + SWIM_FD_INCARNATION, Uhm, FD is too commonly used for a file descriptor. Please use a different name. Why not simply SWIM_PING and SWIM_ACK? > struct swim_transport swim_udp_transport = { > /* .send_round_msg = */ swim_udp_send_msg, > + /* .send_ping = */ swim_udp_send_msg, > + /* .send_ack = */ swim_udp_send_msg, Why do you need a separate transport api for ack/ping sends? Shouldn't send/recv be enough? This is a transport layer, it should be unaware of protocol details. -- Konstantin Osipov, Moscow, Russia, +7 903 626 22 32 http://tarantool.io - www.twitter.com/kostja_osipov