From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounces@dev.tarantool.org>
Received: from [87.239.111.99] (localhost [127.0.0.1])
	by dev.tarantool.org (Postfix) with ESMTP id 1E9467030F;
	Thu, 25 Feb 2021 15:44:18 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 1E9467030F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev;
	t=1614257058; bh=P9icKg2cP5mJYJZpqI5zsCUs7AXtqP/ibvxAzlA2voo=;
	h=To:References:Date:In-Reply-To:Subject:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=EFPtfG6ZcvO4VrFSIgFds21qNt4PYbjx32d2HlLAnsQGWK1pRi0Q2ImFzaSpTh8RY
	 b/qlxS7RxKQUM5rf/UaQ09OYsqVbDeJnUs0adHzwF31NI5LAt11dHhpeUEovTsWS0t
	 v1JWSuzpjFulsutgV3DcPmbo026mH1zf5jcQZwsU=
Received: from smtp55.i.mail.ru (smtp55.i.mail.ru [217.69.128.35])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 7BD677030F
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 25 Feb 2021 15:43:14 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 7BD677030F
Received: by smtp55.i.mail.ru with esmtpa (envelope-from
 <olegrok@tarantool.org>)
 id 1lFFz6-0002Xi-Dq; Thu, 25 Feb 2021 15:43:14 +0300
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>,
 tarantool-patches@dev.tarantool.org, yaroslav.dynnikov@tarantool.org
References: <cover.1614039039.git.v.shpilevoy@tarantool.org>
 <f799cf4c753c0e271cfa9309aac4cd33a6a65aaf.1614039039.git.v.shpilevoy@tarantool.org>
 <837d2697-5165-23d6-72bf-f7533af10864@tarantool.org>
 <cfee9fa8-b370-78be-c5ec-2c6e3b459bc0@tarantool.org>
Message-ID: <f3760ddc-55d7-b6db-718b-d6969bb66085@tarantool.org>
Date: Thu, 25 Feb 2021 15:43:10 +0300
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0)
 Gecko/20100101 Thunderbird/78.7.1
MIME-Version: 1.0
In-Reply-To: <cfee9fa8-b370-78be-c5ec-2c6e3b459bc0@tarantool.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-GB
X-7564579A: 646B95376F6C166E
X-77F55803: 4F1203BC0FB41BD975C3EC174F5669229511437AA01F46811CFCF616A939B362182A05F53808504091FCFA0642490E9C242BDEE6FF47F2BE1733572B21620B1991675B2CB381C567
X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE7544B1CCE26E01C74EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637174E2957C4CE0F938638F802B75D45FF914D58D5BE9E6BC131B5C99E7648C95CDDE882590F889B1CD5457EDD335955C81D4456E09E356D85A471835C12D1D9774AD6D5ED66289B5278DA827A17800CE71AE4D56B06699BBC9FA2833FD35BB23D2EF20D2F80756B5F868A13BD56FB6657A471835C12D1D977725E5C173C3A84C327ED053E960B195E117882F4460429728AD0CFFFB425014E868A13BD56FB6657D81D268191BDAD3DC09775C1D3CA48CF7F8A167F7B42CAD3BA3038C0950A5D36C8A9BA7A39EFB766EC990983EF5C0329BA3038C0950A5D36D5E8D9A59859A8B6CD063A04CBECCE1B76E601842F6C81A1F004C90652538430CDED94BCBF13EF3B93EC92FD9297F6718AA50765F7900637C69D4C8FA58DAC3AA7F4EDE966BC389F395957E7521B51C24C7702A67D5C33162DBA43225CD8A89FDD9D78FC367030856D8C47C27EEC5E9FB5C8C57E37DE458B4C7702A67D5C3316FA3894348FB808DB48C21F01D89DB561574AF45C6390F7469DAA53EE0834AAEE
X-C1DE0DAB: 0D63561A33F958A56B42ABC96495FD15AC4FE941BA784C1038DECB0E97FEFE31D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA7557E988E9157162368E8E86DC7131B365E7726E8460B7C23C
X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D344B4608D01B1CB59AFCF0482AC94B02C84365A9893F8B6EACED90C31A5BE9224103E61E306EB9C3161D7E09C32AA3244C7717C4B78EB0E38886BE7DDF52AD21A8435BF7150578642FFACE5A9C96DEB163
X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojvz1c9SWJtj/3ZQXa26LBiw==
X-Mailru-Sender: 583F1D7ACE8F49BD9317CE1922F30C7E177B989DAC0067AFAB04F7AB7CFE84D82B0CDE12F27F436423E75C7104EB1B885DEE61814008E47C7013064206BFB89F93956FB04BA385BE9437F6177E88F7363CDA0F3B3F5B9367
X-Mras: Ok
Subject: Re: [Tarantool-patches] [PATCH vshard 11/11] router: introduce
 map_callrw()
X-BeenThere: tarantool-patches@dev.tarantool.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
From: Oleg Babin via Tarantool-patches <tarantool-patches@dev.tarantool.org>
Reply-To: Oleg Babin <olegrok@tarantool.org>
Errors-To: tarantool-patches-bounces@dev.tarantool.org
Sender: "Tarantool-patches" <tarantool-patches-bounces@dev.tarantool.org>

Thanks for your detailed explanation. See my comments/answers below.

On 25.02.2021 01:04, Vladislav Shpilevoy wrote:
> Thanks for the review!
>
> On 24.02.2021 11:28, Oleg Babin via Tarantool-patches wrote:
>> Thanks a lot for your patch! See 5 comments below.
>>
>>
>> On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
>>> Closes #147
>> Will read-only map-reduce functions be done in the scope of separate issue/patch?
>>
>> I know about #173 but seems we need to keep information about map_callro function.
> Perhaps there will be a new ticket, yes. Until 173 is fixed, ro refs seem
> pointless. The implementation depends on how exactly to fix 173.
>
> At this moment I didn't even design it yet. I am thinking if I need to
> account ro storage refs separated from rw refs in the scheduler or not.
> Implementation heavily depends on this.
>
> If I go for considering ro refs separated, it adds third group of operations
> to the scheduler complicating it significantly, and a second type of refs to
> the refs module obviously.
>
> It would allow to send buckets and mark them as GARBAGE/SENT while keep ro
> refs. But would complicate the garbage collector a bit as it would need to
> consider storage ro refs.
>
> On the other hand, it would heavily complicate the code in the scheduler.
> Like really heavy, I suppose, but I can be wrong.
>
> Also the win will be zeroed if we ever implement rebalancing of writable
> buckets. Besides, the main reason I am more into unified type of refs is
> that they are used solely for map-reduce. Which means they are taken on all
> storages. This means if you have SENDING readable bucket on one storage,
> you have RECEIVING non-readable bucket on another storage. Which makes
> such ro refs pointless for full cluster scan. Simply won't be able to take
> ro ref on the RECEIVING storage.
>
> If we go for unified refs, it would allow to keep the scheduler relatively
> simple, but requires urgent fix of 173. Otherwise you can't be sure your
> data is consistent. Even now you can't be sure with normal requests, which
> terrifies me and raises a huge question - why the fuck nobody cares? I
> think I already had a couple of nightmares about 173.

Yes, it's a problem but rebalansing is not quite often operation. So, in 
some

cases routeall() was enough. Anyway we didn't have any alternatives. But 
map-reduce it's

really often operation - any request over secondary index and you should 
scan whole cluster.

> Another issue - ro refs won't stop rebalancer from working and don't
> even participate in the scheduler, if we fix 173 in the simplest way -
> just invalidate all refs on the replica if a bucket starts moving. If
> the rebalancer works actively, nearly all your ro map-reduces will return
> errors during data migration because buckets will move constantly. No
> throttling via the scheduler.
>
> One way to go - switch all replica weights to defaults for the time of
> data migration manually. So map_callro() will go to master nodes and
> will participate in the scheduling.
>
> Another way to go - don't care and fix 173 in the simplest way. Weights
> anyway are used super rare AFAIK.
>
> Third way to go - implement some smarter fix of 173. I have many ideas here.
> But neither of them are quick.
>
>>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>>> index 97bcb0a..8abd77f 100644
>>> --- a/vshard/router/init.lua
>>> +++ b/vshard/router/init.lua
>>> @@ -44,6 +44,11 @@ if not M then
>>>            module_version = 0,
>>>            -- Number of router which require collecting lua garbage.
>>>            collect_lua_garbage_cnt = 0,
>>> +
>>> +        ----------------------- Map-Reduce -----------------------
>>> +        -- Storage Ref ID. It must be unique for each ref request
>>> +        -- and therefore is global and monotonically growing.
>>> +        ref_id = 0,
>> Maybe 0ULL?
> Wouldn't break anyway - doubles are precise until 2^53. But an integer
> should be faster I hope.
>
> Changed to 0ULL.
>
> But still not sure. I asked Igor about this. If ULL/LL are more performant,
> I will also use them in storage.ref and storage.sched where possible. They
> have many counters.
>
>>>    @@ -674,6 +679,177 @@ local function router_call(router, bucket_id, opts, ...)
>>>                                ...)
>>>    end
>>>    +local router_map_callrw
>>> +
>>> +if util.version_is_at_least(1, 10, 0) then
>>> +--
>>> +-- Consistent Map-Reduce. The given function is called on all masters in the
>>> +-- cluster with a guarantee that in case of success it was executed with all
>>> +-- buckets being accessible for reads and writes.
>>> +--
>>> +-- Consistency in scope of map-reduce means all the data was accessible, and
>>> +-- didn't move during map requests execution. To preserve the consistency there
>>> +-- is a third stage - Ref. So the algorithm is actually Ref-Map-Reduce.
>>> +--
>>> +-- Refs are broadcast before Map stage to pin the buckets to their storages, and
>>> +-- ensure they won't move until maps are done.
>>> +--
>>> +-- Map requests are broadcast in case all refs are done successfully. They
>>> +-- execute the user function + delete the refs to enable rebalancing again.
>>> +--
>>> +-- On the storages there are additional means to ensure map-reduces don't block
>>> +-- rebalancing forever and vice versa.
>>> +--
>>> +-- The function is not as slow as it may seem - it uses netbox's feature
>>> +-- is_async to send refs and maps in parallel. So cost of the function is about
>>> +-- 2 network exchanges to the most far storage in terms of time.
>>> +--
>>> +-- @param router Router instance to use.
>>> +-- @param func Name of the function to call.
>>> +-- @param args Function arguments passed in netbox style (as an array).
>>> +-- @param opts Can only contain 'timeout' as a number of seconds. Note that the
>>> +--     refs may end up being kept on the storages during this entire timeout if
>>> +--     something goes wrong. For instance, network issues appear. This means
>>> +--     better not use a value bigger than necessary. A stuck infinite ref can
>>> +--     only be dropped by this router restart/reconnect or the storage restart.
>>> +--
>>> +-- @return In case of success - a map with replicaset UUID keys and values being
>>> +--     what the function returned from the replicaset.
>>> +--
>>> +-- @return In case of an error - nil, error object, optional UUID of the
>>> +--     replicaset where the error happened. UUID may be not present if it wasn't
>>> +--     about concrete replicaset. For example, not all buckets were found even
>>> +--     though all replicasets were scanned.
>>> +--
>>> +router_map_callrw = function(router, func, args, opts)
>>> +    local replicasets = router.replicasets
>> It would be great to filter here replicasets with bucket_count = 0 and weight = 0.
> Information on the router may be outdated. Even if bucket_count is 0,
> it may still mean the discovery didn't get to there yet. Or the
> discovery is simply disabled or dead due to a bug.
>
> Weight 0 also does not say anything certain about buckets on the
> storage. It can be that config on the router is outdated, or it is
> outdated on the storage. Or someone simply does not set the weights
> for router configs because they are not used here. Or the weights are
> really 0 and set everywhere, but rebalancing is in progress - buckets
> move from the storage slowly, and the scheduler allows to squeeze some
> map-reduces in the meantime.
>
>> In case if such "dummy" replicasets are disabled we get an error "connection refused".
> I need more info here. What is 'disabled' replicaset? And why would
> 0 discovered buckets or 0 weight lead to the refused connection?
>
> In case this is some cartridge specific shit, this is bad. As you
> can see above, I can't ignore such replicasets. I need to send requests
> to them anyway.
>
> There is a workaround though - even if an error has occurred, continue
> execution if even without the failed storages I still cover all the buckets.
> Then having 'disabled' replicasets in the config would result in some
> unnecessary faulty requests for each map call, but they would work.
>
> Although I don't know how to 'return' such errors. I don't want to log them
> on each request, and don't have a concept of a 'warning' object or something
> like this. Weird option - in case of this uncertain success return the
> result, error, uuid. So the function could return
>
> - nil, err[, uuid] - fail;
> - res              - success;
> - res, err[, uuid] - success but with a suspicious issue;
>
Seems I didn't fully state the problem. Maybe it's not even relevant 
issue and users

that do it just create their own problems. But there was a case when our 
customers added

new replicaset (in fact single instance) in cluster. This instance 
didn't have any data and had a weight=0.

Then they just turned off this instance after some time. And all 
requests that perform map-reduce started to fail with "Connection

refused" error.

It causes a question: "Why do our requests fail if we disable instance 
that doesn't have any data".

Yes, requirement of weight=0 is obviously not enough - because if 
rebalansing is in progress replicaset with weight=0

still could contain some data.


Considering an opportunity to finish requests if someone failed - I'm 
not sure that it's really needed.

Usually we don't need some partial result (moreover if it adds some 
workload).


Regarding the question of error. Cartridge has an example of error 
object for map_call function.

For errors cartridge creates some "object" that contains info about all 
errors that were returned.

It's not excellent solution but I think Yaroslav could give some 
thoughts here.


https://github.com/tarantool/cartridge/blob/cf195bc9576eb460d66d609c357bec5014a90d21/cartridge/pool.lua#L235


>>> +    local timeout = opts and opts.timeout or consts.CALL_TIMEOUT_MIN
>>> +    local deadline = fiber_clock() + timeout
>>> +    local err, err_uuid, res, ok, map
>>> +    local futures = {}
>>> +    local bucket_count = 0
>>> +    local opts_async = {is_async = true}
>>> +    local rs_count = 0
>>> +    local rid = M.ref_id
>>> +    M.ref_id = rid + 1
>>> +    -- Nil checks are done explicitly here (== nil instead of 'not'), because
>>> +    -- netbox requests return box.NULL instead of nils.
>>> +
>>> +    --
>>> +    -- Ref stage: send.
>>> +    --
>>> +    for uuid, rs in pairs(replicasets) do
>>> +        -- Netbox async requests work only with active connections. Need to wait
>>> +        -- for the connection explicitly.
>>> +        timeout, err = rs:wait_connected(timeout)
>>> +        if timeout == nil then
>>> +            err_uuid = uuid
>>> +            goto fail
>>> +        end
>>> +        res, err = rs:callrw('vshard.storage._call',
>>> +                              {'storage_ref', rid, timeout}, opts_async)
>>> +        if res == nil then
>>> +            err_uuid = uuid
>>> +            goto fail
>>> +        end
>>> +        futures[uuid] = res
>>> +        rs_count = rs_count + 1
>>> +    end
>>> +    map = table_new(0, rs_count)
>>> +    --
>>> +    -- Ref stage: collect.
>>> +    --
>>> +    for uuid, future in pairs(futures) do
>>> +        res, err = future:wait_result(timeout)
>>> +        -- Handle netbox error first.
>>> +        if res == nil then
>>> +            err_uuid = uuid
>>> +            goto fail
>>> +        end
>>> +        -- Ref returns nil,err or bucket count.
>>> +        res, err = unpack(res)
>> Seems `res, err = res[1], res[2]` could be a bit faster.
> Indeed. Applied:
>
> ====================
> @@ -767,7 +767,7 @@ router_map_callrw = function(router, func, args, opts)
>               goto fail
>           end
>           -- Ref returns nil,err or bucket count.
> -        res, err = unpack(res)
> +        res, err = res[1], res[2]
>           if res == nil then
>               err_uuid = uuid
> ====================
>
> On a not related note, I gave more thought to your idea with
> doing a 'map-reduce' but not on the whole cluster. And for this we
> could introduce 'sca_callrw/sca_callro'. Which mean 'scatter'.
>
> They could send your requests to the given replicasets (or all by
> default) and return whatever is back. Just a wrapper on top of a
> couple of loops with is_async netbox calls. Not sure if need to
> take storage refs for the calls. Their usage might be not related
> to buckets really. Just to access the storages and their own local
> data, not sharded data.
>
> In case you want scattering for accessing a set of buckets, I
> could add `bat_callrw/bat_callro`. Which means 'batch'. They would
> take a set of bucket ids, match them to replicasets, go to each
> replicaset just one time for all its buckets, take storage ref
> there, and execute your function. Or ref all the buckets individually
> but I don't know what to do if their number is big. Would increase
> request size, and referencing overhead too much.
>
> To attach some context to the buckets bat_call could accept pairs
> {bucket_id = {func, args}}. Or accept one function name and for
> each bucket have {bucket_id = {args}}. And bat_call will call your
> function with the given args on the storage. Maybe with such approach
> individual refs make more sense. Don't know.
>
> If you like these ideas, you can file tickets for them so as they
> wouldn't be lost and maybe eventually would be implemented.

Batch requests are quite often case and seems issue is already filed - 
https://github.com/tarantool/vshard/issues/176.

I'll think about "sca_callrw/sca_callro" if I will find some cases I 
file an issue. But currently it seems my

minds around scatter-operations are closely related with batch requests.