[Tarantool-patches] [PATCH vshard 11/11] router: introduce map_callrw()

Mon Mar 1 13:58:48 MSK 2021

Hi!

On 27.02.2021 02:58, Vladislav Shpilevoy wrote:
>>>> On 23.02.2021 03:15, Vladislav Shpilevoy wrote:
>>>>> Closes #147
>>>> Will read-only map-reduce functions be done in the scope of separate issue/patch?
>>>>
>>>> I know about #173 but seems we need to keep information about map_callro function.
>>> Perhaps there will be a new ticket, yes. Until 173 is fixed, ro refs seem
>>> pointless. The implementation depends on how exactly to fix 173.
>>>
>>> At this moment I didn't even design it yet. I am thinking if I need to
>>> account ro storage refs separated from rw refs in the scheduler or not.
>>> Implementation heavily depends on this.
>>>
>>> If I go for considering ro refs separated, it adds third group of operations
>>> to the scheduler complicating it significantly, and a second type of refs to
>>> the refs module obviously.
>>>
>>> It would allow to send buckets and mark them as GARBAGE/SENT while keep ro
>>> refs. But would complicate the garbage collector a bit as it would need to
>>> consider storage ro refs.
>>>
>>> On the other hand, it would heavily complicate the code in the scheduler.
>>> Like really heavy, I suppose, but I can be wrong.
>>>
>>> Also the win will be zeroed if we ever implement rebalancing of writable
>>> buckets. Besides, the main reason I am more into unified type of refs is
>>> that they are used solely for map-reduce. Which means they are taken on all
>>> storages. This means if you have SENDING readable bucket on one storage,
>>> you have RECEIVING non-readable bucket on another storage. Which makes
>>> such ro refs pointless for full cluster scan. Simply won't be able to take
>>> ro ref on the RECEIVING storage.
>>>
>>> If we go for unified refs, it would allow to keep the scheduler relatively
>>> simple, but requires urgent fix of 173. Otherwise you can't be sure your
>>> data is consistent. Even now you can't be sure with normal requests, which
>>> terrifies me and raises a huge question - why the fuck nobody cares? I
>>> think I already had a couple of nightmares about 173.
>> Yes, it's a problem but rebalansing is not quite often operation. So, in some
>>
>> cases routeall() was enough. Anyway we didn't have any alternatives. But map-reduce it's
>>
>> really often operation - any request over secondary index and you should scan whole cluster.
> Have you tried building secondary indexes over buckets? There is an
> algorithm, in case it is something perf sensitive.
>
> You can store secondary index in another space and shard it independently.
> And there are ways how to deal with inability to atomically update it and
> the main space together. So almost always you will have at most 2 network
> hops to at most 2 nodes to find the primary key. Regardless of cluster
> size.

K. Nazarov had such ideas and he implemented such thing. He called it 
"reverse index" but

finally this patch wasn't applied to project upstream.

It's interesting idea but I've not seen request for this feature from 
our customers.

>>> Another issue - ro refs won't stop rebalancer from working and don't
>>> even participate in the scheduler, if we fix 173 in the simplest way -
>>> just invalidate all refs on the replica if a bucket starts moving. If
>>> the rebalancer works actively, nearly all your ro map-reduces will return
>>> errors during data migration because buckets will move constantly. No
>>> throttling via the scheduler.
>>>
>>> One way to go - switch all replica weights to defaults for the time of
>>> data migration manually. So map_callro() will go to master nodes and
>>> will participate in the scheduling.
>>>
>>> Another way to go - don't care and fix 173 in the simplest way. Weights
>>> anyway are used super rare AFAIK.
>>>
>>> Third way to go - implement some smarter fix of 173. I have many ideas here.
>>> But neither of them are quick.
>>>
>>>>> diff --git a/vshard/router/init.lua b/vshard/router/init.lua
>>>>> index 97bcb0a..8abd77f 100644
>>>>> --- a/vshard/router/init.lua
>>>>> +++ b/vshard/router/init.lua
>>>>> @@ -44,6 +44,11 @@ if not M then
>>>>>             module_version = 0,
>>>>>             -- Number of router which require collecting lua garbage.
>>>>>             collect_lua_garbage_cnt = 0,
>>>>> +
>>>>> +        ----------------------- Map-Reduce -----------------------
>>>>> +        -- Storage Ref ID. It must be unique for each ref request
>>>>> +        -- and therefore is global and monotonically growing.
>>>>> +        ref_id = 0,
>>>> Maybe 0ULL?
>>> Wouldn't break anyway - doubles are precise until 2^53. But an integer
>>> should be faster I hope.
>>>
>>> Changed to 0ULL.
> I reverted it back. Asked Igor and he reminded me it is cdata. So it
> involves heavy stuff with metatables and shit. It is cheaper to simply
> increment the plain number. I didn't measure though.

Well, I hope it won't cause any problems. However I'm not sure that 
summing the numbers can lead to any noticeable load.

>>>>> +router_map_callrw = function(router, func, args, opts)
>>>>> +    local replicasets = router.replicasets
>>>> It would be great to filter here replicasets with bucket_count = 0 and weight = 0.
>>> Information on the router may be outdated. Even if bucket_count is 0,
>>> it may still mean the discovery didn't get to there yet. Or the
>>> discovery is simply disabled or dead due to a bug.
>>>
>>> Weight 0 also does not say anything certain about buckets on the
>>> storage. It can be that config on the router is outdated, or it is
>>> outdated on the storage. Or someone simply does not set the weights
>>> for router configs because they are not used here. Or the weights are
>>> really 0 and set everywhere, but rebalancing is in progress - buckets
>>> move from the storage slowly, and the scheduler allows to squeeze some
>>> map-reduces in the meantime.
>>>
>>>> In case if such "dummy" replicasets are disabled we get an error "connection refused".
>>> I need more info here. What is 'disabled' replicaset? And why would
>>> 0 discovered buckets or 0 weight lead to the refused connection?
>>>
>>> In case this is some cartridge specific shit, this is bad. As you
>>> can see above, I can't ignore such replicasets. I need to send requests
>>> to them anyway.
>>>
>>> There is a workaround though - even if an error has occurred, continue
>>> execution if even without the failed storages I still cover all the buckets.
>>> Then having 'disabled' replicasets in the config would result in some
>>> unnecessary faulty requests for each map call, but they would work.
>>>
>>> Although I don't know how to 'return' such errors. I don't want to log them
>>> on each request, and don't have a concept of a 'warning' object or something
>>> like this. Weird option - in case of this uncertain success return the
>>> result, error, uuid. So the function could return
>>>
>>> - nil, err[, uuid] - fail;
>>> - res              - success;
>>> - res, err[, uuid] - success but with a suspicious issue;
>>>
>> Seems I didn't fully state the problem. Maybe it's not even relevant issue and users
>>
>> that do it just create their own problems. But there was a case when our customers added
>>
>> new replicaset (in fact single instance) in cluster. This instance didn't have any data and had a weight=0.
>>
>> Then they just turned off this instance after some time. And all requests that perform map-reduce started to fail with "Connection
>>
>> refused" error.
>>
>> It causes a question: "Why do our requests fail if we disable instance that doesn't have any data".
>>
>> Yes, requirement of weight=0 is obviously not enough - because if rebalansing is in progress replicaset with weight=0
>>
>> still could contain some data.
>>
>>
>> Considering an opportunity to finish requests if someone failed - I'm not sure that it's really needed.
>>
>> Usually we don't need some partial result (moreover if it adds some workload).
> I want to emphasize - it won't be partial. It will be full result. In case the
> 'disabled' node does not have any data, the requests to the other nodes will
> see that they covered all the buckets. So this 'disabled' node loss is not
> critical.

But can I make sure that this node actually doesn't contain any data or 
it's really some problems

with discovery and etc. And in general we can have more than one such 
replicaset.

Maybe yes, sometimes you want to have some results even if some of 
requests fails. Maybe it's possible to be done under some special

flag? e.g. "ignore_errors" but this approach requires a bit different 
way how to process errors - you should return a map of results and map 
of errors.

Finally, feel free to ignore this suggestion - I'll filed an issue if 
I'm sure that we really need it.

> Map-Reduce, at least how I understood it, is about providing access to all
> buckets. You can succeed at this even if didn't manage to look into each node.
> If the not responded nodes didn't have any data.
>
> But sounds like a crutch for an outdated config. If this is not something
> regular I need to support, I better leave it as is now then. A normal fail.