[tarantool-patches] Re: [PATCH v1 1/1] lib: implement JSON tree class for json library

Mon Oct 8 13:07:17 MSK 2018

On Sat, Oct 06, 2018 at 03:37:14PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev at gmail.com> [18/10/05 12:52]:
> > There will be such a hash table in each root tuple_field object (it's
> > not a part of the generic json tree infrastructure). It will be used to
> > speed up data indexing. Why would we need to bother implementing lookup
> > in a tree then? Because I assume we want to speed up lookups over all
> > sub-paths, e.g. suppose we have a tuple
> 
> Why do we maintain a separate hash for tuple field names and json
> paths within a tuple?
> 
> json paths are just a continuation of existing ways to access a
> tuple field.
> 
> Looks like we need to rebuild existing implementation of field
> names support, not add something on a side.

Field names are not quite the same as json paths within a field.
Basically, they are just aliases to field indexes, i.e. the same tuple
field can be referenced both by name and by index. With json paths, this
is impossible. Handling this properly would complicate the code IMO.
Also, I believe that incorporating json paths in tuple_dictionary can
slow down users that don't use json paths.

That said, I don't think it'd be good idea to mix the two notions, but
we can discuss it you wish.

> 
> > 
> > t = {
> >      {
> >       customer = {
> >        name = {
> >         first = ...,
> >         last = ...,
> >        },
> >        ...
> >       }
> >      },
> >      ...
> >     }
> > 
> > Then we can lookup 'customer.name.first' in Lua using any of the
> > following ways:
> > 
> >   t['[1].customer.name.first']
> >   t[1]['customer.name.first']
> >   t[1].customer['name.first']
> >   t[1].customer.name.first
> > 
> >   etc
> > 
> > I assume we want to use tuple field map so as to avoid msgpack decode
> > whenever possible, not just for cases 1 or 2. Moreover, I guess
> > t[1].customer.name.first will be more commonly used.
> 
> 
> > 
> > We could of course store a hash table over all full paths in each
> > intermediate node, but that would result in maintaining way too many
> > hash entries, e.g. for a tuple field corresponding to 'customer', we
> > would have to store all the following strings in the hash: 'name',
> > 'name.first' and so on for other descendants. So I think that we only
> > want to use the hash table for data indexing, i.e. when a field is
> > looked up by full path like t[1]['customer.name.first']. For all other
> > references we will split the path into nodes and descend down the tree
> > to find the appropriate tuple field map slot.
> 
> I would try to create a single hash table for all possible paths,
> including intermediate paths, and store it in the tuple format.

This means that the number of hashed paths would grow quadratically with
the number of nodes, e.g. for

  a.b.c.d
  a.b.e.f

we'd have to store

  a, a.b, a.b.c, a.b.c.d, a.b.e, a.b.e.f

in root fields and

  b, b.c, b.c.d, b.e, b.e.f, c, c.d, e, e.f, d, f

in intermediate fields.

I don't think it's worthwhile, because to compute the hash of a json
path, we have to parse it (at least, I don't see any other way), so we
could descend the json path tree as we go looking up path nodes one by
one, instead of looking up a full path in a hash table.

> 
> The hash table would store pointers to json tree nodes.
> 
> And I also think we need json tree nodes for top level fields -
> they are just the same as intermediate fields.
> 
> I would also keep in mind that in the future we will automatically
> flatten tuples according to a schema before storing in a space.
> 
> Flattening, from path access perspective, is simply a replacement
> of one path with another  (foo.bar.baz -> [1][1][3]).

I think tuple flattening is about turning a tree into an array, i.e.

foo.bar.baz -> [N]