[Tarantool-patches] [PATCH luajit v1 02/11] utils: introduce leb128 reader and writer

Mon Dec 21 01:44:33 MSK 2020

Sergey,

Thanks for the patch! Please, consider the comments below.

On 16.12.20, Sergey Kaplun wrote:
> This patch introduces module for reading and writing leb128 compression.
> It will be used for streaming profiling events writing, that will be
> added at the next patches.
> 
> Part of tarantool/tarantool#5442
> ---
>  src/Makefile       |   5 +-
>  src/Makefile.dep   |   1 +
>  src/utils/leb128.c | 124 +++++++++++++++++++++++++++++++++++++++++++++
>  src/utils/leb128.h |  55 ++++++++++++++++++++
>  4 files changed, 183 insertions(+), 2 deletions(-)
>  create mode 100644 src/utils/leb128.c
>  create mode 100644 src/utils/leb128.h
> 
> diff --git a/src/Makefile b/src/Makefile
> index caa49f9..be7ed95 100644
> --- a/src/Makefile
> +++ b/src/Makefile

Please, adjust these changes considering the comments to the first
patch. I propose to use either <lj_utils.*> or <lj_utils_leb128.*> for
the name.

> @@ -468,6 +468,7 @@ endif

<snipped>

> diff --git a/src/utils/leb128.c b/src/utils/leb128.c
> new file mode 100644
> index 0000000..921e5bc
> --- /dev/null
> +++ b/src/utils/leb128.c
> @@ -0,0 +1,124 @@
> +/*
> +** Working with LEB128/ULEB128 encoding.
> +**
> +** Major portions taken verbatim or adapted from the LuaVela.
> +** Copyright (C) 2015-2019 IPONWEB Ltd.
> +*/
> +
> +#include <stdint.h>
> +#include <stddef.h>

Why do you include this again instead of using leb128.h?

> +
> +#define LINK_BIT          (0x80)
> +#define MIN_TWOBYTE_VALUE (0x80)
> +#define PAYLOAD_MASK      (0x7f)
> +#define SHIFT_STEP        (7)
> +#define LEB_SIGN_BIT      (0x40)
> +
> +/* ------------------------- Writing ULEB128/LEB128 ------------------------- */
> +
> +size_t write_uleb128(uint8_t *buffer, uint64_t value)
> +{
> +  size_t i = 0;
> +
> +  for (; value >= MIN_TWOBYTE_VALUE; value >>= SHIFT_STEP) {
> +    buffer[i++] = (uint8_t)((value & PAYLOAD_MASK) | LINK_BIT);
> +  }

The braces are excess.

> +  buffer[i++] = (uint8_t)value;
> +
> +  return i;
> +}
> +
> +size_t write_leb128(uint8_t *buffer, int64_t value)
> +{
> +  size_t i = 0;
> +
> +  for (; (uint64_t)(value + 0x40) >= MIN_TWOBYTE_VALUE; value >>= SHIFT_STEP) {

What is 0x40? If this is <LEB_SIGN_BIT>, then just use the constant
here. Otherwise create a new one with the comment. Please, do not use
magic numbers.

> +    buffer[i++] = (uint8_t)((value & PAYLOAD_MASK) | LINK_BIT);
> +  }

The braces are excess.

> +  buffer[i++] = (uint8_t)(value & PAYLOAD_MASK);
> +
> +  return i;
> +}
> +
> +/* ------------------------- Reading ULEB128/LEB128 ------------------------- */
> +
> +/*
> +** NB! For each LEB128 type (signed/unsigned) we have two versions of read

Minor: It's better to use XXX for these cases in comments. We already
discussed this with Vlad here[1] (search for "What is 'XXX'?").

> +** functions: The one consuming unlimited number of input octets and the one
> +** consuming not more than given number of input octets. Currently reading
> +** is not used in performance critical places, so these two functions are
> +** implemented via single low-level function + run-time mode check. Feel free
> +** to change if this becomes a bottleneck.

Well, you can also add LJ_AINLINE for a low-level function, or simply
add a similar hint by yourself (I personally prefer the first one).

> +*/
> +
> +static size_t _read_uleb128(uint64_t *out, const uint8_t *buffer, int guarded,
> +			    size_t n)

AFAICS, <n> argument is used only in case <guarded> is set to 1.
Moreover, <n> can't be 0 when <guarded> is set, otherwise this is a
nilpotent function. So it seems you can drop the <guarded> parameter in
favour of the following contract for <n>:
* n == 0 is for guarded == 0 && n == 0
* n > 0  is for guarded == 1 && n > 0

This also relates to <_read_leb128>.

> +{
> +  size_t i = 0;
> +  uint64_t value = 0;
> +  uint64_t shift = 0;
> +  uint8_t octet;
> +
> +  for(;;) {
> +    if (guarded && i + 1 > n) {
> +      return 0;
> +    }

The braces are excess.

> +    octet = buffer[i++];
> +    value |= ((uint64_t)(octet & PAYLOAD_MASK)) << shift;
> +    shift += SHIFT_STEP;
> +    if (!(octet & LINK_BIT)) {
> +      break;
> +    }

The braces are excess.

> +  }
> +
> +  *out = value;
> +  return i;
> +}
> +

<snipped>

> +static size_t _read_leb128(int64_t *out, const uint8_t *buffer, int guarded,
> +			   size_t n)
> +{
> +  size_t i = 0;
> +  int64_t  value = 0;
> +  uint64_t shift = 0;
> +  uint8_t  octet;

A mess with whitespace above.

> +
> +  for(;;) {
> +    if (guarded && i + 1 > n) {
> +      return 0;
> +    }

The braces are excess.

> +    octet  = buffer[i++];
> +    value |= ((int64_t)(octet & PAYLOAD_MASK)) << shift;
> +    shift += SHIFT_STEP;
> +    if (!(octet & LINK_BIT)) {
> +      break;
> +    }

The braces are excess.

> +  }
> +
> +  if (octet & LEB_SIGN_BIT && shift < sizeof(int64_t) * 8) {
> +    value |= -(1 << shift);
> +  }

The braces are excess.

> +
> +  *out = value;
> +  return i;
> +}
> +

<snipped>

> diff --git a/src/utils/leb128.h b/src/utils/leb128.h
> new file mode 100644
> index 0000000..46d90bc
> --- /dev/null
> +++ b/src/utils/leb128.h
> @@ -0,0 +1,55 @@
> +/*
> +** Interfaces for working with LEB128/ULEB128 encoding.
> +**
> +** Major portions taken verbatim or adapted from the LuaVela.
> +** Copyright (C) 2015-2019 IPONWEB Ltd.
> +*/
> +
> +#ifndef _LJ_UTILS_LEB128_H
> +#define _LJ_UTILS_LEB128_H
> +
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +/* Maximum number of bytes needed for LEB128 encoding of any 64-bit value. */
> +#define LEB128_U64_MAXSIZE 10

The naming looks odd to me. Considering my comment for the first patch,
I propose to use something matching "lj_u?leb128_(read|write)(_n)?".

By the way, the order of the interfaces is also odd.

> +
> +/*
> +** Writes a value from an unsigned 64-bit input to a buffer of bytes.
> +** Buffer overflow is not checked. Returns number of bytes written.
> +*/
> +size_t write_uleb128(uint8_t *buffer, uint64_t value);
> +
> +/*
> +** Writes a value from an signed 64-bit input to a buffer of bytes.
> +** Buffer overflow is not checked. Returns number of bytes written.
> +*/
> +size_t write_leb128(uint8_t *buffer, int64_t value);
> +
> +/*
> +** Reads a value from a buffer of bytes to a uint64_t output.
> +** Buffer overflow is not checked. Returns number of bytes read.

If "buffer overflow" stands for "reading out of bounds", please reword
this. Otherwise, I don't get it.

> +*/
> +size_t read_uleb128(uint64_t *out, const uint8_t *buffer);
> +
> +/*
> +** Reads a value from a buffer of bytes to a int64_t output.
> +** Buffer overflow is not checked. Returns number of bytes read.

Ditto.

> +*/
> +size_t read_leb128(int64_t *out, const uint8_t *buffer);
> +
> +/*
> +** Reads a value from a buffer of bytes to a uint64_t output. Consumes no more
> +** than n bytes. Buffer overflow is not checked. Returns number of bytes read.

Ditto.

> +** If more than n bytes is about to be consumed, returns 0 without touching out.
> +*/
> +size_t read_uleb128_n(uint64_t *out, const uint8_t *buffer, size_t n);
> +
> +/*
> +** Reads a value from a buffer of bytes to a int64_t output. Consumes no more
> +** than n bytes. Buffer overflow is not checked. Returns number of bytes read.

Ditto.

> +** If more than n bytes is about to be consumed, returns 0 without touching out.
> +*/
> +size_t read_leb128_n(int64_t *out, const uint8_t *buffer, size_t n);
> +
> +#endif
> -- 
> 2.28.0
> 

[1]: https://lists.tarantool.org/pipermail/tarantool-patches/2020-July/018314.html

-- 
Best regards,
IM