Presto Core Data Structures: Slice, Block & Page

简介: ![slice.jpg](https://ata2-img.cn-hangzhou.oss-pub.aliyun-inc.com/8efe25242cf8a9e6c0067e1c71cbb7d3.jpg) ## Overview In Presto there are some very essential data structure we need to understand, S

slice.jpg

Overview

In Presto there are some very essential data structure we need to understand, Slice, Block & Page are three of them, I will introduce these data structures in this article.

Slice

From user point of view, Slice is a more developer friendly virtual memory, it defines a set of getters and setters, so you can use the memory like a piece of structured data:

Slice

The typical usage for Slice is to represent a String:

// use it as utf8 encoded string
Slice slice = Slices.utf8Slice("hello");
Slice subSlice = SliceUtf8.substring(slice, 1, 2);

You can see we are using Slice just like a String, the reason that Presto prefers Slice over Strings is:

  • Strings are expensive to build(String concatenation, StringBuilder etc).
  • Slice is mutable while String is not, so it is more efficient when we need to do string calculation.
  • Strings are encoded as UTF16 in memory, while Slice use UTF8, which is more memory efficient.

    • UTF16 use minimum two bytes to represent a character, while UTF8 use minimum one bytes, so if the String content is mainly ascii characters, UTF8 can save a lot of memory.

Thanks @dain for sharing these insigts.

Another usage for Slice(in Presto) is to represent raw bytes(VARBINARY type in SQL):

// use it as raw bytes
block.getSlice().getBytes()

You can get a Slice from a Block just like primitive types Int, Long etc.

Block

Since Page is consisted of Blocks, so we will introduce Block first. Block can be think of an array of same class(int, long, Slice etc.) of data. Each data item occupies a position, the total position count represents the total number of rows of data the Block is holding (Block only holds one column of these rows).

Page and Block

Block has defined several sets of APIs, one of them is the getXXX methods, lets take at getInt as an sample:

/**
    * Gets a little endian int at {@code offset} in the value at {@code position}.
    */
default int getInt(int position, int offset)
{
    throw new UnsupportedOperationException(getClass().getName());
}

Typically one Block only supports one getXxx method, because all the data in one Block belongs to one column which has one certain type.

Another method Block defined is copyPositions, instead of get one value from Block, it gets a list of values specified by a list of positions as a new Block, I suspect this is used in projection.

/**
 * Returns a block containing the specified positions.
 * All specified positions must be valid for this block.
 * <p>
 * The returned block must be a compact representation of the original block.
 */
Block copyPositions(List<Integer> positions);

Along with Block, Presto also defined BlockEncoding which determines how Block is serialized & deserialized:

public interface BlockEncoding
{
    ...

    /**
     * Read a block from the specified input.  The returned
     * block should begin at the specified position.
     */
    Block readBlock(SliceInput input);

    /**
     * Write the specified block to the specified output
     */
    void writeBlock(SliceOutput sliceOutput, Block block);
    ...
}

Lets take the simplest BlockEncoding: IntArrayBlockEncoding as an example, its readBlock looks like this:

int positionCount = block.getPositionCount();
sliceOutput.appendInt(positionCount);

encodeNullsAsBits(sliceOutput, block);

for (int position = 0; position < positionCount; position++) {
    if (!block.isNull(position)) {
        sliceOutput.writeInt(block.getInt(position, 0));
    }
}

As we can see from the code, it first encode the position, then encode data isNull as a bitmap, and finally write all the ints into the output. Looks very similar as network protocol code.

Page

Page is consisted of blocks:

public class Page
{
    private final Block[] blocks;
    private final int positionCount;
    ...
}

Besides blocks, Page has another concept called channel: every Block is a channel for the Page, the total count of blocks is the channel count. So lets summarize how data is structured here, When there's some rows to send, Presto will

  • Put every column into a separate Block.
  • Put these Blocks into a Page.
  • Send the Page

Page is the data structure which holds data and being transferred between Presto physical execution operators: upstream operator produce output through getOutput():

/**
 * Gets an output page from the operator.  If no output data is currently
 * available, return null.
 */
Page getOutput();

Downstream operators get input through addInput() method:

/**
 * Adds an input page to the operator.  This method will only be called if
 * {@code needsInput()} returns true.
 */
void addInput(Page page);

Page flowing in Presto

Just like Block, Page also needs to serialized & deserialized, serialization happens when data needs to transferred between workers. When Page is serialized, it will first encode the Blocks using corresponding BlockEncoding, and then if a compressor is available, it will try to compress the encoded block data, if compression works well(have encoding rate lower than: 0.8), it will use the compressed the data, otherwise the uncompress data is used. The encoded block data will be put into a class named SerializedPage along with some statistic information: the byte size of the page before & after compression.

Summary

Today we introduced the three core data structures in Presto: Slice, Block and Page. In short, Slice is more developer friendly virtual memory, Block represents a column, Page represents a row group. Hope you enjoyed this article.

目录
相关文章
|
8月前
Transparent Data Encryption Data Dynamic and Data Dictionary Views You can query a set of dynamic and data dictionary views to find more information about Transparent Data Encryption (TDE) data.
Transparent Data Encryption Data Dynamic and Data Dictionary Views You can query a set of dynamic and data dictionary views to find more information about Transparent Data Encryption (TDE) data.
59 2
|
8月前
Elasticsearch【问题记录 02】【不能以root运行es + max virtual memory areas vm.max_map_count [65530] is too low处理】
【4月更文挑战第12天】Elasticsearch【问题记录 02】【不能以root运行es + max virtual memory areas vm.max_map_count [65530] is too low处理】
85 3
|
7月前
|
缓存 安全
max file descriptors [65535] for elasticsearch process is too low,【已解决】
max file descriptors [65535] for elasticsearch process is too low,【已解决】
227 1
|
8月前
|
Docker 容器
devmapper: Thin Pool has 162394 free data blocks which is less than minimum required 163840 free dat
devmapper: Thin Pool has 162394 free data blocks which is less than minimum required 163840 free dat
77 0
|
编解码 算法 测试技术
[译]Page Multiplexing and Ordering in a Physical Ogg Stream
Ogg容器格式的设计和排列受几个高级设计决策支配,这些决策构成了特定的低级设计决策的依据。
93 0
|
缓存 Java 关系型数据库
关于page Cache和memory mappped Files 和zero copy
关于page Cache和memory mappped Files 和zero copy
175 0
关于page Cache和memory mappped Files 和zero copy
SAP WM 为Storage Type 004激活SUM报错 - Storage types without pick-point stor.type require partial pallet
SAP WM 为Storage Type 004激活SUM报错 - Storage types without pick-point stor.type require partial pallet
SAP WM 为Storage Type 004激活SUM报错 - Storage types without pick-point stor.type require partial pallet
torch.distributed.init_process_group(‘gloo’, init_method=‘file://tmp/somefile’, rank=0, world_size=1
torch.distributed.init_process_group(‘gloo’, init_method=‘file://tmp/somefile’, rank=0, world_size=1
615 0
torch.distributed.init_process_group(‘gloo’, init_method=‘file://tmp/somefile’, rank=0, world_size=1
|
关系型数据库 数据库管理 Oracle