Skip to content

Latest commit

 

History

History
162 lines (138 loc) · 8.53 KB

File metadata and controls

162 lines (138 loc) · 8.53 KB

Pixels Cache

Pixels cache is the distributed columnar cache that co-locates with the query (compute) engine. It consists of a cache coordinator on the master node and a cache manager on each worker node of the query engine cluster. Their implementation can be found in pixels-daemon. The underlying functionalities are implemented in pixels-cache.

The cache coordinator maintains the cache plan that decides which column chunk in which row group is cached on which worker node. Whereas the cache manager on each worker node listens to the update of the cache plan and replaces the cache content on this worker node accordingly.

How It Works

The cache plan is stored in etcd with the following data model:

  1. layout_version -> {schema_name}.{table_name}:{layout_version}: data layout version, updated by the user or program that want to trigger cache loading or replacement. Only increasing layout versions are accepted.
  2. cache_version -> {schema_name}.{table_name}:{layout_version}: cache version, set by the cache coordinator to notify the cache workers for cache loading or replacement when the cache tasks are ready in etcd.
  3. cache_location_{layout_version}_{worker_hostname} -> files: recording the array of files to be cached on the worker (cache manager) under the specified caching version.

The cache is read and updated as follows:

  1. When the layout optimizer generates a new layout, it writes the new layout (with a new layout_version) into Etcd.
  2. The cache coordinator monitors the new values of layout_version, once it finds a newer layout_version, it creates the corresponding cache tasks for each cache worker in Etcd, then updates the cache_version to the latest layout_version.
  3. The cache workers monitor cache_version. When a cache worker finds a new cache_version, it reads its cache task from cache_location_{layout_version}_{worker_hostname} and sets itself to busy to avoid concurrent cache updating. After that, the cache worker begins to load or replace the local cache content.
  4. When a query comes, Presto/Trino Coordinator checks Etcd for the cache plan, thus find available caches for its query splits.
  5. Each Presto/Trino Worker executes query splits with caching information (whether the column chunks in the query split are cached or not), and calls PixelsCacheReader to read the local cached column chunks (if any).

Installation

Install vmtouch

Find vmtouch-1.3.1.tar.xz in scripts/tars under the Pixels source code folder and decompress it to anywhere. Enter the decompressed folder, run:

make install

to build and install vmtouch to the operating system.

Install Pixels

Install Pixels following the instructions HERE, but do not start Pixels before finishing the following configurations.

Check the following settings related to pixels-cache in PIXELS_HOME/etc/pixels.properties on each node:

# the base location of the cache zone files, the file name of each cache zone adds a postfix to the base location
cache.base.location=/mnt/ramfs/pixels.cache
# the user available size of the whole cache, including all zones, in bytes
cache.size=68719476736
# the location of the index files, the name of each index file adds a postfix to the base location
index.base.location=/mnt/ramfs/pixels.index
# the user available size of the whole index, including the global index and the zone indexes, in bytes
index.size=1073741824
# the number of zones in the cache
cache.zone.num=3
# the number of swap zones in the cache
cache.zone.swap.num=1
# the scheme of the storage system to be cached
cache.storage.scheme=hdfs
# set to true if cache.storage.scheme is a locality sensitive storage such as hdfs
cache.absolute.balancer.enabled=true
# set to true to enable pixels-cache
cache.enabled=true
# set to true to read cache without memory copy
cache.read.direct=true

# heartbeat lease ttl must be larger than heartbeat period
heartbeat.lease.ttl.seconds=20
# heartbeat period must be larger than 0
heartbeat.period.seconds=10

The above values are a good default setting for each node to cache up-to 64GB data of table pixels.test_105 stored on HDFS. Change cache.storage.scheme to cache the data stored in a different storage system.

Mount In-memory File System

On each worker node, create and mount an in-memory file system with 65GB capacity:

sudo mkdir -p /mnt/ramfs
sudo mount -t tmpfs -o size=65g tmpfs /mnt/ramfs

The size parameter of the mount command should be larger than or equal to the sum of cache.size and index.size in PIXELS_HOME/etc/pixels.properties, but must be smaller than the available physical memory size.

Start Pixels (with cache)

Enter PIXELS_HOME and start all Pixels daemons using:

./sbin/start-pixels.sh

If starting the daemons in a cluster of multiple nodes, set the hostnames of the worker nodes in PIXELS_HOME/sbin/workers and run start-pixels.sh on the coordinator node. Each line in PIXELS_HOME/sbin/workers is the hostname of a worker node. If the worker node has a different PIXELS_HOME environment variable than the coordinator node, append the PIXELS_HOME variable to the hostname, separate by a space like this:

worker1 /home/pixels/worker1_pixels_home
worker2 /home/pixels/worker2_pixels_home
...

On each worker node, pin the cache in memory using:

sudo -E ./sbin/pin-cache.sh

Modify CACHE_PATH if it is not consistent with the mount point of the in-memory file system storing the cache and index files.

Then create a new data layout for the cached table, and update layout_version of the cached table in etcd to trigger cache loading or replacement:

./sbin/load-cache.sh {schema_name}.{table_name}:{layout_version}
# e.g., ./sbin/load-cache.sh tpch.lineitem:1

schema_name and table_name specifies which table to cache. Whereas layout_version specifies which layout version of the table to cache. Note that pixels-cache only caches data in the compact path of the layout, so ensure the table is compacted on the layout. See examples of compacting tables HERE. Currently, we only cache the full compact files with the same number of row groups defined by numRowGroupInFile in the LAYOUT_COMPACT field of the layout in metadata. The tail compact file (if exists) with less row groups than numRowGroupInFile will be ignored in cache loading or replacement.

If you have modified the etcd hostname and port in PIXELS_HOME/etc/pixels.properties, change the ENDPOINTS property in load-cache.sh as well.

Stop Pixels and clear cache

To stop Pixels, run:

./sbin/stop-pixels.sh

on the coordinator node to stop all Pixels daemons in the cluster.

The cache does not lost when Pixels is stopped. And it can be reused the next time Pixels is started.

To clear the cache and free the memory, run:

sudo -E ./sbin/unpin-cache.sh

on each worker node to release the memory pinned for the cache. After than, you can delete the shared-memory files at cache.location and index.location on each worker node to finally release the memory occupied by the cache. You can also umount the in-memory file system. This is optional. The in-memory file system will be automatically umount when the operating system is restarted.

Then, run:

./sbin/reset-cache.sh

on any node in the cluster to reset the states related to pixels-cache in etcd. If you have modified the etcd hostname and port in PIXELS_HOME/etc/pixels.properties, change the ENDPOINTS property in reset-cache.sh as well.

Cache Scaling

To enable cache scaling, set cache_expand_or_shrink to 1 or 2 in etcd to trigger cache expansion or shrink. The field values are interpreted as follows:

  • 0: No scaling operation (default state)
  • 1: Trigger cache expansion (scale out)
  • 2: Trigger cache shrink (scale in)
# Set cache_expand_or_shrink to 1 to trigger cache expansion
etcdctl --endpoints=$host:$port put cache_expand_or_shrink 1

Notes for Cache Scaling:

  • The scaling operation does not aim to modify the configuration file. Once the PixelsWorker restarts, the scaling effect is no longer active.
  • Now, Shrink cannot be used together with expansion within the same PixelsWorker lifecycle.
  • The cache can automatically create a new physical zone file, but the deleted physical zone file needs to be removed manually after the shrink operation. The deleted physical zone file is the one located before the swap zones (the swap zones are positioned before the last location).
  • The added or deleted physical zone file requires pinning or unpinning manually.