Caching in Snowflake Cloud Data Warehouse

Caching is the result of Snowflake's Unique architecture which includes various levels of caching to help speed your queries. This tutorial provides an overview of the techniques used, and some best practice tips on how to maximize system performance using caching

What is Snowflake Caching ?

Imagine executing a query that takes 10 minutes to complete. Now if you re-run the same query later in the day while the underlying data hasn’t changed, you are essentially doing again the same work and wasting resources

Instead Snowflake caches the results of every query you ran and when a new query is submitted, it checks previously executed queries and if a matching query exists and the results are still cached, it uses the cached result set instead of executing the query. This can greatly reduce query times because Snowflake retrieves the result directly from the cache.

Snowflake Cache results are global and can be used across users. 
eg. If User A executes a query and the results are cached, 
When User B executes the same query the results will be served from cache

Type of Caching Layers in Snowflake ?

1. Query Results Caching:

The Results cache holds the results of every query executed in the past 24 hours. These are available across virtual warehouses, so query results returned to one user is available to any other user on the system who executes the same query, provided the underlying data has not changed.

2. Virtual Warehouse Local Disk Caching

Whenever data is needed for a given query it's retrieved from the Remote Disk storage, and cached in SSD and memory of the Virtual Warehouse. This data will remain until the virtual warehouse is active. When there is a subsequent query fired an if it requires the same data files as previous query, the virtual warhouse might choose to reuse the datafile instead of pulling it again from the Remote disk

3. Metadata Cache

This is not really a Cache. Instead, It is a service offered by Snowflake. Snowflake automatically collects and manages metadata about tables and micro-partitions

  • Row Count
  • Table Size in Bytes
  • File references and table versions

For Micro-Partitions, Snowflake stores:

  • The range of values (MIN/MAX values)
  • Number of distinct values
  • NULL Count

For Clustering, Snowflake Stores:

  • The total number of Micro-Partitions
  • Number of Micro-Partitions containing values overlapping with each together
  • The depth of overlapping Micro-Partitions
    • This is an indication of how well-clustered a table is since as this value decreases, the number of pruned columns can increase.

All DML operations take advantage of micro-partition metadata for table maintenance. Some operations are metadata alone and require no compute resources to complete, like the query below

SELECT MIN(L_SHIP_DATE), MAX(L_SHIP_DATE) FROM LINE_ITEM;

Micro-partition metadata also allows for the precise pruning of columns in micro-partitions. When pruning, Snowflake does the following:

  1. Snowflake's pruning algorithm first identifies the micro-partitions required to answer a query.
  2. Snowflake will only scan the portion of those micro-partitions that contain the required columns.
  3. Snowflake then uses columnar scanning of partitions so an entire micro-partition is not scanned if the submitted query filters by a single column.

Benefits of Snowflake Query Caching ?

  • Results Cache is Automatic and enabled by default. You do not have to do anything special to avail this functionality
  • All the Results are Cached for 24 hours
  • There is no space restictions. Snowflake Cache has infinite space (aws/gcp/azure)
  • Cache is global and available across all WH and across users
  • Faster Results in your BI dashboards as a result of caching
  • Reduced compute cost as a result of caching

What happens to Cache results when the underlying data changes ?

Snowflake Cache results are invalidated when the data in the underlying micro-partition changes. Although more information is available in the Snowflake Documentation, a series of tests demonstrated the result cache will be reused unless the underlying data (or SQL query) has changed. As a series of additional tests demonstrated inserts, updates and deletes which don't affect the underlying data are ignored, and the result cache is used, provided data in the micro-partitions remains unchanged

Finally, results are normally retained for 24 hours, although the clock is reset every time the query is re-executed, up to a limit of 30 days, after which results query the remote disk

How to disable Snowflake Query Results Caching?

To disable the Snowflake Results cache, run the below query. It should disable the query for the entire session duration

alter session set use_cached_result =false;

Different States of Snowflake Virtual Warehouse ?

  1. Run from Cold Virtual Warehouse:  When you all your Virtual warehouses are suspended (nothing active), and if you run a query, it will start a new instance of Virtual Warehouse (Cold). Which meant starting a NEW virtual warehouse (with no local disk caching), and executing the query.
  2. Run from Warm Virtual Warehouse:  Your virtual warehouse has been active and running for a while and has processed few queries, Then its called WARM Virtual warehouse. Now, If you disable the result caching, and repeat the query.It will make use of the local disk caching which it pulled in the past, which is termed as Warm Caching
  3. Run from Hot Virtual Warehouse:  Which means you repeated the query execute, and the result caching is switched on. The results are fully served from the cache and this is the most efficient operation among all the three types

Lets go through a small example to notice the performace between the three states of the virtual warehouse. In this example we have a 60GB table and we are running the same SQL query but in different Warehouse states

Cold Warehouse Warm Warehouse Hot Warehouse
Run Time 20 seconds 1.2 seconds 2 milliseconds The lower the Query Run time, the better
Remote Disk (Source) 12.5 GB 0 0 This is the data that is being pulled from Snowflake Micro partition files (Disk)
Local Disk Cache 0 12.5 GB 0 This is the files that are stored in the Virtual Warehouse disk and SSD Memory. The more the local disk is used the better
Results Cache 0 0 100% The results cache is the fastest way to fullfill a query


Next Section: Time Travel



MetricCamp.com