Snowflake Cost Optimization Steps

Hardik Patel
5 min readApr 7, 2024

--

Whenever new technologies arise and when we start using, we look at the features it provides, and they look awesome, and you start implementing lot of things on top of that. Initially, you will build POCs and try to see if your ideas are feasible using this new technology.

Same way I also had to work on POCs using snowflake data cloud platform. We had initial budget to build that and that was sufficient as data volume is low at that time, so you generally do not give your attention towards costing side of the product much.

After some time, we decided to go for production workload and then suddenly, credit consumption of the snowflake exploded. As a newbie, if you read lot of stuffs, you get the general idea but that doesn’t work for your specific scenarios. And with time, we found lot of improvements, and we are still doing it.

Before understanding how we can reduce the cost consumption, how are you charged in snowflake?

Compute Resources (Virtual Warehouses)

Snowflake uses a credit-based system.

1 credit ~= one hour of usage for a specific size of virtual warehouse.

Warehouses come in various sizes (X-Small, Small, Medium, …, 6X-Large), with bigger warehouses consuming more credits per hour.

Charges are determined by:

  • Warehouse size
  • The length of time the warehouse is running (even if idle).
  • Multi-cluster warehouses can scale up and down automatically (charged only for what is actively being used).

Storage

Snowflake charges a monthly fee for the average amount of compressed data you store within Snowflake.

This fee is per terabyte (TB) and can vary slightly depending on your Snowflake Edition and the cloud region where your account is located.

Data Transfer

There are no charges for transferring data into a Snowflake account (data ingress).

Fees apply for transferring data out (data egress):

  • Across different regions within the same cloud provider.
  • To a completely different cloud provider.

The cost per GB varies depending on your cloud region.

How to reduce the cost consumption in snowflake?

There are multiple ways of doing it but if you go in one sequence then it would be good and effective as per my experience. If you have different experience, you can add into comment section to this article.

Step — 1: Find out costliest queries and try to optimize!

Majority time, we run non-optimized queries and eventually we spend a lot of credits, and we start looking into increasing the warehouse size or creating new dedicated warehouses. But if you see query profiles of those costly query, you will find out that there would be some space of improvement in your queries.

Once you optimize your queries, it feels there is no room for improvement then you can look at the other ways of improving performance without losing extra money.

To optimize the query, use Query Profiling to check execution plan and analyze it to identify potential performance bottlenecks like:

  • Spills to Disk — Indicates that the warehouse couldn’t hold all the required data in memory, leading to slower disk operations. A larger warehouse might be needed.
  • Long Execution Time for Specific Operators — Points at computationally expensive parts of your query that could benefit from more parallel processing power.

Step — 2: Clustering on tables which are heavily used and very slow for many queries.

Find out the tables which are big in size and also see the patterns using snowflake.account_usage.query_history view. This view will give you lot of information like execution_time, elapsed_time, partitions_scanned at query level. Using query_parameterized_hash column, you can find out query group and in each query group queries will be same with different filter values.

After looking at the query patterns, you will find out which table is mostly accessible. You can also find out these details using snowflake.account_usage.access_history. With those details, you have to identify table_name and columns which are mostly used.

Use system function clustering_information or clustering_depth to check if selected columns are eligible for clustering or snowflake has naturally clustered the data based on given column.

Finally, you cluster the table if you identify the correct table name and column names.

These will reduce the execution time of all those queries which are using the same table and same columns as filters. Eventually it will save lot of credits.

Note — You have to make sure few points before applying clustering.

  • Cardinality should not be large.
  • If table is getting updated very frequently then it will raise the cost of clustering. You can see the clustering cost in cost management view in snowsight UI.

Step — 3: Warehouse Resizing and Load Distribution

In this step, you have to analyze query history and see if those queries are utilizing the warehouse fully or not. If they are not able to use warehouse with full capacity that means that warehouse is oversized for those queries. In that case, downsizing the warehouse will benefit optimizing the warehouse cost.

Suppose warehouse is being used with full capacity and still queries are slow, or data are spilled to local storage or remote storage then it indicates that warehouse size is not enough for the queries. Hence, increasing warehouse size will benefit optimizing the cost.

While analyzing query history and warehouse load history, if you see warehouse load is very high in specific time period then you can write a task which will increase the warehouse size for that time period and downsize once that period is over. That will also help improving the performance in that period and eventually it will optimize the cost.

When you analyze query history, and you find out that there are few queries which run regularly but warehouse is not sufficient for those queries OR warehouse size is as per requirement of those few queries but for all other queries that warehouse is oversized then use two separate warehouses and split the load that way you will find out optimal way of running queries with good performance and lower cost.

To do the analysis, normally, you need following data and data points,

  • SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY — This will give you list of all the queries which ran in past.
  • SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY — This will help you to see that warehouse’s load in different time. That will help you decide time bounded warehouse sizing.
  • BYTES_SPILLED_TO_LOCAL_STORAGE — In each query, how much data were written to local disk to execute the query. This Indicates that the warehouse couldn’t hold all the required data in memory, leading to slower disk operations. A larger warehouse might be needed. This data point, you will find in query history.
  • BYTES_SPILLED_TO_REMOTE_STORAGE — In each query, how much data were written to remote disk (S3) to execute the query, and this is costliest operation in the query. This Indicates that the warehouse couldn’t hold all the required data in memory and even disk, leading to slowest disk operations. A larger warehouse is needed. This data point, you will find in query history.
  • BYTES_SCANNED — You will be able to understand how much data is scanned from the combined storage and cache.
  • QUERY_LOAD_PERCENT — This gives idea about how many servers were utilized from the warehouse. Ex. if this is 100% when query ran on medium warehouse, all 4 nodes were used to execute this query. That doesn’t mean warehouse was utilized fully as warehouse assigns resource from each node from 4 nodes.

Conclusion: -

  • Always write optimized queries which will definitely optimize your cost at great extent.
  • Use appropriate warehouse sizes.
  • Split the load appropriately based on their characteristics to optimize cost and performance both.

Please give your comment if you have built any script or view which can recommend the warehouse size after analyzing query history. I am also trying to build something similar to automate.

Thank you for reading and keep learning.

--

--

No responses yet