Building a high throughput snowflake D/W

91Social
Aug 16, 2023
2 min read

Updated: Feb 29, 2024

An IB product company with marquee clients for its middle and back office software + services was introducing a new data platform into the product. The data platform uses Snowflake as the warehouse for client data sharing and to scale throughput and availability. The initial product version ran into cost and throughput issues with Snowflake vis-a-vis existing SQL server database, and was proving to be infeasible.

As Snowflake is a good fit into the product architecture and started addressing some new use cases in the sales discussions, the solution is to make it operationally viable to increase its throughput by 5X and decrease its cost by 4X. This will bring it on par with a traditional RDBMS for most frequently used data and yield significant operational and cost efficiencies for long term data storage, data sharing and big data usecases.

91social focused on the core data ingestion and data querying out of snowflake while avoiding major refactorings to the design and existing code. To that end, Snowflake architecture is studied in depth and multiple POCs conducted to establish throughput and cost benefits before making changes to the product.

Observation 1: Data in Batches

Snowflake by design is good at ingesting data in a small number of large batches rather than real-time data in a large number of small batches. These are the results of throughput testing on a reasonably sized table with 25 columns

The data pipelines were adjusted and records were buffered through a temporary holding area so larger batches were formed to ensure maximum throughput with optimal data availability.

Observation 2: Table Partitions

Snowflake organizes data in tables by partitions, and these control the amount of disk I/O spent in inserting and retrieving data. As data for a specific business entity spans multiple partitions, the data used to read goes down significantly thus lowering the throughput and increasing the cost. The data used to read ratio is how much data is used from the total data read from the datastore. The higher the ratio, the better optimized it is.

On a table of 15M records with 25 columns, the partition columns were identified and created

The impact of partitions on throughput is

Final Outcome

The design changes to data pipelines and Snowflake resulted in a 4x reduction in monthly cost from an average of 4,000$ to 1,000$ for the smallest client payload and is on target to deliver larger gains for their bigger clients. The overall throughput for the data ingestion increased by 8x - 10x and read query latencies came down significantly.

Back

Building a high throughput snowflake D/W

What we offer

Resources

Company