top of page

Building a high throughput snowflake D/W

Updated: Feb 29



An IB product company with marquee clients for its middle and back office software + services was introducing a new data platform into the product. The data platform uses Snowflake as the warehouse for client data sharing and to scale throughput and availability. The initial product version ran into cost and throughput issues with Snowflake vis-a-vis existing SQL server database, and was proving to be infeasible.


As Snowflake is a good fit into the product architecture and started addressing some new use cases in the sales discussions, the solution is to make it operationally viable to increase its throughput by 5X and decrease its cost by 4X. This will bring it on par with a traditional RDBMS for most frequently used data and yield significant operational and cost efficiencies for long term data storage, data sharing and big data usecases.


91social focused on the core data ingestion and data querying out of snowflake while avoiding major refactorings to the design and existing code. To that end, Snowflake architecture is studied in depth and multiple POCs conducted to establish throughput and cost benefits before making changes to the product.


Observation 1: Data in Batches


Snowflake by design is good at ingesting data in a small number of large batches rather than real-time data in a large number of small batches. These are the results of throughput testing on a reasonably sized table with 25 columns



The data pipelines were adjusted and records were buffered through a temporary holding area so larger batches were formed to ensure maximum throughput with optimal data availability.


Observation 2: Table Partitions


Snowflake organizes data in tables by partitions, and these control the amount of disk I/O spent in inserting and retrieving data. As data for a specific business entity spans multiple partitions, the data used to read goes down significantly thus lowering the throughput and increasing the cost. The data used to read ratio is how much data is used from the total data read from the datastore. The higher the ratio, the better optimized it is.


On a table of 15M records with 25 columns, the partition columns were identified and created


The impact of partitions on throughput is


Final Outcome

The design changes to data pipelines and Snowflake resulted in a 4x reduction in monthly cost from an average of 4,000$ to 1,000$ for the smallest client payload and is on target to deliver larger gains for their bigger clients. The overall throughput for the data ingestion increased by 8x - 10x and read query latencies came down significantly.







38 views

Comments


bottom of page