Am veröffentlicht

Why migrate to Snowflake?

„Data in the 21st century is like oil in the 18th century “- and Snowflake makes it possible to get the maximum out of this resource. Snowflake is Big Data without its complexity. Data warehouse migration aims to satisfy the need of data. Snowflake is a great solution to achieve precisely that due to its great architecture, its features and its partners – independently of the size of the organization.  Startups benefit from Snowflake’s flexible scaling which allows Snowflake to scale  as the startup grows, as well as from Snowflake’s low-cost pay-as-you-go model. An enterprise organization may use Snowflake to improve its existing solutions to reduce costs and complexity by migrating all their data to a single platform in which they are able to store and analyze their data without having to move it physically. This gives the opportunity to make data driven business decisions easier and more precise.

 

Snowflake is changing the game

For some years now, every company had to decide whether they want to store their data on premise or in the Cloud (or hybrid). Deciding for the on-premise solution required a lot of money and time since the company had to create and manage the necessary infrastructure without being able to make use of it in the beginning. On the other hand, cloud data warehouse solutions  were limited to the offers of these providers: Amazon (AWS Redshift), Microsoft (Azure) and Google (Google BigQuery). Snowflake is now changing this game. As a “Software as a Service” (SaaS) there is no need to spend money for an infrastructure or for the management of said. Snowflake is designed for the cloud. It is fully managed, and with its pay-as-you-go model, you only have to pay for resources you actually are using. Furthermore, it can be combined with AWS, Azure or Google Cloud (after mid 2020) and make use of their functions and structures like Amazon S3 or Azure Blobs. The illustration below shows the cloud services with an analogy “Hamburger as a Service”. (Image taken from D.Anoshin, D.Shirokov, D.Strok – “Jumpstart Snowflake”)

Snowflake’s advantage through architecture

The gist of the Cloud was to give companies the opportunity to choose a solution for their data in which they only need to spend money for resources they use. Snowflake honours this intention. Looking at the architecture of Snowflake, this can clearly be recognized.

 

Snowflake’s unique Multi-Cluster Shared Data Architecture is a combination of the shared-disk and shared-nothing architecture, which reaped the best properties of both and is now split in three parts: the Cloud services, the Compute resources and the Storage resources. The Cloud services can be compared to how a brain works. It is a collection of services that coordinates the activities in the Data Warehouse, for instance helps it to optimize queries and is responsible for security. The Compute resources aka Query Processing,which includes the Virtual Warehouses, can be described as the muscle of the system. The size of the Warehouses completely depends on the needs of the organization and allows to separate different workloads. The third and last component are the Storage resources. They are responsible for storing any kind of data and are synchronously replicated across multiple disk devices as well as three availability zones that guarantee a high availability of the data. Furthermore,  data is automatically split into Micro-Partitions, which are the physical data files that comprise Snowflake’s logical tables. Each of them has a size of up to 16MB. These Micro-Partitions allow for some of Snowflakes amazing features like Time-Travel and zero-copy cloning which will be mentioned later.

 

But why is this architecture so special?

As mentioned earlier, an on-premise solution requires substantial spending for the infrastructure right at the beginning of your project – without any immediate benefits. Snowflake’s architecture allows to scale the storage and the Compute resources independently from each other and gives you the opportunity to only pay for what you use at any given moment.

Storage can be bought as on-Demand Storage or fixed capacity at the beginning of each month. The first option is a little bit more expensive per TB but more flexible and the costs correspond to actual usage. The second option has lower cost per TB than the first, but you may miscalculate your needed storage , hence it is therefore not necessarily more economic.

At Snowflake you pay for Compute resources for any hour a Warehouse is used, and Warehouses can be provisioned for tasks of arbitrary load. Imagine a company who runs an expensiveETL process every night but at daytime there are no or just a few ETL processes. In Snowflake you may choose a Warehouse for said ETL processes that is small (and therefore inexpensive) during the day to handle the few processes needed during that time. At night you can increase the size of the warehouse (this is called „Scale out“ or „vertical scaling“) to improve the performance. After the work is done, the Warehouse will automatically shut down and it will only restart once there are more queries to handle. In the morning the warehouse can be decreased in size again to handle the daily work. Since you need only pay for a running Warehouse, you can minimize your costs during the day, because the Warehouse will auto-suspend when there is no work to do.

 

 

Furthermore, Snowflake caches information in different caches. Queries are cached for 24 hours in the Database Storage Layer, which allows you to run the same query over and over without spinning up a warehouse. The Metadata Cache is in the Service Layer. This cache enables Snowflake to handle general queries about the data. These concepts reduce the accumulated costs since repeated queries do not need a warehouse and queries which are taking a long time can be accessed without them running.

It’s not just the architecture

Beside Snowflake’s unique architecture there are great features like zero-copy cloning.

Zero-copy cloning allows to duplicate big data in seconds – without using additional storage. Snowflake achieves this by taking a snapshot of the source the instant the clone is created. Since the clone object is writable and independent from its source, it is possible to manipulate its data independently from the source. Changing the clone will make the clone cost storage resources since the underlying data got changed. It is therefore a great feature to run tests without changing the original data or even (accidentally) deleting it. But even deleting data is not a big deal in Snowflake due to its Time Travel feature. The feature allows you to recreate the data the way it was at a specific point in time or before a query ran. You can even combine Time Travel and cloning to recreate an older set of data and compare the present data to the old one.

Besides Snowflake’s unique architecture and great features there are more reasons to migrate to Snowflake. Snowflake has high security standards, for instance Multi-Factor Authentication and encryption. It supports structured and semi-structured data like XML, JSON or Parquet, which can be easily transferred to your Database with partners like Fivetran, Matillion or Informatica and can be transformed by them. With more expenditure you may implement your own Python solution for loading and transforming your Data as well. Aside from these transformation and loading partners, you can use Tableau as a BI Tool to visualize your Data or Spark for data science projects. There are even more solutions than the aforementioned partners since tools like DBT can connect via community solutions to Snowflake.

All in all, Snowflake solved many issues of other Data Warehouses. As a Software as a Service, Snowflake eliminated the complexity of managing infrastructure and data. Through its nearly unlimited scalability it supports all kind of data (fun fact aside: the biggest table in snowflake has 64 trillion rows and belongs to Adobe), with its elasticity and scaling horizontally as well as vertically for improving performance and concurrency, it is able to handle any situation.  Furthermore, Snowflake’s Multi Cluster Shared Data architecture gives the possibility for everyone to have access to all the data.