DatencloudAm veröffentlicht

Cloud Data Warehousing: Which Technology should you use?

The fourth industrial revolution – that is incorporating the use of smart technologies in industrial practices – is what drives many companies to reinvent their IT solutions during the past decade. Data obtained from many different sources, such as mobile devices, Internet of Things (IoT) applications, smart sensors or customer profiling solutions require flexible data warehouses to store, enrich and expose the data to end users. Further, the amount of data grows rapidly. Traditional on-premises data warehouses would need to be extended regularly, which would result in a lot of administrative overhead.

Cloud data warehouses remedy this shortcoming by their native flexibility. Further, the modern design of typical cloud data warehouses allows for flexible integration of data from a variety of sources – structured, semi-structured or unstructured – usually by leveraging other services of the respective cloud provider.

Several cloud data warehouse solutions have emerged in the past decade, building upon different strategies. Prominent players in the field are AWS Redshift, Azure Synapse Analytics, Google BigQuery, and Snowflake. But which one to choose, based on use-case and IT ecosystems already preexisting in your company? To provide a base for this decision, we will compare these four technologies on a high level, considering points like scalability, maintainability, pricing models, and security. But before addressing these details, first some general remarks on the cloud data warehouses:

AWS Redshift

Redshift was first released by AWS in 2012 and is based on Postgres. As such, it uses a shared-nothing architecture, meaning that data is distributed between multiple nodes within a cluster. Each node will operate only on the data stored on its disk and results will be merged by a master node. This data locality leads to high performance but requires some maintenance.

Azure Synapse Analytics

The first version of Azures data warehouse solution was released in 2016. A second generation of Synapse Analytics was then made available in 2018. The data warehouse uses a multi cluster shared-data approach, where all nodes within a cluster share the same data pool. This reduces maintenance effort but introduces some overhead for network transfer.

Google BigQuery

BigQuery is a fully serverless data warehouse. It is designed by Google, based on the Google Dremel technology. Transient compute resources operate on a shared pool of data only for as long as the queries need to execute. Therefore, this solution consequently follows the pay-as-you-go approach of serverless cloud solutions.

 

Snowflake

Snowflake, released in 2014, is a data warehouse that was conceptualized to work in the cloud. In contrast to the other three solutions, which all operate within their respective cloud platforms, Snowflake can be hosted in AWS, Azure or Google Cloud environments. It follows a hybrid shared-data / shared-nothing approach, in which serverless compute nodes load data from a shared data pool and operate on them locally.

Scalability

Being Cloud solutions, the data warehouse technologies discussed here all follow the concept of scalability. Each solution can scaled in and out flexibly. In BigQuery, resources get provisioned dynamically based on demand, without the need for reconfiguration. Within Snowflake, the user gets to choose from t-shirt sizes (XS, S, M, L, XL, …). A change in warehouse size will be in effect at the start of the next query. Apart from that, an autoscale feature can spawn additional compute clusters of the same size to handle high concurrency.

In the cases of Synapse Analytics and Redshift, the user needs to increase (or decrease) the number of resources associated with a permanently running cluster. While association or dissociation of nodes is managed by the cloud provider, the process usually takes a few minutes and very short downtimes may occur. Additionally, the shared-nothing architecture of AWS Redshift requires to always scale compute and storage alongside each other. Other solutions allow a decoupling in this matter, where storage is automatically scaled to the data volume used by the data warehouse. Redshift added the new RA3 node type to introduce this feature as well. However, the use of this node type is probably exaggerated for most use cases.

Maintainability

Snowflake and BigQuery follow a serverless strategy, which means that there is no architecture to configure or maintain. Apart from the choice of the t-shirt size in Snowflake, both solutions can be used „out-of-the-box“. Synapse Analytics and Redshift, on the other hand, require the configuration and management of cluster sizes. Furthermore, due to its shared-nothing architecture, Redshift’s performance is sensitive to the distribution of da data between the nodes. Therefore, the user is well advised to have an eye on this to improve query performance.

Pricing Models

The pricing models of the four solutions compared here perhaps show the largest differences. First, let us consider compute resources. Google BigQuery strictly follows the pay-as-you-go and charges by the data volume that is processed by the engine. Snowflake’s pricing model involves the runtime of the transient compute resources provisioned to run the queries. On top of that, cloud services costs, associated with background processes like automations or caches, are charged if they exceed 10% of the usual compute charges.
Synapse Analytics and Redshift charge per cluster size and uptime of the cluster. Synapse Analytics uses the concepts of „Data Warehouse Units“ to group CPU and RAM resources, while Redshift uses the more transparent but more coarse-grained concept of virtual machines.

In terms of storage costs, BigQuery, Synapse Analytics, Snowflake, and Redshifts RA3 nodes charge only for the data volume used by the customer. Within BigQuery and Redshifts RA3, the price follows the respective cloud object store (Google Cloud Storage and AWS S3) and is usually little short of 25$ per TB per month. Snowflake charges almost twice that price for on-demand storage. However, the price is reduced to the price of the underlying cloud object storage price if a long-term commitment is made. Synapse Analytics uses a shared-disk concept to build its data pool, which increases the price for storage dramatically (by a factor of 5 compared to other technologies). Traditional Redshift nodes follow a similar concept, using HDDs or SSDs, which are 2 to 4 times higher in price than the S3 storage solution.

Conclusion

In conclusion, the choice of a new cloud data warehouse should be guided by a few general design principles. Do you want a more traditional architecture with permanently running clusters because ETL and reporting queries will run constantly? Then choosing AWS Redshift or Azure Synapse Analytics with their constantly running but cheaper infrastructure will be a good option. Are you using your data warehouse only few hours a day? Then serverless data warehouses like BigQuery or Snowflake might be your option. Do you expect a fast growth of data in your data warehouse? Consider using data warehouses which scale flexibly with your usage, like Synapse Analytics, Snowflake or BigQuery.

If this high-level comparison did not fully build the fundament for your decision yet, watch out for the in-depth comparison of Redshift and Snowflake coming soon or take a look at our Cloud Data Warehouse Seminar.