data loading Snowflake MatillionAm veröffentlicht

Data Loading into Snowflake powered by Metadata

As a modern, cloud-centred data warehouse, Snowflake delivers an integrated, extensible platform with parallel data processing, effortless scaling and full SQL support. More and more businesses are deploying or migrating their databases to the cloud, enabling a long-term strategy to maximize their data assets. The actual process of migrating to the cloud can be challenging.

Typically, the process of loading data into Snowflake starts by exporting data from homogenous or heterogeneous sources with specialized software or ELT/ETL software. The data is placed in an external (S3, Azure Blob, GCS) or internal (Snowflake) stage, from where it is ingested into a Snowflake table via a SQL command. While compression and encryption is automatically handled by Snowflake during ingestion, this needs to be handled by the user in the previous steps.

Many businesses plan to migrate from a SQL Server on premise solution to Snowflake, potentially affecting hundreds to thousands of tables and numerous data sources (e.g. salesforce, SAP). In-house solutions for this process might introduce automation scripts that leverage the metadata and complex transformations. New structures in the underlying data need to be hand coded for this approach. Modern tools for ETL and ELT processes offer encompassing solutions for transformations and orchestration but are often costly and automation is not readily available.

A metadata driven approach, for example by integrating with Matillion, can drastically improve the time to value. When automated data orchestration pipelines are required, the priorities can be summarized with the following points:

  • Fast track to automation for newly introduced data sources and tables by configuring meta data in the existing framework
  • Usage of generic, readily available connectors
  • Data is exported from the source, staged in a cloud object store…
  • …and then ingested as a table into Snowflake (landing)
  • From the landing inside Snowflake, the data is transformed to be available in a Data Lake or Operational Data Store
  • As an option, this process can support the validation of data quality rules, change data capture and slowly changing dimensions

When a single Snowflake account is used to handle multiple environments such as Development, Production, Quality Assurance etc., care needs to be applied when setting up users, roles, warehouses and databases to avoid collisions and the escalation of problems. Global roles (Administrators) should only be used sparingly, while the specific environments are worked on, managed and maintained by specialized roles. New users should only be assigned specific roles and permissions, while data loading jobs are restricted to a particular user.

In a data lake, which is seen as a repository for raw format data, the structure of the source system usually remains unchanged. Here, a metadata-driven approach to a data lake offers a robust, modular solution with fast development and implementation of changes. The ease of maintenance can prevent the deterioration into a data swamp. This is achieved with a specialized metadata table that  keeps track of the following information:

  • Source system designator
  • The name of the source object
  • A source prefix for easy identification
  • The name of the landing database
  • Identification of the landing schema
  • The names of the landing database, schema and table
  • A delta identifier column if the source delivers incremental data
  • 2 columns for the minimum and maximum date for the delta identifier, respectively
  • A priority tag if necessary
  • Additional WHERE clauses applied to the source
  • Information relevant for audit processes (e.g. creation and update date)

The main job for this process queries the metadata either concurrently or sequentially and then loads the data from source to landing, e.g. from Salesforce, SQL Server or another DBMS to S3 or an Azure Blob storage, in the form of AVRO, Parquet etc. A second job loads the data from the landing location to the data lake. The parameters for the loading are derived from the metadata table, while at every step, an audit/error notification framework is called. The transfer of around a billion records with a slowly changing dimension of type 2 from landing to data lake takes only slightly above one hour in tests.

To summarize, the clever usage of an ELT/ETL tool like Matillion or Informatica Cloud can simplify and speed up the process of moving to Snowflake. This can offer additional advantages if there is a requirement for slowly changing dimensions