Data Loading into Snowflake powered by Metadata

As a modern, cloud-centred data warehouse, Snowflake delivers an integrated, extensible platform with parallel data processing, effortless scaling and full SQL support. More and more businesses are deploying or migrating their databases to the cloud, enabling a long-term strategy to maximize their data assets. The actual process of migrating to the cloud can be challenging.

Typically, the process of loading data into Snowflake starts by exporting data from homogenous or heterogeneous sources with specialized software or ELT/ETL software. The data is placed in an external (S3, Azure Blob, GCS) or internal (Snowflake) stage, from where it is ingested into a Snowflake table via a SQL command. While compression and encryption is automatically handled by Snowflake during ingestion, this needs to be handled by the user in the previous steps.

Many businesses plan to migrate from a SQL Server on premise solution to Snowflake, potentially affecting hundreds to thousands of tables and numerous data sources (e.g. salesforce, SAP). In-house solutions for this process might introduce automation scripts that leverage the metadata and complex transformations. New structures in the underlying data need to be hand coded for this approach. Modern tools for ETL and ELT processes offer encompassing solutions for transformations and orchestration but are often costly and automation is not readily available.

A metadata driven approach, for example by integrating with Matillion, can drastically improve the time to value. When automated data orchestration pipelines are required, the priorities can be summarized with the following points:

Fast track to automation for newly introduced data sources and tables by configuring meta data in the existing framework
Usage of generic, readily available connectors
Data is exported from the source, staged in a cloud object store…
…and then ingested as a table into Snowflake (landing)
From the landing inside Snowflake, the data is transformed to be available in a Data Lake or Operational Data Store
As an option, this process can support the validation of data quality rules, change data capture and slowly changing dimensions

When a single Snowflake account is used to handle multiple environments such as Development, Production, Quality Assurance etc., care needs to be applied when setting up users, roles, warehouses and databases to avoid collisions and the escalation of problems. Global roles (Administrators) should only be used sparingly, while the specific environments are worked on, managed and maintained by specialized roles. New users should only be assigned specific roles and permissions, while data loading jobs are restricted to a particular user.

In a data lake, which is seen as a repository for raw format data, the structure of the source system usually remains unchanged. Here, a metadata-driven approach to a data lake offers a robust, modular solution with fast development and implementation of changes. The ease of maintenance can prevent the deterioration into a data swamp. This is achieved with a specialized metadata table that keeps track of the following information:

Source system designator
The name of the source object
A source prefix for easy identification
The name of the landing database
Identification of the landing schema
The names of the landing database, schema and table
A delta identifier column if the source delivers incremental data
2 columns for the minimum and maximum date for the delta identifier, respectively
A priority tag if necessary
Additional WHERE clauses applied to the source
Information relevant for audit processes (e.g. creation and update date)

The main job for this process queries the metadata either concurrently or sequentially and then loads the data from source to landing, e.g. from Salesforce, SQL Server or another DBMS to S3 or an Azure Blob storage, in the form of AVRO, Parquet etc. A second job loads the data from the landing location to the data lake. The parameters for the loading are derived from the metadata table, while at every step, an audit/error notification framework is called. The transfer of around a billion records with a slowly changing dimension of type 2 from landing to data lake takes only slightly above one hour in tests.

To summarize, the clever usage of an ELT/ETL tool like Matillion or Informatica Cloud can simplify and speed up the process of moving to Snowflake. This can offer additional advantages if there is a requirement for slowly changing dimensions.

Akzeptieren
Name	YouTube
Anbieter	Google LLC
Zweck	Diese Webseite verwendet Youtube zu Marketingzwecken. Die Daten werden an einen Server in den USA übertragen und dort gespeichert. Die personenbezogenen Daten werden auf Grundlage des Art. 46 und/oder Art. 49 Abs. 1 lit. a) DSGVO übermittelt.
Laufzeit	Daten werden gelöscht, sobald sie für die Bearbeitung nicht mehr benötigt werden.
Weiterführende Infos	https://policies.google.com/privacy

Akzeptieren
Name	hellotrust
Anbieter	Keyed GmbH
Zweck	hellotrust speichert den Zustimmungsstatus des Benutzers für Cookies auf der aktuellen Domain.
Laufzeit	Daten werden gelöscht, sobald sie für die Bearbeitung nicht mehr benötigt werden.
Weiterführende Infos	https://hellotrust.de/datenschutz

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Diese Webseite verwendet Google Analytics zur Analyse der Websitebenutzung durch Nutzer. Die Daten werden an einen Server von Google in den USA übertragen und dort gespeichert. Die personenbezogenen Daten werden auf Grundlage des Art. 46 und/oder Art. 49 Abs. 1 lit. a) DSGVO übermittelt.
Laufzeit	Daten werden gelöscht, sobald sie für die Bearbeitung nicht mehr benötigt werden. In der Regel werden die Cookies von Google für eine Dauer von 2 Jahren gespeichert.
Weiterführende Infos	https://policies.google.com/privacy

Data Loading into Snowflake powered by Metadata

Weitere interessante synvert saracus Materialien

POCing Snowflake

Synchronisierung eines DWH mit einem Quellsystem nach fehlerhafter Datenlieferung

Kommende Webinare

Einführung in Snowpark

360°-Sicht auf das Personalmanagement durch Dashboards für Führungskräfte

POCing Snowflake

CI/CD entschlüsselt: Tools und Techniken für kontinuierliche Integration und Bereitstellung

Data Ingestion aus SAP-Systemen mit Azure Data Factory

Einführung in CI/CD mit Azure DevOps

Change Management aus der IT-Perspektive

Databricks: Beschleunigung von Data Warehousing und ETL mit PySpark

Webinare auf Abruf

Data Catalogs Part I: Einführung in Metadatenmanagement und Data Catalogs

Data Catalogs Part II: Data Catalog in Action: Alation

Data Catalogs Part III: Data Catalog in Action: Informatica Enterprise Data Catalog

Ähnliche Artikel

Aufbau einer LLMOPs-Pipeline

Stammdaten-Schatz: Wie Sie Ihre Daten zum wertvollsten Asset verwandeln

10 KI-Trends für 2024

NeMo Guardrails, das ultimative Open-Source LLM Sicherheits-Toolkit

Wir verwenden Cookies

Beitrag teilen

Wei­tere inter­es­sante syn­vert saracus Materialien

POCing Snow­flake

Syn­chro­ni­sie­rung eines DWH mit einem Quell­sys­tem nach feh­ler­haf­ter Datenlieferung

Kommende Webinare

Ein­füh­rung in Snowpark

360°-Sicht auf das Per­so­nal­ma­nage­ment durch Dash­boards für Führungskräfte

POCing Snow­flake

CI/CD ent­schlüs­selt: Tools und Tech­ni­ken für kon­ti­nu­ier­li­che Inte­gra­tion und Bereitstellung

Data Inges­tion aus SAP-Sys­te­men mit Azure Data Factory

Ein­füh­rung in CI/CD mit Azure DevOps

Change Manage­ment aus der IT-Perspektive

Dat­ab­ricks: Beschleu­ni­gung von Data Ware­housing und ETL mit PySpark

Webinare auf Abruf

Data Cata­logs Part I: Ein­füh­rung in Meta­da­ten­ma­nage­ment und Data Catalogs

Data Cata­logs Part II: Data Cata­log in Action: Alation

Data Cata­logs Part III: Data Cata­log in Action: Infor­ma­tica Enter­prise Data Catalog

Ähnliche Artikel

Auf­bau einer LLMOPs-Pipeline

Stam­m­­da­­ten-Schatz: Wie Sie Ihre Daten zum wert­volls­ten Asset verwandeln

10 KI-Trends für 2024

NeMo Guar­drails, das ulti­ma­tive Open-Source LLM Sicherheits-Toolkit

Abonnieren Sie unseren Newsletter, um aktuelle Infos von synvert saracus zu erhalten

Dan­ke­schön!

Anmel­dung

Um Zugang zu all unse­ren On-Demand-Web­i­na­ren und White­pa­pers zu erhalten!

Ein­log­gen

Stö­bern Sie jeder­zeit in Web­i­nar-Videos und White­pa­pers von syn­vert saracus

Pass­wort zurücksetzen

Pass­wort ver­ges­sen? Geben Sie Ihre E‑Mail-Adresse ein und Sie erhal­ten einen Link zum Zurück­set­zen des Passworts

Wir verwenden Cookies

Weitere interessante synvert saracus Materialien

POCing Snowflake

Synchronisierung eines DWH mit einem Quellsystem nach fehlerhafter Datenlieferung

Einführung in Snowpark

360°-Sicht auf das Personalmanagement durch Dashboards für Führungskräfte

POCing Snowflake

CI/CD entschlüsselt: Tools und Techniken für kontinuierliche Integration und Bereitstellung

Data Ingestion aus SAP-Systemen mit Azure Data Factory

Einführung in CI/CD mit Azure DevOps

Change Management aus der IT-Perspektive

Databricks: Beschleunigung von Data Warehousing und ETL mit PySpark

Data Catalogs Part I: Einführung in Metadatenmanagement und Data Catalogs

Data Catalogs Part II: Data Catalog in Action: Alation

Data Catalogs Part III: Data Catalog in Action: Informatica Enterprise Data Catalog

Aufbau einer LLMOPs-Pipeline

Stammdaten-Schatz: Wie Sie Ihre Daten zum wertvollsten Asset verwandeln

NeMo Guardrails, das ultimative Open-Source LLM Sicherheits-Toolkit

Abonnieren Sie unseren Newsletter,
um aktuelle Infos von synvert saracus zu erhalten

Dankeschön!

Anmeldung

Um Zugang zu all unseren On-Demand-Webinaren und Whitepapers zu erhalten!

Einloggen

Stöbern Sie jederzeit in Webinar-Videos und Whitepapers von synvert saracus

Passwort zurücksetzen

Passwort vergessen? Geben Sie Ihre E‑Mail-Adresse ein und Sie erhalten einen Link zum Zurücksetzen des Passworts