Over the last few years, there have been various approaches to incorporate Azure Data Factory which is termed as a fully managed serverless data integration service with Snowflake. However, they have all had their share of limitations and challenges. In this blog, we’ll walk through how we can ingest data from ADLS Gen2 to Snowflake using the Azure Data factory. Also, we will outline the native support using Azure Data Factory that can help integrate data between ADLS Gen2 and Snowflake.
Azure Data Factory:
- It is a cloud service for implementing ETL, ELT, and integration operations
- We can create pipelines, and data flows and ingest data from different data sources
- We can perform transformations on the data using data flows
- We have 6 main concepts in Azure Data Factory
- Pipelines: It is a stage where a group of activities performs one unit of work
We can have more than one pipeline
Ex: Ingesting data from azure blob(A1)
Making partitions using hive query(A2)
Loading the data into a data store like Synapse (A3)
- Activities: They are the single processing step in a pipeline
Example: copy activity-copy data from source to destination
- Datasets: Structure or view of the data performing.
- Linked services: Linked services are the connection strings and compute resources to transform
- Dataflows: Mapping Data Flows allows us to perform data row transformations as our data is being processed, from source to target.
- Integration runtimes: In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a designated computer resource or environment where you can run training scripts or host a service deployment. An integration runtime provides the bridge between the activity and linked Services
Ingestion of data from Azure Data Lake Gen2 to Snowflake via Azure Data Factory
- Azure Account Subscription
- Azure Data lake Storage Gen2
- Azure Data Factory
- Azure Blob Storage Account
- Snowflake Account
NOTE: Make sure that the cloud provider is Microsoft Azure while creating a Snowflake account and has same region as that of Azure resources
Source: Azure Datalakes Storage Gen2
ETL Tool: AZURE Datafactory
Steps to be followed:
- Create a storage account (ADLS GEN2)
- Create a container in the storage account
- Create Azure Data Factory
- Create a warehouse in Snowflake
- Create a pipeline
Steps to create a storage account (ADLS Gen2):
- Log in to the Azure portal: https://portal.azure.com/
- Under Azure services, click on Create a resource
- Search the storage account and click on the Create option
- We’ll be redirected to the page where we are supposed to enter the project and instance details
- Select the subscription in which to create the storage account
- Click on Create New under the Resource group field
- Enter a name for the new resource group
- Once the details are entered as per the project requirement, click on the option Review, and create
- Make sure that the hierarchical namespace field is enabled in the advanced tab while creating Azure Data Lake Gen 2
- The option to enable a hierarchical namespace will be greyed out after the creation of the storage account as it cannot be modified post-provisioning.
Steps to create a container in the storage account:
- Log in to the Azure portal
- Open the Storage account that you have created.
- Click on the storage browser and select blob container.
- Then click Add a Container.
- Provide the name of the container and access level as required and then click Create.* The creation of a blob storage account is way more similar to that of Azure Data Lake Gen2.* The only difference is that the blob storage has a flat namespace whereas the Azure Data Lake gen2 has a hierarchal namespace.
Steps to create Azure Data Factory:
- Log in to Azure Portal.
- Select Create a Resource under Azure Services.
- Search the data factory in the marketplace.
- Click on Create and enter the project and instance details.
- Click on the option create once the validation is successful.
- After the creation is completed, select Go to Resource and navigate to the Data Factory page.
- Select Launch Studio to open Azure Data Factory Studio and start the Azure Data Factory user interface (UI) application.
- The studio shall be opened on a separate browser tab.
Steps to create a warehouse in Snowflake account:
- Login to your Snowflake account
- Create a schema, database, and virtual warehouse.
- Execute the below query to create an x-large warehouse in Snowflake where compute_wh is the warehouse name.
- create or replace warehouse compute_wh with warehouse_size=’X-SMALL
Create a pipeline:
- Once we landed on the home page of the Data factory, we can see multiple options either to ingest, orchestrate or transform the data based on the user’s requirement.
- To perform the ingestion of data, we need to create
* A linked service
* A pipeline
* A data set.
- Linked service for Azure Data Lake Gen 2:
* Under the Manage tab, click on the linked services, and click New to create one.
* Since our source is Azure Data Lake Gen 2 and the destination is Snowflake, we shall create a linked service for both Azure Data Lake Gen2 and Snowflake.
* Firstly, let’s look at the creation of Azure Data Lake Gen 2 Linked Service.
* Search the required resource in the data store and click on continue.
* Enter the Name of the linked service and select the authentication type and storage account name either enter manually or from an Azure subscription.
* Test the connection and once it is successful, click Create.
- Linked service for Snowflake:
* Now search for Snowflake in Data Store.
* Provide the account name, and username and select the kind of authentication.
* Give the database and warehouse name which is created in the snowflake initially.* Test the connection and once it is validated, click on the Create option.
* We have successfully created two linked services i.e., Azure Data Lake Gen2 and Snowflake.
- Let us now upload a CSV file in the container that we created in the Azure Data Lake Gen2 storage account.
* We have uploaded the username file to the storage account of the file in CSV format.
* When we look at the snowflake instance, we created a new database and a new schema and it consists of a table definition with a username details table.
* We can see that there is no data in the table as shown in the data preview field.
- PIPELINE TO LOAD DATA FROM ADLS GEN2 TO SNOWFLAKE:
* To ingest data from ADLS GEN2 to Snowflake, we should create a pipeline.* Navigate to the author tab and click on add a symbol, we can then view the pipeline option, and click on it.* As soon as we click on the pipeline option, we’ll be redirected format where we can see the list of activities.* As our requirement is to move data, click on the move and transform activity.* It displays two options.
– Copy data
– Data flow
PIPELINE- MOVE & TRANSFORM USING COPY DATA ACTIVITY:
Set the name of the pipeline and move to the field Source. We need to create a source dataset i.e., ADLS GEN2.
CREATION OF DATASET FOR ADLS GEN2:
Click on the New button and select the dataset as ADLS GEN2. Then select the file format as delimited text (in this case) and click continue.
Set the properties of the dataset as follows, choose the linked service which we have created in the above procedure, and provide the file path of the uploaded file. Based on the data format, one can check the first row as the header box and click OK
The source is successfully defined.
CREATION OF DATASET FOR SNOWFLAKE:
Now move to the field sink and create a dataset for Snowflake. Set the properties of the dataset, and provide the table name and schema name in which the data should be ingested. Then click ok.
The pipeline to copy data from ADLS GEN2 to SNOWFLAKE is created. Once we debug the pipeline, we face an error stating that we need to enable staging to move data from ADLS GEN2 to SNOWFLAKE
The pipeline validation output is “Direct copying data to Snowflake is only supported when source dataset is Delimited Text, Parquet, JSON with Azure Blob Storage or Amazon S3 linked service as a source, for other dataset or linked service, please enable staging.”
Hence, we need to enable staging in the settings field. Once the staging checkbox is enabled, we need to create a staging account-linked service. The staging account-linked service can be either ADLS GEN2 or Blob Storage. If the staging account linked service is ADLS GEN2 then there is no constraint for the authentication type
- If we have created a linked service of type blob storage and it must be only authenticated with SAS URI, otherwise it shall fail. We can get the SAS URI from the Shared Access Signature field in the blob storage account.
- Allow all the resource types as per the requirement and then we’ll get to see an option to generate a SAS connection string when we scroll down.
- Copy the Blob Service SAS URL and set it in the linked service creation and test the connection.
- Click Create once the test connection is successful and publish all the changes. Now trigger the pipeline and wait for the pipeline to execute.
- We can see that the data is loaded in the snowflake table.
PIPELINE-MOVE & TRANSFORM USING DATA FLOW:
We can do this using Dataflow also. In dataflow, we can do Upsert or Insert. In copy data activity it will always insert data, even if the data is present in the table it will insert when the destination is Snowflake. Whereas in Dataflow we can handle it by doing an Upsert by specifying the key column. Hence, using Dataflow would be more efficient for loading data from ADLS GEN 2 to Snowflake.
- While creating the pipeline, drag the data flow to the workspace.
- Create a new data flow by hovering over the settings tab. Click on add source and add the data set as ADLS GEN2. Select the dataset which we have already created (ADLS GEN2). Add sink as snowflake dataset. We’ll get the flow as shown below.
We have processed format, row modifiers, and schema modifiers to perform the required transformations from source to destination.
In this case, alter row modifier: Upsert if is used which will update an existing row if a specified value already exists in a table, and insert a new row if the specified value doesn’t already exist.
Debug the data flow and execute the pipeline.
Once I have updated the data in the existing file in ADLS GEN2, the data will be reflected in the snowflake table as an update when we debug the data flow. In this case, I have altered the Last name of ‘Grey07’ from ‘Grey’ to ‘Zamdi’
With new script activity using Azure Data Factory, Snowflake’s new schema detection capabilities can create an Azure Data Factory pipeline that can effectively create the target table in Snowflake while loading data. Overall seamless migration to Snowflake from ADLS Gen 2 can create end-to-end pipelines, execute data manipulation language, provide data-defined language, and perform data stored procedures. We successfully completed loading data to SNOWFLAKE from ADLS GEN 2 using copy data activity and Data flows. These two are the 2 ways we can manually load data. Therefore, these pipelines can be triggered automatically using Triggers concepts in Azure Data Factory.