title | description | ms.author | author | ms.subservice | ms.custom | ms.date | ms.topic | monikerRange |
---|---|---|---|---|---|---|---|---|
Build a data pipeline by using Azure Pipelines |
Learn how to use an Azure CI/CD data pipeline to ingest, process, and share data. |
jukullam |
JuliaKM |
azure-devops-pipelines-apps |
devx-track-azurecli, devx-track-arm-template, arm2024 |
10/30/2024 |
how-to |
=azure-devops |
[!INCLUDE version-eq-azure-devops]
Get started building a data pipeline with data ingestion, data transformation, and model training.
Learn how to grab data from a CSV (comma-separated values) file and save the data to Azure Blob Storage. Transform the data and save it to a staging area. Then train a machine learning model by using the transformed data. Write the model to blob storage as a Python pickle file.
Before you begin, you need:
- An Azure account that has an active subscription. Create an account for free.
- An active Azure DevOps organization. Sign up for Azure Pipelines.
- The Administrator role for service connections in your Azure DevOps project. Learn how to add the Administrator role.
- Data from sample.csv.
- Access to the data pipeline solution in GitHub.
- DevOps for Azure Databricks.
-
Sign in to the Azure portal.
-
From the menu, select the Cloud Shell button. When you're prompted, select the Bash experience.
:::image type="content" source="media/azure-portal-menu-cloud-shell.png" alt-text="Screenshot showing where to select Cloud Shell from the menu.":::
[!NOTE] You'll need an Azure Storage resource to persist any files that you create in Azure Cloud Shell. When you first open Cloud Shell, you're prompted to create a resource group, storage account, and Azure Files share. This setup is automatically used for all future Cloud Shell sessions.
A region is one or more Azure datacenters within a geographic location. East US, West US, and North Europe are examples of regions. Every Azure resource, including an App Service instance, is assigned a region.
To make commands easier to run, start by selecting a default region. After you specify the default region, later commands use that region unless you specify a different region.
-
In Cloud Shell, run the following
az account list-locations
command to list the regions that are available from your Azure subscription.az account list-locations \ --query "[].{Name: name, DisplayName: displayName}" \ --output table
-
From the
Name
column in the output, choose a region that's close to you. For example, chooseasiapacific
orwestus2
. -
Run
az config
to set your default region. In the following example, replace<REGION>
with the name of the region you chose.az config set defaults.location=<REGION>
The following example sets
westus2
as the default region.az config set defaults.location=westus2
-
In Cloud Shell, generate a random number. You'll use this number to create globally unique names for certain services in the next step.
resourceSuffix=$RANDOM
-
Create globally unique names for your storage account and key vault. The following commands use double quotation marks, which instruct Bash to interpolate the variables by using the inline syntax.
storageName="datacicd${resourceSuffix}" keyVault="keyvault${resourceSuffix}"
-
Create one more Bash variable to store the names and the region of your resource group. In the following example, replace
<REGION>
with the region that you chose for the default region.rgName='data-pipeline-cicd-rg' region='<REGION>'
-
Create variable names for your Azure Data Factory and Azure Databricks instances.
datafactorydev='data-factory-cicd-dev' datafactorytest='data-factory-cicd-test' databricksname='databricks-cicd-ws'
-
Run the following
az group create
command to create a resource group by usingrgName
.az group create --name $rgName
-
Run the following
az storage account create
command to create a new storage account.az storage account create \ --name $storageName \ --resource-group $rgName \ --sku Standard_RAGRS \ --kind StorageV2
-
Run the following
az storage container create
command to create two containers,rawdata
andprepareddata
.az storage container create -n rawdata --account-name $storageName az storage container create -n prepareddata --account-name $storageName
-
Run the following
az keyvault create
command to create a new key vault.az keyvault create \ --name $keyVault \ --resource-group $rgName
-
Create a new data factory by using the portal UI or Azure CLI:
- Name:
data-factory-cicd-dev
- Version:
V2
- Resource group:
data-pipeline-cicd-rg
- Location: Your closest location
- Clear the selection for Enable Git.
-
Add the Azure Data Factory extension.
az extension add --name datafactory
-
Run the following
az datafactory create
command to create a new data factory.az datafactory create \ --name data-factory-cicd-dev \ --resource-group $rgName
-
Copy the subscription ID. Your data factory uses this ID later.
- Name:
-
Create a second data factory by using the portal UI or the Azure CLI. You use this data factory for testing.
- Name:
data-factory-cicd-test
- Version:
V2
- Resource group:
data-pipeline-cicd-rg
- Location: Your closest location
- Clear the selection for Enable GIT.
-
Run the following
az datafactory create
command to create a new data factory for testing.az datafactory create \ --name data-factory-cicd-test \ --resource-group $rgName
-
Copy the subscription ID. Your data factory uses this ID later.
- Name:
-
Add a new Azure Databricks service:
- Resource group:
data-pipeline-cicd-rg
- Workspace name:
databricks-cicd-ws
- Location: Your closest location
-
Add the Azure Databricks extension if it's not already installed.
az extension add --name databricks
-
Run the following
az databricks workspace create
command to create a new workspace.az databricks workspace create \ --resource-group $rgName \ --name databricks-cicd-ws \ --location eastus2 \ --sku trial
-
Copy the subscription ID. Your Databricks service uses this ID later.
- Resource group:
- In the Azure portal, open your storage account in the
data-pipeline-cicd-rg
resource group. - Go to Blob Service > Containers.
- Open the
prepareddata
container. - Upload the sample.csv file.
You use Azure Key Vault to store all connection information for your Azure services.
- In the Azure portal, go Databricks and then open your workspace.
- In the Azure Databricks UI, create and copy a personal access token.
- Go to your storage account.
- Open Access keys.
- Copy the first key and connection string.
-
Create three secrets:
- databricks-token:
your-databricks-pat
- StorageKey:
your-storage-key
- StorageConnectString:
your-storage-connection
- databricks-token:
-
Run the following
az keyvault secret set
command to add secrets to your key vault.az keyvault secret set --vault-name "$keyVault" --name "databricks-token" --value "your-databricks-pat" az keyvault secret set --vault-name "$keyVault" --name "StorageKey" --value "your-storage-key" az keyvault secret set --vault-name "$keyVault" --name "StorageConnectString" --value "your-storage-connection"
- Sign in to your Azure DevOps organization and then go to your project.
- Go to Repos and then import your forked version of the GitHub repository. For more information, see Import a Git repo into your project.
- Create an Azure Resource Manager service connection.
- Select App registration (automatic) and Workload identity federation.
- Select your subscription.
- Choose the data-pipeline-cicd-rg resource group.
- Name the service connection
azure_rm_connection
. - Select Grant access permission to all pipelines. You need to have the Service Connections Administrator role to select this option.
-
Create a new variable group named
datapipeline-vg
. -
Add the Azure DevOps extension if it isn't already installed.
az extension add --name azure-devops
-
Sign in to your Azure DevOps organization.
az devops login --org https://dev.azure.com/<yourorganizationname>
az pipelines variable-group create --name datapipeline-vg -p <yourazuredevopsprojectname> --variables \ "LOCATION=$region" \ "RESOURCE_GROUP=$rgName" \ "DATA_FACTORY_NAME=$datafactorydev" \ "DATA_FACTORY_DEV_NAME=$datafactorydev" \ "DATA_FACTORY_TEST_NAME=$datafactorytest" \ "ADF_PIPELINE_NAME=DataPipeline" \ "DATABRICKS_NAME=$databricksname" \ "AZURE_RM_CONNECTION=azure_rm_connection" \ "DATABRICKS_URL=<URL copied from Databricks in Azure portal>" \ "STORAGE_ACCOUNT_NAME=$storageName" \ "STORAGE_CONTAINER_NAME=rawdata"
-
Create a second variable group named
keys-vg
. This group pulls data variables from Key Vault. -
Select Link secrets from an Azure key vault as variables. For more information, see Link a variable group to secrets in Azure Key Vault.
-
Authorize the Azure subscription.
-
Choose all of the available secrets to add as variables (
databricks-token
,StorageConnectString
,StorageKey
).
Follow the steps in the next sections to set up Azure Databricks and Azure Data Factory.
- In the Azure portal, go to Key vault > Properties.
- Copy the DNS Name and Resource ID.
- In your Azure Databricks workspace, create a secret scope named
testscope
.
- In the Azure Databricks workspace, go to Clusters.
- Select Create Cluster.
- Name and save your new cluster.
- Select your new cluster name.
- In the URL string, copy the content between
/clusters/
and/configuration
. For example, in the stringclusters/0306-152107-daft561/configuration
, you would copy0306-152107-daft561
. - Save this string to use later.
- In Azure Data Factory, go to Author & Monitor. For more information, see Create a data factory.
- Select Set up code repository and then connect your repo.
- Repository type: Azure DevOps Git
- Azure DevOps organization: Your active account
- Project name: Your Azure DevOps data pipeline project
- Git repository name: Use existing.
- Select the main branch for collaboration.
- Set /azure-data-pipeline/factorydata as the root folder.
- Branch to import resource into: Select Use existing and main.
- In the Azure portal UI, open the key vault.
- Select Access policies.
- Select Add Access Policy.
- For Configure from template, select Key & Secret Management.
- In Select principal, search for the name of your development data factory and add it.
- Select Add to add your access policies.
- Repeat these steps to add an access policy for the test data factory.
- Go to Manage > Linked services.
- Update the Azure key vault to connect to your subscription.
- Go to Manage > Linked services.
- Update the Azure Blob Storage value to connect to your subscription.
- Go to Manage > Linked services.
- Update the Azure Databricks value to connect to your subscription.
- For the Existing Cluster ID, enter the cluster value you saved earlier.
- In Azure Data Factory, go to Edit.
- Open
DataPipeline
. - Select Variables.
- Verify that the
storage_account_name
refers to your storage account in the Azure portal. Update the default value if necessary. Save your changes. - Select Validate to verify
DataPipeline
. - Select Publish to publish data-factory assets to the
adf_publish
branch of your repository.
Follow these steps to run the continuous integration and continuous delivery (CI/CD) pipeline:
- Go to the Pipelines page. Then choose the action to create a new pipeline.
- Select Azure Repos Git as the location of your source code.
- When the list of repositories appears, select your repository.
- As you set up your pipeline, select Existing Azure Pipelines YAML file. Choose the YAML file: /azure-data-pipeline/data_pipeline_ci_cd.yml.
- Run the pipeline. When running your pipeline for the first time, you might need to give permission to access a resource during the run.
If you're not going to continue to use this application, delete your data pipeline by following these steps:
- Delete the
data-pipeline-cicd-rg
resource group. - Delete your Azure DevOps project.
[!div class="nextstepaction"] Learn more about data in Azure Data Factory