The Team was working on a data governance requirement. It required the Team to group the same type of assets into a logical collection. They were using Azure Purview as the data governance tool. However, it was a tedious task to achieve this manually. Therefore, there was a need to automate it. This blog deals with the solution to automate the process.
The Team used PyApacheAtlas to automate the movement of assets into collections. It is used to perform the most common operations of Azure Purview programmatically. One can follow the steps below to automate moving assets into collections.
Step-1: Establish a Connection
Establish a Connection between Azure Purview and PyApacheAtlas using Azure CLI or Service Principal Authentication.
REFERENCE_NAME_PURVIEW = <Purview Name>
PROJ_PATH = Path(__file__).resolve().parent
CREDS = DefaultAzureCredential()
CLIENT = PurviewClient(account_name= REFERENCE_NAME_PURVIEW, authentication= CREDS)
DefaultAzureCredential : Provides a default TokenCredential authentication flow for applications deployed to Azure. A default credential capable of handling most Azure SDK authentication scenarios. The identity it uses depends on the environment. When an access token is needed, it requests one using the identities in turn, stopping when one provides a token:
- A service principal configured by environment variables.
- WorkloadIdentityCredential when the Azure workload identity webhook sets environment variable configuration.
- An Azure-managed identity.
- On Windows only: a user who has signed in with a Microsoft application, such as Visual Studio
- Check ~azure.identity.SharedTokenCacheCredential for more details.
- The identity should be logged into the Azure CLI or Azure PowerShell, or the Azure Developer CLI.
PurviewClient provides communication between your application and the Azure Purview service. Simplifies the requirements for knowing the endpoint URL and requires only the Purview account name.
Step-2: Move a specific Asset to Collection using GUID.
Below is the function to move an Asset to a specified collection based on Asset GUID using PyApacheAtlas.
def move_asset_to_collection(asset_guid, collection_name):
result=CLIENT.collections.move_entities(guids=asset_guid, collection=collection_name)
return result
CLIENT.collections.move_entities method is used to move one or more entities based on the GUID(Global Unique Identifier of an Entity) provided to the specified collection.
The parameters are a list of GUID’s and Collection-friendly names, typically a 6-letter pseudo-random string like “kd2cbh” which can be obtained in the purview portal.
Step-3: Move Assets to Collection based on its type.
Below is the function to move all the Assets to a provided collection based on Asset type using PyApacheAtlas.
def get_all_entities_same_type(type_name,collection_name):
try:
search_results = CLIENT.discovery.browse(entityType=type_name)
Entity_guid = [result[‘id’] for result in search_results[‘value’]]
result=move_asset_to_collection(Entity_guid,collection_name)
return result
except Exception as e:
return e
CLIENT.discovery.browse method helps execute a search for Purview, based on the entity against the /catalog/api/browse endpoint.
The Parameters are entityType (String). The entity type to browse is the root-level entry point. This must be a valid Purview built-in or custom type.
Conclusion
It helped us achieve the collection of assets efficiently and remove the manual intervention.