Wednesday 21 October 2020

Reuse logic/code in Databricks Notebook

Databricks is great tool and it make processing of large amount of data very easy. While designing Databricks workflow, I came across the need to reuse the logic and business rules across many notebooks. I do not want to create jar or python wheel as it will create dependency on another tool. My team primarily being consist of data engineers from SQL and ETL background, I don’t not want them to learn new things and they were also least interested going out of Databricks Notebook. 

After research I came up with following solution to include the reusable logic from one notebook to another notebook.

%run ./pyclass

Let me explain you in detail. I have created one notebook with python class including all the reusable logic and included that class in another notebook using %run magic command. 
Once the class is included, I can simply create instance of the class and reuse it. In the example I created notebook name pyclass with class having two methods:

    1. msg: displays message
    2. fab_num : calculates Fibonacci numbers

Databricks-Include-Notebook-Class

In another notebook I created the instance of class and reused the logic. 

Databricks-Include-Notebook


Happy Coding!

Sunday 4 October 2020

ADF Copy Activity to Oracle Error-"Error tolerance exceeded. Bulk load operation terminated"

Recently I have have been copying data from Azure Data Factory Copy activity to Oracle (on-premise) using Self-Hosted IR. After few days starting getting below error. 

{     
"errorCode": "BadRequest",     
"message": "Operation on target CD_SQL_ORA failed: Failure happened on 'Sink' side. ErrorCode=UserErrorOdbcOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ERROR [HY000] [Microsoft][ODBC Oracle Wire Protocol driver]Error tolerance exceeded. Bulk load operation terminated. 
Error in parameter 4.\r\nERROR [HY000] [Microsoft][ODBC Oracle Wire Protocol driver][Oracle]Invalid Position. Error in row 10001. Error in parameter 4.,Source=Microsoft.DataTransfer.ClientLibrary.Odbc.OdbcConnector,''Type=Microsoft.DataTransfer.ClientLibrary.Odbc.Exceptions.OdbcException,Message=ERROR [HY000] [Microsoft][ODBC Oracle Wire Protocol driver]Error tolerance exceeded. Bulk load operation terminated. Error in parameter 4.\r\nERROR [HY000] [Microsoft][ODBC Oracle Wire Protocol driver][Oracle]Invalid Position. Error in row 10001. Error in parameter 4.,Source=msora28.dll,'",     "failureType": "UserError",     "target": "PL_COPY_CUST",     "details": "" }

After googling about the error and even contacting Oracle DBA there was not solution found. Contacted Microsoft support as last option. After few trail and error it worked out after changing 'writeBatchSize' from 10,000 (default value) to 50,000. 

Again not sure of the reason and there is not much information around it in the documentation. Please let me know if anyone know exact cause of the issue.


Happy Coding!

Friday 12 June 2020

Query to get resource class of user in Azure Synapse

I struggled a bit find to find which user is assigned to which resource class in Azure Synapse. After short effort I prepare this query to get the details. Thought it is worth sharing.

SELECT u.[name] as [User Name], p.[name] as [Resource Class] FROM sys.database_principals AS p
INNER JOIN sys.database_role_members AS r on p.principal_id = r.role_principal_id
INNER JOIN sys.sysusers AS u ON u.uid = r.member_principal_id
WHERE p.[name] like '%rc%' and u.[name] = 'user'

Happy Coding!!

Friday 5 June 2020

AssertionError: col should be Column (Azure Databricks)

You get the error "AssertionError: col should be Column" when you try to add column to DataFrame.

df_col  = df_col.withColumn("UniqueID", hash(str(col)))

To solve this just import following and you should get rid of the error. Still did not found the cause but it worked. Will update as soon as cause is found.

from pyspark.sql.functions import *

Happy Coding!

Tuesday 24 March 2020

Issue while connecting Azure Data Factory (v2) to Azure Data Lake Gen2 using Managed Service Identity (MSI)


Issue while connecting Azure Data Factory (v2) to Azure Data Lake Gen2 using Managed Service Identity (MSI)

Usually documentation of Microsoft is too good but for errors it does not offers much help. One is case when I was trying to connect to ADF v2 to ADLS Gen2 using Managed Identity. I was getting error 

ADLS Gen2 operation failed for: Operation returned an invalid status code 'Forbidden'. Account: 'adlsgen2rk'. FileSystem: 'filesystem'. ErrorCode: 'AuthorizationPermissionMismatch'. Message: 'This request is not authorized to perform this operation using this permission.'. RequestId: 'd26ce7e5-601f-004f-4bda-017884000000'.. Operation returned an invalid status code 'Forbidden' Activity ID: f49c8379-d827-40d9-8c13-2206c8fd0c01.


There was no help available online so connected with Microsoft to solve this.



Steps to solve are:
  1. Go to ADLS Gen2 where you want to connect
  2. On left navigation pan select "Access control (IAM)"
  3. Click on "+Add" button, select "Add role assignment"
  4. In form, in Role drop down select "Storage Blob Data Contributor"
  5. In Select, select your Azure Data Factory name and click "Save"
IMPORTANT: Wait for 5 to 10 minutes before it reflects in Azure Data Factory

Go to Azure Data Factory again and try creating Connection, it should be successful now 😊 !

Happy Coding!!