Why do I have unexpected costs with Azure Databricks using Spark NLP? #429

JustHeroo · 2021-11-29T17:09:23Z

JustHeroo
Nov 29, 2021
Collaborator

I observed some high unexpected costs associated with azure resource consumption whenever I ran SparkNLP pipelines within Azure Databricks jobs. To be specific, whenever I ran a job using SparkNLP, I saw a new cost emerging called "Geo-replication v2 Data Transfer". This cost is incurred when you perform a 'write' operation on the data to your out of region Azure storage account. All our storage accounts are in the west European region and the data on which we are performing the write operation is really insignificant.

JustHeroo · 2021-11-29T17:10:09Z

JustHeroo
Nov 29, 2021
Collaborator Author

When Databricks job is launched, it creates storage that force Geo-replication mode (which means that every time you save a file in it, it will replicate it in other regions), and these storages correspond to DBFS storage. When we changed the temporary storage in SparkNLP jobs from DBFS to our datalake, the extra cost completely disappeared.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why do I have unexpected costs with Azure Databricks using Spark NLP? #429

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why do I have unexpected costs with Azure Databricks using Spark NLP? #429

Uh oh!

JustHeroo Nov 29, 2021 Collaborator

Replies: 1 comment

Uh oh!

JustHeroo Nov 29, 2021 Collaborator Author

JustHeroo
Nov 29, 2021
Collaborator

JustHeroo
Nov 29, 2021
Collaborator Author