Replies: 1 comment
-
When Databricks job is launched, it creates storage that force Geo-replication mode (which means that every time you save a file in it, it will replicate it in other regions), and these storages correspond to DBFS storage. When we changed the temporary storage in SparkNLP jobs from DBFS to our datalake, the extra cost completely disappeared. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I observed some high unexpected costs associated with azure resource consumption whenever I ran SparkNLP pipelines within Azure Databricks jobs. To be specific, whenever I ran a job using SparkNLP, I saw a new cost emerging called "Geo-replication v2 Data Transfer". This cost is incurred when you perform a 'write' operation on the data to your out of region Azure storage account. All our storage accounts are in the west European region and the data on which we are performing the write operation is really insignificant.
Beta Was this translation helpful? Give feedback.
All reactions