This repository contains a collection of notebooks demonstrating various features in Azure Databricks.
Working With Pandas:
a notebook demonstrating the pandas_udf
feature in Spark 2.3, which allows you to distribute processing of pandas
dataframes across a cluster
Plotting Distributions:
a notebook demonstrating how to plot the distribution of all numeric columns in a Spark dataframe using matplotlib
Write to a Single CSV File: if you have a small dataset in Spark, you can write the data into a single CSV file (instead of Spark's default behavior of writing to multiple files)
NYC Taxi Data: Do you need a big dataset for experimenting with Spark? The NYC Taxi is free and publicly available. This pair of notebooks will download all of the raw data and then convert it into a Delta table.
- Notebook 1: Download Raw Parquet Files
- Notebook 2: Convert Raw Parquet to Delta
Custom Delimiter: a brief example showing how you can use Spark to read data from a flat file if it uses a non-standard delmitier
Stream to Kafka: an example of how you can use Spark Streaming to send data to Azure Event Hubs using the Kafka API
DBUtils in Parallel: a demonstration of the performance gains of using dbutils
in parallel on a cluster instead of runnig it only on the driver
UDF Speed Testing: a comparision of the performance of Scala UDF's vs. Python UDF's