Skip to content

Move Spark docs + add Azure Synapse documentation #3675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 16, 2025
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,15 @@ That way, you would be able to access clickhouse1 table `<ck_db>.<ck_table>` fro

:::

## ClickHouse Cloud Settings {#clickhouse-cloud-settings}

When connecting to [ClickHouse Cloud](https://clickhouse.com), make sure to enable SSL and set the appropriate SSL mode. For example:

```text
spark.sql.catalog.clickhouse.option.ssl true
spark.sql.catalog.clickhouse.option.ssl_mode NONE
```

## Read Data {#read-data}

<Tabs groupId="spark_apis">
Expand Down
89 changes: 89 additions & 0 deletions docs/integrations/data-ingestion/azure-synapse/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
sidebar_label: 'Azure Synapse'
slug: /integrations/azure-synapse
description: 'Introduction to Azure Synapse with ClickHouse'
keywords: ['clickhouse', 'azure synapse', 'azure', 'synapse', 'microsoft', 'azure spark', 'data']
title: 'Integrating Azure Synapse with ClickHouse'
---

import TOCInline from '@theme/TOCInline';
import Image from '@theme/IdealImage';
import sparkConfigViaNotebook from '@site/static/images/integrations/data-ingestion/azure-synapse/spark_notebook_conf.png';
import sparkUICHSettings from '@site/static/images/integrations/data-ingestion/azure-synapse/spark_ui_ch_settings.png';

# Integrating Azure Synapse with ClickHouse

[Azure Synapse](https://azure.microsoft.com/en-us/products/synapse-analytics) is an integrated analytics service that combines big data, data science and warehousing to enable fast, large-scale data analysis.
Within Synapse, Spark pools provide on-demand, scalable [Apache Spark](https://spark.apache.org) clusters that let users run complex data transformations, machine learning, and integrations with external systems.

This article will show you how to integrate the [ClickHouse Spark connector](/integrations/apache-spark/spark-native-connector) when working with Apache Spark within Azure Synapse.


<TOCInline toc={toc}></TOCInline>

## Add the connector's dependencies {#add-connector-dependencies}
Azure Synapse supports three levels of [packages maintenance](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries):
1. Default packages
2. Spark pool level
3. Session level

<br/>

Follow the [Manage libraries for Apache Spark pools guide](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-pool-packages) and add the following required dependencies to your Spark application
- `clickhouse-spark-runtime-{spark_version}_{scala_version}-{connector_version}.jar` - [official maven](https://mvnrepository.com/artifact/com.clickhouse.spark)
- `clickhouse-jdbc-{java_client_version}-all.jar` - [official maven](https://mvnrepository.com/artifact/com.clickhouse/clickhouse-jdbc)

Please visit our [Spark Connector Compatibility Matrix](/integrations/apache-spark/spark-native-connector#compatibility-matrix) docs to understand which versions suit your needs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to the docs, just a note: JDBC version in Spark connector is quite outdated 😞


## Add ClickHouse as a catalog {#add-clickhouse-as-catalog}

There are a variety of ways to add Spark configs to your session:
* Custom configuration file to load with your session
* Add configurations via Azure Synapse UI
* Add configurations in your Synapse notebook

Follow this [Manage Apache Spark configuration](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-create-spark-configuration)
and add the [connector required Spark configurations](/integrations/apache-spark/spark-native-connector#register-the-catalog-required).

For instance, you can configure your Spark session in your notebook with these settings:

```python
%%configure -f
{
"conf": {
"spark.sql.catalog.clickhouse": "com.clickhouse.spark.ClickHouseCatalog",
"spark.sql.catalog.clickhouse.host": "<clickhouse host>",
"spark.sql.catalog.clickhouse.protocol": "https",
"spark.sql.catalog.clickhouse.http_port": "<port>",
"spark.sql.catalog.clickhouse.user": "<username>",
"spark.sql.catalog.clickhouse.password": "password",
"spark.sql.catalog.clickhouse.database": "default"
}
}
```

Make sure it will be in the first cell as follows:

<Image img={sparkConfigViaNotebook} size="xl" alt="Setting Spark configurations via notebook" border/>

Please visit the [ClickHouse Spark configurations page](/integrations/apache-spark/spark-native-connector#configurations) for additional settings.

:::info
When working with ClickHouse Cloud Please make sure to set the [required Spark settings](/integrations/apache-spark/spark-native-connector#clickhouse-cloud-settings).
:::

## Setup Verification {#setup-verification}

To verify that the dependencies and configurations were set successfully, please visit your session's Spark UI, and go to your `Environment` tab.
There, look for your ClickHouse related settings:

<Image img={sparkUICHSettings} size="xl" alt="Verifying ClickHouse settings using Spark UI" border/>


## Additional Resources {#additional-resources}

- [ClickHouse Spark Connector Docs](/integrations/apache-spark)
- [Azure Synapse Spark Pools Overview](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview)
- [Optimize performance for Apache Spark workloads](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-performance)
- [Manage libraries for Apache Spark pools in Synapse](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-pool-packages)
- [Manage Apache Spark configuration in Synapse](https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-create-spark-configuration)
25 changes: 14 additions & 11 deletions docs/integrations/data-ingestion/data-ingestion-index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
slug: /integrations/data-ingestion-overview
keywords: ['Airbyte', 'Amazon Glue', 'Apache Beam', 'dbt', 'Fivetran', 'NiFi', 'dlt', 'Vector']
keywords: [ 'Airbyte', 'Apache Spark', 'Spark', 'Azure Synapse', 'Amazon Glue', 'Apache Beam', 'dbt', 'Fivetran', 'NiFi', 'dlt', 'Vector' ]
title: 'Data Ingestion'
description: 'Landing page for the data ingestion section'
---
Expand All @@ -10,13 +10,16 @@ description: 'Landing page for the data ingestion section'
ClickHouse integrates with a number of solutions for data integration and transformation.
For more information check out the pages below:

| Data Ingestion Tool | Description |
|--------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Airbyte](/integrations/airbyte) | An open-source data integration platform. It allows the creation of ELT data pipelines and is shipped with more than 140 out-of-the-box connectors. |
| [Amazon Glue](/integrations/glue) | A fully managed, serverless data integration service provided by Amazon Web Services (AWS) simplifying the process of discovering, preparing, and transforming data for analytics, machine learning, and application development. |
| [Apache Beam](/integrations/apache-beam) | An open-source, unified programming model that enables developers to define and execute both batch and stream (continuous) data processing pipelines. |
| [dbt](/integrations/dbt) | Enables analytics engineers to transform data in their warehouses by simply writing select statements. |
| [dlt](/integrations/data-ingestion/etl-tools/dlt-and-clickhouse) | An open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets. |
| [Fivetran](/integrations/fivetran) | An automated data movement platform moving data out of, into and across your cloud data platforms. |
| [NiFi](/integrations/nifi) | An open-source workflow management software designed to automate data flow between software systems. |
| [Vector](/integrations/vector) | A high-performance observability data pipeline that puts organizations in control of their observability data. |
| Data Ingestion Tool | Description |
|------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [Airbyte](/integrations/airbyte) | An open-source data integration platform. It allows the creation of ELT data pipelines and is shipped with more than 140 out-of-the-box connectors. |
| [Apache Spark](/integrations/apache-spark) | A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters |
| [Amazon Glue](/integrations/glue) | A fully managed, serverless data integration service provided by Amazon Web Services (AWS) simplifying the process of discovering, preparing, and transforming data for analytics, machine learning, and application development. |
| [Azure Synapse](/integrations/azure-synapse) | A fully managed, cloud-based analytics service provided by Microsoft Azure, combining big data and data warehousing to simplify data integration, transformation, and analytics at scale using SQL, Apache Spark, and data pipelines. |
| [Apache Beam](/integrations/apache-beam) | An open-source, unified programming model that enables developers to define and execute both batch and stream (continuous) data processing pipelines. |
| [dbt](/integrations/dbt) | Enables analytics engineers to transform data in their warehouses by simply writing select statements. |
| [dlt](/integrations/data-ingestion/etl-tools/dlt-and-clickhouse) | An open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets. |
| [Fivetran](/integrations/fivetran) | An automated data movement platform moving data out of, into and across your cloud data platforms. |
| [NiFi](/integrations/nifi) | An open-source workflow management software designed to automate data flow between software systems. |
| [Vector](/integrations/vector) | A high-performance observability data pipeline that puts organizations in control of their observability data. |

3 changes: 1 addition & 2 deletions docs/integrations/data-ingestion/data-sources-index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
slug: /integrations/index
keywords: ['AWS S3', 'PostgreSQL', 'Kafka', 'Apache Spark', 'MySQL', 'Cassandra', 'Redis', 'RabbitMQ', 'MongoDB', 'Google Cloud Storage', 'Hive', 'Hudi', 'Iceberg', 'MinIO', 'Delta Lake', 'RocksDB', 'Splunk', 'SQLite', 'NATS', 'EMQX', 'local files', 'JDBC', 'ODBC']
keywords: ['AWS S3', 'PostgreSQL', 'Kafka', 'MySQL', 'Cassandra', 'Redis', 'RabbitMQ', 'MongoDB', 'Google Cloud Storage', 'Hive', 'Hudi', 'Iceberg', 'MinIO', 'Delta Lake', 'RocksDB', 'Splunk', 'SQLite', 'NATS', 'EMQX', 'local files', 'JDBC', 'ODBC']
description: 'Datasources overview page'
title: 'Data Sources'
---
Expand All @@ -15,7 +15,6 @@ For further information see the pages listed below:
| [AWS S3](/integrations/s3) |
| [PostgreSQL](/integrations/postgresql) |
| [Kafka](/integrations/kafka) |
| [Apache Spark](/integrations/apache-spark) |
| [MySQL](/integrations/mysql) |
| [Cassandra](/integrations/cassandra) |
| [Redis](/integrations/redis) |
Expand Down
2 changes: 2 additions & 0 deletions docs/integrations/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ import Yepcodesvg from '@site/static/images/integrations/logos/yepcode.svg';
import Warpstreamsvg from '@site/static/images/integrations/logos/warpstream.svg';
import Bytewaxsvg from '@site/static/images/integrations/logos/bytewax.svg';
import glue_logo from '@site/static/images/integrations/logos/glue_logo.png';
import azure_synapse_logo from '@site/static/images/integrations/logos/azure-synapse.png';
import logo_cpp from '@site/static/images/integrations/logos/logo_cpp.png';
import cassandra from '@site/static/images/integrations/logos/cassandra.png';
import deltalake from '@site/static/images/integrations/logos/deltalake.png';
Expand Down Expand Up @@ -204,6 +205,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no
|Amazon Glue|<Image img={glue_logo} size="logo" alt="Amazon Glue logo"/>|Data ingestion|Query ClickHouse over JDBC|[Documentation](/integrations/glue)|
|Apache Spark|<Sparksvg alt="Amazon Spark logo" style={{width: '3rem'}}/>|Data ingestion|Spark ClickHouse Connector is a high performance connector built on top of Spark DataSource V2.|[GitHub](https://github.com/housepower/spark-clickhouse-connector),<br/>[Documentation](/integrations/data-ingestion/apache-spark/index.md)|
|Azure Event Hubs|<Azureeventhubssvg alt="Azure Events Hub logo" style={{width: '3rem'}}/>|Data ingestion|A data streaming platform that supports Apache Kafka's native protocol|[Website](https://azure.microsoft.com/en-gb/products/event-hubs)|
|Azure Synapse|<Image img={azure_synapse_logo} size="logo" alt="Azure Synapse logo"/>|Data ingestion|A cloud-based analytics service for big data and data warehousing.|[Documentation](/integrations/azure-synapse)|
|C++|<Image img={logo_cpp} alt="Cpp logo" size="logo"/>|Language client|C++ client for ClickHouse|[GitHub](https://github.com/ClickHouse/clickhouse-cpp)|
|Cassandra|<Image img={cassandra} alt="Cassandra logo" size="logo"/>|Data ingestion|Allows ClickHouse to use [Cassandra](https://cassandra.apache.org/) as a dictionary source.|[Documentation](/sql-reference/dictionaries/index.md#cassandra)|
|CHDB|<Chdbsvg alt="CHDB logo" style={{width: '3rem' }}/>|AI/ML|An embedded OLAP SQL Engine|[GitHub](https://github.com/chdb-io/chdb#/),<br/>[Documentation](https://doc.chdb.io/)|
Expand Down
4 changes: 4 additions & 0 deletions scripts/aspell-dict-file.txt
Original file line number Diff line number Diff line change
Expand Up @@ -981,3 +981,7 @@ tunable
DAGs
--docs/migrations/postgres/appendix.md--
Citus
--docs/integrations/data-ingestion/azure-synapse/index.md--
microsoft
sparkConfigViaNotebook
sparkUICHSettings
25 changes: 13 additions & 12 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -793,18 +793,6 @@ const sidebars = {
"integrations/data-ingestion/kafka/kafka-table-engine-named-collections"
],
},
{
type: "category",
label: "Apache Spark",
className: "top-nav-item",
collapsed: true,
collapsible: true,
items: [
"integrations/data-ingestion/apache-spark/index",
"integrations/data-ingestion/apache-spark/spark-native-connector",
"integrations/data-ingestion/apache-spark/spark-jdbc",
],
},
"integrations/data-sources/mysql",
"integrations/data-sources/cassandra",
"integrations/data-sources/redis",
Expand Down Expand Up @@ -935,7 +923,20 @@ const sidebars = {
link: { type: "doc", id: "integrations/data-ingestion/data-ingestion-index" },
items: [
"integrations/data-ingestion/etl-tools/airbyte-and-clickhouse",
{
type: "category",
label: "Apache Spark",
className: "top-nav-item",
collapsed: true,
collapsible: true,
items: [
"integrations/data-ingestion/apache-spark/index",
"integrations/data-ingestion/apache-spark/spark-native-connector",
"integrations/data-ingestion/apache-spark/spark-jdbc",
],
},
"integrations/data-ingestion/aws-glue/index",
"integrations/data-ingestion/azure-synapse/index",
"integrations/data-ingestion/etl-tools/apache-beam",
{
type: "category",
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.