Skip to content

example: Add getting-started example for Polaris with Minio, Spark & Trino #1595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions getting-started/jdbc/docker-compose-bootstrap-db.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,10 @@ services:
- QUARKUS_DATASOURCE_JDBC_URL=${QUARKUS_DATASOURCE_JDBC_URL}
- QUARKUS_DATASOURCE_USERNAME=${QUARKUS_DATASOURCE_USERNAME}
- QUARKUS_DATASOURCE_PASSWORD=${QUARKUS_DATASOURCE_PASSWORD}
command:
- "bootstrap"
- "--realm=POLARIS"
- "--credential=POLARIS,root,s3cr3t"

command:>
bootstrap
--realm=POLARIS_MINIO_REALM
--credential=POLARIS_MINIO_REALM,root,s3cr3t
polaris:
depends_on:
polaris-bootstrap:
Expand Down
66 changes: 43 additions & 23 deletions getting-started/jdbc/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,32 +21,52 @@ services:

polaris:
image: apache/polaris:postgres-latest
depends_on:
postgres-minio: # Polaris server depends on PostgreSQL being healthy
condition: service_healthy
# polaris-bootstrap-minio is a setup task; polaris server doesn't need to wait for it on every start
# after the initial successful bootstrap. Other services that *use* Polaris data
# (like polaris-setup-catalog-minio) should depend on polaris: service_healthy.
ports:
# API port
- "8181:8181"
# Management port (metrics and health checks)
- "8182:8182"
# Optional, allows attaching a debugger to the Polaris JVM
- "5005:5005"
# The host port is defined by POLARIS_MINIO_API_PORT from .env, container port is 8181
- "${POLARIS_MINIO_API_PORT:-8183}:${QUARKUS_HTTP_PORT:-8181}" # Or just - "${POLARIS_MINIO_API_PORT:-8183}:8181"
# The host port is defined by POLARIS_MINIO_MGMT_PORT from .env, container port is 8182
- "${POLARIS_MINIO_MGMT_PORT:-8184}:${QUARKUS_MANAGEMENT_PORT:-8182}" # Or just - "${POLARIS_MINIO_MGMT_PORT:-8184}:8182"
environment:
- JAVA_DEBUG=true
- JAVA_DEBUG_PORT=*:5005
- POLARIS_PERSISTENCE_TYPE=relational-jdbc
- POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_RETRIES=5
- POLARIS_PERSISTENCE_RELATIONAL_JDBC_INITIAL_DELAY_IN_MS=100
- POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_DELAY_IN_MS=5000
- QUARKUS_DATASOURCE_DB_KIND=pgsql
- QUARKUS_DATASOURCE_JDBC_URL=${QUARKUS_DATASOURCE_JDBC_URL}
- QUARKUS_DATASOURCE_USERNAME=${QUARKUS_DATASOURCE_USERNAME}
- QUARKUS_DATASOURCE_PASSWORD=${QUARKUS_DATASOURCE_PASSWORD}
- POLARIS_REALM_CONTEXT_REALMS=POLARIS
- QUARKUS_OTEL_SDK_DISABLED=true
# These variables will be sourced from the .env file (or shell environment).
# Docker Compose makes them available to the container if they are defined.
- QUARKUS_DATASOURCE_DB_KIND
- QUARKUS_DATASOURCE_JDBC_URL
- QUARKUS_DATASOURCE_USERNAME
- QUARKUS_DATASOURCE_PASSWORD

- POLARIS_PERSISTENCE_TYPE
- POLARIS_REALM_CONTEXT_REALMS

# Optional JDBC retry settings
- POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_RETRIES
- POLARIS_PERSISTENCE_RELATIONAL_JDBC_INITIAL_DELAY_IN_MS
- POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_DELAY_IN_MS

# Other Quarkus/App settings from .env
- QUARKUS_OTEL_SDK_DISABLED
- QUARKUS_HTTP_PORT # Tells Quarkus which port to bind to inside the container
- QUARKUS_MANAGEMENT_PORT # Tells Quarkus which management port to bind to inside the container

# Optional: Debug logging settings (will be sourced from .env if uncommented there)
- QUARKUS_LOG_CONSOLE_LEVEL
- QUARKUS_LOG_CATEGORY_IO_SMALLRYE_CONFIG_LEVEL
- QUARKUS_LOG_CATEGORY_ORG_APACHE_POLARIS_LEVEL
- QUARKUS_LOG_CATEGORY_IO_QUARKUS_DATASOURCE_LEVEL
- QUARKUS_LOG_CATEGORY_ORG_AGROAL_LEVEL
healthcheck:
test: ["CMD", "curl", "http://localhost:8182/q/health"]
interval: 2s
timeout: 10s
retries: 10
start_period: 10s
# Uses the management port defined by POLARIS_MINIO_MGMT_PORT (which sets QUARKUS_MANAGEMENT_PORT for inside the container)
# The healthcheck runs INSIDE the container network, so it checks localhost:QUARKUS_MANAGEMENT_PORT (e.g. localhost:8182)
test: ["CMD-SHELL", "curl -f http://localhost:${QUARKUS_MANAGEMENT_PORT:-8182}/q/health/live || curl -f http://localhost:${QUARKUS_MANAGEMENT_PORT:-8182}/q/health/ready || curl -f http://localhost:${QUARKUS_MANAGEMENT_PORT:-8182}/q/health"]
interval: 10s
timeout: 5s
retries: 15
start_period: 30s

polaris-setup:
image: alpine/curl
Expand Down
63 changes: 63 additions & 0 deletions getting-started/minio/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# .env
# Default environment variables for Polaris Minio S3 example

# Minio Root Credentials (used by minio service and mc script)
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin

# Minio S3 User Credentials (created by mc script, used by services)
POLARIS_S3_USER=polaris_s3_user
POLARIS_S3_PASSWORD=polaris_s3_password_val

SPARK_MINIO_S3_USER=spark_minio_s3_user
SPARK_MINIO_S3_PASSWORD=spark_minio_s3_password_val

TRINO_MINIO_S3_USER=trino_minio_s3_user
TRINO_MINIO_S3_PASSWORD=trino_minio_s3_password_val

# Polaris Client Credentials (for Spark & Trino to auth to Polaris)
# These are used by:
# - polaris-bootstrap-minio (command to create them)
# - polaris-setup-governance (environment for script to know client IDs, and to create credentials if bootstrap doesn't)
# - spark-sql-minio (environment for Spark's Polaris catalog auth)
# - trino-minio (environment for Trino's Polaris catalog auth)
SPARK_POLARIS_CLIENT_ID=spark_app_client
SPARK_POLARIS_CLIENT_SECRET=spark_client_secret_val

TRINO_POLARIS_CLIENT_ID=trino_app_client
TRINO_POLARIS_CLIENT_SECRET=trino_client_secret_val

# These specific _ENV suffixed versions are referenced by the spark-sql-minio service environment block
# Setting them explicitly here to match the defaults and avoid Docker Compose warnings.
SPARK_POLARIS_CLIENT_ID_ENV=spark_app_client
SPARK_POLARIS_CLIENT_SECRET_ENV=spark_client_secret_val

# --- Polaris Service Specific Configuration ---
POLARIS_PERSISTENCE_TYPE=in-memory
POLARIS_REALM_CONTEXT_REALMS=POLARIS_MINIO_REALM
POLARIS_BOOTSTRAP_CREDENTIALS="POLARIS_MINIO_REALM,root,s3cr3t" # Custom root credentials for the realm
# --- Other Quarkus and Port Mappings for Services ---
QUARKUS_OTEL_SDK_DISABLED=true # For polaris service

# Port Mappings (defaults used in docker-compose.yml)
MINIO_API_PORT=9000
MINIO_CONSOLE_PORT=9001
POSTGRES_MINIO_PORT=5433
POLARIS_MINIO_API_PORT=8183
POLARIS_MINIO_MGMT_PORT=8184 # Important for health check

SPARK_UI_MINIO_START_PORT=4050
SPARK_UI_MINIO_END_PORT=4055 # Used in port range mapping

TRINO_MINIO_PORT=8083

# Quarkus HTTP/Management ports for Polaris Service (can reference variables above)
QUARKUS_HTTP_PORT=${POLARIS_MINIO_API_PORT}
QUARKUS_MANAGEMENT_PORT=${POLARIS_MINIO_MGMT_PORT}

# --- Optional: Debug Logging for Polaris Service (uncomment if needed) ---
# QUARKUS_LOG_CONSOLE_LEVEL=DEBUG
# QUARKUS_LOG_CATEGORY_IO_SMALLRYE_CONFIG_LEVEL=DEBUG
# QUARKUS_LOG_CATEGORY_ORG_APACHE_POLARIS_LEVEL=DEBUG
# QUARKUS_LOG_CATEGORY_IO_QUARKUS_DATASOURCE_LEVEL=DEBUG
# QUARKUS_LOG_CATEGORY_ORG_AGROAL_LEVEL=DEBUG
158 changes: 158 additions & 0 deletions getting-started/minio/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Getting Started with Apache Polaris: Minio S3, Governance with Spark & Trino (Read-Only)

This example demonstrates setting up Apache Polaris to manage an Iceberg data lake in Minio S3, focusing on governance.
Polaris uses Postgres for its metadata. Spark SQL is configured for read/write access to create and populate Iceberg tables. Trino is configured for **strict read-only access** to query these tables. Access control is enforced by Polaris, with underlying S3 permissions managed by Minio.

**Prerequisites:**
* Docker and Docker Compose.
* `jq` installed on your host machine.
* Apache Polaris images (`apache/polaris-admin-tool:postgres-latest`, `apache/polaris:postgres-latest`) built from source with JDBC support, tagged as `postgres-latest`.

Run

```shell
./gradlew \
:polaris-quarkus-server:assemble \
:polaris-quarkus-server:quarkusAppPartsBuild --rerun \
:polaris-quarkus-admin:assemble \
:polaris-quarkus-admin:quarkusAppPartsBuild --rerun \
-Dquarkus.container-image.tag=postgres-latest \
-Dquarkus.container-image.build=true
```

**Security Overview:**
* **Minio (S3 Storage):**
* `polaris_s3_user` (R/W): Used by Polaris service for warehouse management.
* `spark_minio_s3_user` (R/W): Used by Spark engine for data R/W operations.
* `trino_minio_s3_user` (R/O): Used by Trino engine for data read operations.
* **Polaris (Catalog & Governance):**
* `root` user: Admin access to Polaris.
* `spark_app_client`: Polaris client ID for Spark, assigned `polaris_spark_role` (R/W permissions on `minio_catalog.ns_governed`).
* `trino_app_client`: Polaris client ID for Trino, assigned `polaris_trino_role` (R/O permissions on `minio_catalog.ns_governed`).

**Setup and Execution:**

1. **Environment Variables (Optional):**
Create a `.env` file in this directory (`getting-started/minio/.env`) to customize credentials and ports. Example:
```env
# Minio Settings
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin
MINIO_API_PORT=9000
MINIO_CONSOLE_PORT=9001

# Minio S3 User Credentials (used by services, created by mc)
POLARIS_S3_USER=polaris_s3_user
POLARIS_S3_PASSWORD=polaris_s3_password_val
SPARK_MINIO_S3_USER=spark_minio_s3_user
SPARK_MINIO_S3_PASSWORD=spark_minio_s3_password_val
TRINO_MINIO_S3_USER=trino_minio_s3_user
TRINO_MINIO_S3_PASSWORD=trino_minio_s3_password_val

# Polaris Client Credentials (for Spark & Trino to auth to Polaris, created by bootstrap)
SPARK_POLARIS_CLIENT_ID=spark_app_client
SPARK_POLARIS_CLIENT_SECRET=spark_client_secret_val
TRINO_POLARIS_CLIENT_ID=trino_app_client
TRINO_POLARIS_CLIENT_SECRET=trino_client_secret_val

# Ports
POSTGRES_MINIO_PORT=5433
POLARIS_MINIO_API_PORT=8183
POLARIS_MINIO_MGMT_PORT=8184
SPARK_UI_MINIO_START_PORT=4050
# SPARK_UI_MINIO_END_PORT=4055 # Not strictly needed if using start port only for mapping range
TRINO_MINIO_PORT=8083
```

2. **Ensure Scripts are Executable:**
```bash
chmod +x getting-started/minio/minio-config/setup-minio.sh
chmod +x getting-started/minio/polaris-config/create-catalog-minio.sh
chmod +x getting-started/minio/polaris-config/setup-polaris-governance.sh
```

3. **Start Services:**
Navigate to `getting-started/minio` and run:
```shell
docker compose up -d --build
```
This will start all services, including Minio setup, Polaris bootstrap (creating `root`, `spark_app_client`, `trino_app_client` principals), Polaris catalog creation, and Polaris governance setup (creating roles and assigning grants). Check logs with `docker compose logs -f`.

4. **Access Minio Console:**
`http://localhost:${MINIO_CONSOLE_PORT:-9001}` (default: `minioadmin`/`minioadmin`). Verify `polaris-bucket`.

5. **Using Spark SQL (Read/Write Access):**
Attach to Spark: `docker attach spark-sql-minio-gov` (Press ENTER for prompt).
The default catalog is `polaris_minio_gov`.
```sql
-- Create a namespace governed by Polaris policies
CREATE NAMESPACE IF NOT EXISTS ns_governed
COMMENT 'Namespace for governed data access'
LOCATION 's3a://polaris-bucket/iceberg_warehouse/minio_catalog/ns_governed/'; -- Optional but good practice

USE ns_governed;

-- Create an Iceberg table
CREATE TABLE IF NOT EXISTS my_gov_table (id INT, name STRING, value DOUBLE)
USING iceberg
COMMENT 'Governed table for Spark R/W and Trino R/O demo'
TBLPROPERTIES ('format-version'='2');

-- Insert data
INSERT INTO my_gov_table VALUES (1, 'SparkRecordOne', 10.1), (2, 'SparkRecordTwo', 20.2);

-- Select data
SELECT * FROM my_gov_table ORDER BY id;
-- Expected: Shows inserted records.
```

6. **Using Trino CLI (Strict Read-Only Access):**
Access Trino CLI: `docker exec -it minio-trino-gov trino`
The Polaris catalog is mapped to `iceberg` in Trino.
```sql
SHOW CATALOGS;
-- Expected: iceberg, system, ...

SHOW SCHEMAS FROM iceberg;
-- Expected: information_schema, ns_governed

SHOW TABLES FROM iceberg.ns_governed;
-- Expected: my_gov_table

DESCRIBE iceberg.ns_governed.my_gov_table;
-- Expected: Schema of my_gov_table

SELECT * FROM iceberg.ns_governed.my_gov_table ORDER BY id;
-- Expected: Shows records inserted by Spark.

-- Test Read-Only: Attempt to create a table (SHOULD FAIL)
-- CREATE TABLE iceberg.ns_governed.trino_test_table (id INT) WITH (location = 's3a://polaris-bucket/iceberg_warehouse/minio_catalog/ns_governed/trino_test_table/');
-- Expected: Error from Polaris indicating permission denied for CREATE_TABLE.

-- Test Read-Only: Attempt to insert data (SHOULD FAIL)
-- INSERT INTO iceberg.ns_governed.my_gov_table VALUES (3, 'TrinoRecord', 30.3);
-- Expected: Error, as Trino's Polaris role and Minio S3 user are read-only.
```

7. **Accessing Polaris API (Optional):**
Get token for `trino_app_client` (should have limited scope):
```shell
export POLARIS_API_ENDPOINT="http://localhost:${POLARIS_MINIO_API_PORT:-8183}"
export TRINO_APP_TOKEN=$(curl -s "${POLARIS_API_ENDPOINT}/api/catalog/v1/oauth/tokens" \
--user "${TRINO_POLARIS_CLIENT_ID:-trino_app_client}:${TRINO_POLARIS_CLIENT_SECRET:-trino_client_secret_val}" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d 'grant_type=client_credentials' \
-d 'realmName=POLARIS_MINIO_REALM' | jq -r .access_token)
echo "Trino App Token: $TRINO_APP_TOKEN"

# Try to list tables using Trino's token
curl -v "${POLARIS_API_ENDPOINT}/api/catalog/v1/minio_catalog/namespaces/ns_governed/tables" -H "Authorization: Bearer $TRINO_APP_TOKEN"
# This should succeed.
```

8. **Cleanup:**
```shell
docker compose down -v
```

This set of scripts and configurations should enforce the desired access controls, with Trino having strictly read-only capabilities.
Loading