Skip to content

example: Add getting-started example for Polaris with Minio, Spark & Trino #1595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

hackintoshrao
Copy link

This PR introduces a new "getting started" example in getting-started/minio/. It demonstrates using Apache Polaris to manage an Iceberg data lake on Minio S3, with governed access for Spark (R/W) and Trino (R/O).

Key Features:

  • Docker Compose setup for Minio, Polaris (in-memory metastore), Spark, and Trino.
  • Scripts for Minio initialization, Polaris catalog creation, and Polaris governance setup (client principals, roles, grants).
  • README with full setup and execution instructions.
  • Tested with Polaris images built from commit f048bcd.

Help needed:

The Polaris server, when configured for in-memory persistence, currently auto-generates random root credentials for its bootstrapped realm, even when POLARIS_BOOTSTRAP_CREDENTIALS="POLARIS_MINIO_REALM,root,s3cr3t" is set. The server log indicates "credentials were not present...". This makes it difficult for setup scripts to use predefined root credentials for subsequent API calls. I'd like some help in fixing this.

This commit introduces the initial README.md file for the 'getting-started/minio' example.

The README outlines:
- Purpose of the example: Demonstrating Apache Polaris managing an Iceberg data lake in Minio S3, with a focus on governance for Spark (R/W) and Trino (R/O).
- Prerequisites for running the example.
- An overview of the security model, including Minio S3 users and Polaris client roles.
- Detailed setup and execution steps:
    - Optional environment variable configuration.
    - Making scripts executable.
    - Starting services with docker-compose.
    - Accessing the Minio console.
    - Using Spark SQL for creating namespaces, tables, and inserting data.
    - Using Trino CLI for querying data and verifying read-only access (including expected failures for write operations).
    - Optional steps for accessing the Polaris API with a scoped token.
    - Cleanup instructions.
- A brief overview of the file structure within the `getting-started/minio` directory.
This commit introduces the JSON policy files required for configuring Minio access control in the Polaris with Minio S3 example. These policies define permissions for different users/services interacting with the Minio bucket.

The following policies have been added to `getting-started/minio/minio-config/`:

1.  `polaris-s3-rw-policy.json`: Grants Read-Write (R/W) permissions to the Minio bucket. This policy is intended for the `polaris_s3_user`, which the Polaris service itself uses for managing the Iceberg warehouse (e.g., creating namespace directories, managing catalog-level S3 interactions).

2.  `spark-minio-rw-policy.json`: Grants Read-Write (R/W) permissions to the Minio bucket. This policy is for the `spark_minio_s3_user`, which the Spark engine uses for data plane operations like reading and writing Iceberg table data and metadata files to S3.

3.  `trino-minio-ro-policy.json`: Grants Read-Only (R/O) permissions to the Minio bucket. This policy is for the `trino_minio_s3_user`, which the Trino engine uses for data plane operations, specifically reading Iceberg table data and metadata files from S3.

These policies will be applied to their respective Minio users by the `setup-minio.sh` script, ensuring a layered security model where Polaris governs metadata access and Minio controls direct object storage access according to the principle of least privilege for each component.
This commit introduces the `setup-minio.sh` script located in `getting-started/minio/minio-config/`.

This script is responsible for bootstrapping the Minio S3 service within the Docker Compose environment for the Polaris example. Its key functions include:

- Waiting for the Minio service to become healthy and responsive.
- Configuring the Minio client (`mc`) with an alias for the local Minio instance.
- Creating the designated S3 bucket (`polaris-bucket`) if it doesn't already exist.
- Creating Minio access policies by applying the previously defined JSON policy files:
    - `polaris-s3-rw-policy.json`
    - `spark-minio-rw-policy.json`
    - `trino-minio-ro-policy.json`
- Creating three distinct Minio users with their respective credentials (passed as environment variables):
    - `polaris_s3_user` (for the Polaris service)
    - `spark_minio_s3_user` (for the Spark engine's data plane access)
    - `trino_minio_s3_user` (for the Trino engine's data plane access)
- Attaching the appropriate access policies to each of these newly created Minio users.

This script automates the necessary Minio setup steps, ensuring that the object storage is correctly configured with the required users and permissions before other services like Polaris, Spark, and Trino attempt to interact with it.
This commit introduces the `create-catalog-minio.sh` script, located in `getting-started/minio/polaris-config/`.

The primary purpose of this script is to configure a new catalog within Apache Polaris that uses Minio S3 as its underlying storage for Iceberg table metadata and data.

Key actions performed by the script:
- Waits for the Polaris service to become healthy and responsive.
- Acquires an administrative access token for the Polaris API using root credentials for the configured realm.
- Defines the configuration for the new catalog, named `minio_catalog`. This configuration includes:
    - The S3 warehouse path (e.g., `s3a://polaris-bucket/iceberg_warehouse/minio_catalog`).
    - S3 connection details, such as the Minio endpoint.
    - S3 credentials (`POLARIS_S3_USER` and its password) that Polaris will use to interact with the Minio bucket for managing the warehouse structure.
- Checks if the `minio_catalog` already exists in Polaris.
- If the catalog does not exist, it sends a request to the Polaris Management API to create it with the specified configuration.
- Verifies that the catalog creation was successful or that the catalog already existed.

This script ensures that Polaris is aware of and configured to manage the Minio S3 storage location as an Iceberg catalog, which is essential before Spark or Trino can interact with tables within this catalog via Polaris.
This commit adds the `setup-polaris-governance.sh` script to `getting-started/minio/polaris-config/`. This script is responsible for configuring the access control policies within Apache Polaris for the Minio S3 example.

Key functionalities of this script include:
- Ensuring the Polaris service is operational before proceeding.
- Obtaining an administrative token to interact with the Polaris Management API.
- Defining and creating two distinct principal roles within Polaris:
    - `polaris_spark_role`: Intended for Spark, with read/write capabilities.
    - `polaris_trino_role`: Intended for Trino, with strict read-only capabilities.
- Assigning the pre-bootstrapped client principals (`spark_app_client` and `trino_app_client`) to their respective roles (`polaris_spark_role` and `polaris_trino_role`).
- Granting fine-grained privileges to these roles on Polaris resources (the `minio_catalog` and the `ns_governed` namespace):
    - The `polaris_spark_role` receives permissions to use the catalog, create namespaces, use namespaces, and perform full CRUD (Create, Read, Update, Delete) operations on tables and their data within `ns_governed`.
    - The `polaris_trino_role` receives permissions to use the catalog, use the `ns_governed` namespace, and read table metadata and data within that namespace. It is explicitly NOT granted any write, create, alter, or delete permissions.

This script is crucial for demonstrating Polaris's governance capabilities by centrally defining and enforcing different access levels for Spark and Trino when interacting with the Iceberg tables managed by Polaris and stored in Minio.
…nly)

This commit adds the `iceberg.properties` file to `getting-started/minio/trino-catalog/`. This file configures the Trino Iceberg connector to integrate with Apache Polaris for metadata management and Minio for data storage, specifically enforcing read-only access for Trino.

Key configurations in this file include:
- Setting the connector name to `iceberg`.
- Defining the Iceberg catalog type as `rest`, with the URI pointing to the internal Polaris service endpoint (`http://polaris:8181/api/catalog`).
- Mapping Trino's `iceberg` catalog to the `minio_catalog` defined within Polaris.
- Configuring OAuth2 authentication for Trino to securely connect to Polaris, using `TRINO_POLARIS_CLIENT_ID` and `TRINO_POLARIS_CLIENT_SECRET`. These credentials correspond to a Polaris principal associated with a read-only role.
- Specifying S3 connection details for Trino's data plane operations, including the Minio endpoint and credentials (`TRINO_MINIO_S3_USER` and `TRINO_MINIO_S3_PASSWORD`) that have read-only permissions at the Minio S3 level.
- Enabling Hadoop filesystem support (`fs.hadoop.enabled=true`) as required for S3 interaction.

This configuration ensures that Trino can discover and query Iceberg tables managed by Polaris, with data residing in Minio, while adhering to the strict read-only access policies defined both in Polaris and Minio.
…governance example

This commit introduces the `docker-compose.yml` file for the `getting-started/minio` example. This file orchestrates the deployment of a multi-container environment to demonstrate Apache Polaris managing an Iceberg data lake on Minio S3, with governed access for Spark (read/write) and Trino (read-only).

The Docker Compose setup includes the following services:
- `minio`: Provides S3-compatible object storage.
- `mc`: Minio client used to initialize buckets, users, and policies in Minio.
- `postgres-minio`: PostgreSQL database instance for Polaris metadata.
- `polaris-bootstrap-minio`: Bootstraps the Polaris database and creates initial principals for root admin, Spark client, and Trino client.
- `polaris`: The Apache Polaris catalog server.
- `polaris-setup-catalog-minio`: A utility service to create the `minio_catalog` within Polaris, configured to use the Minio backend.
- `polaris-setup-governance`: A utility service to apply fine-grained access control policies (roles and grants) within Polaris for Spark and Trino.
- `spark-sql-minio`: Apache Spark SQL shell, configured to interact with Polaris for R/W operations on Iceberg tables.
- `trino-minio`: Trino server, configured to interact with Polaris for R/O query operations on Iceberg tables.

Key aspects of this Docker Compose configuration:
- Defines service dependencies (`depends_on`) to ensure a correct startup order.
- Manages network configuration for inter-service communication.
- Utilizes volume mounts for persistent data (Postgres, Minio) and for injecting configuration files.
- Employs environment variables for passing credentials, S3 user details, Polaris client IDs/secrets, and other settings, with sensible defaults provided.
- Includes health checks for critical services like Minio and Polaris.
- Uses specific, recent image versions for Minio, mc, Postgres, Spark, and Trino to ensure stability and reproducibility.

This setup provides a complete, self-contained environment to test and demonstrate the end-to-end functionality of Polaris, including its governance features, with Minio as the storage backend and Spark/Trino as data processing engines.
This commit refactors the `getting-started/minio` example to configure the main Apache Polaris server to use an in-memory metastore. This simplifies the setup by removing the dependency on PostgreSQL for Polaris's own metadata, making it lighter for a getting-started experience and to isolate previous database connection issues.

Key changes include:

1.  **Docker Compose (`docker-compose.yml`):**
    * Removed the `postgres-minio` and `polaris-bootstrap-minio` services.
    * Updated the `polaris` service:
        * Removed `depends_on: postgres-minio`.
        * Environment variables are now configured to set `POLARIS_PERSISTENCE_TYPE` to `in-memory`.
        * Added `POLARIS_BOOTSTRAP_CREDENTIALS` to allow the in-memory Polaris instance to initialize with known `root` credentials.
        * Removed PostgreSQL-specific `QUARKUS_DATASOURCE_*` variables from its environment block, relying on values from the `.env` file for other settings.
        * Updated health check timings and port references.
    * Adjusted `depends_on` for `polaris-setup-catalog-minio` and `polaris-setup-governance` to depend directly on the `polaris` service.
    * Updated image tags for `minio/mc` and `minio/minio` to `latest`.
    * Removed `version: '3.8'` as it's obsolete.

2.  **Environment File (`.env`):**
    * Set `POLARIS_PERSISTENCE_TYPE=in-memory`.
    * Added `POLARIS_BOOTSTRAP_CREDENTIALS="POLARIS_MINIO_REALM,root,s3cr3t"`.
    * Commented out/removed PostgreSQL specific `QUARKUS_DATASOURCE_*` variables (as they are not needed for the in-memory `polaris` service).
    * Ensured other necessary variables (ports, client IDs/secrets for setup scripts) are present.

3.  **Minio Setup Script (`minio-config/setup-minio.sh`):**
    * Removed the `curl`-based health check loop, relying on Docker Compose's `depends_on: minio: condition: service_healthy`.

4.  **Polaris Governance Script (`polaris-config/setup-polaris-governance.sh`):**
    * Added conceptual API calls to create `spark_app_client` and `trino_app_client` principals and their credentials using the `root` token, as these are no longer created by a dedicated bootstrap service. (Note: These API calls are illustrative and depend on actual Polaris API structure).

These changes aim to provide a working "getting started" example using an in-memory Polaris server, which simplifies deployment and focuses on Polaris's interaction with Minio and its governance features for Spark and Trino. The removal of the PostgreSQL dependency for the Polaris server itself should resolve previous H2 fallback issues.

Sources and related content
@richban
Copy link

richban commented May 19, 2025

I can't run your scripts when creating the catalog.

polaris-setup-catalog-1  | Polaris service is live.
polaris-setup-catalog-1  | Attempting to get Polaris admin token...
polaris-setup-catalog-1  | Polaris admin token obtained.
polaris-setup-catalog-1  | Attempting to create/verify catalog 'pyiceberg_catalog'...
polaris-setup-catalog-1  | Payload being sent: {
polaris-setup-catalog-1  |   "name": "pyiceberg_catalog",
polaris-setup-catalog-1  |   "type": "INTERNAL",
polaris-setup-catalog-1  |   "properties": {
polaris-setup-catalog-1  |     "warehouse": "s3a://datalake/pyiceberg_catalog",
polaris-setup-catalog-1  |     "storage.type": "s3",
polaris-setup-catalog-1  |     "s3.endpoint": "http://minio:9000",
polaris-setup-catalog-1  |     "s3.access-key-id": "polaris_s3_user",
polaris-setup-catalog-1  |     "s3.secret-access-key": "polaris_s3_password_val",
polaris-setup-catalog-1  |     "s3.path-style-access": "true",
polaris-setup-catalog-1  |     "client.region": "us-east-1"
polaris-setup-catalog-1  |   },
polaris-setup-catalog-1  |   "readOnly": false
polaris-setup-catalog-1  | }
polaris-setup-catalog-1  | Failed to create catalog 'pyiceberg_catalog'. HTTP Status: 400. Response:
polaris-setup-catalog-1 exited with code 1

Is minio actually supported? Ref: #389

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants