Skip to content

Conversation

@martyngigg
Copy link
Contributor

@martyngigg martyngigg commented Feb 9, 2026

Summary

Introduce the OpenFGA as an authorization backend for Lakekeeper. This enables fine-grained permissions with the Lakekeeper warehouses. On bootstrapping Lakekeeper the following happens:

  • a Lakekeeper admin user, supplied as an argument to the ansible playbook, is defined
  • the trino service account is given read access to the warehouses.

The Admin user must use the Lakekeeper UI or REST API to define other permissions. Further work will define more granular permissions in #135 and related issues.

Refs #211

Summary by CodeRabbit

Release Notes

  • New Features

    • Added authorization service integration with warehouse-level permission management for service accounts.
    • Enhanced warehouse bootstrap workflow with improved credential handling.
  • Documentation

    • Updated deployment guide to include initial administrator configuration step.
  • Chores

    • Updated service versions (Lakekeeper to v0.11.1, OpenFGA to v1.11.2).
    • Restructured deployment configuration for improved infrastructure management.

@martyngigg martyngigg changed the title feat: Lakekeeper openfga feat: Enable fine-grained permissions in Lakekeeper catalog Feb 9, 2026
@martyngigg
Copy link
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

The pull request integrates OpenFGA as an authorization service, refactors Lakekeeper's bootstrap process to use OAuth2 sessions, updates Keycloak configuration paths, reorganises Ansible playbook tasks into modular files, and adds comprehensive database configurations across development and QA environments. The Lakekeeper bootstrap script is significantly refactored to replace token-based authentication with session-based OAuth2 workflows.

Changes

Cohort / File(s) Summary
Documentation & Deployment Instructions
docs-devel/deployment/index.md
Added CLI variable for admin email in Ansible deployment; clarified first-time deployment requirement for the new variable.
Keycloak Configuration
infra/ansible/group_vars/keycloak.yml, infra/ansible/roles/keycloak/tasks/setup-realm.yml
Swapped public Keycloak client from lakekeeper-ui to lakekeeper-api; added task to grant "view-users" role to machine service account.
OpenFGA Integration
infra/ansible/roles/openfga/defaults/main.yml, infra/ansible/roles/openfga/tasks/main.yml, infra/ansible/inventories/dev/group_vars/datastore.yml
Introduced new OpenFGA role with Docker container management, migrations, and caching configuration; added openfga database to PostgreSQL provisioning list.
Keycloak Base Path & Catalog URI Updates
infra/ansible/group_vars/all/all.yml, infra/ansible/roles/elt/templates/secrets/envvars.j2, infra/ansible/roles/trino/tasks/main.yml
Changed Keycloak base path from /authn to /auth; renamed lakekeeper_catalog.uri to lakekeeper_catalog.catalog_uri with new management_uri field.
Lakekeeper Configuration Variables
infra/ansible/inventories/dev/group_vars/lakekeeper.yml, infra/ansible/inventories/qa/group_vars/lakekeeper.yml
Added OpenFGA database connection variables and Lakekeeper metadatabase encryption/bootstrap credentials sourced from Vault; expanded configuration for both environments.
Lakekeeper Bootstrap Script Refactoring
infra/ansible/roles/lakekeeper/files/bootstrap-warehouse.py
Replaced direct token-based Keycloak integration with OAuth2Session for authentication; refactored CLI arguments and added helper functions for logging, OAuth2 configuration, and OIDC user ID derivation.
Lakekeeper Playbook Refactoring
infra/ansible/roles/lakekeeper/tasks/main.yml, infra/ansible/roles/lakekeeper/tasks/migrate.yml, infra/ansible/roles/lakekeeper/tasks/start.yml, infra/ansible/roles/lakekeeper/tasks/bootstrap.yml
Reorganised Lakekeeper deployment into modular task files; introduced shared environment variable structure; automated warehouse bootstrap via Docker container with template-driven JSON configuration.
Lakekeeper Configuration & Defaults
infra/ansible/roles/lakekeeper/defaults/main.yml, infra/ansible/roles/lakekeeper/templates/bootstrap-warehouse.json.j2
Updated Lakekeeper image to v0.11.1; added log level configuration; added permissions field to bootstrap warehouse template.
Database & Infrastructure
infra/ansible/roles/postgres/tasks/main.yml
Tightened PostgreSQL data directory permissions; added ownership specifications for secrets directory.
Ansible Playbook Structure
infra/ansible/site.yml
Converted role declarations from nested mappings to scalar entries; added explicit tags to multiple roles for selective execution.
Local Development
infra/local/docker-compose.yml, infra/local/keycloak/bootstrap.sh, infra/local/warehouses/lakekeeper/bootstrap-dev_isis_*.json
Updated OpenFGA image to v1.11.2; refactored bootstrap script CLI arguments; simplified Keycloak user creation; added permissions for service-account-trino in warehouse bootstrap configurations.

Poem

🐰 Hopping through configs with OAuth's grace,
OpenFGA now guards the Lakekeeper space,
Modular tasks make deployments neat,
From tokens to sessions—the refactor's complete!
Permissions and realms in harmony dance,
Rabbit bids new architecture a chance.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarises the main objective of the pull request: enabling fine-grained permissions in the Lakekeeper catalog through OpenFGA integration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
infra/ansible/roles/lakekeeper/files/bootstrap-warehouse.py (1)

71-89: ⚠️ Potential issue | 🟠 Major

Missing error handling on bootstrap POST.

The bootstrap method (line 80) calls self._auth_session.request("POST", ...) but never checks the response status. If bootstrapping fails server-side, the script will log "Server bootstrapped successfully." regardless.

Proposed fix
         LOGGER.info("Bootstrapping server.")
-        self._auth_session.request(
+        response = self._auth_session.request(
             "POST",
             self.management_url + "/bootstrap",
             json={
                 "accept-terms-of-use": True,
                 "is-operator": True,
             },
         )
+        response.raise_for_status()
 
         LOGGER.info("Server bootstrapped successfully.")
🤖 Fix all issues with AI agents
In `@infra/ansible/group_vars/all/all.yml`:
- Around line 36-37: The new management_uri variable is defined but unused;
either remove it or wire it into the bootstrap/playbook and template that expect
the management endpoint. Fix option 1: remove management_uri from
group_vars/all/all.yml if not needed. Fix option 2: replace hardcoded local
management address in the bootstrap playbook and any tasks/templates that
reference the management endpoint with the management_uri variable (match how
catalog_uri is used), e.g. update infra/ansible/roles/trino/tasks/main.yml,
infra/ansible/roles/elt/templates/secrets/envvars.j2 and the bootstrap playbook
to consume management_uri instead of a hardcoded URL so the variable is actually
referenced. Ensure references use the exact variable name management_uri and
follow the same templating pattern as catalog_uri.
- Line 23: Documentation references to legacy Keycloak paths are stale: update
any mentions of "/authn" and "/iceberg" in the deployment docs to match the
deployed configuration using keycloak_base_path (/auth). Search the docs for the
strings "/authn" and "/iceberg" and replace them with "/auth" (or otherwise
align the text to reference keycloak_base_path) and ensure any examples or
generated keycloak_url values reflect the current keycloak_base_path setting.

In `@infra/ansible/roles/lakekeeper/files/bootstrap-warehouse.py`:
- Around line 91-93: The is_bootstrapped method currently assumes a successful
/info response; update it to handle non-2xx responses and missing keys by first
validating the HTTP response (e.g., check response.ok or call
response.raise_for_status()) and then safely reading the JSON with
response.json().get("bootstrapped", False); if parsing fails or the key is
absent, return False or raise a clear exception. Use the existing _auth_session,
management_url and the "/info" endpoint in this logic so callers of
is_bootstrapped receive a deterministic boolean or a clear error.
- Around line 57-69: In assign_grants, the HTTP GET and POST responses are not
checked for errors—call raise_for_status() on the GET response returned by
self._auth_session.get(...) before accessing response.json()["assignments"], and
call raise_for_status() on the POST response returned by
self._auth_session.post(...) (or otherwise check its status) so failures surface
consistently with other methods like rename_default_project/create_warehouse;
update the code around the response variable and the
self._auth_session.post(...) call to perform these checks and log or re-raise
errors as appropriate.
- Around line 123-131: The get_warehouse_id method should validate the GET
response before using response.json()["warehouses"]; update get_warehouse_id to
check response.ok (or status_code) and handle non-2xx by logging or raising,
wrap response.json() parsing in a try/except to catch JSONDecodeError, and
verify the parsed payload contains a "warehouses" list before iterating; use
self._auth_session, self.management_url and the "/warehouse" path to locate the
call, and return None (or propagate a clear error) when the response is invalid
or missing the expected structure.
- Around line 95-109: In rename_default_project, avoid calling response.json()
before verifying the response status: first call response.raise_for_status() on
the Response returned by self._auth_session.get(self.management_url +
"/project"), then (and only then) call response.json() for logging/inspection;
update uses of response.json() in the LOGGER.debug and the subsequent
project-name check to occur after the raise_for_status() call, and ensure the
POST to self.management_url + "/project/rename" still calls
response.raise_for_status() afterwards as currently implemented.
- Around line 310-322: Guard against a missing filename and missing
"permissions" key: first validate warehouse_json_file (ensure it's not
None/empty and exists) before calling open(warehouse_json_file) and fail with a
clear error via click or LOGGER if it's not provided; then replace
warehouse_json.pop("permissions") with warehouse_json.pop("permissions", {}) so
absence yields an empty dict, and keep the subsequent
server.create_warehouse(warehouse_json) and the grant loop (using oidc_user_id
and server.assign_grants) unchanged so no KeyError occurs when permissions are
absent.
- Around line 110-121: The provision_user method is defined but never invoked;
either remove it or call it from the bootstrap flow—locate the provision_user
function and the main() entrypoint and, if provisioning is intended, invoke
self.provision_user() (or provision_user() as appropriate) after the
authentication/session is established and before exiting, ensuring you handle
exceptions similarly to the existing try/except around requests; if provisioning
is not needed, delete the provision_user method to remove dead code.

In `@infra/ansible/roles/lakekeeper/tasks/bootstrap.yml`:
- Line 51: The restart_policy value is written as unquoted no which YAML parses
as boolean false; update the restart_policy assignment (restart_policy:) to pass
a string by quoting the value (e.g., restart_policy: "no" or 'no') so the
Ansible module receives the literal string instead of a boolean.
- Around line 53-56: The env block under the community.docker.docker_container
task is using shell-style "KEY=val" strings (and a bare "#" line) which is
invalid for the module; change the env parameter to a proper YAML mapping of
key: value pairs (e.g., UV_LINK_MODE: "copy", UV_PROJECT_ENVIRONMENT:
"/opt/uv-venv"), remove the stray "#" line, and ensure the mapping is indented
under the env key in the same task so the module receives a dict not a list of
strings.

In `@infra/ansible/roles/openfga/defaults/main.yml`:
- Line 10: The default variable openfga_database_options currently contains the
insecure value "?sslmode=disable"; remove that insecure default from the role
defaults (set it to empty/null) and instead add explicit overrides in the QA/dev
inventory group_vars to set a secure option (e.g., "?sslmode=require"), and
ensure any production inventory also provides an explicit secure override;
update usages of openfga_database_options (where it's referenced) to tolerate an
empty/default value if not set.

In `@infra/ansible/roles/openfga/tasks/main.yml`:
- Around line 39-40: The OPENFGA_CHECK_QUERY_CACHE_ENABLED env var is
incorrectly wired to openfga_check_iterator_cache_enabled; update the assignment
so OPENFGA_CHECK_QUERY_CACHE_ENABLED uses the openfga_check_query_cache_enabled
variable instead (keeping the existing templating/string conversion style),
leaving OPENFGA_CHECK_ITERATOR_CACHE_ENABLED unchanged; verify you reference
openfga_check_query_cache_enabled exactly and preserve surrounding formatting.
🧹 Nitpick comments (11)
infra/ansible/roles/openfga/tasks/main.yml (2)

45-46: Healthcheck port is hardcoded but a variable exists.

The healthcheck uses -addr=localhost:8081 while openfga_grpc_port is defined and used for OPENFGA_GRPC_ADDR on line 35. If the port default changes, the healthcheck will break silently.

Proposed fix
-      test: ["CMD", "/usr/local/bin/grpc_health_probe", "-addr=localhost:8081"]
+      test: ["CMD", "/usr/local/bin/grpc_health_probe", "-addr=localhost:{{ openfga_grpc_port }}"]

15-15: Unquoted no is parsed as boolean false in YAML.

restart_policy: no should be restart_policy: "no" to ensure it's treated as the string "no" rather than boolean false. Ansible may tolerate this, but it's a known YAML gotcha.

Proposed fix
-    restart_policy: no
+    restart_policy: "no"
infra/ansible/inventories/dev/group_vars/lakekeeper.yml (1)

2-6: Minor inconsistency: openfga_database_port is hardcoded in dev but vault-sourced in QA.

In the dev inventory this is 5432 (line 3), while the QA inventory uses {{ vault_db_port }}. This is likely intentional for local dev simplicity but worth noting — if the dev Postgres port ever changes, this will need a manual update.

infra/ansible/roles/lakekeeper/defaults/main.yml (1)

3-5: Default log level of ERROR may be too restrictive.

Setting the default to ERROR suppresses all WARN-level messages (e.g. deprecation notices, slow operations, configuration drift). Consider WARN as a safer default that still keeps logs quiet but surfaces actionable information. The bootstrap log level at INFO is fine.

infra/ansible/roles/lakekeeper/files/bootstrap-warehouse.py (4)

133-157: id shadows the Python built-in.

Line 139 uses id as a variable name, which shadows the built-in id() function. While unlikely to cause a bug here, it's a code-smell that linters will flag.

Proposed fix — rename to `warehouse_id`
     def create_warehouse(self, warehouse_config: Dict[str, Any]) -> str:
         """Create a warehouse in the server with the config.
 
         If the bucket does not exist it is created.
         """
         name = warehouse_config["warehouse-name"]
-        id = self.get_warehouse_id(name)
-        if id is not None:
+        warehouse_id = self.get_warehouse_id(name)
+        if warehouse_id is not None:
             LOGGER.info(
                 f"Warehouse '{name}' already exists. Skipping warehouse creation."
             )
-            return id
+            return warehouse_id

1-9: Dependency version range for authlib is very broad.

authlib>=1.6,<=2 includes a major version boundary (1.x → 2.x). If authlib 2.0 introduces breaking changes, this script could fail. Consider tightening the upper bound to <2.

Proposed fix
-#     "authlib>=1.6,<=2",
+#     "authlib>=1.6,<2",

220-235: Add explicit grant_type="client_credentials" for clarity.

Whilst Authlib's OAuth2Session.fetch_access_token() automatically infers client_credentials as the grant type when neither an authorization code nor user credentials are provided, the Authlib documentation recommends explicitly setting this parameter to avoid ambiguity.

Recommended explicit call
     # Cache an access token
-    client.fetch_access_token()
+    client.fetch_access_token(grant_type="client_credentials")
     return client

293-298: Consider using auto_refresh_token parameter instead of manually calling refresh_token().

Whilst connection is a documented public property of KeycloakAdmin, manually calling kcadm.connection.refresh_token() is not the recommended pattern. The python-keycloak library provides auto_refresh_token parameter for this purpose—pass it when creating the KeycloakAdmin instance:

Suggested approach
kcadm = KeycloakAdmin(
    server_url=keycloak_url.rstrip("/") + "/",
    username=keycloak_admin_credentials.left,
    password=keycloak_admin_credentials.right,
    auto_refresh_token=["get", "post", "put", "delete"],
)
kcadm.change_current_realm(keycloak_user_realm)

This enables automatic token refresh on HTTP calls, eliminating the need for manual refresh before switching realms.

infra/ansible/roles/lakekeeper/tasks/bootstrap.yml (2)

39-42: Credentials are exposed on the container command line.

--keycloak-admin-credentials and --bootstrap-credentials are passed as CLI arguments, making them visible in docker inspect and process listings. Since the container is ephemeral (cleanup: true, detach: false), the window is small, but consider passing these via environment variables instead for defence in depth.


16-29: Hardcoded permissions for Trino service account.

warehouse_permissions is defined inline with only "service-account-trino": ["select"]. This is consistent with the PR objectives, but consider extracting this to a default variable (e.g. in defaults/main.yml) so it can be overridden per-environment without modifying the task file.

infra/ansible/site.yml (1)

5-8: base and traefik roles in the traefik play are untagged, unlike other plays.

Other plays tag their primary roles (e.g. tags: [postgres], tags: [lakekeeper]). The traefik play omits tags for both base and traefik. This is a minor inconsistency — if the intent is to allow --tags traefik for selective execution, you'd need a tag here too.

@martyngigg martyngigg merged commit 396b2ea into main Feb 10, 2026
2 checks passed
@martyngigg martyngigg deleted the lakekeeper-openfga branch February 10, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant