Skip to content

Fix quickstart doc with docker compose #1610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

MonkeyCanCode
Copy link
Contributor

@MonkeyCanCode MonkeyCanCode commented May 19, 2025

This PR fixes couple issues found while trying to go through quickstart guide:

  1. Doc site/content/in-dev/unreleased/getting-started/quickstart.md has an invalid ref to getting-started/eclipselink/docker-compose-postgres.yml. This is updated to getting-started/assets/postgres/docker-compose-postgres.yml (this is correct in getting-started/eclipselink/README.md but not in the public page)
  2. Due to now we have 3 docker compose and 1 in a diff path, running this command can cause failure as they 2 of them have binds from local to container but from diff paths (it will create empty dirs on the wrong path). To fix this, I introduced ASSETS_PATH to point to the absolute path of assets directory which is being used for binding.
  3. Doc getting-started/eclipselink/docker-compose-bootstrap-db.yml has a hard-coded value of default username/password which is no longer valid after recent change in Use env var in spark container #1522
  4. quickstart page (https://polaris.apache.org/in-dev/unreleased/getting-started/quickstart/#docker-image) ask user to build the image then run docker compose and asks user to do expose CLIENT_ID and CLIENT_SECRET. Due to change in item 3, that is no longer valid as default value are not longer available. This will cause failure as well. To fix this, I add .env file to contain the original default value. This will continue to allow user to set them to diff by remove env loading for docker compose.

pingtimeout
pingtimeout previously approved these changes May 19, 2025
Copy link
Contributor

@pingtimeout pingtimeout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

This PR raises a question about how we structure the docker-compose commands. I understand the desire to reduce the amount of duplicated yaml across docker-compose files. However, for the sake of being beginner-friendly, I would rather have everything contained in a single docker-compose file, even if it slightly increases the maintenance effort (think upgrading the Postgres version). Wdyt?

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 19, 2025
@dimas-b
Copy link
Contributor

dimas-b commented May 19, 2025

I would rather have everything contained in a single docker-compose file

I support that. I even suggested that in #1470 (comment) :)

@adnanhemani
Copy link
Collaborator

While there are some good call-outs in this PR of things we should change, there are also some regressions. Please do not merge this PR - give me until EOD to get comments in.

Copy link
Collaborator

@adnanhemani adnanhemani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MonkeyCanCode - lots of great catches here. #1522 introduced a major new change in Quickstart where we are no longer relying on default credential values and must define them up front - and we did not change the website to fully reflect that. For allowing this change and not recognizing the regression to the documentation, I apologize.

We had discussed an ".env" file in this thread but ultimately decided against it: #1522 (comment). I still agree with the comments posted there. I think that if we move this section into the top of the Quickstart page and then remove the later instructions to export the same environment variables, it would be best. We should do the same for the Cloud Provider Deployment instructions and the READMEs in the Getting Started folder.

100% agreed on the ASSETS_PATH change and the correcting of the YAML file locations.

@@ -82,7 +82,7 @@ services:
--conf, "spark.sql.catalog.quickstart_catalog.type=rest",
--conf, "spark.sql.catalog.quickstart_catalog.warehouse=quickstart_catalog",
--conf, "spark.sql.catalog.quickstart_catalog.uri=http://polaris:8181/api/catalog",
--conf, "spark.sql.catalog.quickstart_catalog.credential=${USER_CLIENT_ID}:${USER_CLIENT_SECRET}",
--conf, "spark.sql.catalog.quickstart_catalog.credential=${CLIENT_ID}:${CLIENT_SECRET}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression, it should remain USER_CLIENT_ID/SECRET - these are user credentials rather than the admin credentials in CLIENT_ID/SECRET

Copy link
Contributor Author

@MonkeyCanCode MonkeyCanCode May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did noticed this one last night and assuming there is intensional. However, going through the quickstart guide, this is supposed to be the credential created at the end of the polaris CLI. But USER_CLIENT_ID and USER_CLIENT_SECRET is never defined in this context I believed. As an end-user tries to bring up setup via docker compose, there is no relation to the polaris CLI.

Now assuming we want to keep it this way, an end-user can't really define this one until polaris CLI returns the value back (which needed to bring up docker compose first, run polaris CLI for entities creation, then restart spark-sql to have this be effective). This restart workflow is really odd imo and we should replace it with other more automated workflow.

If this is causing regression, I can change it back. But I don't think we ask user to follow the restart workflow as it is really tedious imo.

Let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at this: https://polaris.apache.org/in-dev/unreleased/getting-started/using-polaris/#connecting-with-spark

We do ask users to refresh their Docker containers once they export the required variables.

You've correctly pointed out the chicken-and-the-egg problem we have here, but unfortunately this is how we have solved it so far - asking the user for one line ran is not completely unreasonable and we have no better known solution without breaking the Docker Compose file down even further :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there are couple problem with both existed route as well as the above, let me share my observations:

On terminal 1, we will run the following:

export ASSETS_PATH=$(pwd)/getting-started/assets/
docker compose --env-file getting-started/assets/getting-started.env \
  -f getting-started/assets/postgres/docker-compose-postgres.yml \
  -f getting-started/eclipselink/docker-compose-bootstrap-db.yml \
  -f getting-started/eclipselink/docker-compose.yml up

Due to USER_CLIENT_ID and USER_CLIENT_SECRET are not yet defined at this stage, the spark-sql will fail due to fail auth (as both will resolve to empty string). So the container is in stop state.

Now on terminal 2, we do the needed work to bootstrap polaris entities and have a different principal created and run the following to set values for environment variables:

export USER_CLIENT_ID=xxxxx
export USER_CLIENT_SECRET=xxxx

and try to start the failed containers here with current command in the doc:

docker compose -f getting-started/eclipselink/docker-compose.yml up -d

This will actually crashed the first setup as two docker compose command on diff terminal/session will create two diff projects and they are trying to bind to the same ports.

Now to workaround the problem, I added -p xxx to specify project so two commands from diff sessions will be dealing with same resources. However, this doesn't solve the problem of created container doesn't reload (which back to my original point where this is not really a regression as it never works before). To have containers reload the env, we will need to stop/remove the container then recreate it. By doing so, it will then try to load envs from the client's terminal.

@@ -36,7 +36,8 @@ cd ~/polaris
:polaris-quarkus-admin:assemble --rerun \
-Dquarkus.container-image.tag=postgres-latest \
-Dquarkus.container-image.build=true
docker compose -f getting-started/eclipselink/docker-compose-postgres.yml -f getting-started/eclipselink/docker-compose-bootstrap-db.yml -f getting-started/eclipselink/docker-compose.yml up
export ASSETS_PATH=$(pwd)/getting-started/assets/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this line to the top of the file with the exporting of the CLIENT_ID/SECRET, so that it only needs to be run once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I followed, this is the beginning of the quickstart page (https://polaris.apache.org/in-dev/unreleased/getting-started/quickstart/#docker-image). This is only needed for the context of docker compose.

The export of CLIENT_ID/CLIENT_SECRET is invalid I think as the current state (without the env) file, this won't even be able to start.

If I understand correctly, we should consider move export of CLIENT_ID/CLIENT_SECRET to this section (as the current docker compose file has no credential, so it will try to set empty string for root credential (as well as username, which is in-valid).

The export is only needed if user doesn't want to use env file (as env file will load the credential in the updated command). Let me know what you think.

Copy link
Collaborator

@adnanhemani adnanhemani May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this comment is in conjunction with the suggestion from the overall review's comment. We should do the following:

  1. Move the export of CLIENT_ID/CLIENT_SECRET to the top of the file - the Docker Compose files will be able to intake the environment variables set in bash. (i.e. keep one reference at the top of the Quickstart page and one at the top of each of the cloud deployment pages)
  2. Remove all references to setting the CLIENT_ID/CLIENT_SECRET elsewhere.
  3. Add the export ASSETS_PATH to these references to setting CLIENT_ID/CLIENT_SECRET

Does this make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got what you mean. I had made some changes to this doc for refactor. Please review.

@github-project-automation github-project-automation bot moved this from Ready to merge to PRs In Progress in Basic Kanban Board May 20, 2025
@adnanhemani
Copy link
Collaborator

This PR raises a question about how we structure the docker-compose commands. I understand the desire to reduce the amount of duplicated yaml across docker-compose files. However, for the sake of being beginner-friendly, I would rather have everything contained in a single docker-compose file, even if it slightly increases the maintenance effort (think upgrading the Postgres version). Wdyt?

I disagree on this - if you read my reply to the comment that @dimas-b linked (#1470 (comment)), there's the reason why the Dockerfiles need to stay piecewise (tldr - it doesn't have to do with duplicated YAML but with the ability to reload services when required). If you have a better way to solve that issue, I'm open to it!

@MonkeyCanCode
Copy link
Contributor Author

I would rather have everything contained in a single docker-compose file

I support that. I even suggested that in #1470 (comment) :)

If we all want to do it in a single docker compose file, I can take care this in the next PR. Both route works and there are ways around to make them work. More toward to a preference things (e.g. a single giant compose file may be hard to read and maybe we will have other service such as minio which we don't want to start out of box...but an end-user can decide if minio should be spin up by adding additional compose file to the command).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants