Skip to content

Conversation

@choplin
Copy link
Contributor

@choplin choplin commented Jul 27, 2025

Description

This PR updates the scalardb-analytics-spark-sample to support ScalarDB Analytics 3.16. The sample now includes a complete setup with ScalarDB Analytics Server and CLI, enabling users to run federated queries across ScalarDB and PostgreSQL data sources.

Related issues and/or PRs

N/A

Changes made

  • Added ScalarDB Analytics Server (3.16.2) and CLI services to docker-compose.yml
  • Created a dedicated PostgreSQL database for Analytics Server catalog metadata
  • Added configuration files for Analytics Server and CLI
  • Created data source definitions for ScalarDB and PostgreSQL
  • Reorganized all configuration files into a config directory for better organization
  • Updated Dockerfiles:
    • Spark Dockerfile: Added SHA512 verification for security, removed unnecessary Coursier
    • Sample Data Loader: Migrated from shadowJar to Gradle application plugin
  • Added platform specification for ARM64 compatibility (required for Analytics Server on M1/M2 Macs)
  • Added volume caching for Spark JAR dependencies to improve startup time
  • Created README with step-by-step setup instructions

Checklist

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes.
  • Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
  • Tests (unit, integration, etc.) have been added for the changes.
  • My changes generate no new warnings.
  • Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

None

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @choplin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the scalardb-analytics-spark-sample to fully support ScalarDB Analytics 3.16. The core intent is to provide a comprehensive, self-contained environment for users to explore federated queries, combining data from ScalarDB-managed stores (Cassandra, MySQL) and direct PostgreSQL tables through a unified Spark SQL interface. The changes streamline the setup, improve security, and enhance the overall user experience.

Highlights

  • Full ScalarDB Analytics 3.16 Integration: The sample now includes the ScalarDB Analytics Server and CLI as Docker services, enabling a complete federated query setup with version 3.16.2.
  • Enhanced Federated Query Experience: Users can now seamlessly query data from both ScalarDB-managed databases (Cassandra, MySQL) and direct PostgreSQL tables through a unified Spark SQL interface, demonstrated with new data source configurations.
  • Improved Sample Structure and Setup: Configuration files have been reorganized into a dedicated config directory, and a comprehensive README.md has been added with step-by-step instructions for quick setup and query execution.
  • Docker Build and Dependency Management Improvements: The Spark Dockerfile now includes SHA512 verification for security, and Spark JAR dependencies are cached using Docker volumes to speed up startup times. The sample data loader's build process was also modernized using the Gradle application plugin.
  • Platform Compatibility: Explicit platform specification (linux/amd64) has been added for the Analytics Server to ensure compatibility, particularly for users on ARM64 architectures like M1/M2 Macs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the scalardb-analytics-spark-sample to support ScalarDB Analytics 3.16, introducing a more complete setup with federated queries. The changes are well-structured, including the reorganization of configuration files and improvements to the Docker setup like SHA512 verification for Spark.

My review focuses on improving security and stability. Key suggestions include:

  • Using stable versions for database images (postgres:16, mysql:8) instead of pre-release versions.
  • Addressing hardcoded credentials in configuration files.
  • Tuning healthcheck parameters to be less aggressive.

Comment on lines 8 to 9
"username": "postgres",
"password": "postgres",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding credentials in configuration files is a security risk. It's better to use a mechanism to inject these at runtime, for example, through environment variables. This is a good practice to follow even in sample applications to avoid accidentally leaking credentials.

@choplin choplin force-pushed the update-analytics-sample-for-3.16 branch from 48659ed to b5ffbf1 Compare July 27, 2025 14:22
…rDB Analytics 3.16

- Add ScalarDB Analytics Server and CLI services to docker-compose
- Configure Analytics Server with catalog database
- Create data source definitions for ScalarDB and PostgreSQL
- Reorganize configuration files into config directory
- Add platform specification for ARM64 compatibility
- Update Dockerfiles to use Gradle application plugin instead of shadowJar
- Add volume caching for Spark JAR dependencies
- Include setup instructions in README

This update enables the sample to work with ScalarDB Analytics 3.16,
providing a complete example of federated queries across ScalarDB and
PostgreSQL data sources.
@choplin choplin force-pushed the update-analytics-sample-for-3.16 branch from b5ffbf1 to 1c6f597 Compare July 27, 2025 14:24
Copy link
Contributor

@komamitsu komamitsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

choplin added 3 commits July 28, 2025 23:35
…oading

- Remove profile requirement for sample-data-loader service
- Add volume-based flag to track data loading completion
- Implement idempotent data loading with skip on subsequent runs
- Add healthcheck dependencies for database readiness
…file naming

- Rename analytics-server.properties to scalardb-analytics-server.properties
- Rename analytics-cli-config.properties to scalardb-analytics-cli.properties
- Replace hyphens with underscores in property names for consistency
- Update docker-compose.yml volume mappings
…ta loading

- Remove manual sample-data-loader step
- Add --wait flag to docker compose up command
- Update step numbering
- Add note about automatic data loading on first run
Copy link
Collaborator

@brfrn169 brfrn169 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Copy link
Contributor

@feeblefakie feeblefakie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents look good.
I left suggestions on the naming. PTAL!

…nalytics-spark-sample

- Rename directory from 'scalardb-analytics-spark-sample' to 'scalardb-analytics-sample'
- Update README.md title to match new directory name
- This change reflects that the sample is not limited to Spark but covers ScalarDB Analytics in general
@choplin choplin requested a review from feeblefakie July 30, 2025 15:36
Copy link
Contributor

@feeblefakie feeblefakie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

@feeblefakie feeblefakie merged commit 7c29f41 into main Jul 31, 2025
32 checks passed
@feeblefakie feeblefakie deleted the update-analytics-sample-for-3.16 branch July 31, 2025 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants