[KYUUBI #6832] Initial impl Spark DSv2 YARN Connector that supports reading YARN aggregated logs by pan3793 · Pull Request #7455 · apache/kyuubi

pan3793 · 2026-05-17T15:01:05Z

Why are the changes needed?

Close #6832. This connector gives the Hadoop administrator a chance to analyze YARN aggregated logs at the cluster level, for example, aggregate container logs across applications by host to diagnose potential host hardware issues

The current initial implementation has several limitations:

feat: only support TFile, but not IFile
perf: only supports pushing down app_id, user, host filters, does not support pushing down container_id, log_type, though they are supposed to be selective
perf: listing aggregated log files runs in a single thread during the planning phase - for a large cluster, it should run in parallel on the driver side or launch a job to do that on the executor side.
etc.

How was this patch tested?

UT is not added yet, as this requires a real YARN cluster with history agg logs.

$ kyuubi-beeline -u 'jdbc:kyuubi://spark-dev1.foo.bar:10009/default' \
  --conf spark.jars=/tmp/kyuubi-spark-connector-yarn_2.12-1.12.0-SNAPSHOT.jar \
  --conf spark.sql.catalog.yarn=org.apache.kyuubi.spark.connector.yarn.YarnCatalog
0: > select
. .>   mtime, app_id, container_id, host, log_type, message
. .> from yarn.app_logs
. .> where
. .>   user = 'hadoop'
. .>   and host = 'spark-dev2.foo.bar'
. .>   and message like '%ERROR%'
. .>   and message not like '%RECEIVED SIGNAL TERM%'
. .>   and message not like '%Aborting task%'
. .> limit 2;
...
+--------------------------+---------------------------------+-----------------------------------------+---------------------+-----------+----------------------------------------------------+
|          mtime           |             app_id              |              container_id               |        host         | log_type  |                      message                       |
+--------------------------+---------------------------------+-----------------------------------------+---------------------+-----------+----------------------------------------------------+
| 2025-04-03 18:07:18.893  | application_1743671377509_0001  | container_1743671377509_0001_01_000001  | spark-dev2.foo.bar  | stdout    | 25/04/03 18:07:15 ERROR ApplicationMaster$AMEndpoint: Driver terminated with exit code 1! Shutting down. spark-dev1.foo.bar:16601 |
| 2025-04-03 18:07:18.893  | application_1743671377509_0001  | container_1743671377509_0001_01_000001  | spark-dev2.foo.bar  | stdout    | 25/04/03 18:07:15 ERROR ApplicationMaster$AMEndpoint: Driver terminated with exit code 1! Shutting down. spark-dev1.foo.bar:16601 |
+--------------------------+---------------------------------+-----------------------------------------+---------------------+-----------+----------------------------------------------------+
2 rows selected (0.648 seconds)

Was this patch authored or co-authored using generative AI tooling?

Assisted-by: Claude Opus 4.7.

pan3793 · 2026-05-17T15:07:04Z

+
+    <artifactId>kyuubi-spark-connector-yarn_${scala.binary.version}</artifactId>
+    <packaging>jar</packaging>
+    <name>Kyuubi Spark Hadoop YARN Connector</name>


Use a generic name - we may extend it to include other interactions with YARN, for example, use YarnClient to retrieve app list from RM, implement STORE PROCEDUREs which are equivalent to the yarn commands

pan3793 · 2026-05-17T15:48:22Z

This is a longstanding missing feature for Hadoop. YARN-1440 was raised in 2013 - Yarn aggregated logs are difficult for external tools to understand. Large-scale log processing is a typical use case for Hadoop, I'm surprised there is not an out-of-the-box solution in 2026 to analyze logs of applications run in Hadoop YARN in batch.

@aajisaka @wForget @cxzl25, could you please take a look? and also would like to know if you have any better ideas to process yarn aggregate logs across the application

aajisaka · 2026-05-18T02:36:54Z

I agreed it's a long standing missing feature. I'll take a look further.

cxzl25 · 2026-05-18T03:58:25Z

+  private def listFilesWithFilters(): Array[FileStatus] = {
+    val baseDir = remoteAppLogDir
+    val bucketDir = s"bucket-$remoteAppLogDirSuffix-tfile"
+    var path = s"$baseDir/{{USER}}/$bucketDir/{{BUCKET}}/{{APP_ID}}/{{HOST}}_*"


In the future, we should be able to support paths without YARN-6929, which does not have buckets.

thank you for pointing this out! YARN-6929 landed in Hadoop 3.3.0, I'm reading the branch-3.3 code, so missing this part, will add a TODO here for now.

the bucket folder layout seems to fit exactly TABLESAMPLE SYSTEM introduced by SPARK-55978

cxzl25 · 2026-05-18T04:00:12Z

+        logWarning(s"Unsupported filter: $f")
+    }
+    val globPath = path
+      .replace("{{BUCKET}}", "*") // TODO parallize bucket listing


If an app_id is entered, the bucket can also be calculated in advance to improve list performance.

org.apache.hadoop.yarn.logaggregation.LogAggregationUtils#getRemoteBucketDir

int bucket = appId.getId() % 10000; String bucketDir = String.format("%04d", bucket);

good point, thanks for pointing this out

added in line 85-86

aajisaka

Can we create a sample YARN aggregated log and create a unit test to verify?

aajisaka · 2026-05-18T04:29:16Z

+  override def listTables(namespace: Array[String]): Array[Identifier] = {
+    Array(Identifier.of(namespace, "app_logs"))
+  }


This catalog is not aware of namespace and it makes catalog.any_namespace.app_logs resolve to the same table. We need to at least document the limitation.

it's not intended, will limit it to catalog.app_logs

pan3793 · 2026-05-18T04:58:45Z

Can we create a sample YARN aggregated log and create a unit test to verify?

yeah, I plan to do that.

wForget

Thanks @pan3793 , LGTM

pan3793 · 2026-05-23T20:51:13Z

Can we create a sample YARN aggregated log and create a unit test to verify?

@aajisaka I collect a set of app logs by setting up a Hadoop cluster and runs 3 Spark applications, and then add some basic unit tests to verify the app_logs reading, with and without filters pushdown.

pan3793 · 2026-05-23T20:54:45Z

-  /** Runs `f` by passing in `sc` and ensures that `sc` is stopped. */
-  def withSparkSession[T](sc: SparkSession)(f: SparkSession => T): T = {
+  /** Runs `f` by passing in `spark` and ensures that `spark` is stopped. */
+  def withSparkSession[T](spark: SparkSession)(f: SparkSession => T): T = {


it's naming fix. we should follow the naming policy defined in spark-shell, sc refers to the SparkContext, spark refers to SparkSession

pan3793 · 2026-05-26T14:51:53Z

thanks all, merging to master

github-actions Bot added module:spark kind:build module:extensions labels May 17, 2026

pan3793 commented May 17, 2026

View reviewed changes

cxzl25 approved these changes May 18, 2026

View reviewed changes

aajisaka reviewed May 18, 2026

View reviewed changes

wForget approved these changes May 20, 2026

View reviewed changes

pan3793 added 8 commits May 24, 2026 04:18

Init YARN Aggregated Log connector

75bbdae

unnecessary change

4dfcc21

fix scala 2.13 compile

a8ac849

add test yarn agg logs

2d349bd

support ignoreMissingFiles

cac9677

add basic tests

0cae4aa

comments

16abc8c

bucket pruning when app_id is provided

e916678

pan3793 force-pushed the yarn-agg-log branch from 27dc538 to e916678 Compare May 23, 2026 20:35

pan3793 commented May 23, 2026

View reviewed changes

NoSuchTableException ctor compatibility

89a9a15

aajisaka approved these changes May 26, 2026

View reviewed changes

pan3793 self-assigned this May 26, 2026

pan3793 added this to the v1.12.0 milestone May 26, 2026

pan3793 closed this in 36fd762 May 26, 2026

Conversation

pan3793 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aajisaka commented May 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aajisaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented May 18, 2026

Uh oh!

wForget left a comment

Choose a reason for hiding this comment

Uh oh!

pan3793 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pan3793 commented May 17, 2026 •

edited

Loading

pan3793 commented May 17, 2026 •

edited

Loading

pan3793 commented May 23, 2026 •

edited

Loading