Skip to content

ES|QL query log #124094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Mar 18, 2025
Merged

ES|QL query log #124094

merged 31 commits into from
Mar 18, 2025

Conversation

luigidellaquila
Copy link
Contributor

@luigidellaquila luigidellaquila commented Mar 5, 2025

Adding Query Log to ES|QL.

It can be configured at cluster level, using the following cluster settings (TimeValue):

esql.querylog.threshold.trace    // time period, theshold to log the query at trace level, -1 by default (disabled)
esql.querylog.threshold.debug   // time period, theshold to log the query at debug level, -1 by default (disabled)
esql.querylog.threshold.info    // time period, theshold to log the query at info level, -1 by default (disabled)
esql.querylog.threshold.warn     // time period, theshold to log the query at warn level, -1 by default (disabled)
esql.querylog.include.user    // include user data, false by default

By default, queries will be logged to

logs/runTask_esql_querylog.json

Example log in case of success:

{
    "@timestamp": "2025-03-11T08:39:50.076Z",
    "log.level": "TRACE",
    "auth.type": "REALM",
    "elasticsearch.querylog.planning.took": 3108666,
    "elasticsearch.querylog.planning.took_millis": 3,
    "elasticsearch.querylog.query": "from index | limit 100",
    "elasticsearch.querylog.search_type": "ESQL",
    "elasticsearch.querylog.success": true,
    "elasticsearch.querylog.took": 8050416,
    "elasticsearch.querylog.took_millis": 8,
    "user.name": "elastic-admin",
    "user.realm": "default_file",
    "ecs.version": "1.2.0",
    "service.name": "ES_ECS",
    "event.dataset": "elasticsearch.esql_querylog",
    "process.thread.name": "elasticsearch[runTask-0][esql_worker][T#12]",
    "log.logger": "esql.querylog.query",
    "elasticsearch.cluster.uuid": "KZo1V7TcQM-O6fnqMm1t_g",
    "elasticsearch.node.id": "uPgRE2TrSfa9IvnUpNT1Uw",
    "elasticsearch.node.name": "runTask-0",
    "elasticsearch.cluster.name": "runTask"
}

Example log in case of failure:

{
    "@timestamp": "2025-03-11T08:41:54.172Z",
    "log.level": "TRACE",
    "auth.type": "REALM",
    "elasticsearch.querylog.error.message": "line 1:15: mismatched input 'limitxyz' expecting {DEV_CHANGE_POINT, 'enrich', 'dissect', 'eval', 'grok', 'limit', 'sort', 'stats', 'where', DEV_INLINESTATS, DEV_FORK, 'lookup', DEV_JOIN_LEFT, DEV_JOIN_RIGHT, DEV_LOOKUP, 'mv_expand', 'drop', 'keep', DEV_INSIST, 'rename'}",
    "elasticsearch.querylog.error.type": "org.elasticsearch.xpack.esql.parser.ParsingException",
    "elasticsearch.querylog.query": "from person | limitxyz 100",
    "elasticsearch.querylog.search_type": "ESQL",
    "elasticsearch.querylog.success": false,
    "elasticsearch.querylog.took": 963750,
    "elasticsearch.querylog.took_millis": 0,
    "user.name": "elastic-admin",
    "user.realm": "default_file",
    "ecs.version": "1.2.0",
    "service.name": "ES_ECS",
    "event.dataset": "elasticsearch.esql_querylog",
    "process.thread.name": "elasticsearch[runTask-0][search][T#16]",
    "log.logger": "esql.querylog.query",
    "elasticsearch.cluster.uuid": "KZo1V7TcQM-O6fnqMm1t_g",
    "elasticsearch.node.id": "uPgRE2TrSfa9IvnUpNT1Uw",
    "elasticsearch.node.name": "runTask-0",
    "elasticsearch.cluster.name": "runTask"
}

TODO

  • docs


private static Map<String, Object> prepareMap(Result esqlResult, String query) {
Map<String, Object> messageFields = new HashMap<>();
messageFields.put("elasticsearch.slowlog.message", esqlResult.executionInfo().clusterAliases());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit surprising, message contains cluster aliases, is it intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need that at all.
SearchSlowLog has searchContext.indexShard().shardId() in there, maybe we want something different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing it for now, in case we'll reintroduce it

messageFields.put("elasticsearch.slowlog.planning.took_millis", esqlResult.executionInfo().planningTookTime().millis());
messageFields.put("elasticsearch.slowlog.search_type", "ESQL");
String source = escapeJson(query);
messageFields.put("elasticsearch.slowlog.source", source);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think source is a bit ambiguous. Should we call it query here?

Copy link
Contributor Author

@luigidellaquila luigidellaquila Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't like it very much, query would be much better.
I'm using source only because it's what Search does, but I'm not sure we want to be consistent with it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, source for _search is the actual request body, while in our case it's only the query string.
I'll rename it as "query" for now.

return; // TODO review, it happens in some tests, not sure if it's a thing also in prod
}
long tookInNanos = esqlResult.executionInfo().overallTook().nanos();
var queryWarnThreshold = clusterSettings.get(ESQL_SLOWLOG_THRESHOLD_QUERY_WARN_SETTING).nanos();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: org.elasticsearch.common.settings.Setting#get logic is a bit tricky.
We should copy the value to the field and watch for updates using org.elasticsearch.common.settings.AbstractScopedSettings#initializeAndWatch

Comment on lines 59 to 67
final String node1, node2;
if (randomBoolean()) {
internalCluster().ensureAtLeastNumDataNodes(2);
node1 = randomDataNode().getName();
node2 = randomValueOtherThan(node1, () -> randomDataNode().getName());
} else {
node1 = randomDataNode().getName();
node2 = randomDataNode().getName();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unused

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Added some comments/questions

Comment on lines 109 to 110
TimeValue.timeValueNanos(-1),
TimeValue.timeValueMillis(-1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some reason for the default to be -1ns instead of -1ms? I guess both would work the same, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the same. let me change it

if (execInfo.isCrossClusterSearch()) {
execInfo.markEndPlanning();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug? Was it working when not CCS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really.
This method was introduced with CCS implementation, and planningTookTime was only used in that contex.
Now we need it also for other purposes, so I'll set it also for non-CCS queries.

latch.countDown();
}
}));
latch.await(30, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we failing the test if the countdown isn't 0? I think await() does nothing about it, right?
Not sure what will happen here exactly if the query fails.

I see other test doing an assert like:

assertEquals("All requests must respond, requests: " + requests, 0, latch.getCount());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use safeAwait(latch); instead. It should handle correctly the cases when operation times out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see other test doing an assert like:

One more assert won't hurt, I'm adding it.


private static final Logger queryLogger = LogManager.getLogger(SLOWLOG_PREFIX + ".query");

private static final ToXContent.Params FORMAT_PARAMS = new ToXContent.MapParams(Collections.singletonMap("pretty", "false"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@luigidellaquila luigidellaquila marked this pull request as ready for review March 7, 2025 15:45
@luigidellaquila luigidellaquila requested a review from a team as a code owner March 7, 2025 15:45
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

Copy link
Member

@costin costin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.
It would be really useful to see an example of what the esql slow logs looks like (I assume it will be part of the doc).

Comment on lines 91 to 92
planTelemetryManager.publish(planTelemetry, true);
slowLog.onQueryPhase(x, request.query());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help with encapsulation, please create two internal methods called when the query is successfully executed or is failing and move the metrics, telemetry and now the slow log there (e.g. onQuerySuccesful(), onQueryFailure())

this(settings, null);
}

public void onQueryPhase(Result esqlResult, String query) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method and onQueryFailure are very similar - same code should be reused across both of them.

Comment on lines 135 to 157
jsonFields.put("elasticsearch.slowlog.error.message", exception.getMessage() == null ? "" : exception.getMessage());
jsonFields.put("elasticsearch.slowlog.error.type", exception.getClass().getName());
jsonFields.put("elasticsearch.slowlog.took", took);
jsonFields.put("elasticsearch.slowlog.took_millis", took / 1_000_000);
return new ESLogMessage().withFields(jsonFields);
}

private static Map<String, Object> prepareMap(Result esqlResult, String query, boolean success, User user) {
Map<String, Object> messageFields = new HashMap<>();
if (user != null) {
messageFields.put("user.name", user.principal());
}
if (esqlResult != null) {
messageFields.put("elasticsearch.slowlog.took", esqlResult.executionInfo().overallTook().nanos());
messageFields.put("elasticsearch.slowlog.took_millis", esqlResult.executionInfo().overallTook().millis());
messageFields.put("elasticsearch.slowlog.planning.took", esqlResult.executionInfo().planningTookTime().nanos());
messageFields.put("elasticsearch.slowlog.planning.took_millis", esqlResult.executionInfo().planningTookTime().millis());
}
String source = escapeJson(query);
messageFields.put("elasticsearch.slowlog.success", success);
messageFields.put("elasticsearch.slowlog.search_type", "ESQL");
messageFields.put("elasticsearch.slowlog.query", source);
return messageFields;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract constants out of these repetitive strings including during their construction, such as the common prefix ("elasticsearch.slowlog")

Also there's a repetition on took/took_millis, which seems to be overridden always in of method.

@luigidellaquila
Copy link
Contributor Author

Thanks for the feedback folks!

@costin I added two examples of slow log output to the PR description.
Docs will need a separate PR on https://github.com/elastic/docs-content

Copy link
Contributor

@jfreden jfreden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a comment on combining the SlowLogFieldProviders.

@@ -48,6 +50,11 @@ public Map<String, String> searchFields() {
}
return Map.of();
}

@Override
public Map<String, String> queryFields() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

return fields.stream()
.flatMap(f -> f.indexFields().entrySet().stream())
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
SlowLogFieldProvider slowLogFieldProvider = new SlowLogFieldProvider() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this is that it will merge all field providers provided by the plugins into one (in practice I think that's only the one in the security plugin). So this should ideally have an implementation for merging queryFields too and then this instance should be passed to the PluginService (instead of the list of providers).

If you don't want to implement the merge for indexFields and searchFields in the provider without index settings you could add an empty default implementation in the interface in the same way you did for queryFields.

Copy link
Contributor

@jfreden jfreden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing the suggested changes. LGTM pending CI and the comment I left above.

@luigidellaquila
Copy link
Contributor Author

Thanks @jfreden!

@luigidellaquila luigidellaquila changed the title ES|QL slow log ES|QL query log Mar 17, 2025
@luigidellaquila
Copy link
Contributor Author

SlowLog -> QueryLog

The intention is to make it more generic in the future, so that we can log queries not based on the execution time, but also based on other criteria.

@luigidellaquila
Copy link
Contributor Author

Docs PR here elastic/docs-content#816

@luigidellaquila luigidellaquila added v8.19.0 auto-backport Automatically create backport pull requests when merged labels Mar 18, 2025
@luigidellaquila luigidellaquila merged commit f3ed9b3 into elastic:main Mar 18, 2025
17 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 124094

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged backport pending >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants