Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for _health_report #1002

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

richardklose
Copy link

In elasticsearch 8.7 a new endpoint for cluster health has been added. See https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-health-report

@ederator
Copy link

cool, this will greatly improve our elasticsearch monitoring! Would be great if someone could review it.

@@ -0,0 +1,457 @@
// Copyright 2021 The Prometheus Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright 2021 The Prometheus Authors
// Copyright 2025 The Prometheus Authors

// See the License for the specific language governing permissions and
// limitations under the License.

package collector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response struct should be in health_report.go. I have been removing all of the _response files.

@@ -0,0 +1,169 @@
// Copyright 2021 The Prometheus Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright 2021 The Prometheus Authors
// Copyright 2025 The Prometheus Authors

defaultHealthReportLabels = []string{"cluster"}
)

type healthReportMetric struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been moving away from these custom metric types

client: client,
url: url,

metrics: []*healthReportMetric{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving all metric definitions to package vars instead of inside collector structs.

}

// Describe set Prometheus metrics descriptions.
func (c *HealthReport) Describe(ch chan<- *prometheus.Desc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collector interface doesn't need Describe so this function can be removed.

}

func (c *HealthReport) fetchAndDecodeHealthReport() (HealthReportResponse, error) {
var hrr HealthReportResponse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this can be replaced by https://github.com/prometheus-community/elasticsearch_exporter/blob/master/collector/util.go#L24.
In that case, the rest of this can just be part of Update

}

for _, metric := range c.statusMetrics {
for _, color := range statusColors {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that looping through the colors is an antipattern. What is the purpose of having all the metrics in all the colors? I'm not sure that the color needs to be a label on very many metrics at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I just took this from the cluster_health information and copied it over here. The API returns the status as a string already containing the color, so I thought that was the way to go.
Iirc the SLM collector does this similarly for the operation mode:

elasticsearch_slm_stats_operation_mode{operation_mode="RUNNING"} 1
elasticsearch_slm_stats_operation_mode{operation_mode="STOPPED"} 0
elasticsearch_slm_stats_operation_mode{operation_mode="STOPPING"} 0

The challenge here is, that the health report API has many "sub"-statuses for different components, so we have a lot of metrics here. Any suggestion on how report those statuses better as a metric?

@sysadmind
Copy link
Contributor

See the SLM collector for a good example of how we currently implement collectors

In elasticsearch 8.7 a new endpoint for cluster health has been added. See https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-health-report

Signed-off-by: Richard Klose <[email protected]>
@richardklose
Copy link
Author

@sysadmind I refactored the collector and tried to stick with how the SLM collector is implemented. Not sure how to deal with the many status colors there, so I'll appreciate any suggestions for improvements on that.

@richardklose richardklose requested a review from sysadmind March 4, 2025 18:52
var (
healthReportTotalRepositories = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "health_report", "total_repositories"),
"The number snapshot repositories",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The number snapshot repositories",
"The number of snapshot repositories",

MasterIsStable HealthReportMasterIsStable `json:"master_is_stable"`
RepositoryIntegrity HealthReportRepositoryIntegrity `json:"repository_integrity"`
Disk HealthReportDisk `json:"disk"`
ShardsCapacity HealthReportShardsCapacity `json:"shards_capacity"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shards_capacity seems to be missing from the test fixture

Disk HealthReportDisk `json:"disk"`
ShardsCapacity HealthReportShardsCapacity `json:"shards_capacity"`
ShardsAvailability HealthReportShardsAvailability `json:"shards_availability"`
DataStreamLifecycle HealthReportDataStreamLifecycle `json:"data_stream_lifecycle"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_stream_lifecycle seems to be missing from the test fixture

ch <- prometheus.MustNewConstMetric(
healthReportTotalRepositories,
prometheus.GaugeValue,
float64(healthReportResponse.Indicators.RepositoryIntegrity.Details.TotalRepositories),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data for this metric is missing in the test fixture.

)
healthReportDiskStatus = prometheus.NewDesc(
prometheus.BuildFQName(namespace, "health_report", "disk_status"),
"disk status",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"disk status",
"Disk status",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants