feat: add support for _health_report #1002

richardklose · 2025-02-21T15:39:18Z

In elasticsearch 8.7 a new endpoint for cluster health has been added. See https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-health-report

ederator · 2025-02-22T08:32:51Z

cool, this will greatly improve our elasticsearch monitoring! Would be great if someone could review it.

sysadmind · 2025-02-23T14:45:04Z

collector/health_report.go

@@ -0,0 +1,457 @@
+// Copyright 2021 The Prometheus Authors


Suggested change

// Copyright 2021 The Prometheus Authors

// Copyright 2025 The Prometheus Authors

sysadmind · 2025-02-23T14:45:54Z

collector/health_report_response.go

+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package collector


The response struct should be in health_report.go. I have been removing all of the _response files.

sysadmind · 2025-02-23T14:46:48Z

collector/health_report_test.go

@@ -0,0 +1,169 @@
+// Copyright 2021 The Prometheus Authors


Suggested change

// Copyright 2021 The Prometheus Authors

// Copyright 2025 The Prometheus Authors

sysadmind · 2025-02-23T14:49:49Z

collector/health_report.go

+	defaultHealthReportLabels = []string{"cluster"}
+)
+
+type healthReportMetric struct {


We have been moving away from these custom metric types

sysadmind · 2025-02-23T14:50:49Z

collector/health_report.go

+		client: client,
+		url:    url,
+
+		metrics: []*healthReportMetric{


We are moving all metric definitions to package vars instead of inside collector structs.

sysadmind · 2025-02-23T14:52:46Z

collector/health_report.go

+}
+
+// Describe set Prometheus metrics descriptions.
+func (c *HealthReport) Describe(ch chan<- *prometheus.Desc) {


The collector interface doesn't need Describe so this function can be removed.

sysadmind · 2025-02-23T14:55:56Z

collector/health_report.go

+}
+
+func (c *HealthReport) fetchAndDecodeHealthReport() (HealthReportResponse, error) {
+	var hrr HealthReportResponse


Most of this can be replaced by https://github.com/prometheus-community/elasticsearch_exporter/blob/master/collector/util.go#L24.
In that case, the rest of this can just be part of Update

sysadmind · 2025-02-23T14:56:54Z

collector/health_report.go

+	}
+
+	for _, metric := range c.statusMetrics {
+		for _, color := range statusColors {


I think that looping through the colors is an antipattern. What is the purpose of having all the metrics in all the colors? I'm not sure that the color needs to be a label on very many metrics at all.

To be honest, I just took this from the cluster_health information and copied it over here. The API returns the status as a string already containing the color, so I thought that was the way to go.
Iirc the SLM collector does this similarly for the operation mode:

elasticsearch_slm_stats_operation_mode{operation_mode="RUNNING"} 1 elasticsearch_slm_stats_operation_mode{operation_mode="STOPPED"} 0 elasticsearch_slm_stats_operation_mode{operation_mode="STOPPING"} 0

The challenge here is, that the health report API has many "sub"-statuses for different components, so we have a lot of metrics here. Any suggestion on how report those statuses better as a metric?

sysadmind · 2025-02-23T14:59:53Z

See the SLM collector for a good example of how we currently implement collectors

In elasticsearch 8.7 a new endpoint for cluster health has been added. See https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-health-report Signed-off-by: Richard Klose <[email protected]>

richardklose · 2025-02-24T11:50:11Z

@sysadmind I refactored the collector and tried to stick with how the SLM collector is implemented. Not sure how to deal with the many status colors there, so I'll appreciate any suggestions for improvements on that.

sysadmind · 2025-03-05T02:10:08Z

collector/health_report.go

+var (
+	healthReportTotalRepositories = prometheus.NewDesc(
+		prometheus.BuildFQName(namespace, "health_report", "total_repositories"),
+		"The number snapshot repositories",


Suggested change

"The number snapshot repositories",

"The number of snapshot repositories",

sysadmind · 2025-03-05T02:17:20Z

collector/health_report.go

+	MasterIsStable      HealthReportMasterIsStable      `json:"master_is_stable"`
+	RepositoryIntegrity HealthReportRepositoryIntegrity `json:"repository_integrity"`
+	Disk                HealthReportDisk                `json:"disk"`
+	ShardsCapacity      HealthReportShardsCapacity      `json:"shards_capacity"`


The shards_capacity seems to be missing from the test fixture

sysadmind · 2025-03-05T02:17:49Z

collector/health_report.go

+	Disk                HealthReportDisk                `json:"disk"`
+	ShardsCapacity      HealthReportShardsCapacity      `json:"shards_capacity"`
+	ShardsAvailability  HealthReportShardsAvailability  `json:"shards_availability"`
+	DataStreamLifecycle HealthReportDataStreamLifecycle `json:"data_stream_lifecycle"`


data_stream_lifecycle seems to be missing from the test fixture

sysadmind · 2025-03-05T02:21:46Z

collector/health_report.go

+	ch <- prometheus.MustNewConstMetric(
+		healthReportTotalRepositories,
+		prometheus.GaugeValue,
+		float64(healthReportResponse.Indicators.RepositoryIntegrity.Details.TotalRepositories),


The data for this metric is missing in the test fixture.

sysadmind · 2025-03-05T02:25:35Z

collector/health_report.go

+	)
+	healthReportDiskStatus = prometheus.NewDesc(
+		prometheus.BuildFQName(namespace, "health_report", "disk_status"),
+		"disk status",


Suggested change

"disk status",

"Disk status",

richardklose force-pushed the feat/health_report branch from 910c31e to 17cc0a2 Compare February 21, 2025 15:40

sysadmind reviewed Feb 23, 2025

View reviewed changes

sysadmind requested changes Feb 23, 2025

View reviewed changes

richardklose force-pushed the feat/health_report branch from 17cc0a2 to 55c1f6f Compare February 24, 2025 08:03

feat: add support for _health_report

6a0bb0e

In elasticsearch 8.7 a new endpoint for cluster health has been added. See https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-health-report Signed-off-by: Richard Klose <[email protected]>

richardklose force-pushed the feat/health_report branch from 55c1f6f to 6a0bb0e Compare February 24, 2025 11:47

richardklose requested a review from sysadmind March 4, 2025 18:52

sysadmind requested changes Mar 5, 2025

View reviewed changes

sysadmind reviewed Mar 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for _health_report #1002

feat: add support for _health_report #1002

richardklose commented Feb 21, 2025

ederator commented Feb 22, 2025

sysadmind Feb 23, 2025

sysadmind Feb 23, 2025

sysadmind Feb 23, 2025

sysadmind Feb 23, 2025

sysadmind Feb 23, 2025

sysadmind Feb 23, 2025

sysadmind Feb 23, 2025

sysadmind Feb 23, 2025

richardklose Feb 24, 2025

sysadmind commented Feb 23, 2025

richardklose commented Feb 24, 2025

sysadmind Mar 5, 2025

sysadmind Mar 5, 2025

sysadmind Mar 5, 2025

sysadmind Mar 5, 2025

sysadmind Mar 5, 2025

	// Copyright 2021 The Prometheus Authors
	// Copyright 2025 The Prometheus Authors

	"The number snapshot repositories",
	"The number of snapshot repositories",

feat: add support for _health_report #1002

Are you sure you want to change the base?

feat: add support for _health_report #1002

Conversation

richardklose commented Feb 21, 2025

ederator commented Feb 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sysadmind commented Feb 23, 2025

richardklose commented Feb 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment