-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for _health_report #1002
base: master
Are you sure you want to change the base?
feat: add support for _health_report #1002
Conversation
910c31e
to
17cc0a2
Compare
cool, this will greatly improve our elasticsearch monitoring! Would be great if someone could review it. |
collector/health_report.go
Outdated
@@ -0,0 +1,457 @@ | |||
// Copyright 2021 The Prometheus Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Copyright 2021 The Prometheus Authors | |
// Copyright 2025 The Prometheus Authors |
collector/health_report_response.go
Outdated
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
package collector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The response struct should be in health_report.go. I have been removing all of the _response files.
collector/health_report_test.go
Outdated
@@ -0,0 +1,169 @@ | |||
// Copyright 2021 The Prometheus Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Copyright 2021 The Prometheus Authors | |
// Copyright 2025 The Prometheus Authors |
collector/health_report.go
Outdated
defaultHealthReportLabels = []string{"cluster"} | ||
) | ||
|
||
type healthReportMetric struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been moving away from these custom metric types
collector/health_report.go
Outdated
client: client, | ||
url: url, | ||
|
||
metrics: []*healthReportMetric{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving all metric definitions to package vars instead of inside collector structs.
collector/health_report.go
Outdated
} | ||
|
||
// Describe set Prometheus metrics descriptions. | ||
func (c *HealthReport) Describe(ch chan<- *prometheus.Desc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The collector interface doesn't need Describe so this function can be removed.
collector/health_report.go
Outdated
} | ||
|
||
func (c *HealthReport) fetchAndDecodeHealthReport() (HealthReportResponse, error) { | ||
var hrr HealthReportResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this can be replaced by https://github.com/prometheus-community/elasticsearch_exporter/blob/master/collector/util.go#L24.
In that case, the rest of this can just be part of Update
collector/health_report.go
Outdated
} | ||
|
||
for _, metric := range c.statusMetrics { | ||
for _, color := range statusColors { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that looping through the colors is an antipattern. What is the purpose of having all the metrics in all the colors? I'm not sure that the color needs to be a label on very many metrics at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I just took this from the cluster_health information and copied it over here. The API returns the status as a string already containing the color, so I thought that was the way to go.
Iirc the SLM collector does this similarly for the operation mode:
elasticsearch_slm_stats_operation_mode{operation_mode="RUNNING"} 1
elasticsearch_slm_stats_operation_mode{operation_mode="STOPPED"} 0
elasticsearch_slm_stats_operation_mode{operation_mode="STOPPING"} 0
The challenge here is, that the health report API has many "sub"-statuses for different components, so we have a lot of metrics here. Any suggestion on how report those statuses better as a metric?
See the SLM collector for a good example of how we currently implement collectors |
17cc0a2
to
55c1f6f
Compare
In elasticsearch 8.7 a new endpoint for cluster health has been added. See https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-health-report Signed-off-by: Richard Klose <[email protected]>
55c1f6f
to
6a0bb0e
Compare
@sysadmind I refactored the collector and tried to stick with how the SLM collector is implemented. Not sure how to deal with the many status colors there, so I'll appreciate any suggestions for improvements on that. |
var ( | ||
healthReportTotalRepositories = prometheus.NewDesc( | ||
prometheus.BuildFQName(namespace, "health_report", "total_repositories"), | ||
"The number snapshot repositories", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The number snapshot repositories", | |
"The number of snapshot repositories", |
MasterIsStable HealthReportMasterIsStable `json:"master_is_stable"` | ||
RepositoryIntegrity HealthReportRepositoryIntegrity `json:"repository_integrity"` | ||
Disk HealthReportDisk `json:"disk"` | ||
ShardsCapacity HealthReportShardsCapacity `json:"shards_capacity"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shards_capacity seems to be missing from the test fixture
Disk HealthReportDisk `json:"disk"` | ||
ShardsCapacity HealthReportShardsCapacity `json:"shards_capacity"` | ||
ShardsAvailability HealthReportShardsAvailability `json:"shards_availability"` | ||
DataStreamLifecycle HealthReportDataStreamLifecycle `json:"data_stream_lifecycle"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_stream_lifecycle seems to be missing from the test fixture
ch <- prometheus.MustNewConstMetric( | ||
healthReportTotalRepositories, | ||
prometheus.GaugeValue, | ||
float64(healthReportResponse.Indicators.RepositoryIntegrity.Details.TotalRepositories), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data for this metric is missing in the test fixture.
) | ||
healthReportDiskStatus = prometheus.NewDesc( | ||
prometheus.BuildFQName(namespace, "health_report", "disk_status"), | ||
"disk status", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"disk status", | |
"Disk status", |
In elasticsearch 8.7 a new endpoint for cluster health has been added. See https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-health-report