You can use {product-title} to gain insight and optimize your managed clusters. Enable the observability service operator, multicluster-observability-operator
, on your hub cluster to monitor the health of your managed clusters. Learn about the architecture for the multicluster observability service in the following sections.
Note: The on-demand log provides access for engineers to get logs for a given pod in real-time. Logs from the hub cluster are not aggregated. These logs can be accessed with the search service and other parts of the console.
By default, observability is included with the product installation, but not enabled. Due to the requirement for persistent storage, the observability service is not enabled by default. See the Support section for observability.
When the service is enabled, the observability-endpoint-operator
is automatically deployed to each imported or created cluster. This controller collects the data from {ocp} Prometheus, then sends it to the {product-title-short} hub cluster. If the hub cluster imports itself as the local-cluster
, observability is also enabled on it and metrics are collected from the hub cluster.
The observability service deploys an instance of Prometheus AlertManager, which enables alerts to be forwarded with third-party applications. It also includes an instance of Grafana to enable data visualization with dashboards (static) or data exploration. {product-title-short} supports version 8.1.3 of Grafana. You can also design your Grafana dashboard. For more information, see Designing your Grafana dashboard. You can customize the observability service by creating custom recording rules or alerting rules.
-
{product-title-short} is tested with and fully supported by Red Hat OpenShift Data Foundation (formerly Red Hat OpenShift Container Storage).
-
{product-title-short} supports the function of the multicluster observability operator on user-provided third-party object storage that is S3 API compatible. The observability service uses Thanos supported, stable object stores.
-
{product-title-short} use commercial, reasonable efforts to assist in the identification of the root cause. If a support ticket is raised and the root cause has been determined to be a result of the customer-provided S3 compatible object storage, then the issue must be resolved using the customer support channels.
-
{product-title-short} does not commit to fix support tickets raised by customers, where the root cause identified is the S3 compatible object storage provider.
By default, {ocp-short} sends metrics to Red Hat using the Telemetry service. The acm_managed_cluster_info
is available with {product-title-short} and is included with telemetry, but is not displayed on the {product-title-short} Observe environments overview dashboard.
View the following table of metric types that are supported by the framework:
Metric name | Metric type | Labels/tags | Status |
---|---|---|---|
|
Gauge |
|
Stable |
|
Gauge |
|
Stable. See Governance metric for more details. |
|
Gauge |
|
Stable. See Managing insight PolicyReports for more details. |
|
Histogram |
None. |
Stable. See Governance metric for more details. |
|
Histogram |
None. |
Stable. See Governance metric for more details. |
|
Histogram |
None. |
Stable. See Governance metric for more details. |
Observability components require 2701mCPU and 11972Mi memory to install the observability service. The following table is a list of the pod capacity requests for five managed clusters with observability-addons
enabled:
Deployment or StatefulSet | Container name | CPU (mCPU) | Memory (Mi) | Replicas | Pod total CPU | Pod total memory |
---|---|---|---|---|---|---|
observability-alertmanager |
alertmanager |
4 |
200 |
3 |
12 |
600 |
config-reloader |
4 |
25 |
3 |
12 |
75 |
|
alertmanager-proxy |
1 |
20 |
3 |
3 |
60 |
|
observability-grafana |
grafana |
4 |
100 |
2 |
8 |
200 |
grafana-dashboard-loader |
4 |
50 |
2 |
8 |
100 |
|
observability-observatorium-api |
observatorium-api |
20 |
128 |
2 |
40 |
256 |
observability-observatorium-operator |
observatorium-operator |
100 |
100 |
1 |
10 |
50 |
observability-rbac-query-proxy |
rbac-query-proxy |
20 |
100 |
2 |
40 |
200 |
oauth-proxy |
1 |
20 |
2 |
2 |
40 |
|
observability-thanos-compact |
thanos-compact |
100 |
512 |
1 |
100 |
512 |
observability-thanos-query |
thanos-query |
300 |
1024 |
2 |
600 |
2048 |
observability-thanos-query-frontend |
thanos-query-frontend |
100 |
256 |
2 |
200 |
512 |
observability-thanos-query-frontend-memcached |
memcached |
45 |
128 |
3 |
135 |
384 |
exporter |
5 |
50 |
3 |
15 |
150 |
|
observability-thanos-receive-controller |
thanos-receive-controller |
4 |
32 |
1 |
4 |
32 |
observability-thanos-receive-default |
thanos-receive |
300 |
512 |
3 |
900 |
1536 |
observability-thanos-rule |
thanos-rule |
50 |
512 |
3 |
150 |
1536 |
configmap-reloader |
4 |
25 |
3 |
12 |
75 |
|
observability-thanos-store-memcached |
memcached |
45 |
128 |
3 |
135 |
384 |
exporter |
5 |
50 |
3 |
15 |
150 |
|
observability-thanos-store-shard |
thanos-store |
100 |
1024 |
3 |
300 |
3072 |
When you install {product-title-short} the following persistent volumes (PV) must be created so that Persistent Volume Claims (PVC) can attach to it automatically. As a reminder, you must define a storage class in the MultiClusterObservability
CR when there is no default storage class specified or you want to use a non-default storage class to host the PVs. It is recommended to use Block Storage, similare to what Prometheus uses. Also each replica of alertmanager
, thanos-compactor
, thanos-ruler
, thanos-receive-default
and thanos-store-shard
must have its own PV. View the following table:
Persistent volume name |
Purpose |
alertmanager |
Alertmanager stores the |
thanos-compact |
The compactor needs local disk space to store intermediate data for its processing, as well as bucket state cache. The required space depends on the size of the underlying blocks. The compactor must have enough space to download all of the source blocks, then build the compacted blocks on the disk. On-disk data is safe to delete between restarts and should be the first attempt to get crash-looping compactors unstuck. However, it is recommended to give the compactor persistent disks in order to effectively use bucket state cache in between restarts. |
thanos-rule |
The thanos ruler evaluates Prometheus recording and alerting rules against a chosen query API by issuing queries at a fixed interval. Rule results are written back to the disk in the Prometheus 2.0 storage format. The amount of hours or days of data retained in this stateful set was fixed in the API version |
thanos-receive-default |
Thanos receiver accepts incoming data (Prometheus remote-write requests) and writes these into a local instance of the Prometheus TSDB. Periodically (every 2 hours), TSDB blocks are uploaded to the object storage for long term storage and compaction. The amount of hours or days of data retained in this stateful set, which acts a local cache was fixed in API Version |
thanos-store-shard |
It acts primarily as an API gateway and therefore does not need significant amounts of local disk space. It joins a Thanos cluster on startup and advertises the data it can access. It keeps a small amount of information about all remote blocks on local disk and keeps it in sync with the bucket. This data is generally safe to delete across restarts at the cost of increased startup times. |
Note: The time series historical data is stored in object stores. Thanos uses object storage as the primary storage for metrics and metadata related to them. For more details about the object storage and downsampling, see Enabling observability service.
-
For more information about enabling observability, see Enabling observability service.
-
See Customizing observability to learn how to configure the observability service, view metrics and other data.
-
Learn from the {ocp-short} documentation what types of metrics are collected and sent using telemetry. See Information collected by Telemetry for information.
-
Refer to Prometheus recording rules.
-
Also refer to Prometheus alerting rules