Skip to content

Fix must-gather output for omc compatibility#4965

Open
ozzywalsh wants to merge 11 commits into
open-telemetry:mainfrom
ozzywalsh:fix-must-gather
Open

Fix must-gather output for omc compatibility#4965
ozzywalsh wants to merge 11 commits into
open-telemetry:mainfrom
ozzywalsh:fix-must-gather

Conversation

@ozzywalsh
Copy link
Copy Markdown
Contributor

@ozzywalsh ozzywalsh commented Apr 14, 2026

Summary

omc is a CLI for browsing OpenShift must-gather output offline. Our must-gather used a non-standard directory layout that omc couldn't parse.

This PR rewrites the output to match the layout omc expects. After this change, all standard omc commands work against our must-gather output.

What changed

  • Directory structure: resources grouped by API group and kind (apps/deployments/, core/services/, opentelemetry.io/opentelemetrycollectors/) instead of flat files under the collector name
  • YAML metadata: apiVersion and kind fields populated on all resources (controller-runtime strips these from List results)
  • CRDs: all OpenTelemetry CRD definitions collected so omc can understand custom resources
  • OLM resources: written to correct omc paths (operators.coreos.com/clusterserviceversions/, etc.)
  • Pod logs: collector, target allocator, and bridge pod logs now collected (previously only operator manager logs were gathered)
  • Log path format: matches omc's pods/<pod>/<container>/<container>/logs/current.log convention so omc logs works
  • Binary renamed: /usr/bin/must-gather/usr/bin/gather (OpenShift must-gather convention)
  • Default collection dir: ~/must-gather/must-gather (matches the mount point oc adm must-gather provides)
  • e2e tests: removed stale must-gather check from otlp-metrics-traces (now covered by the dedicated must-gather test); added pod log assertions

Before / after

Before (main)
must-gather
├── deployment-opentelemetry-operator-controller-manager.yaml
├── namespaces
│   └── sandbox
│       └── otel
│           ├── configmap-otel-collector-601db117.yaml
│           ├── deployment-otel-collector.yaml
│           ├── opentelemetrycollector-otel.yaml
│           ├── poddisruptionbudget-otel-collector.yaml
│           ├── serviceaccount-otel-collector.yaml
│           ├── service-otel-collector-headless.yaml
│           ├── service-otel-collector-monitoring.yaml
│           └── service-otel-collector.yaml
├── olm
│   ├── clusterserviceversion-opentelemetry-operator-v0-144-0-1.yaml
│   ├── installplan-install-pjxxr.yaml
│   ├── operatorgroup-openshift-opentelemetry-operator.yaml
│   ├── operator-opentelemetry-product-openshift-opentelemetry-operator.yaml
│   └── subscription-opentelemetry-product.yaml
└── opentelemetry-operator-controller-manager-566597c897-vzp6b
After (this PR)
must-gather-output/
├── event-filter.html
├── image-registry-openshift-image-registry-svc-5000-openshift-must-gather-sha256-857ed10e79736210e52fe61d9c2691cd917839ca0a99a72c4092e81dc22001a4
│   ├── cluster-scoped-resources
│   │   ├── apiextensions.k8s.io
│   │   │   └── customresourcedefinitions
│   │   │       ├── instrumentations.opentelemetry.io.yaml
│   │   │       ├── opampbridges.opentelemetry.io.yaml
│   │   │       ├── opentelemetrycollectors.opentelemetry.io.yaml
│   │   │       └── targetallocators.opentelemetry.io.yaml
│   │   └── operators.coreos.com
│   │       └── operators
│   │           └── opentelemetry-product.openshift-opentelemetry-operator.yaml
│   ├── gather.logs
│   └── namespaces
│       ├── openshift-opentelemetry-operator
│       │   ├── apps
│       │   │   └── deployments
│       │   │       └── opentelemetry-operator-controller-manager.yaml
│       │   ├── core
│       │   │   └── pods
│       │   │       └── opentelemetry-operator-controller-manager-566597c897-vzp6b.yaml
│       │   ├── operators.coreos.com
│       │   │   ├── clusterserviceversions
│       │   │   │   └── opentelemetry-operator.v0.144.0-1.yaml
│       │   │   ├── installplans
│       │   │   │   └── install-pjxxr.yaml
│       │   │   ├── operatorgroups
│       │   │   │   └── openshift-opentelemetry-operator.yaml
│       │   │   └── subscriptions
│       │   │       └── opentelemetry-product.yaml
│       │   └── pods
│       │       └── opentelemetry-operator-controller-manager-566597c897-vzp6b
│       │           ├── manager
│       │           │   └── manager
│       │           │       └── logs
│       │           │           └── current.log
│       │           └── opentelemetry-operator-controller-manager-566597c897-vzp6b.yaml
│       └── sandbox
│           ├── apps
│           │   └── deployments
│           │       └── otel-collector.yaml
│           ├── core
│           │   ├── configmaps
│           │   │   └── otel-collector-601db117.yaml
│           │   ├── pods
│           │   │   └── otel-collector-7b946f87bc-zmz88.yaml
│           │   ├── serviceaccounts
│           │   │   └── otel-collector.yaml
│           │   └── services
│           │       ├── otel-collector-headless.yaml
│           │       ├── otel-collector-monitoring.yaml
│           │       └── otel-collector.yaml
│           ├── opentelemetry.io
│           │   └── opentelemetrycollectors
│           │       └── otel.yaml
│           ├── pods
│           │   └── otel-collector-7b946f87bc-zmz88
│           │       ├── otc-container
│           │       │   └── otc-container
│           │       │       └── logs
│           │       │           └── current.log
│           │       └── otel-collector-7b946f87bc-zmz88.yaml
│           └── policy
│               └── poddisruptionbudgets
│                   └── otel-collector.yaml
├── must-gather.logs
└── timestamp

Test plan

  • Unit tests for path construction, GVK population, pod YAML in log dir, and omc directory layout
  • Manual validation on CRC: omc get opentelemetrycollectors -A, omc get deployments, omc logs <pod> all return data
  • e2e test (tests/e2e-openshift/must-gather/) updated with collector pod log assertions
  • Removed stale must-gather check from otlp-metrics-traces e2e (covered by dedicated test)

Known issues

  • Error handling is still poor (all or nothing); the gather should ideally gracefully handle failures and "gather" whatever it can. This PR is already large so I'd keep that for a future change.

Try it

Build the image from ./cmd/gather/Dockerfile
Push it to openshift internal registry or an alternative (docker, quay)
Deploy the otel operator, a collector etc.
Run must-gather
oc adm must-gather --dest-dir ./your-output-dir --image=IMAGE_URL_AND_TAG -- /usr/bin/gather --operator-namespace OPERATOR_NAMESPACE

@ozzywalsh ozzywalsh force-pushed the fix-must-gather branch 2 times, most recently from 59c4537 to 70ee1ee Compare April 14, 2026 20:22
@ozzywalsh ozzywalsh changed the title WIP: Fix must-gather output for omc compatibility Fix must-gather output for omc compatibility Apr 15, 2026
@ozzywalsh ozzywalsh marked this pull request as ready for review April 15, 2026 12:24
@ozzywalsh ozzywalsh requested a review from a team as a code owner April 15, 2026 12:25
@ozzywalsh
Copy link
Copy Markdown
Contributor Author

Continuous Integration / Govulncheck is also failing on main; so unrelated to changes in this PR.

# Use pipe (|) for multiline entries.
subtext: |
Previously collected files used a per-collector directory with kind-prefixed filenames (e.g. `namespaces/<ns>/<collector-name>/deployment-<name>.yaml`),
which omc cannot parse. Output now follows the standard omc layout (`namespaces/<ns>/<api-group>/<resource-plural>/<name>.yaml`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we find a way to implement a test to check the output with omc? Perhaps even as e2e test that runs on OCP where the tooling is available?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. I'll look into and see if I can add something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an omc step to the e2e test script.

4037cd5

ozzywalsh added 11 commits May 21, 2026 12:13
We need to loop over the pods (in case replicas > 1)
processPodsByInstance was writing pod YAML but not collecting container
logs, so collector/TA/bridge pod logs were missing from must-gather
output. Now iterates all containers and calls getPodLogs for each.

Remove the stale must-gather check from the otlp-metrics-traces e2e
test — it used the old directory layout and is fully covered by the
dedicated must-gather test.
omc expects container logs at pods/<pod>/<container>/<container>/logs/
(container name repeated twice), not pods/<pod>/<container>/logs/.
Also, omc logs discovers pods from pods/<pod>/<pod>.yaml — write pod
YAML there so omc logs can find and display them.
The previous implementation appended "es"/"s" based on whether the Kind
string ended in "s". This produces wrong paths for kinds like NetworkPolicy
(→ "networkpolicys" instead of "networkpolicies"), which omc cannot find.

Replace it with an explicit map covering all Kinds the tool collects.
log.Fatalf on unknown kinds ensures omissions are caught immediately
rather than silently writing to a path omc cannot parse.
Download omc binary and verify it can parse the gathered output:
get opentelemetrycollectors, get deployments, get pods, and logs.
omc logs defaults to the 'default' project, so it couldn't find the
gather-collector pod in the chainsaw-must-gather namespace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants