Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-JIRA: Switch layered build to treefile-apply, drain get-ocp-repo.sh #1780

Merged
merged 11 commits into from
Apr 4, 2025

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Mar 28, 2025

See individual commit messages.

@jlebon jlebon changed the title Nuke okd-c9s variant NO-JIRA: Nuke okd-c9s variant Mar 28, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 28, 2025
@openshift-ci-robot
Copy link

@jlebon: This pull request explicitly references no jira issue.

In response to this:

This is not built anywhere by anyone. OKD has moved to the new layered node image model and uses the output from the c9s variant we currently build internally.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from c4rt0 and marmijo March 28, 2025 16:03
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2025
@jlebon
Copy link
Member Author

jlebon commented Mar 28, 2025

[1/3] STEP 9/9: RUN --mount=type=secret,id=yumrepos,target=/os/secret.repo if [[ -n "${VARIANT}" ]]; then MANIFEST="manifest-${VARIANT}.yaml"; EXTENSIONS="extensions-${VARIANT}.yaml"; else MANIFEST="manifest.yaml"; EXTENSIONS="extensions.yaml"; fi && rpm-ostree compose extensions --rootfs=/ --output-dir=/usr/share/rpm-ostree/extensions/ ./"${MANIFEST}" ./"${EXTENSIONS}"
error: Can't open file "./manifest-okd-c9s.yaml" for reading: No such file or directory (os error 2) 

Man, this get-ocp-repo.sh script has become quite the beast. I'm thinking of reworking how that works entirely to make it saner.

@jlebon jlebon changed the title NO-JIRA: Nuke okd-c9s variant NO-JIRA: Switch layered build to treefile-apply, drain get-ocp-repo.sh Mar 31, 2025
@jlebon
Copy link
Member Author

jlebon commented Mar 31, 2025

This requires coreos/rpm-ostree#5351 and coreos/coreos-assembler#4054.

@jlebon
Copy link
Member Author

jlebon commented Mar 31, 2025

cc @Prashanth684 since this also touches OKD

Copy link
Contributor

@jbtrystram jbtrystram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome !
Just a couple of questions :)

@@ -16,9 +16,6 @@ supported:
- `rhel-9.6`: RHEL 9.6-based CoreOS; without OpenShift components.
- `ocp-rhel-9.6`: RHEL 9.6-based CoreOS; including OpenShift components.
- `c9s`: CentOS Stream-based CoreOS, without OKD components.
- `okd-c9s`: CentOS Stream-based CoreOS, including OpenShift components. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the okd-c9s variant used in the okd/scos build pipeline [1] run in MOC and more specifically in the latest commit okd-project/okd-coreos-pipeline@d4be53e for 4.19.
But according to openshift/release#62296 the scos imagestream (to be used as node image) is now populated by the OpenShift CI itself instead of the MOC pipeline.

So maybe we should decommission the MOC pipeline [2] before merging this patch ? What do you think @Prashanth684 ? It's not a blocker though, the MOC builds would just fail and can be deal as a follow-up.

[1] https://github.com/search?q=repo%3Aokd-project%2Fokd-coreos-pipeline%20okd-c9s&type=code
[2] https://github.com/okd-project/okd-coreos-pipeline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The okd-c9s "variant" used in that pipeline is for the extensions build, not the base OS AFAIK.

That said, there is indeed a small cleanup possible there which is that it no longer needs to provide a VARIANT argument to the extensions build since that's auto-detected now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we should decommission the MOC pipeline [2] before merging this patch ? What do you think @Prashanth684 ? It's not a blocker though, the MOC builds would just fail and can be deal as a follow-up.

Correct. MOC is only used for 4.18. Once we release 4.19 as stable, we will stop those also. We are working to migrate off MOC (we still do OKD release promotions from there) to an internal cluster.

@jcapiitao
Copy link
Contributor

It looks like the CI base image does not contain the /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial file containing the CentOS Stream GPG keys.

@jlebon
Copy link
Member Author

jlebon commented Apr 1, 2025

It looks like the CI base image does not contain the /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial file containing the CentOS Stream GPG keys.

Yeah, it needed coreos/coreos-assembler#4054. That's merged now but there's no point retriggering the tests until coreos/rpm-ostree#5351 percolates down too. We're working on that.

@jlebon
Copy link
Member Author

jlebon commented Apr 2, 2025

/retest

@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch from b8fbcd9 to ba35f21 Compare April 2, 2025 19:10
@jlebon
Copy link
Member Author

jlebon commented Apr 2, 2025

OK, we have new enough rpm-ostree and cosa now. Let's try this out!

@jlebon
Copy link
Member Author

jlebon commented Apr 2, 2025

@Prashanth684 Can you take of this bit once this lands:

That said, there is indeed a small cleanup possible there which is that it no longer needs to provide a VARIANT argument to the extensions build since that's auto-detected now.

Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM - some comments

Comment on lines 44 to 45
# buildah doesn't seem to support heredoc output
# redirection like buildkit so do it manually here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a bug for this we can link to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find one, but I could file one I guess.

Ahhh OK, just saw your other comment below related to this. The feature I'm talking about here isn't just bash redirection, but a RUN feature so you can write e.g.

RUN <<EOF > /out
echo foobar

and it'll go to /out. Which would've been perfect for our use case here.

Comment on lines 16 to 17
MANIFEST="manifest-rhel-9.6.yaml"
EXTENSIONS="extensions-ocp-rhel-9.6.yaml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply these paths will need to be updated each time we say bump to the next version of RHEL? (i.e. RHEL 9.7 or 9.8 are here?)

I would kind of prefer that we didn't have to remember to update these variables when that happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 9.7, but 9.8 yeah. It would be part of all the tree updates we do after branching for 4.22 I guess? When bumping the base variant from rhel-9.6 to rhel-9.8. Though by then we should be on top of rhel-bootc which likely means we can rework this too since it's then driven by the base RHEL version of the rhel-bootc image.

@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch from ba35f21 to 7bbe649 Compare April 3, 2025 03:19
@jlebon
Copy link
Member Author

jlebon commented Apr 3, 2025

Hmm, the builder is choking on the heredocs. I think possibly the Dockerfile parser there (which happens before it's handed off to buildah) is getting tripped up on something. :( I'll dig into this a bit, but if it can't be easily worked around, I think I'll just add a commit for now that moves the heredocs to shell scripts for now.

c9s builds are failing on

 error: Installing packages: Updating rpm-md repo 'c9s-baseos-mirror': Failed to download gpg key for repo 'c9s-baseos-mirror': Curl error (37): Could not read a file:// file for file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial [Couldn't open file /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial] 

which I have no idea why. It's using new enough cosa with coreos/coreos-assembler#4054. And using that new cosa locally I can't reproduce this.

Copy link
Contributor

@jcapiitao jcapiitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my inline comment, sorry I missed it in my previous review.
Otherwise LGTM 👍

@jcapiitao
Copy link
Contributor

Hmm, the builder is choking on the heredocs. I think possibly the Dockerfile parser there (which happens before it's handed off to buildah) is getting tripped up on something. :( I'll dig into this a bit, but if it can't be easily worked around, I think I'll just add a commit for now that moves the heredocs to shell scripts for now.

c9s builds are failing on

 error: Installing packages: Updating rpm-md repo 'c9s-baseos-mirror': Failed to download gpg key for repo 'c9s-baseos-mirror': Curl error (37): Could not read a file:// file for file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial [Couldn't open file /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial] 

which I have no idea why. It's using new enough cosa with coreos/coreos-assembler#4054. And using that new cosa locally I can't reproduce this.

It's because the latest run CI job pull cosa commit b45a4066b16a2332517f659111d4f474372f77d5 [1]

+ [[ -d /cosa ]]
+ jq .
{
  "date": "2025-04-02T17:18:40Z",
  "git": {
    "commit": "b45a4066b16a2332517f659111d4f474372f77d5",
    "origin": "https://github.com/coreos/coreos-assembler.git",
    "branch": "HEAD",
    "dirty": "false"
  },

and not the latest one with the changes we want coreos/coreos-assembler@ae0e86a

Maybe some cache issue or race condition issue ? I'm not yet familiar with the CI workflow to be sure, maybe a simple /retest should work

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_os/1780/pull-ci-openshift-os-master-scos-9-build-test-qemu/1907634163477909504/build-log.txt

@jcapiitao
Copy link
Contributor

Hmm, the builder is choking on the heredocs. I think possibly the Dockerfile parser there (which happens before it's handed off to buildah) is getting tripped up on something. :( I'll dig into this a bit, but if it can't be easily worked around, I think I'll just add a commit for now that moves the heredocs to shell scripts for now.
c9s builds are failing on

 error: Installing packages: Updating rpm-md repo 'c9s-baseos-mirror': Failed to download gpg key for repo 'c9s-baseos-mirror': Curl error (37): Could not read a file:// file for file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial [Couldn't open file /etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial] 

which I have no idea why. It's using new enough cosa with coreos/coreos-assembler#4054. And using that new cosa locally I can't reproduce this.

It's because the latest run CI job pull cosa commit b45a4066b16a2332517f659111d4f474372f77d5 [1]

+ [[ -d /cosa ]]
+ jq .
{
  "date": "2025-04-02T17:18:40Z",
  "git": {
    "commit": "b45a4066b16a2332517f659111d4f474372f77d5",
    "origin": "https://github.com/coreos/coreos-assembler.git",
    "branch": "HEAD",
    "dirty": "false"
  },

and not the latest one with the changes we want coreos/coreos-assembler@ae0e86a

Maybe some cache issue or race condition issue ? I'm not yet familiar with the CI workflow to be sure, maybe a simple /retest should work

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_os/1780/pull-ci-openshift-os-master-scos-9-build-test-qemu/1907634163477909504/build-log.txt

Hmm forget about it, wrong assumption, the codebase up to b45a4 contains the changes we want from https://github.com/coreos/coreos-assembler/pull/4054/commits
That's odd it does not work on CI

@jlebon
Copy link
Member Author

jlebon commented Apr 3, 2025

OK, coreos/coreos-assembler#4059 should fix the GPG issue!

@jbtrystram
Copy link
Contributor

/retest

jlebon added 3 commits April 4, 2025 09:18
Instead of having to explicitly pass in the `VARIANT`, we can autodetect
it based on the node image we're building `FROM`.
`apply-manifest` was essentially folded back into rpm-ostree in:
coreos/rpm-ostree#5274

The only thing we need to keep is the workaround for cri-o's `/var/opt`,
which... we should just try to get fixed.
Previously, when building the layered node image, we were relying on the
default repo enablement settings. This though is at the source of a lot
of complexity because then we need to make sure that we only inject just
the repos that we need with the right enablement. See e.g. the complex
logic in `get-ocp-repo.sh`.

Let's instead match the semantics already in use by the base compose and
extensions builds, both of which explicitly list the repos to enable.
This means that we can be a lot less careful in what repo definitions we
inject into the build environment, knowing only the necessary ones will
be enabled.

This is pretty easy to do now that (1) rpm-ostree suppports inlined
treefiles, and (2) `treefile-apply` supports a `--var` option to define
variables at invocation time.
@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch 3 times, most recently from 100d9f8 to ebd3ff7 Compare April 4, 2025 14:18
jlebon added 2 commits April 4, 2025 10:50
Now that (1) we've reworked the layered node image build to only enable
the repos it needs, and (2) we've simplified the CentOS Stream GPG keys,
we can delete all of the complex logic in this repo. It basically just
boils down to curl'ing down all the repo files we may need to build the
various artifacts that use this script.
We only want certain packages to come from the 4.19 plashet. And we
can't just rely on NVRs because the plashet may sometimes win. Long-term
we should sever that dependence on ART packages, but for now, let's add
a hack to essentially generate a repo on the fly from the 4.19 repo with
the filters we need.

The advantage of doing it this way instead of e.g. in the
`get-ocp-repo.sh` script is that this applies both in CI and locally.
@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch from ebd3ff7 to ac98e93 Compare April 4, 2025 14:51
jlebon added 2 commits April 4, 2025 11:26
The OCP builder API path isn't parsing the heredoc correctly for some
reason:

     error: build error: EOF: unterminated heredoc

This will be fixed by openshift/builder#469.

Anyway, just work around this for now by moving all the logic to
scripts. It does make the Containerfiles cleaner at least now that it
has gotten so larger and we get syntax highlighting, ShellCheck, etc...
so probably for the best.
Before we inherited this from the ocp-rhel-9.6 manifest. But now that
we're inheriting from the rhel-9.6 manifest, that repo isn't enabled
by default there since it's not strictly needed (because we don't ship
openvswitch in the base).

So we need to enable it here ourselves.
@jlebon jlebon force-pushed the pr/nuke-okd-c9s branch from ac98e93 to 7997d1a Compare April 4, 2025 15:35
@jlebon
Copy link
Member Author

jlebon commented Apr 4, 2025

OK, the OKD node image build is failing on

Error: Unknown repo: 'c9s-baseos'

which I know why. Got a fix for that, but let's see if CI for the other tests pass. If they do, then let's get this in, and I'll add my fix to #1498 instead when I rebase it.

@dustymabe
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 4, 2025
Copy link
Contributor

openshift-ci bot commented Apr 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustymabe, jbtrystram, jlebon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [dustymabe,jbtrystram,jlebon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 48a1891 and 2 for PR HEAD 7997d1a in total

@jlebon
Copy link
Member Author

jlebon commented Apr 4, 2025

This will known fail. It'll be fixed in #1498.

/override ci/prow/okd-scos-images

OK, as for the RHCOS failure, it seems to have just... timed out?

 --- FAIL: ext.config.shared.rpm-ostree.kernel-replace (1206.73s)
        harness.go:106: TIMEOUT[20m0s]: ssh: sudo /usr/local/bin/kolet run-test-unit kola-runext.service
        harness.go:106: TIMEOUT[20m0s]: ssh: journalctl -t kola-runext-kernel-replace 

Nothing in particular in the journal logs. Last operation was

Apr  4 18:20:36.204321 rpm-ostreed.service[1391]: Fetching ostree-unverified-image:oci-archive:/var/tmp/coreos-derived.ociarchive

So possibly a genuine timeout because of e.g. slow I/O.

Anyway, don't think it needs to block this. It'll rerun in #1498.

Copy link
Contributor

openshift-ci bot commented Apr 4, 2025

@jlebon: Overrode contexts on behalf of jlebon: ci/prow/okd-scos-images

In response to this:

This will known fail. It'll be fixed in #1498.

/override ci/prow/okd-scos-images

OK, as for the RHCOS failure, it seems to have just... timed out?

--- FAIL: ext.config.shared.rpm-ostree.kernel-replace (1206.73s)
       harness.go:106: TIMEOUT[20m0s]: ssh: sudo /usr/local/bin/kolet run-test-unit kola-runext.service
       harness.go:106: TIMEOUT[20m0s]: ssh: journalctl -t kola-runext-kernel-replace 

Nothing in particular in the journal logs. Last operation was

Apr  4 18:20:36.204321 rpm-ostreed.service[1391]: Fetching ostree-unverified-image:oci-archive:/var/tmp/coreos-derived.ociarchive

So possibly a genuine timeout because of e.g. slow I/O.

Anyway, don't think it needs to block this. It'll rerun in #1498.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Apr 4, 2025

@jlebon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 7997d1a link false /test okd-scos-e2e-aws-ovn
ci/prow/rhcos-9-build-test-qemu 7997d1a link true /test rhcos-9-build-test-qemu
ci/prow/e2e-aws 7997d1a link false /test e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jlebon jlebon merged commit 902882d into openshift:master Apr 4, 2025
10 of 14 checks passed
@jlebon jlebon deleted the pr/nuke-okd-c9s branch April 4, 2025 18:51
jbtrystram added a commit to jbtrystram/openshift-os that referenced this pull request Apr 7, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
jbtrystram added a commit to jbtrystram/openshift-os that referenced this pull request Apr 7, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
jbtrystram added a commit to jbtrystram/openshift-os that referenced this pull request Apr 7, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
jbtrystram added a commit to jbtrystram/openshift-os that referenced this pull request Apr 7, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
dustymabe pushed a commit to dustymabe/os that referenced this pull request Apr 8, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
dustymabe pushed a commit to jbtrystram/openshift-os that referenced this pull request Apr 8, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
dustymabe pushed a commit to jbtrystram/openshift-os that referenced this pull request Apr 8, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
dustymabe pushed a commit to jbtrystram/openshift-os that referenced this pull request Apr 8, 2025
We now need to support both EL9 and EL10.
Using the conditionnal includes for treefiles
added in [1], update `osversion` to contain the
variant (centos/rhel) and the major version.

This allows the layered build to source
`/etc/release` and include the correct repos.

Update denylist entries to matcht that.

[1] openshift#1780
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants