Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container: make cpu_manager_policy optional in kubelet_config #11572

Merged
merged 4 commits into from
Sep 12, 2024

Conversation

wyardley
Copy link
Contributor

@wyardley wyardley commented Aug 29, 2024

Part of hashicorp/terraform-provider-google#19225

This should resolve some confusing behavior with cpu_manager_policy

  • It frequently will show a permadrift when it can't be set
  • It also doesn't seem to accept the documented value of "none" as an empty value, though the previously undocumented empty string ("") seems to work.

efb71a9#r458238583
efb71a9#r473173480
☝️ context on when it was originally marked Required

This doesn't resolve all of the issues, but resolves other issues where it must be set where it's not actually needed (for example, if insecure_kubelet_readonly_port_enabled is set).

It appears that it was marked as Required somewhat arbitrarily (see above), and it's also possible that some of what's in place is tied to an API level bug that may have since been resolved. Maybe we could require at last one instead -- happy to do that if there's a good example to follow.

I did come across hashicorp/terraform-provider-google#15767 while testing this, but I think this is neutral as far as that goes.

Release Note Template for Downstream PRs (will be copied)

container: make `cpu_manager_policy` optional in `kubelet_config`

@modular-magician modular-magician added the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Aug 29, 2024
@wyardley wyardley marked this pull request as ready for review August 29, 2024 04:31
@github-actions github-actions bot requested a review from melinath August 29, 2024 04:31
Copy link

Hello! I am a robot. Tests will require approval from a repository maintainer to run.

@melinath, a repository maintainer, has been assigned to review your changes. If you have not received review feedback within 2 business days, please leave a comment on this PR asking them to take a look.

You can help make sure that review is quick by doing a self-review and by running impacted tests locally.

@melinath
Copy link
Member

Here's the full PR where it was originally marked required: #3760

@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 3 files changed, 5 insertions(+), 11 deletions(-))
google-beta provider: Diff ( 3 files changed, 5 insertions(+), 11 deletions(-))

Missing test report

Your PR includes resource fields which are not covered by any test.

Resource: google_container_cluster (382 total tests)
Please add an acceptance test which includes these fields. The test should include the following:

resource "google_container_cluster" "primary" {
  node_config {
    kubelet_config {
      cpu_manager_policy = # value needed
    }
  }
}

Copy link
Member

@melinath melinath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, the API has an API-side default of "none", but if you send "" or "none" it returns an empty value? If so, this might fall under https://googlecloudplatform.github.io/magic-modules/develop/permadiff/#default_if_empty

@melinath
Copy link
Member

melinath commented Aug 29, 2024

@hoskeri could you confirm the expected API behavior?

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 203
Passed tests: 188
Skipped tests: 13
Affected tests: 2

Click here to see the affected service packages
  • container

Action taken

Found 2 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates
  • TestAccContainerNodePool_withKubeletConfig

Get to know how VCR tests work

@wyardley
Copy link
Contributor Author

wyardley commented Aug 29, 2024

I’m not sure if there’s a permadiff in this case; I imagine a real fix to the linked issue may involve some more code changes.

This is only intended to be a partial fix in that I don’t believe this parameter is actually required? But I think the behavior may be a net neutral or slight improvement as it is, and also lets people set unrelated parameters when needed. If the value is unset, I believe the behavior should already be default by setting it to an empty string or ”none”

If I’m understanding the original intent correctly the main point was to make sure that at least one parameter was set in the block (which I think there’s a AtLeastOneOf function mentioned in that original commit to switch to if that’s the use case we’re concerned about).

Maybe there are some other cases where certain things are required, but AFAICT, even the other cpu_* policies don't directly require this setting to be explicitly set.

So this PR solves the problem of people not being able to set other parameters in the same block without having to set cpu_manager_policy at all when it’s not needed, and adds / updates the docs very slightly, while punting on the ongoing problems with other types of known issues with these resources (and the confusion of "none" vs "").

Put differently: other than allowing some people to remove the attribute, I don't believe this should cause a change in behavior for anyone who has it currently set to one of the 3 allowed values ("static", "", or "none"), and should simplify things and possibly avoid a permadiff for people who don't have it set at all, but also want to set, e.g., insecure_kubelet_readonly_port_enabled.

I don’t think this should change the behavior for existing users as best I could ascertain.

@github-actions github-actions bot requested a review from melinath August 29, 2024 21:16
@modular-magician
Copy link
Collaborator

$\textcolor{green}{\textsf{Tests passed during RECORDING mode:}}$
TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates[Debug log]
TestAccContainerNodePool_withKubeletConfig[Debug log]

$\textcolor{green}{\textsf{No issues found for passed tests after REPLAYING rerun.}}$


$\textcolor{green}{\textsf{All tests passed!}}$

View the build log or the debug log for each test

@wyardley
Copy link
Contributor Author

TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates

note: with this change, we could technically remove cpu_manager_policy from the test if we wanted it to be (and left in the broader TestAccContainerNodePool_withKubeletConfig one.

@wyardley
Copy link
Contributor Author

wyardley commented Aug 29, 2024

I can't comment on how it's supposed to behave, obviously, so hopefully we'll get a response on that, but:

https://pkg.go.dev/google.golang.org/api/container/v1#NodeKubeletConfig

Control the CPU management policy on the node. See https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/
The following values are allowed. * "none": the default, which represents the existing scheduling behavior. * "static": allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node. The default value is 'none' if unspecified.
CpuManagerPolicy string json:"cpuManagerPolicy,omitempty"

Even as-is, though, "static" and "" (which is allowed) seem to work as expected, other than the update variant.

What I see is if I create a new cluster with the current 6.0.1 (unpatched) provider, it sends "none", and the API does seem to return it vs. returning empty for default. So maybe even though they have identical results, seemingly, setting none does not behave the same way as not setting it, if you look at the API response further down.

  node_config {
    kubelet_config {
      cpu_manager_policy = "none"
    }
  }

Call to API during initial cluster creation:

2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:   "nodeConfig": {
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:    "kubeletConfig": {
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:     "cpuCfsQuota": false,
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:     "cpuManagerPolicy": "none"
2024-08-29T15:30:52.742-0700 [DEBUG] provider.terraform-provider-google_v6.0.1_x5:    },

Interestingly, the API does seem to return a value vs. no value in that case: maybe @hoskeri can clarify. I'm not sure what that old comment about "default" as a value means or if that's still the case, but I haven't tried sending that, and I don't think the provider does either.

% gcloud container clusters describe --location us-central1-f xxxx | yq .nodeConfig.kubeletConfig
cpuCfsQuota: false
cpuManagerPolicy: none

(cpuCfsQuota being set to false is due to the existing known issues we're looking at separately)

This is probably good from the standpoint of the tf provider - there is no permadrift when setting the value to "none" explicitly, so I am guessing most of the issues we have with this parameter are around updates not working.

Copy link

github-actions bot commented Sep 3, 2024

@melinath This PR has been waiting for review for 3 weekdays. Please take a look! Use the label disable-review-reminders to disable these notifications.

@wyardley
Copy link
Contributor Author

wyardley commented Sep 3, 2024

Side note: I confirmed with someone that this was (at the time this was added) common practice at one point to require one attribute to avoid a situation where an empty block was defined and cause issues, and apparently there are much better checks for this in place now.

Copy link

github-actions bot commented Sep 5, 2024

@GoogleCloudPlatform/terraform-team @melinath This PR has been waiting for review for 1 week. Please take a look! Use the label disable-review-reminders to disable these notifications.

@wyardley
Copy link
Contributor Author

wyardley commented Sep 9, 2024

@melinath not sure if you heard back from @hoskeri, but a question I have is whether if we can establish that this PR is relatively safe / neutral as-is in terms of behavior, whether you and the team would feel comfortable allowing this as-is before a proper resolution to hashicorp/terraform-provider-google#19225.

I could do some more testing if there are specific scenarios you're thinking of, but IMO, this will improve the existing behavior in a relatively safe way.

I do absolutely think the behavior and the docs around the confusing should eventually be resolved and made more consistent, but I think that this change as-is should help avoid confusing diffs / behavior to users who need to set other values within kubelet_config.

Once we get into coercing "" and "none" to the same value, I think we have a greater risk that planning may show confusing diffs to the user, even if it can be done in a way that behaves the same.

@melinath
Copy link
Member

melinath commented Sep 9, 2024

apologies for the delayed review. I've been working through a backlog. I should be able to take a look tomorrow. In general this does look like a good change; I just want to do some manual testing to make sure I understand how it behaves.

Copy link
Member

@melinath melinath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having some trouble reproducing the described permadiff (unless I try to update a cluster, but that's due to hashicorp/terraform-provider-google#19225, which I agree is orthogonal to this issue and doesn't need to be fixed at the same time.) hashicorp/terraform-provider-google#19225 so they don't need to be fixed in the same PR.

My main concern is that treating "" (unset) and "none" as distinct values even though they both behave the same may cause confusion for users, since the server could silently change the behavior for that case (similar to what will happen with insecure_kubelet_readonly_port_enabled eventually).

However, this block may also have problems with sending values to the API when they're unset - ref: hashicorp/terraform-provider-google#15767 and hashicorp/terraform-provider-google#19428 (for another block on the same resource.)

I'm currently inclined to go with your proposed implementation and just make the field optional with no additional changes.

@@ -1172,6 +1172,7 @@ func expandKubeletConfig(v interface{}) *container.NodeKubeletConfig {
kConfig := &container.NodeKubeletConfig{}
if cpuManagerPolicy, ok := cfg["cpu_manager_policy"]; ok {
kConfig.CpuManagerPolicy = cpuManagerPolicy.(string)
kConfig.ForceSendFields = append(kConfig.ForceSendFields, "CpuManagerPolicy")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause the field to be sent even if unset - per hashicorp/terraform-provider-google#15767 that might not be necessary / desired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a few, so I may have to run through some different scenarios to remember exactly why this ended up seemingly being needed.

IIRC, there was some kind of issue when I didn't have it set -- at the least, I did try it first without that line added.

If you're already setup to test it, you could try building the provider with this line removed and step through some of the tests that I did. If I've got time to dig into it, I will try to see as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did at least a basic test with this line removed again, and a config with other kubelet_config stuff set seemed to do what I'd expect.

Another scenario to look at is when the nodepool is managed separately (where nodepool updates do work)... I think there's a case where removing the attribute when it was set might cause an issue when the attribute doesn't get set in the API call. That said, I haven't been able to induce that issue thus far with this example, and TestAccContainerNodePool_withKubeletConfig passes, so I'm pushing up an update that removes this again. I didn't test "" but since that's not a documented value, probably doesn't matter too much even if that did break somehow.

resource "google_container_cluster" "test_cfs_quota" {
  name               = "test-cfs-quota"
  location           = "us-central1-f"
  initial_node_count = 1

  node_config {
    kubelet_config {
      cpu_cfs_quota                          = true
      insecure_kubelet_readonly_port_enabled = "FALSE"
    }
  }

  deletion_protection = false
  network             = "default"
  subnetwork          = "default"
}

resource "google_container_node_pool" "test" {
  name       = "secondary-test"
  cluster    = google_container_cluster.test_cfs_quota.id
  node_count = 1

  node_config {
    kubelet_config {
      cpu_cfs_quota                          = true
      cpu_manager_policy                     = "none"
      insecure_kubelet_readonly_port_enabled = "FALSE"
    }
  }
}

@github-actions github-actions bot requested a review from melinath September 11, 2024 04:55
wyardley and others added 2 commits September 10, 2024 21:56
Partial fix for hashicorp/terraform-provider-google#19225

This should resolve some confusing behavior with `cpu_manager_policy`

* It frequently will show a permadrift when it can't be set
* It also doesn't seem to accept the documented value of "none" as an
  empty value, though the previously undocumented empty string (`""`)
  seems to work.

This doesn't resolve all of the issues, but resolves other issues where
it must be set where it's not actually needed (for example, if
`insecure_kubelet_readonly_port_enabled` is set).

It appears that it was marked as `Required` somewhat arbitrarily, and
it's also possible that some of what's in place is tied to an API level
bug that may have since been resolved.
@wyardley wyardley force-pushed the wyardley/19225_partial branch from 38cfe3c to d7bd3d9 Compare September 11, 2024 04:56
@modular-magician modular-magician added the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Sep 11, 2024
@melinath
Copy link
Member

yeah force pushes that leave the history intact are fine.

@modular-magician modular-magician removed the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Sep 11, 2024
@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 3 files changed, 3 insertions(+), 11 deletions(-))
google-beta provider: Diff ( 3 files changed, 3 insertions(+), 11 deletions(-))

Missing test report

Your PR includes resource fields which are not covered by any test.

Resource: google_container_cluster (388 total tests)
Please add an acceptance test which includes these fields. The test should include the following:

resource "google_container_cluster" "primary" {
  node_config {
    kubelet_config {
      cpu_manager_policy = # value needed
    }
  }
}

Copy link
Member

@melinath melinath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this the more I agree with you that just making the field optional is safest for now. We may want to make other changes in the future but this fixes the immediate issue.

It would be best practice to make sure the google_container_cluster.node_config.kubelet_config.cpu_manager_policy field is used in at least one test, though.

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 208
Passed tests: 193
Skipped tests: 13
Affected tests: 2

Click here to see the affected service packages
  • container

Action taken

Found 2 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates
  • TestAccContainerNodePool_withKubeletConfig

Get to know how VCR tests work

@modular-magician
Copy link
Collaborator

$\textcolor{green}{\textsf{Tests passed during RECORDING mode:}}$
TestAccContainerCluster_withInsecureKubeletReadonlyPortEnabledInNodeConfigUpdates[Debug log]
TestAccContainerNodePool_withKubeletConfig[Debug log]

$\textcolor{green}{\textsf{No issues found for passed tests after REPLAYING rerun.}}$


$\textcolor{green}{\textsf{All tests passed!}}$

View the build log or the debug log for each test

@wyardley
Copy link
Contributor Author

Sure - I can add it. Do you want it in an existing test or a standalone new test case?

only caveat is that, since we already know that update is broken when it’s used that way, it probably can’t be a test where the resource is updated.

@github-actions github-actions bot requested a review from melinath September 12, 2024 01:22
@wyardley
Copy link
Contributor Author

It would be best practice to make sure the google_container_cluster.node_config.kubelet_config.cpu_manager_policy field is used in at least one test, though.

It is still present here, right?

@melinath
Copy link
Member

melinath commented Sep 12, 2024

Sure - I can add it. Do you want it in an existing test or a standalone new test case?

only caveat is that, since we already know that update is broken when it’s used that way, it probably can’t be a test where the resource is updated.

An existing test would be ideal (but it doesn't have to be one of the ones already modified in this PR).

It would be best practice to make sure the google_container_cluster.node_config.kubelet_config.cpu_manager_policy field is used in at least one test, though.

It is still present here, right?

That's node_pool.node_config.kubelet_config (as opposed to just node_config.kubelet_config).

@wyardley
Copy link
Contributor Author

An existing test would be ideal (but it doesn't have to be one of the ones already modified in this PR).

@melinath I didn't see any real obvious choice, and the insecure_... one won't work because updates, so I made a new one that includes all the other node_config.kubelet_config settings, with a note? It could be folded into the insecure kubelet one later if someone were so inclined? Checking that it passes now, but let me know if this approach looks Ok to you.

@modular-magician modular-magician added the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Sep 12, 2024
@melinath
Copy link
Member

sure, seems fine

@modular-magician modular-magician removed the awaiting-approval Pull requests that need reviewer's approval to run presubmit tests label Sep 12, 2024
@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 3 files changed, 59 insertions(+), 11 deletions(-))
google-beta provider: Diff ( 3 files changed, 59 insertions(+), 11 deletions(-))

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 209
Passed tests: 195
Skipped tests: 13
Affected tests: 1

Click here to see the affected service packages
  • container

Action taken

Found 1 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_withNodeConfigKubeletConfigSettings

Get to know how VCR tests work

@modular-magician
Copy link
Collaborator

$\textcolor{green}{\textsf{Tests passed during RECORDING mode:}}$
TestAccContainerCluster_withNodeConfigKubeletConfigSettings[Debug log]

$\textcolor{green}{\textsf{No issues found for passed tests after REPLAYING rerun.}}$


$\textcolor{green}{\textsf{All tests passed!}}$

View the build log or the debug log for each test

@melinath melinath merged commit c920b8f into GoogleCloudPlatform:main Sep 12, 2024
11 checks passed
iyabchen pushed a commit to iyabchen/magic-modules that referenced this pull request Sep 14, 2024
abd-goog pushed a commit to abd-goog/magic-modules that referenced this pull request Sep 23, 2024
@wyardley wyardley deleted the wyardley/19225_partial branch September 25, 2024 19:23
niharika-98 pushed a commit to niharika-98/magic-modules that referenced this pull request Oct 7, 2024
Philip-Jonany pushed a commit to Philip-Jonany/magic-modules that referenced this pull request Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants