Skip to content

[nexus] webhooks #7277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 221 commits into
base: main
Choose a base branch
from
Open

[nexus] webhooks #7277

wants to merge 221 commits into from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Dec 18, 2024

This branch adds an MVP implementation of the internal machinery for delivering webhooks from Nexus. This includes:

  • webhook-related external API endpoints (as described in RFD 538)
  • database tables for storing webhook receiver configurations and, webhook events and tracking their
    delivery status
  • background tasks for actually delivering webhook events to receivers

The user-facing interface for webhooks is described in greater detail in RFD 538. The code change in this branch includes a "Big Theory Statement" comment that describes most of the implementation details, so reviewers are encouraged to refer to that for more information on the implementation.

Future Work

Immediate follow-up work (i.e. stuff I'd like to do shortly but would prefer to land in separate PRs):

  • Garbage collection for old records in the webhook_delivery, webhook_delivery_attempt, and webhook_event CRDB tables (need to figure out a good retention policy for events)
  • omdb db webhooks commands for actually looking at the webhook database tables
  • Oximeter metrics tracking webhook delivery attempt outcomes and latencies

Not currently planned, but possible future work:

  • Actually record webhook events when stuff happens :)
  • Some mechanism for communicating JSON schemas for webhook event payloads (either via OpenAPI 3.1, by sticking JSON schemas in the /v1/webhooks/event-classes endpoints, or both)
  • Allow webhook receivers to have roles with more restrictive permissions than fleet.viewer (see RFD 538 Appendix B.3); probably requires service accounts
  • Track receiver liveness and alert when a receiver has gone away (see RFD 538 Appendix B.4)

@hawkw hawkw force-pushed the eliza/webhook-models branch from 51f7f8e to 139cfe6 Compare December 18, 2024 21:10
@hawkw hawkw changed the base branch from eliza/webhook-api to main December 18, 2024 21:11
@hawkw hawkw requested a review from augustuswm December 18, 2024 21:11
@hawkw hawkw force-pushed the eliza/webhook-models branch 2 times, most recently from 140aea4 to 0b80c8f Compare January 8, 2025 17:28
@hawkw hawkw changed the title [nexus] Webhook DB models [nexus] webhooks Jan 11, 2025
@hawkw hawkw force-pushed the eliza/webhook-models branch from 41cf0b0 to 2bc5925 Compare January 17, 2025 19:20
@hawkw
Copy link
Member Author

hawkw commented Jan 24, 2025

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that. I'd like to have a way to include a couple "test" variants in there that aren't exposed in the public API, so I'll be giving some thought to how to deal with that.

@hawkw
Copy link
Member Author

hawkw commented Jan 24, 2025

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that.

Glob subscription entries in webhook_rx_event_glob should capture the schema version when they're created, so that we can trigger reprocessing (generating the exact event class subscriptions for those globs) if the schema has changed. It's probably fine for nexus to do glob reprocessing on startup rather than in a bg task, although online update might invalidate that assumption.

@hawkw
Copy link
Member Author

hawkw commented Jan 24, 2025

As far as GCing old events from the event table, dispatching an event should probably add a count of the number of receivers it was dispatched to, and then when we successfully deliver the event, we increment a count of successes. That way, we would not consider an event entry eligible to be deleted unless the two counts are equal; we want to hang onto events that weren't successfully delivered so any failed deliveries can be re-triggered.

GCing an event would also clean up any child delivery attempt records.

@hawkw hawkw enabled auto-merge (squash) April 17, 2025 21:22
path_params: Path<params::WebhookReceiverSelector>,
) -> Result<HttpResponseOk<views::WebhookReceiver>, HttpError>;

/// Create webhook receiver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove periods at the end of these, please. They end up in the docs site sidebar. Wording otherwise looks good to me. Can't really get any shorter.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually one wording change: remove "a" from "List delivery attempts to a webhook receiver"

@hawkw hawkw disabled auto-merge April 22, 2025 15:56
@benjaminleonard
Copy link
Contributor

I think from a client perspective we are better off moving the receiver's event update into its own endpoint/s like secrets.

By doing it in webhook_receiver_update it means:

  1. The user must include the title and description when updating events
  2. The user must include the events when updating title and/or description

There's also the possibility here of accidentally incorrectly updating some other value in the process. We can pre-fill in the console, but I see this being an issue on the CLI. I also think the separating the actions in the console will be helpful anyway.

Again, apologies for jumping in the last moment but this looks great. Real monumental effort!

@hawkw
Copy link
Member Author

hawkw commented Apr 22, 2025

As a quick update, I've decided to go ahead and address @benjaminleonard's suggestion from #7277 (comment) on this branch, prior to merging this. It ends up being a fairly large change to the current code, but it also ends up making the implementation simpler. @smklein, sorry, but I'll probably end up wanting another review pass on those additional changes; hopefully it won't be too much more work from you.

glob reprocessing is now performed by the dispatcher exclusively
(eliminates a bunch of transactions).
@hawkw
Copy link
Member Author

hawkw commented Apr 24, 2025

@smklein: okay, I've made the changes I discussed in #7277 (comment), and I would love to get another look when you have the chance. Beyond moving subscription add/remove to their own API endpoints, I've also changed how exact subscriptions are generated for globs: now, it's always done "lazily" when determining what events a receiver is subscribed to, rather than "eagerly" when adding the glob subscription. This way, we can create the glob subscription by just adding its record, so the subscription-add path doesn't have to do a transaction.

I do still need to update with the latest changes from main; #7985, #8003 and similar have introduced a huge pile of merge conflicts that are kind of a pain to track. But, I think the main differences from what you've reviewed previously should be pretty stable across that.

@hawkw hawkw requested a review from smklein April 25, 2025 17:04
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, but why not cover the endpoints in nexus/tests/integration_tests/endpoints.rs ? not sure I'm understanding why they're excluded

}
}
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a ramble based on my gut reaction, but I don't have a solid recommendation for what I think would be better. It might be as simple as renaming WebhookSubscription to WebhookEvent or something.


There are a bunch of spots that use the same WebhookSubscription type:

  • WebhookReceiver events field
    • Notable that this field is not called "subscriptions." If anything I like "events" better -- maybe the endpoints should be called add/remove events.
  • Add subscription request body
  • Add subscription response body
  • Delete subscription subscription path parameter
  • filter query param on event classes list endpoint (didn't notice this one until the end when I was putting together this list, so it's less important to me, but still probably deserves to be unified with the others)

This is of course elegant in a way, but I also found it pretty surprising and had to think through it quite a bit to understand -- especially the fact that the request and response bodies have the same type as a path parameter. I don't think there are any other places we can do that, simply because there aren't many other resources that are just strings. (Maybe there is a list of IP addresses somewhere that works like this, where the IP string is its own identifier.)

Usually a path parameter's type makes clear (like NameOrId does) that it's some kind of identifier with a natural string representation. So while we call it the field project, project: NameOrId make clear we're talking about an identifier. subscription is therefore correct as a name for the field, but the name of the type messes me up. The thing that's really unusual in this case is that the resource itself also is fundamentally a string. Maybe the main thing throwing me off is the name WebhookSubscription, which puts me in mind of a bigger structured object -- really what it makes me think of is the receiver.

A related but distinct issue is that we generally want API request and response bodies to be JSON objects rather than strings. We've run into issues with client generation when that has not been the case. A string is valid JSON, but I think this would be the only endpoint that returns a string. One reason to prefer objects is that objects can be extended by adding another key without changing the basic shape of the thing. If these unlikely to change their shape, consistency with other endpoints is probably a stronger argument. On the other hand, it would feel silly to have the list of subscriptions in the receiver response be a list of objects like { glob: 'my.event.*' } rather than a list of strings, so that's a conundrum, I guess.

path = "/v1/webhooks/receivers/{receiver}/subscriptions/{subscription}",
tags = ["system/webhooks"],
}]
async fn webhook_receiver_subscription_delete(
Copy link
Contributor

@david-crespo david-crespo Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nitpick but I prefer remove to delete for the opposite of add here. Compare to IP range add/remove. For secrets, you have add/delete also, but I think in that case my preference is to make it create/delete. The fact that the secret is a child of the webhook is not dispositive as an argument for "add" — see all project-scoped resources like disk and instance, which we are nonetheless creating rather than adding. For me it's more about whether you are creating and deleting an instance of an independent resource -- yes for secret, no for event subscription glob string. For me IP ranges are similar to event classes in that they are fully reducible to their string representations and they really are subservient in a way to their parent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense, I'll change the naming to be more consistent with the other APIs. From the authz perspective, I've also tried to treat the secrets as their own resource but the subscriptions are a list that's logically associated with the receiver resource — this felt right, since the secrets have IDs while the subscriptions don't.

Another option would be to just call it subscribe and unsubscribe instead of add/remove, but I dunno what our position on using more specific verbs like that is...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like add/remove because it’s easier to guess that it will result in the events array on the receiver being changed (or it would if the names matched)

hawkw added 2 commits April 26, 2025 10:41
I've made the following changes to the APIs for adding and removing
subscriptions:

- renamed `webhook_receiver_subscription_delete` to
  `webhook_receiver_subscription_remove`.
- renamed the `events` field in receiver models to `subscriptions`.
- wrapped the JSON request and response bodies for
  `webhook_receiver_subscription_add` in JSON objects.

  These are separate models in `params` and `views` for the
  request/response, respectively, even though they currently both just
  contain one field, of the `shared::WebhookSubscription` type. I felt
  like that was worthwhile as we may in future want to add different
  fields to the request and response models.

  Elsewhere, we still use the `shared::WebhookSubscription` type,
  including in the receiver models, as those felt awkward when it had to
  be wrapped in an additional object. It's important to still be able to
  use this as a bare string when it's used in path or query params.

@david-crespo please let me know how you feel about these changes (but
don't feel like you need to this weekend...)
this is now vestigial, as the receiver update query no longer mutates
subscriptions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants