Skip to content

Latest commit

 

History

History
315 lines (229 loc) · 14.8 KB

monitoring-and-telemetry.adoc

File metadata and controls

315 lines (229 loc) · 14.8 KB

Monitoring and telemetry

High-level structure

The monitoring and telemetry system has the following high-level structure:

flowchart TD
    A[Threshold Dashboard] -->|Deposit data| B(Second track deposit verifier)
    B --> |Status| A
    A --> |Telemetry| C(Sentry)
    D[TBTCv2 minters and guardians] --> |Telemetry| C
    C -->|Alerts| E(PagerDuty)
    C --> |Alerts| F(Discord)
    G[TBTCv2 monitoring] --> |Alerts| C
    G --> |Notifications| F
Loading

There are several sources of monitoring and telemetry data:

Monitoring and telemetry data are gathered by Sentry hub. Collected data are used to produce and send alerts to:

Sources of data

Each source produces a different kind of monitoring and telemetry data. Here is a detailed description of each data source.

Threshold dashboard

The dashboard instances collect the following telemetry data that are sent directly to Sentry hub:

  • generated deposit data,

  • unhandled errors.

Generated deposit data

Each time a new deposit address is generated, the dashboard collects the following data:

  • BTC address of the deposit

  • JSON recovery file content

  • Second track deposit verification result

Worth noting that those data are send to the telemetry regardless of whether the user actually funded and revealed the deposit. Those data are collected in order to help users with potential deposit reveals that have not occurred automatically and deposit recovery in case of wallet misbehavior. Last but not least, all deposit addresses are checked against the second track deposit verifier to make sure the address generation logic used by the dashboard is proper and will not lead to fund loss.

Note

The second track deposit verifier is actually a Google Cloud Function named verify-deposit-address living in the keep-prd GCP project. That function accepts some deposit data and return a deposit BTC address computed using their own algorithm, independent from the one used by the dashboard. See an example call.

Unhandled errors

Apart from deposit data, all unhandled errors thrown by the dashboard are also captured by the telemetry system. This allows to identify bugs encountered by users in an automatic way.

TBTCv2 minters and guardians

The minters and guardians instances collect the following telemetry data that are sent directly to Sentry hub:

  • handled processing and validation errors,

  • unhandled errors.

Handled processing and validation errors

During their work, the minters and guardians may encounter several errors related to the processing logic and deposit validation. Most of them are explicitly sent to the telemetry as errors or warnings. Examples of such errors are:

  • Revert of an Ethereum transaction created by the bot

  • Failed validation of a revealed deposit

Unhandled errors

Apart from handled errors collected explicitly, all unhandled errors thrown by the minters and guardians are also captured by the telemetry system.

TBTCv2 monitoring

The monitoring component inspects the tBTC v2 system on-chain contracts and produce different kind of system events that are sent to Sentry and Discord based on their type. A general rule of thumb is that notifications (i.e. informational sytem events) are sent directly to Discord *-notifications channels while alerts requiring an action (i.e. warning/critical system events) are propagated to the Sentry hub that decides about next steps. Specific system events produced by the monitoring component are:

  • deposit revealed,

  • redemption requested,

  • wallet registered,

  • DKG result submitted,

  • DKG result approved,

  • DKG result challenged,

  • large deposit revealed,

  • large redemption requested,

  • stale redemption,

  • optimistic minting canceled,

  • optimistic minting requested too early,

  • optimistic minting requested for undetermined Bitcoin transaction,

  • optimistic minting not requested by designated minter,

  • optimistic minting not finalized by designated minter,

  • optimistic minting not requested by any minter,

  • optimistic minting not finalized by any minter,

  • high TBTC token total supply change.

Deposit revealed

An informational system event indicating that a new deposit was revealed to the on-chain Bridge contract. This event is directly sent to Discord as a notification that does not require any action.

Redemption requested

An informational system event indicating that a new redemption was requested from the on-chain Bridge contract. This event is directly sent to Discord as a notification that does not require any action.

Wallet registered

An informational system event indicating that a new wallet was registered on the on-chain Bridge contract. This event is directly sent to Discord as a notification that does not require any action.

DKG result submitted

An informational system event indicating that a new DKG result was submitted to the on-chain WalletRegistry contract. This event is directly sent to Discord as a notification that does not require any action.

DKG result approved

An informational system event indicating that the submitted DKG result was approved on the on-chain WalletRegistry contract. This event is directly sent to Discord as a notification that does not require any action.

DKG result challenged

A critical system event indicating that the submitted DKG result was challenged on the on-chain WalletRegistry contract. This event is sent to Sentry hub and requires an immediate team’s action. The default action is checking the reason of the challenge as that event may indicate a malicious wallet operator or a serious bug in the off-chain client code.

Large deposit revealed

A warning system event indicating that a large deposit was revealed to the on-chain Bridge contract. This event is sent to Sentry hub and should get team’s attention. The default action is making sure that the deposit is handled correctly by the system.

Large redemption requested

A warning system event indicating that a large redemption was requested from the on-chain Bridge contract. This event is sent to Sentry hub and should get team’s attention. The default action is making sure that the redemption is not a result of a malicious action, and if not, that the redemption is handled correctly by the system.

Stale redemption

A warning system event indicating that a redemption request became stale, i.e. was not handled within the expected time. This event is sent to Sentry hub and should get team’s attention. The default action is investigating the cause of the extended processing time as this alert may be an early sign of a malfunctioning wallet or may indicate a problem with the maintainer bot.

Optimistic minting cancelled

A warning system event indicating that an optimistic minting request was cancelled by a guardian. This event is sent to Sentry hub and should get team’s attention. The default action is checking the reason of cancellation as that event may indicate a malicious minter or guardian that should be evicted from the system.

Optimistic minting requested too early

A critical system event indicating that an optimistic minting request was issued too early regarding their BTC funding transaction confirmation state. This event is sent to Sentry hub and requires an immediate team’s action. The default action is checking the reason of the early request as that event may indicate a malicious minter that should be evicted from the system.

Optimistic minting requested for undetermined Bitcoin transaction

A critical system event indicating that an optimistic minting request was done for an undetermined Bitcoin transaction. This event is sent to Sentry hub and requires an immediate team’s action. The default action is checking why the Bitcoin transaction cannot be determined as that event may indicate problems with the underlying Bitcoin client used by the monitoring component or flag a malicious minter that should be evicted from the system.

Optimistic minting not requested by designated minter

A warning system event indicating that an optimistic minting request was not issued by the designated minter and another minter did that job. This event is sent to Sentry hub and should get team’s attention. The default action is investigating the cause of the designated minter idleness as the designated minter may be unhealthy/malicious or there may be a bug in the minters bot code.

Optimistic minting not finalized by designated minter

A warning system event indicating that an optimistic minting request was not finalized by the designated minter and another minter did that job. This event is sent to Sentry hub and should get team’s attention. The default action is investigating the cause of the designated minter idleness as the designated minter may be unhealthy/malicious or there may be a bug in the minters bot code.

Optimistic minting not requested by any minter

A warning system event indicating that an optimistic minting request was not issued by any minter. This event is sent to Sentry hub and should get team’s attention. The default action is investigating the cause of the minters idleness as the underlying deposit may be invalid, minters may be unhealthy/malicious or there may be a bug in the minters bot code.

Optimistic minting not finalized by any minter

A warning system event indicating that an optimistic minting request was not finalized by any minter. This event is sent to Sentry hub and should get team’s attention. The default action is investigating the cause of the minters idleness as the underlying deposit may be invalid, minters may be unhealthy/malicious or there may be a bug in the minters bot code.

High TBTC token total supply change

A critical system event indicating that a high change (i.e. >=10%) of the total TBTC v2 token supply took place in the last 12 hours. This event is sent to Sentry hub and requires an immediate team’s action. The default action is checking the root cause of the supply change and making sure its source is actually a proper deposit/redemption and there are no signs of any malicious action.

Sentry hub

The monitoring and telemetry system uses Sentry as hub for relevant monitoring and telemetry data that requires an action from the team. Here is a detailed description of this component.

Configuration

The Sentry application has been configured in the following way:

  • There is a Keep organization that groups all invited members under the #Keep team

  • There are projects corresponding to specific monitoring and telemetry data sources:

Alerts

As mentioned earlier, Sentry uses the collected monitoring and telemetry data to raise alerts that are propagated to PagerDuty and Discord *-alerts channels. Here is the exact summary of configured alert rules:

Alert name Project Firing conditions Notified entities

Mainnet deposit second track verification failure

prod-threshold-dashboard

When deposit address returned by the second track deposit verifier is different from the address generated by the dashboard

PagerDuty and Discord mainnet-alerts channel

Testnet deposit second track verification failure

test-threshold-dashboard

When deposit address returned by the second track deposit verifier is different from the address generated by the dashboard

Discord testnet-alerts channel

Mainnet monitoring alerts Discord router

prod-tbtc-v2-monitoring

When a new alert (i.e. warning/critical system event) is received from the TBTCv2 monitoring component

Discord mainnet-alerts channel

Mainnet monitoring alerts PagerDuty router

prod-tbtc-v2-monitoring

When a new critical alert (i.e. critical system event) is received from the TBTCv2 monitoring component

PagerDuty

Testnet monitoring alerts Discord router

test-tbtc-v2-monitoring

When a new alert (i.e. warning/critical system event) is received from the TBTCv2 monitoring component

Discord testnet-alerts channel