Skip to content

Add new awsebsnvmereceiver #1603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Apr 2, 2025
Merged

Conversation

duhminick
Copy link
Contributor

@duhminick duhminick commented Mar 18, 2025

Note to Reviewers

This PR does not enable this new receiver. This is strictly just adding a new receiver that will be enabled later.

Changes 2:

  1. Use fmt.Sscanf instead of the custom parsing logic for the device file name
  2. Enable only one metric by default. This is for the upcoming translation changes -- this just simplifies some of the logic. We need to keep at least one metric enabled by default or else the generated tests from mdatagen fail
  3. Rename Resources to Devices in the receiver config

Changes 3:

  1. Refactor to use collections.Set instead of custom map implementation for device tracking
  2. Remove unnecessary work in device processing pipeline by skipping EBS devices and serial retrieval if it's already been retrieved
  3. Remove cleanupString function to simplify string handling
  4. Rename DeviceInfoProvider for better code organization
  5. Don't unnecessarily export constants used only within the package
  6. Rename stub implementation to notunix for better platform distinction

Description of the issue

Elastic Block Storage (EBS) exposes performance statistics for EBS volumes attached to EC2 instances as NVMe devices in a vendor unique log page. The log page can be retrieved by making a system call to the NVMe device. CloudWatch Agent (CWA) is going to collect the retrieved metrics and emit them to CloudWatch.

Description of changes

  1. Main Scraper Implementation (scraper.go):

    • Implements the core scraping logic through the nvmeScraper struct
    • Uses a MetricsBuilder to construct standardized metrics
      • MetricsBuilder is generated using mdatagen with the schema defined in metadata.yaml
    • Handles device discovery and metric collection for EBS NVMe devices
      • Includes the ability to scrape all devices, or specific devices.
    • Implements safety checks for integer overflow protection
  2. NVMe Metrics Collection (internal/nvme):

    • Implements low-level NVMe device interaction through Linux ioctl calls
    • Provides structured metric collection through EBSMetrics struct
    • Collects key metrics including:
      • Read/Write operations and bytes
      • Total read/write times
      • IOPS and throughput exceeded counters
      • Queue length
      • Read/Write latency histograms (though this is not being collected atm)
  3. Generated components (internal/metadata);

  • All generated by mdatagen using the schema defined in metadata.yaml

The receiver collects the following metrics from EBS NVMe devices:

  • diskio_ebs_total_read_ops
  • diskio_ebs_total_write_ops
  • diskio_ebs_total_read_bytes
  • diskio_ebs_total_write_bytes
  • diskio_ebs_total_read_time
  • diskio_ebs_total_write_time
  • diskio_ebs_volume_performance_exceeded_iops
  • diskio_ebs_volume_performance_exceeded_tp
  • diskio_ebs_ec2_instance_performance_exceeded_iops
  • diskio_ebs_ec2_instance_performance_exceeded_tp
  • diskio_ebs_volume_queue_length

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  1. Manually updated the YAML config to have a new instance of the EBS receiver as well as update the existing host delta metrics pipeline to have the new receiver.
  • Attached an EBS volume to the instance
image

The EC2 instance that the manual tests were ran on have two EBS volumes attached (nvme0 and nvme1)

Resource is explicitly empty

2025-03-18T19:35:03Z D! {"caller":"awsebsnvmereceiver/scraper.go:54","msg":"Began scraping for NVMe metrics","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics"}
2025-03-18T19:35:03Z D! {"caller":"awsebsnvmereceiver/scraper.go:133","msg":"skipping un-allowed device","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","device":"nvme0"}
2025-03-18T19:35:03Z D! {"caller":"awsebsnvmereceiver/scraper.go:133","msg":"skipping un-allowed device","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","device":"nvme0n1"}

One device (nvme0) is in resources

2025-03-18T19:36:24Z D! {"caller":"awsebsnvmereceiver/scraper.go:54","msg":"Began scraping for NVMe metrics","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics"}
2025-03-18T19:36:24Z D! {"caller":"awsebsnvmereceiver/scraper.go:133","msg":"skipping un-allowed device","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","device":"nvme0n1"}
2025-03-18T19:36:24Z D! {"caller":"awsebsnvmereceiver/scraper.go:133","msg":"skipping un-allowed device","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","device":"nvme0n1p1"}
2025-03-18T19:36:24Z D! {"caller":"awsebsnvmereceiver/scraper.go:101","msg":"emitted metrics for nvme device with controller id","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","controllerID":0}

* for resources

2025-03-18T19:38:30Z D! {"caller":"awsebsnvmereceiver/scraper.go:54","msg":"Began scraping for NVMe metrics","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics"}
2025-03-18T19:38:30Z D! {"caller":"awsebsnvmereceiver/scraper.go:101","msg":"emitted metrics for nvme device with controller id","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","controllerID":0}
2025-03-18T19:38:30Z D! {"caller":"awsebsnvmereceiver/scraper.go:101","msg":"emitted metrics for nvme device with controller id","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","controllerID":1}

Sample Config

receivers:
    awsebsnvmereceiver:
        collection_interval: 1m0s
        initial_delay: 1s
        metrics:
            diskio_ebs_ec2_instance_performance_exceeded_iops:
                enabled: false
            diskio_ebs_ec2_instance_performance_exceeded_tp:
                enabled: false
            diskio_ebs_total_read_bytes:
                enabled: true
            diskio_ebs_total_read_ops:
                enabled: false
            diskio_ebs_total_read_time:
                enabled: false
            diskio_ebs_total_write_bytes:
                enabled: true
            diskio_ebs_total_write_ops:
                enabled: false
            diskio_ebs_total_write_time:
                enabled: false
            diskio_ebs_volume_performance_exceeded_iops:
                enabled: false
            diskio_ebs_volume_performance_exceeded_tp:
                enabled: false
            diskio_ebs_volume_queue_length:
                enabled: false
        resource_attributes:
            VolumeId:
                enabled: true
        devices:
            - nvme0n1
        timeout: 0s

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@duhminick duhminick force-pushed the dominic-nvme-receiver branch from 66fb288 to 546fb3e Compare March 18, 2025 15:04
@duhminick duhminick changed the base branch from main to ebs March 18, 2025 15:19
@duhminick duhminick force-pushed the dominic-nvme-receiver branch from 546fb3e to be8e93b Compare March 18, 2025 15:23
@duhminick duhminick marked this pull request as ready for review March 18, 2025 19:39
@duhminick duhminick requested a review from a team as a code owner March 18, 2025 19:39
@duhminick duhminick changed the base branch from ebs to main March 18, 2025 19:40
@lisguo
Copy link
Contributor

lisguo commented Mar 25, 2025

"This PR should be merged into the ebs branch"

Is this still true? We are just adding the receiver not enabling it so putting it to main seems fine

@duhminick
Copy link
Contributor Author

Is this still true? We are just adding the receiver not enabling it so putting it to main seems fine

Oh, got it. Then I'll leave the destination branch as is

@duhminick
Copy link
Contributor Author

agent.json
output.txt

Attaching the agent config and translated configs

Copy link
Contributor

@lisguo lisguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also provide a sample yaml config of the receiver?

lisguo
lisguo previously approved these changes Mar 27, 2025
IsEbsDevice(device *DeviceFileAttributes) (bool, error)
}

type Util struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know we use Util as a name all over the place, but it doesn't help me understand what this is. Would rather it be called like DeviceInfoProvider or something more descriptive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why is this a struct if it doesn't have a state? Why not just have the functions exported as package level functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know we use Util as a name all over the place, but it doesn't help me understand what this is. Would rather it be called like DeviceInfoProvider or something more descriptive.

Will do.

nit: Why is this a struct if it doesn't have a state? Why not just have the functions exported as package level functions?

I made it so it would be a bit easier to mock for unit testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common pattern is to have a constructor function like

func NewDeviceInfoProvider() DeviceInfoProvider {
  return &util{}
}

so that only the interface is exposed.

Comment on lines +57 to +60
var (
ErrInvalidEBSMagic = errors.New("invalid EBS magic number")
ErrParseLogPage = errors.New("failed to parse log page")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If they aren't used outside of the package, don't export them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment below about keeping the code as similar as possible to what is found in the EBS CSI driver code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could put CSI driver code into a separate package (e.g. internal/nvme/csidriver). Are we planning on trying to keep this in sync with the CSI driver code or are we willing to diverge?

Comment on lines +88 to +94
func nvmeReadLogPage(fd uintptr, logID uint8) ([]byte, error) {
data := make([]byte, 4096) // 4096 bytes is the length of the log page.
bufferLen := len(data)

if bufferLen > math.MaxUint32 {
return nil, errors.New("nvmeReadLogPage: bufferLen exceeds MaxUint32")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this check. If we define the length of the slice, how would this ever return an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes directly from the EBS CSI driver code. I did see some things I could change, but I do have a preference to keep it the same as much as possible.

volumeID string
type ebsDevices struct {
volumeID string
deviceNames []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one volume can have multiple devices?

Copy link
Contributor Author

@duhminick duhminick Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's all based on the controller ID. For example: nvme0, nvme0n1, nvme0n2, nvme0n1p1 will all have the same serial/volume ID, and metrics. Whereas nvme1 would be different

return devices, nil
}

func (u *Util) GetDeviceSerial(device *DeviceFileAttributes) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated in what way? device.BaseDeviceName() does the same check.

t.Run(tt.name, func(t *testing.T) {
got, err := ParseNvmeDeviceFileName(tt.device)
if (err != nil) != tt.wantErr {
t.Errorf("ParseNvmeDeviceFileName() error = %v, wantErr %v", err, tt.wantErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For consistency with other tests, consider using github.com/stretchr/testify/assert or github.com/stretchr/testify/require

Comment on lines +57 to +60
var (
ErrInvalidEBSMagic = errors.New("invalid EBS magic number")
ErrParseLogPage = errors.New("failed to parse log page")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could put CSI driver code into a separate package (e.g. internal/nvme/csidriver). Are we planning on trying to keep this in sync with the CSI driver code or are we willing to diverge?

IsEbsDevice(device *DeviceFileAttributes) (bool, error)
}

type Util struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common pattern is to have a constructor function like

func NewDeviceInfoProvider() DeviceInfoProvider {
  return &util{}
}

so that only the interface is exposed.

// retrieving the volume ID and validating if it's an EBS device
if entry, seenController := devices[device.Controller()]; seenController {
entry.deviceNames = append(entry.deviceNames, deviceName)
s.logger.Debug("skipping unnecessary device validation steps", zap.String("device", deviceName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Some of these logs seem excessive especially if this is going to get logged multiple times during each scrape.

@duhminick duhminick merged commit ecf2afd into aws:main Apr 2, 2025
7 checks passed
@duhminick duhminick deleted the dominic-nvme-receiver branch April 2, 2025 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants