Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S.M.A.R.T support #614

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

S.M.A.R.T support #614

wants to merge 1 commit into from

Conversation

geekifan
Copy link

@geekifan geekifan commented Feb 22, 2025

I follow the manner of GPUManager to add support for S.M.A.R.T to the agent. Since I am not an expert in Go and do not have enough physical devices around for testing, I hope someone can do a basic review of my code and test it on their own devices. Once everything is ready, I will proceed with modifying the hub's code.

@henrygd
Copy link
Owner

henrygd commented Feb 23, 2025

Hi Yifan, thank you very much for your work!

This looks like a great start. Let me get back to you later in the week as I have limited time right now and am trying to get the next release out as soon as possible.

On the hub side we should probably create a new table (PocketBase collection) for this data.

From my limited knowledge I think parsing smartctl output is a fine approach and should work on MacOS also. But I may be wrong.

There's also this Go library which provides SMART information: https://github.com/anatol/smart.go

And a standalone application, Scrutiny, which is written in Go and may be a helpful reference: https://github.com/AnalogJ/scrutiny

As far as hardware, I'm in the same boat as you. I actually don't even own a HDD, but we should be able to find some output samples online and use them as test data (or people using Beszel can provide them).

Again, I appreciate your time and will get back to you as soon as I can.

Edit: If anyone reads this and wants to provide sample output, please change the serial numbers before sharing.

@geekifan
Copy link
Author

geekifan commented Feb 23, 2025

Thank you very much for your detailed response.

First, I have considered using smart.go. If we use smart.go, we will be dependent on all its aspects (such as potential bugs and the possibility that its smart database may not be updated in a timely manner). If such issues arise and it is no longer maintained, all we can do is fork it, fix the bugs, or update the smart database. This would add a significant burden to the maintenance of beszel. In contrast, smartctl is a very widely used tool, with timely updates to the smart database and more prompt maintenance in case of bugs. Its support for JSON-formatted output is a great advantage for data parsing in Go.

Regarding the macOS issue, I currently also have macOS and will conduct tests later.

The hardware I currently have available for testing includes: NVMe/SATA/SCSI (only testable under Linux platform), and USB storage, which should cover mainstream hardware. What I really worry about are some corner cases.

Additionally, I have a few issues that I am unsure how to handle:

  1. The SMART data for SATA/SCSI uses the ATA format, while NVMe uses a different format, leading to inconsistencies in SMART key values. Other hardware might have more SMART formats, so I believe we need everyone's help to find the appropriate data structures to store and monitor them.
  2. Due to the hot-swappable nature of hard drives, if a hard drive is unplugged, the agent part will delete the corresponding data entry when report to the hub. But how will the hub handle the missing data? Will it delete the corresponding hard drive data when displaying, or will it retain the state at the time of unplugging? (Sorry, I am not familiar with PocketBase and some database operations.)

EDIT: I checked the code of https://github.com/AnalogJ/scrutiny. Scrutiny parses the json output of smartctl to get the SMART info.

@henrygd
Copy link
Owner

henrygd commented Feb 24, 2025

Sounds good, I agree with the direct smartctl approach.

I don't think there's any reason to worry about corner cases in the first iteration. We'll get sample output and include the most important or common values.

If there's an issue parsing then we'll just log an error. We can add support for more formats as people request them.

Hopefully the JSON structure is consistent and it's just the properties that differ, because dealing with inconsistent JSON is not fun.

The regular non-JSON output looks easy to parse, so we could just use bufio to scan the output line by line for the values we need.

Here's output from my laptop with one nvme drive:

smartctl --scan
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
smartctl --scan -j
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.13.2-arch1-1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--scan",
      "-j"
    ],
    "exit_status": 0
  },
  "devices": [
    {
      "name": "/dev/nvme0",
      "info_name": "/dev/nvme0",
      "type": "nvme",
      "protocol": "NVMe"
    }
  ]
}
sudo smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.13.2-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WD PC SN810 SDCPNRY-1T00-1006
Serial Number:                      226223861317
Firmware Version:                   HPS2
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      8224
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001c44 8b25c6eb61
Local Time is:                      Sun Feb 23 19:34:35 2025 EST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     88 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.25W    8.25W       -    0  0  0  0        0       0
 1 +     3.50W    3.50W       -    0  0  0  0        0       0
 2 +     2.60W    2.60W       -    0  0  0  0        0       0
 3 -   0.0250W       -        -    3  3  3  3     5000   10000
 4 -   0.0035W       -        -    4  4  4  4     3900   45700

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        34 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    20,427,281 [10.4 TB]
Data Units Written:                 27,523,884 [14.0 TB]
Host Read Commands:                 308,278,905
Host Write Commands:                722,398,619
Controller Busy Time:               2,230
Power Cycles:                       3,086
Power On Hours:                     1,392
Unsafe Shutdowns:                   173
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged
sudo smartctl -aj /dev/nvme0
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.13.2-arch1-1",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-aj",
      "/dev/nvme0"
    ],
    "exit_status": 0
  },
  "local_time": {
    "time_t": 1740357511,
    "asctime": "Sun Feb 23 19:38:31 2025 EST"
  },
  "device": {
    "name": "/dev/nvme0",
    "info_name": "/dev/nvme0",
    "type": "nvme",
    "protocol": "NVMe"
  },
  "model_name": "WD PC SN810 SDCPNRY-1T00-1006",
  "serial_number": "286223861317",
  "firmware_version": "HPS2",
  "nvme_pci_vendor": {
    "id": 5559,
    "subsystem_id": 5559
  },
  "nvme_ieee_oui_identifier": 5920,
  "nvme_total_capacity": 1024209543168,
  "nvme_unallocated_capacity": 0,
  "nvme_controller_id": 8224,
  "nvme_version": {
    "string": "1.4",
    "value": 66560
  },
  "nvme_number_of_namespaces": 1,
  "nvme_namespaces": [
    {
      "id": 1,
      "size": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "capacity": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "utilization": {
        "blocks": 2000409264,
        "bytes": 1024209543168
      },
      "formatted_lba_size": 512,
      "eui64": {
        "oui": 5930,
        "ext_id": 592171146913
      }
    }
  ],
  "user_capacity": {
    "blocks": 2000409264,
    "bytes": 1024209543168
  },
  "logical_block_size": 512,
  "smart_support": {
    "available": true,
    "enabled": true
  },
  "smart_status": {
    "passed": true,
    "nvme": {
      "value": 0
    }
  },
  "nvme_smart_health_information_log": {
    "critical_warning": 0,
    "temperature": 34,
    "available_spare": 100,
    "available_spare_threshold": 5,
    "percentage_used": 0,
    "data_units_read": 20427312,
    "data_units_written": 27524011,
    "host_reads": 308279032,
    "host_writes": 722405653,
    "controller_busy_time": 2230,
    "power_cycles": 3086,
    "power_on_hours": 1392,
    "unsafe_shutdowns": 173,
    "media_errors": 0,
    "num_err_log_entries": 0,
    "warning_temp_time": 0,
    "critical_comp_time": 0
  },
  "temperature": {
    "current": 34
  },
  "power_cycle_count": 3086,
  "power_on_time": {
    "hours": 1392
  },
  "nvme_error_information_log": {
    "size": 256,
    "read": 16,
    "unread": 0
  },
  "nvme_self_test_log": {
    "current_self_test_operation": {
      "value": 0,
      "string": "No self-test in progress"
    }
  }
}

If a drive is unplugged and not in current updates, we'll just keep the record for some predefined time, like a week.

So the data would remain the same as when the drive was unplugged. We could show a 'last updated' time or up/down indicator.

I'll use a scheduled job to delete records that haven't had an update in a week. We could also give users an option to delete the drive themselves.

You can keep the scope of this PR as narrow as you'd like. Just having something working on the agent side is a huge help! I can handle the rest of it no problem.

There's also no rush as I have two other big PRs in the queue as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Next
Development

Successfully merging this pull request may close these issues.

2 participants