Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is your release cycle? #91

Closed
alanzhai219 opened this issue Jan 27, 2025 · 9 comments
Closed

What is your release cycle? #91

alanzhai219 opened this issue Jan 27, 2025 · 9 comments

Comments

@alanzhai219
Copy link

Have NOT released any binary for a very long time?
Will this project be dropped? If not, when do you publish the next release?

@eero-t
Copy link

eero-t commented Feb 4, 2025

I'm not a project developer, just another user, but XPUM officially supports (= is tested with) only Intel Data Center GPUs, i.e. Flex + PVC. There have been no new Data Center GPUs for a while, only client GPUs ones.

@alanzhai219
Copy link
Author

I'm not a project developer, just another user, but XPUM officially supports (= is tested with) only Intel Data Center GPUs, i.e. Flex + PVC. There have been no new Data Center GPUs for a while, only client GPUs ones.

Yes. I have a B580. But the software stack doesn't support this dGPU very well.

@eero-t
Copy link

eero-t commented Feb 7, 2025

Yes. I have a B580. But the software stack doesn't support this dGPU very well.

Kernel has enabled Battlemage support from v6.12 onward, with 6.13 providing further improvements.

User space 3D (Mesa 24.2.2), media and compute (compute-runtime 24.48.31907.7) driver releases had production Battlemage support at end of 2024, i.e. they haven't been in wider testing that long yet => successive releases fix issues found only by wider testing, but it takes some time until those get reported to driver projects and distros upgrade to fix releases.

By the time Ubuntu 25.04, Debian Trixie etc are released later this year, things should be better.

Please make sure that your issues are reported to upstream driver projects!
(after checking that issue is still there with the latest release)

@alanzhai219
Copy link
Author

alanzhai219 commented Feb 13, 2025

@eero-t Do you know any tools for monitoring the bmg status? Thanks.
cant understand why they don't want to provide a tool like nvidia-smi.
I post above suggestion previously but be closed forcely. Please review customers' needs.

@eero-t
Copy link

eero-t commented Feb 13, 2025

@eero-t Do you know any tools for monitoring the bmg status? Thanks.

Some basic metrics you can get from gputop tool included to latest 1.30 release of intel-gpu-tools: https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/NEWS

Rolling distros like Debian Testing, Arch etc should include new enough version of it.

Support for PMU stats needed for engine utilization metrics with Xe KMD, is not yet in upstream kernel though. It will take some time until it will arrive there.

As to other metrics than ones listed by gputop...

zello_sysman test tool in Intel driver repo can be used to check which OneAPI Level-Zero metrics are available with the FW, KMD and user-space compute driver versions you're currently using, see: #26 (comment)

cant understand why they don't want to provide a tool like nvidia-smi.

xpu-smi (from this project) is the Intel alternative for that. It does not support Xe KMD based devices (yet: #88), but it should work with i915 KMD ones, although it's only validated for current Data Center GPUs (Flex and PVC). It would need new releases though (#89).


PS. One more option for metrics tracking is gpu_sysman plugin for collectd 6.0-rc: https://github.com/collectd/collectd/releases

On Debian/Ubuntu (with new enough drivers to support BMG) it you can built like this (untested):

$ sudo apt install gcc g++ flex bison make patch \
  autoconf automake libtool pkg-config libze-dev libmicrohttpd-dev
$ ./build.sh
$ configure --prefix=/usr/local/ \
  --disable-all-plugins --enable-write_prometheus  \
  --with-sysman=yes --enable-gpu_sysman \
  --with-perl-bindings=n
$ make "-j$(nproc)" install

Save suitable collectd config to file (it enables Sysman input and Prometheus output plugins):

Interval 10

LoadPlugin write_prometheus
<LoadPlugin gpu_sysman>
  Interval 4
</LoadPlugin>

<Plugin write_prometheus>
  Port 8888
</Plugin>

<Plugin gpu_sysman>
  LogGpuInfo true
  LogMetrics true
</Plugin>

And run collectd with it like this:

$ collectd -f -C <configfile>

(If it complains about no devices, you may be missing a L0 backend: sudo apt install libze-intel-gpu1.)

LogMetrics option outputs metrics to console in addition to it being available through Prometheus endpoint. For more info, search the plugin name from here: https://github.com/collectd/collectd/blob/collectd-6.0/src/collectd.conf.pod

@alanzhai219
Copy link
Author

@eero-t Thank you very much. The above tools are not from the intel official software stack, right?

@eero-t
Copy link

eero-t commented Feb 14, 2025

AFAIK these should be the officially supported Intel tools, but where you get them differs [1]:

  • intel-gpu-tools includes tests and tools for Intel KMDs, and I guess nowadays also for some other DRM drivers. Distros include it in their standard package repos, as it includes tools like intel_gpu_top for i915 and gputop for that & xe
  • XPUM / xpu-smi (this project) is the current official maintenance & telemetry tool for Intel Data Center GPUs (which are officially supported only by OOT i915 DKMS). It's understandable why it's not in distros, despite being Open Source
  • zello_sysman is part of Intel's official compute / L0 / OneAPI driver sources. It's not not included into binary releases, but I have ticket about that: Provide "zello_sysman" tool with binary releases compute-runtime#787
  • Intel VTune, extensive CPU & GPU profiler, but not Open Source: https://www.intel.com/content/www/us/en/docs/vtune-profiler/get-started-guide/2024-0/overview.html

There are quite a few unofficial projects, with different degrees of HW support and liveness:

[1] people should get tools through distros, unless there's a good reason not to. intel-gpu-tools & renderdoc are included to most main distros, perfetto to some of them, rest are not.

@alanzhai219
Copy link
Author

very disappointed. not a tool like nvidia-smi. Copy nvidia is not too hard if there's not a better solution, I think.

@eero-t
Copy link

eero-t commented Feb 24, 2025

very disappointed. not a tool like nvidia-smi. Copy nvidia is not too hard if there's not a better solution, I think.

While I've occasionally used xpu-smi (on Arc & Flex), I've never used nvidia-smi. Out of curiosity, are there lot of features missing from xpu-smi, compared to nvidia-smi: https://github.com/intel/xpumanager/blob/master/doc/smi_user_guide.md ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants