Better Raspberry Pi server performance #2172

chewi · 2024-02-25T17:39:34Z

Description

Now I reveal what I really want to use Sunshine for. As a server on the Raspberry Pi! Why would I want such a thing? Surely it makes more sense as a client? Normally yes, but when combined with the PiStorm project, things get very interesting.

As you might imagine, PiStorm is very CPU-intensive, so for this to be feasible, Sunshine needs to use as little CPU as possible. The first step here was obviously to get hardware video encoding to work. The Pi does not support VAAPI or CUDA, but fortunately, this still turned out to be very easy.

These initial changes to add a V4L2M2M encoder did not work for me at first, as Sunshine claimed that an IDR frame was not produced. Digging around in the internals, it looked very much to me like requesting IDR frames should work on the Pi. As a shot in the dark, I applied John Cox's ffmpeg patchset for the Raspberry Pi. This patchset, which I recently applied to Gentoo's ffmpeg package, enables efficient zero-copy video playback on the Pi. With this, I have seen 1080p videos go from a stuttery mess to being buttery smooth. Being playback-focused, I really didn't expect it to help, but I was delighted when it suddenly sprang to life!

[2024:02:25:17:15:54]: Info: Found H.264 encoder: h264_v4l2m2m [V4L2M2M]
[2024:02:25:17:15:54]: Info: Executing [Desktop]
[2024:02:25:17:15:54]: Info: CLIENT CONNECTED
[2024:02:25:17:15:54]: Warning: No render device name for: /dev/dri/card1
[2024:02:25:17:15:55]: Error: Couldn't expose some/all drm planes for card: /dev/dri/card0
[2024:02:25:17:15:55]: Info: Screencasting with KMS
[2024:02:25:17:15:55]: Warning: No render device name for: /dev/dri/card1
[2024:02:25:17:15:55]: Info: Found monitor for DRM screencasting
[2024:02:25:17:15:55]: Info: Found connector ID [32]
[2024:02:25:17:15:55]: Info: Found cursor plane [309]
[2024:02:25:17:15:55]: Info: SDR color coding [Rec. 601]
[2024:02:25:17:15:55]: Info: Color depth: 8-bit
[2024:02:25:17:15:55]: Info: Color range: [MPEG]
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160]  <<< v4l2_encode_init: fmt=0/0
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] Using device /dev/video11
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] driver 'bcm2835-codec' on card 'bcm2835-codec-encode' in mplane mode
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] requesting formats: output=YU12/yuv420p capture=H264/none

The quality isn't fantastic though, and it's still using 275% CPU. I utilised gprof to find where it's spending all the effort.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 51.88     10.78    10.78                             ff_hscale16to15_X4_neon_asm
 18.48     14.62     3.84                             ff_yuv2planeX_8_neon
 13.47     17.42     2.80   156694     0.00     0.00  bgr32ToUV_half_c
 11.98     19.91     2.49   155935     0.00     0.00  bgr32ToY_c
  0.67     20.05     0.14                             ff_hscale16to15_4_neon_asm
  0.53     20.16     0.11      142     0.00     0.00  std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > > std::__copy_move_a1<false, char const*, std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > >
 >(char const*, char const*, std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > >)
  0.43     20.25     0.09      284     0.00     0.02  scale_internal
  0.38     20.33     0.08    94424     0.00     0.00  chr_planar_vscale
  0.29     20.39     0.06    38454     0.00     0.00  chr_convert
  0.29     20.45     0.06    38383     0.00     0.00  chr_h_scale
  0.24     20.50     0.05      577     0.00     0.00  yuv2planeX_8_c
  0.19     20.54     0.04    60422     0.00     0.00  lum_convert
  0.19     20.58     0.04        1     0.04     2.95  video::capture_async(std::shared_ptr<safe::mail_raw_t>, video::config_t&, void*)
  0.14     20.61     0.03     2133     0.00     0.00  lumRangeToJpeg_c
  0.14     20.64     0.03                             _init
  0.10     20.66     0.02   103963     0.00     0.00  lum_planar_vscale
  0.10     20.68     0.02       24     0.00     0.00  alloc_gamma_tbl
  0.05     20.69     0.01   955063     0.00     0.00  av_pix_fmt_desc_get
  0.05     20.70     0.01    59959     0.00     0.00  lum_h_scale
  0.05     20.71     0.01     6502     0.00     0.00  obl_axpy
  0.05     20.72     0.01     2148     0.00     0.00  chrRangeToJpeg_c
  0.05     20.73     0.01     2081     0.00     0.00  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
  0.05     20.74     0.01      483     0.00     0.00  av_frame_unref
  0.05     20.75     0.01      376     0.00     0.00  stream::control_server_t::call(unsigned short, stream::session_t*, std::basic_string_view<char, std::char_traits<char> > const&, bool)
  0.05     20.76     0.01        3     0.00     0.00  video::avcodec_encode_session_t::request_idr_frame()
  0.05     20.77     0.01                             av_bprint_escape
  0.05     20.78     0.01                             ff_hscale16to19_X4_neon_asm
  0.00     20.78     0.00   463475     0.00     0.00  ff_hscale16to15_X4_neon
  0.00     20.78     0.00   103794     0.00     0.00  ff_rotate_slice
  0.00     20.78     0.00    28314     0.00     0.00  av_opt_next
  0.00     20.78     0.00    11496     0.00     0.00  av_bprint_init
  0.00     20.78     0.00     9141     0.00     0.00  av_buffer_unref
  0.00     20.78     0.00     7378     0.00     0.00  glad_gl_get_proc_from_userptr
  0.00     20.78     0.00     7306     0.00     0.00  enet_list_clear
  0.00     20.78     0.00     7184     0.00     0.00  enet_protocol_send_outgoing_commands
  0.00     20.78     0.00     6975     0.00     0.00  enet_time_get
  0.00     20.78     0.00     6812     0.00     0.00  config::whitespace(char)
  0.00     20.78     0.00     6433     0.00     0.00  ff_hscale16to15_4_neon

This is not my area of expertise, but it looks like finding the right format might be the key here. I'd appreciate any help you can provide here. I know that John Cox's patchset adds support for Pi-specific SAND formats, but I don't know whether they are usable in this context.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Dependency update (updates to dependencies)
Documentation update (changes to documentation)
Repository update (changes to repository files, e.g. .github/...)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have added or updated the in code docstring/documentation-blocks for new or existing methods/components

Branch Updates

LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch
must be updated before it can be merged. You must also
Allow edits from maintainers.

I want maintainers to keep my branch updated

cgutman · 2024-02-25T19:33:17Z

src/video.cpp

+      {},
+      // Fallback options
+      {},
+      std::make_optional<encoder_t::option_t>("qp"s, &config::video.qp),


Does the encoder really not support CBR/VBR bitrate control? QP shouldn't be provided if CBR or VBR is available.

Probably, I was just copying what the others did as a first step. I'll give it a try.

cgutman · 2024-02-25T20:08:16Z

As a server on the Raspberry Pi!

This is a Pi 4, I assume? I don't think the Pi 5 has any hardware encoders anymore.

This is not my area of expertise, but it looks like finding the right format might be the key here. I'd appreciate any help you can provide here. I know that John Cox's patchset adds support for Pi-specific SAND formats, but I don't know whether they are usable in this context.

Yeah, it's all in the RGB->YUV color conversion code, which is expected since it's doing all the color conversion on the CPU. I guess it's nice that's multi-threaded now. You can adjust the "Minimum CPU Thread Count" on the Advanced tab in the UI if you want to play with the amount of concurrency there.

What your encoding pipeline looks like now:
RGB framebuffer DMA-BUF from KMS capture -> import to EGL (eglCreateImage) -> readback from EGL to CPU (glGetTextureSubImage) -> RGB to YUV conversion and scaling (libswscale) -> upload to DMA-BUF again -> encode the DMA-BUF

What you want is more like what we do with VAAPI:
RGB framebuffer DMA-BUF from KMS capture -> import to EGL (eglCreateImage) -> render using color conversion shaders into another DMA-BUF -> pass that DMA-BUF (AV_PIX_FMT_DRM_PRIME) to h264_v4l2m2m.

Most of that pipeline is simple and already written in Sunshine. The tricky part will be getting that second DMA-BUF to write into and/or exporting the render target as a DMA-BUF. Since there's no standard way to create a DMA-BUF, that part tends to be highly API-specific. For VAAPI, we import the underlying DMA-BUF of the VA surface as the render target for our color conversion. For CUDA, we create a blank texture to use as the render target and use the CUDA-GL interop APIs to import that texture as a CUDA resource for NVENC to read.

Where to start is probably writing something like this for AV_HWDEVICE_TYPE_DRM and using that in your encoder.

Then for your encoder definition you probably want something like this:

    std::make_unique<encoder_platform_formats_avcodec>(
      AV_HWDEVICE_TYPE_DRM, AV_HWDEVICE_TYPE_NONE,
      AV_PIX_FMT_DRM_PRIME,
      AV_PIX_FMT_NV12, AV_PIX_FMT_P010,
      drm_init_avcodec_hardware_input_buffer),

Since FFmpeg's hwcontext_drm.c doesn't support frame allocation, you'll need to figure out how to do that and provide a buffer pool for frame allocation.

Finally, for encoding side, you'll want to do something similar to what I did in 8182f59 for supporting KMS->GL->CUDA with the gl_cuda_vram_t and make_avcodec_gl_encode_device.

chewi · 2024-02-25T20:57:17Z

Many thanks for the detailed reply. Sounds like this could be an interesting exercise. I may be wrong, but I think playback scenarios have managed to avoid GL altogether. What Kodi calls Direct to Plane and mpv calls HW-overlay? Is that not possible here?

cgutman · 2024-02-25T21:12:24Z

I think that color conversion hardware is only accessible on the scanout path (and it's YUV->RGB, not RGB->YUV). Some encoders do have the ability to accept RGB frames and perform the conversion to YUV internally (using dedicated hardware or a shader), but I don't think the Pi's encoder supports RGB input.

ReenigneArcher

I'm not familiar with the details of the encoder but the following will also need to be updated.

Sunshine/docs/source/about/advanced_usage.rst

Lines 1142 to 1161 in 42aec26

    
           `encoder <https://localhost:47990/config/#encoder>`__ 
        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
        
           **Description** 
        
              Force a specific encoder. 
        
           **Choices** 
        
           .. table:: 
        
              :widths: auto 
        
              =========  =========== 
        
              Value      Description 
        
              =========  =========== 
        
              nvenc      For NVIDIA graphics cards 
        
              quicksync  For Intel graphics cards 
        
              amdvce     For AMD graphics cards 
        
              vaapi      Use Linux VA-API (AMD, Intel) 
        
              software   Encoding occurs on the CPU 
        
              =========  ===========

Sunshine/src_assets/common/assets/web/configs/tabs/Advanced.vue

Lines 93 to 96 in 42aec26

    
           <template #linux> 
        
             <option value="nvenc">NVIDIA NVENC</option> 
        
             <option value="vaapi">VA-API</option> 
        
           </template>

Sunshine/tests/unit/test_video.cpp

Line 55 in 42aec26

std::make_tuple(video::vaapi.name, &video::vaapi),

For the final 2 bullet points, we probably need a way to detect if running on raspberry pi. This may give some hints: https://stackoverflow.com/questions/70395696/predefined-macro-to-determine-if-running-on-a-raspberry

Lastly, do we need to make any changes to our ffmpeg build-deps repo? Edit: I think it's already enabled via: https://github.com/LizardByte/build-deps/blob/1e16ab273175976a4623e248fde89ff20b549a1f/.github/workflows/build-ffmpeg.yml#L39

chewi · 2024-06-16T14:08:11Z

Thanks for those pointers.

My initial PoC seemingly needed John Cox's ffmpeg patchset for the Raspberry P. It's a rather heavy patchset, but Gentoo isn't the only party invested in keeping it updated. John does a good job by himself anyway. Whether it will be needed in the end will depend on the architecture we go for.

I did spend quite a long time looking into this after cgutman gave me some pointers. I was really struggling with the DMA-BUF part of it, as v4l2m2m seems to work quite differently to VAAPI.

I also considered doing it a different way, using the Pi's ISP for the pixel format conversion. ffmpeg has some support for it already. This might be a simpler and even more efficient, but it would also be Pi-specific. v4l2m2m seems preferable, as it is supported by many SoCs.

It's been a while since I had time to work on this. It's something I'd really like to go, but Gentoo maintenance usually takes priority.

ReenigneArcher · 2024-06-16T14:24:53Z

Understood. I will convert this to a draft for now, whenever you are ready feel free to mark it as ready for review again.

LizardByte-bot · 2025-01-03T10:21:28Z

It looks like this PR has been idle for 90 days. If it's still something you're working on or would like to pursue, please leave a comment or update your branch. Otherwise, we'll be closing this PR in 10 days to reduce our backlog. Thanks!

LizardByte-bot · 2025-01-13T10:23:37Z

This PR was closed because it has been stalled for 10 days with no activity.

Add v4l2m2m encoder

d722a83

chewi force-pushed the rpi branch from 76136e8 to d722a83 Compare February 25, 2024 17:40

cgutman reviewed Feb 25, 2024

View reviewed changes

ReenigneArcher added this to the adjust lint rules milestone Feb 28, 2024

ReenigneArcher requested changes Jun 16, 2024

View reviewed changes

ReenigneArcher marked this pull request as draft June 16, 2024 14:25

ReenigneArcher removed this from the adjust lint rules milestone Jul 8, 2024

LizardByte-bot added the stale label Jan 3, 2025

LizardByte-bot closed this Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Better Raspberry Pi server performance #2172

Better Raspberry Pi server performance #2172

Uh oh!

chewi commented Feb 25, 2024 •

edited

Loading

Uh oh!

cgutman Feb 25, 2024

Uh oh!

chewi Feb 25, 2024

Uh oh!

cgutman commented Feb 25, 2024

Uh oh!

chewi commented Feb 25, 2024 •

edited

Loading

Uh oh!

cgutman commented Feb 25, 2024

Uh oh!

ReenigneArcher left a comment •

edited

Loading

Uh oh!

chewi commented Jun 16, 2024

Uh oh!

ReenigneArcher commented Jun 16, 2024

Uh oh!

LizardByte-bot commented Jan 3, 2025

Uh oh!

LizardByte-bot commented Jan 13, 2025

Uh oh!

Uh oh!

	`encoder <https://localhost:47990/config/#encoder>`__
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Description
	Force a specific encoder.

	Choices

	.. table::
	:widths: auto

	========= ===========
	Value Description
	========= ===========
	nvenc For NVIDIA graphics cards
	quicksync For Intel graphics cards
	amdvce For AMD graphics cards
	vaapi Use Linux VA-API (AMD, Intel)
	software Encoding occurs on the CPU
	========= ===========

	<template #linux>
	<option value="nvenc">NVIDIA NVENC</option>
	<option value="vaapi">VA-API</option>
	</template>

Uh oh!

Better Raspberry Pi server performance #2172

Better Raspberry Pi server performance #2172

Uh oh!

Conversation

chewi commented Feb 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist

Branch Updates

Uh oh!

cgutman Feb 25, 2024

Choose a reason for hiding this comment

Uh oh!

chewi Feb 25, 2024

Choose a reason for hiding this comment

Uh oh!

cgutman commented Feb 25, 2024

Uh oh!

chewi commented Feb 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgutman commented Feb 25, 2024

Uh oh!

ReenigneArcher left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chewi commented Jun 16, 2024

Uh oh!

ReenigneArcher commented Jun 16, 2024

Uh oh!

LizardByte-bot commented Jan 3, 2025

Uh oh!

LizardByte-bot commented Jan 13, 2025

Uh oh!

Uh oh!

chewi commented Feb 25, 2024 •

edited

Loading

chewi commented Feb 25, 2024 •

edited

Loading

ReenigneArcher left a comment •

edited

Loading