Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Raspberry Pi server performance #2172

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

chewi
Copy link
Contributor

@chewi chewi commented Feb 25, 2024

Description

Now I reveal what I really want to use Sunshine for. As a server on the Raspberry Pi! Why would I want such a thing? Surely it makes more sense as a client? Normally yes, but when combined with the PiStorm project, things get very interesting.

As you might imagine, PiStorm is very CPU-intensive, so for this to be feasible, Sunshine needs to use as little CPU as possible. The first step here was obviously to get hardware video encoding to work. The Pi does not support VAAPI or CUDA, but fortunately, this still turned out to be very easy.

These initial changes to add a V4L2M2M encoder did not work for me at first, as Sunshine claimed that an IDR frame was not produced. Digging around in the internals, it looked very much to me like requesting IDR frames should work on the Pi. As a shot in the dark, I applied John Cox's ffmpeg patchset for the Raspberry Pi. This patchset, which I recently applied to Gentoo's ffmpeg package, enables efficient zero-copy video playback on the Pi. With this, I have seen 1080p videos go from a stuttery mess to being buttery smooth. Being playback-focused, I really didn't expect it to help, but I was delighted when it suddenly sprang to life!

[2024:02:25:17:15:54]: Info: Found H.264 encoder: h264_v4l2m2m [V4L2M2M]
[2024:02:25:17:15:54]: Info: Executing [Desktop]
[2024:02:25:17:15:54]: Info: CLIENT CONNECTED
[2024:02:25:17:15:54]: Warning: No render device name for: /dev/dri/card1
[2024:02:25:17:15:55]: Error: Couldn't expose some/all drm planes for card: /dev/dri/card0
[2024:02:25:17:15:55]: Info: Screencasting with KMS
[2024:02:25:17:15:55]: Warning: No render device name for: /dev/dri/card1
[2024:02:25:17:15:55]: Info: Found monitor for DRM screencasting
[2024:02:25:17:15:55]: Info: Found connector ID [32]
[2024:02:25:17:15:55]: Info: Found cursor plane [309]
[2024:02:25:17:15:55]: Info: SDR color coding [Rec. 601]
[2024:02:25:17:15:55]: Info: Color depth: 8-bit
[2024:02:25:17:15:55]: Info: Color range: [MPEG]
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160]  <<< v4l2_encode_init: fmt=0/0
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] Using device /dev/video11
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] driver 'bcm2835-codec' on card 'bcm2835-codec-encode' in mplane mode
[2024:02:25:17:15:55]: Info: [h264_v4l2m2m @ 0x7f58002160] requesting formats: output=YU12/yuv420p capture=H264/none

The quality isn't fantastic though, and it's still using 275% CPU. I utilised gprof to find where it's spending all the effort.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 51.88     10.78    10.78                             ff_hscale16to15_X4_neon_asm
 18.48     14.62     3.84                             ff_yuv2planeX_8_neon
 13.47     17.42     2.80   156694     0.00     0.00  bgr32ToUV_half_c
 11.98     19.91     2.49   155935     0.00     0.00  bgr32ToY_c
  0.67     20.05     0.14                             ff_hscale16to15_4_neon_asm
  0.53     20.16     0.11      142     0.00     0.00  std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > > std::__copy_move_a1<false, char const*, std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > >
 >(char const*, char const*, std::back_insert_iterator<std::vector<unsigned char, std::allocator<unsigned char> > >)
  0.43     20.25     0.09      284     0.00     0.02  scale_internal
  0.38     20.33     0.08    94424     0.00     0.00  chr_planar_vscale
  0.29     20.39     0.06    38454     0.00     0.00  chr_convert
  0.29     20.45     0.06    38383     0.00     0.00  chr_h_scale
  0.24     20.50     0.05      577     0.00     0.00  yuv2planeX_8_c
  0.19     20.54     0.04    60422     0.00     0.00  lum_convert
  0.19     20.58     0.04        1     0.04     2.95  video::capture_async(std::shared_ptr<safe::mail_raw_t>, video::config_t&, void*)
  0.14     20.61     0.03     2133     0.00     0.00  lumRangeToJpeg_c
  0.14     20.64     0.03                             _init
  0.10     20.66     0.02   103963     0.00     0.00  lum_planar_vscale
  0.10     20.68     0.02       24     0.00     0.00  alloc_gamma_tbl
  0.05     20.69     0.01   955063     0.00     0.00  av_pix_fmt_desc_get
  0.05     20.70     0.01    59959     0.00     0.00  lum_h_scale
  0.05     20.71     0.01     6502     0.00     0.00  obl_axpy
  0.05     20.72     0.01     2148     0.00     0.00  chrRangeToJpeg_c
  0.05     20.73     0.01     2081     0.00     0.00  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
  0.05     20.74     0.01      483     0.00     0.00  av_frame_unref
  0.05     20.75     0.01      376     0.00     0.00  stream::control_server_t::call(unsigned short, stream::session_t*, std::basic_string_view<char, std::char_traits<char> > const&, bool)
  0.05     20.76     0.01        3     0.00     0.00  video::avcodec_encode_session_t::request_idr_frame()
  0.05     20.77     0.01                             av_bprint_escape
  0.05     20.78     0.01                             ff_hscale16to19_X4_neon_asm
  0.00     20.78     0.00   463475     0.00     0.00  ff_hscale16to15_X4_neon
  0.00     20.78     0.00   103794     0.00     0.00  ff_rotate_slice
  0.00     20.78     0.00    28314     0.00     0.00  av_opt_next
  0.00     20.78     0.00    11496     0.00     0.00  av_bprint_init
  0.00     20.78     0.00     9141     0.00     0.00  av_buffer_unref
  0.00     20.78     0.00     7378     0.00     0.00  glad_gl_get_proc_from_userptr
  0.00     20.78     0.00     7306     0.00     0.00  enet_list_clear
  0.00     20.78     0.00     7184     0.00     0.00  enet_protocol_send_outgoing_commands
  0.00     20.78     0.00     6975     0.00     0.00  enet_time_get
  0.00     20.78     0.00     6812     0.00     0.00  config::whitespace(char)
  0.00     20.78     0.00     6433     0.00     0.00  ff_hscale16to15_4_neon

This is not my area of expertise, but it looks like finding the right format might be the key here. I'd appreciate any help you can provide here. I know that John Cox's patchset adds support for Pi-specific SAND formats, but I don't know whether they are usable in this context.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Dependency update (updates to dependencies)
  • Documentation update (changes to documentation)
  • Repository update (changes to repository files, e.g. .github/...)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated the in code docstring/documentation-blocks for new or existing methods/components

Branch Updates

LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch
must be updated before it can be merged. You must also
Allow edits from maintainers.

  • I want maintainers to keep my branch updated

{},
// Fallback options
{},
std::make_optional<encoder_t::option_t>("qp"s, &config::video.qp),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the encoder really not support CBR/VBR bitrate control? QP shouldn't be provided if CBR or VBR is available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, I was just copying what the others did as a first step. I'll give it a try.

@cgutman
Copy link
Collaborator

cgutman commented Feb 25, 2024

As a server on the Raspberry Pi!

This is a Pi 4, I assume? I don't think the Pi 5 has any hardware encoders anymore.

This is not my area of expertise, but it looks like finding the right format might be the key here. I'd appreciate any help you can provide here. I know that John Cox's patchset adds support for Pi-specific SAND formats, but I don't know whether they are usable in this context.

Yeah, it's all in the RGB->YUV color conversion code, which is expected since it's doing all the color conversion on the CPU. I guess it's nice that's multi-threaded now. You can adjust the "Minimum CPU Thread Count" on the Advanced tab in the UI if you want to play with the amount of concurrency there.

What your encoding pipeline looks like now:
RGB framebuffer DMA-BUF from KMS capture -> import to EGL (eglCreateImage) -> readback from EGL to CPU (glGetTextureSubImage) -> RGB to YUV conversion and scaling (libswscale) -> upload to DMA-BUF again -> encode the DMA-BUF

What you want is more like what we do with VAAPI:
RGB framebuffer DMA-BUF from KMS capture -> import to EGL (eglCreateImage) -> render using color conversion shaders into another DMA-BUF -> pass that DMA-BUF (AV_PIX_FMT_DRM_PRIME) to h264_v4l2m2m.

Most of that pipeline is simple and already written in Sunshine. The tricky part will be getting that second DMA-BUF to write into and/or exporting the render target as a DMA-BUF. Since there's no standard way to create a DMA-BUF, that part tends to be highly API-specific. For VAAPI, we import the underlying DMA-BUF of the VA surface as the render target for our color conversion. For CUDA, we create a blank texture to use as the render target and use the CUDA-GL interop APIs to import that texture as a CUDA resource for NVENC to read.

Where to start is probably writing something like this for AV_HWDEVICE_TYPE_DRM and using that in your encoder.

Then for your encoder definition you probably want something like this:

    std::make_unique<encoder_platform_formats_avcodec>(
      AV_HWDEVICE_TYPE_DRM, AV_HWDEVICE_TYPE_NONE,
      AV_PIX_FMT_DRM_PRIME,
      AV_PIX_FMT_NV12, AV_PIX_FMT_P010,
      drm_init_avcodec_hardware_input_buffer),

Since FFmpeg's hwcontext_drm.c doesn't support frame allocation, you'll need to figure out how to do that and provide a buffer pool for frame allocation.

Finally, for encoding side, you'll want to do something similar to what I did in 8182f59 for supporting KMS->GL->CUDA with the gl_cuda_vram_t and make_avcodec_gl_encode_device.

@chewi
Copy link
Contributor Author

chewi commented Feb 25, 2024

Many thanks for the detailed reply. Sounds like this could be an interesting exercise. I may be wrong, but I think playback scenarios have managed to avoid GL altogether. What Kodi calls Direct to Plane and mpv calls HW-overlay? Is that not possible here?

@cgutman
Copy link
Collaborator

cgutman commented Feb 25, 2024

I think that color conversion hardware is only accessible on the scanout path (and it's YUV->RGB, not RGB->YUV). Some encoders do have the ability to accept RGB frames and perform the conversion to YUV internally (using dedicated hardware or a shader), but I don't think the Pi's encoder supports RGB input.

@ReenigneArcher ReenigneArcher added this to the adjust lint rules milestone Feb 28, 2024
Copy link
Member

@ReenigneArcher ReenigneArcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the details of the encoder but the following will also need to be updated.

  • `encoder <https://localhost:47990/config/#encoder>`__
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    **Description**
    Force a specific encoder.
    **Choices**
    .. table::
    :widths: auto
    ========= ===========
    Value Description
    ========= ===========
    nvenc For NVIDIA graphics cards
    quicksync For Intel graphics cards
    amdvce For AMD graphics cards
    vaapi Use Linux VA-API (AMD, Intel)
    software Encoding occurs on the CPU
    ========= ===========
  • <template #linux>
    <option value="nvenc">NVIDIA NVENC</option>
    <option value="vaapi">VA-API</option>
    </template>
  • std::make_tuple(video::vaapi.name, &video::vaapi),

For the final 2 bullet points, we probably need a way to detect if running on raspberry pi. This may give some hints: https://stackoverflow.com/questions/70395696/predefined-macro-to-determine-if-running-on-a-raspberry

Lastly, do we need to make any changes to our ffmpeg build-deps repo? Edit: I think it's already enabled via: https://github.com/LizardByte/build-deps/blob/1e16ab273175976a4623e248fde89ff20b549a1f/.github/workflows/build-ffmpeg.yml#L39

@chewi
Copy link
Contributor Author

chewi commented Jun 16, 2024

Thanks for those pointers.

My initial PoC seemingly needed John Cox's ffmpeg patchset for the Raspberry P. It's a rather heavy patchset, but Gentoo isn't the only party invested in keeping it updated. John does a good job by himself anyway. Whether it will be needed in the end will depend on the architecture we go for.

I did spend quite a long time looking into this after cgutman gave me some pointers. I was really struggling with the DMA-BUF part of it, as v4l2m2m seems to work quite differently to VAAPI.

I also considered doing it a different way, using the Pi's ISP for the pixel format conversion. ffmpeg has some support for it already. This might be a simpler and even more efficient, but it would also be Pi-specific. v4l2m2m seems preferable, as it is supported by many SoCs.

It's been a while since I had time to work on this. It's something I'd really like to go, but Gentoo maintenance usually takes priority.

@ReenigneArcher
Copy link
Member

Understood. I will convert this to a draft for now, whenever you are ready feel free to mark it as ready for review again.

@ReenigneArcher ReenigneArcher marked this pull request as draft June 16, 2024 14:25
@ReenigneArcher ReenigneArcher removed this from the adjust lint rules milestone Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants