Skip to content

Support encoding into a bytes tensor #635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Apr 10, 2025

This PR adds support for encoding audio samples into a bytes (uint8) tensor. We allocate the output tensor to 10MB, and double the allocated size up to 320MB if needed (after which we error out).

We rely on a new AVIOToTensorContext class, which inherits from AVIOContextHolder.

The C++ Encoder class now takes two optional and mutually exclusive parameters: fileName and format. fileName is used when encoding to a file, while format is used when encoding to a tensor.

The core API has changed: we don't expose an Encoder object anymore. We only expose two stateless functions:

  • encode_audio_to_file(filename : str, ...)
  • encoded_tensor = encode_audio_to_tensor(format : str, ...)

I suspect that the public Python API will look similar to this, but final design is still WIP.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 10, 2025
@@ -46,11 +46,13 @@ class AVIOContextHolder {

// These signatures are defined by FFmpeg.
using AVIOReadFunction = int (*)(void*, uint8_t*, int);
using AVIOWriteFunction = int (*)(void*, const uint8_t*, int);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we define the buffer parameter as const uint8_t*, which is how it's defined starting from FFmpeg 7. Before that, it was defined as uint8_t*. That's why we now have to wrap the call to avio_alloc_context() into our own AVIOAllocContext() in FFMPEGCommon.cpp

@@ -59,8 +59,9 @@ function(make_torchcodec_libraries
set(decoder_library_name "libtorchcodec_decoder${ffmpeg_major_version}")
set(decoder_sources
AVIOContextHolder.cpp
AVIOBytesContext.cpp
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoder.[cpp, h] rely on AVIOToTensorContext, and specifically on the AVIOToTensorContext::getOutputTensor() method. They don't rely on the AVIOContextHolder base class, like the decoder. For this reason we have to add AVIOBytesContext.cpp to the source dependency here.

Alternatively, I think we could make getOutputTensor a virtual method of the base class? But this method wouldn't make much sense for the existing child classes (like AVIOBytesContext), so this doesn't sounds like a great OOP design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on not making getOutputTensor() virtual on the base class, since it's only applicable to one of the derived classes. We could potentially pull AVIOToTensorContext out of AVIOBytesContext.[h|cpp] to limit what gets put into libtorchcodec_decoderN.so.

I think the issue here is that the Encoder class needs to call getOutputTensor(), which means that it must actually hold a reference to an actual AVIOToTensorContext rather than just the base AVIOContextHolder. (Which is what SingleStreamDecoder does.) A potential way around this is for AVIOToTensorContext to accept a tensor rather than creating its own. The caller, however, would still need to keep track of how many bytes it asked it to encode in order to do the final narrow on the tensor, so that's not necessarily any cleaner. Your call on what you think makes the most sense.

Comment on lines +82 to +83
std::optional<std::string_view> fileName,
std::optional<std::string_view> formatName,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We enforce that fileName and formatName parameters are mutually exclusive. Happy to consider alternative designs. Creating separate constructors seemed more complex and doesn't provide much value IMHO.

Note that this only exists at the C++ constructor level. The custom ops expose two distinct entry-points: encode_audio_to_file() and encode_audio_to_tensor(). I suspect that the public Python API will look similar to this, but that's still up for discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow the same pattern set out in SingleStreamDecoder: one constructor that takes the file name, and another constructor that takes the AVIOToTensorContext as an argument. That means it's the callers responsibility to create the AVIOToTensorContext object; that's how it currently works in SingleStreamDecoder. That would also require pulling all common initialization into some initializeEncoder() function that both constructors would call.

The value, to me, is that it's more clear when creating the object that you're doing it correctly, and then when reading the initialization code, it's more clear what's required for, and correct for, each path.

TORCH_CHECK(
(seek != nullptr) && ((write != nullptr) ^ (read != nullptr)),
"seek method must be defined, and either write or read must be defined. "
"But not both!")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may relax the mutual-exclusivity check above eventually, if we implement both write and read within the same class. For now, mutual-exclusivity is assumed and enforced, because we use the existence of write to set the write_flag below.

void* opaque,
int (*read_packet)(void* opaque, uint8_t* buf, int buf_size),
int (*write_packet)(void* opaque, const uint8_t* buf, int buf_size),
int64_t (*seek)(void* opaque, int64_t offset, int whence));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the convenience function type aliases of AVIO*Function to here, and then use those in all of our interfaces. That should be easier to grep for, and also easier to tell at a glance that the signatures are the same in different places.

Also, since AVIOAllocContext is a function and not a type, I think we should name it avioAllocContext(). Otherwise I think readers will assume we're instantiating an object of type AVIOAllocContext until they look at the declaration.

Copy link
Member Author

@NicolasHug NicolasHug Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I assume we can't use the type aliases in the child classes member declarations / definition? I.e. in the class AVIOBytesContext header, we still have to use verbose

static int read(void* opaque, uint8_t* buf, int buf_size);

declaration, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct. We can only use the type aliases when we declare some variable that has that type.

@scotts
Copy link
Contributor

scotts commented Apr 28, 2025

Approving to unblock, but I do think we should follow the same two-constructor pattern we have in SingleStreamDecoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants