-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Add support for Hunyuan-Foley #10052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add support for Hunyuan-Foley #10052
Conversation
12824ea to
fee1e57
Compare
…t/ComfyUI into yousef-hunyuan-foley
…t/ComfyUI into yousef-hunyuan-foley
the trimming fn needs an update because of the over-trimming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on the review!
Comfy and I had trouble getting the model to run, so it would be great if you could provide the checkpoints/models links + input video to be able to replicate your results.
The main feedback here is:
- There are some odd nodes that were added - the Resample Video node returns an improper Video output, and Video To Image node is a duplicate of what 'Get Video Components node does - provides images, audio, and fps.
- The Encode Video node should probably be split in two - it is unclear if only one of the optional inputs should be filled at a time to then be plugged into HunyuanFoleyConditioning. It might be better for the encode video stuff to all go into one node, but I did notice on the workflow image that two different resamples happen before being plugged into Encode Video. Is this some oddity from the source code?
- According to comfy, you only need tokenizer.json to get what you need; the repo the files are pulled from had both original .json/.txt files and their merged versions. You can reference comfy/sd1_tokenizer to see what you'd actually need here.
- See if there can be more code reuse between ClipTextModel and ClapTextModel stuff, if possible.
| frames.append(arr) | ||
| next_time += step | ||
|
|
||
| return io.NodeOutput(torch.stack(frames)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output is written as io.Video, but here it is not using any Video wrapper around the torch frames, so it breaks compatibility with video.
| std[std < 1.0] = 1.0 | ||
| audio /= std | ||
| return ({"waveform": audio, "sample_rate": 44100}, ) | ||
| sample_rate = vae.first_stage_model.decode_sample_rate or 44100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comfy says that there should be a different way of doing this; he'd prefer that this be turned into a function on the VAE first_stage_model object instead of how it is handled here for maintainability.
| embeds = torch.cat([siglip_encoding_1.cpu(), synchformer_encoding_2.cpu()], dim = 0) | ||
|
|
||
| x = siglip_encoding_1 | ||
| positive_tensor = CpuLockedTensor(torch.cat([torch.zeros_like(embeds), text_encoding_negative]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of a custom tensor class should not be necessary. What was the reason this was done? If there is some issue, it should be handled in a different way.
Also includes support for SynchFormer and Clap encoders, as well as three new video nodes.
