Skip to content

Conversation

@yousef-rafat
Copy link
Contributor

@yousef-rafat yousef-rafat commented Sep 27, 2025

Also includes support for SynchFormer and Clap encoders, as well as three new video nodes.
Screenshot 2025-10-09 112624

@yousef-rafat yousef-rafat marked this pull request as draft September 28, 2025 16:37
@Kosinkadink Kosinkadink added the Core Core team dependency label Sep 30, 2025
@yousef-rafat yousef-rafat marked this pull request as ready for review October 13, 2025 18:11
Copy link
Collaborator

@Kosinkadink Kosinkadink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on the review!

Comfy and I had trouble getting the model to run, so it would be great if you could provide the checkpoints/models links + input video to be able to replicate your results.

The main feedback here is:

  1. There are some odd nodes that were added - the Resample Video node returns an improper Video output, and Video To Image node is a duplicate of what 'Get Video Components node does - provides images, audio, and fps.
  2. The Encode Video node should probably be split in two - it is unclear if only one of the optional inputs should be filled at a time to then be plugged into HunyuanFoleyConditioning. It might be better for the encode video stuff to all go into one node, but I did notice on the workflow image that two different resamples happen before being plugged into Encode Video. Is this some oddity from the source code?
  3. According to comfy, you only need tokenizer.json to get what you need; the repo the files are pulled from had both original .json/.txt files and their merged versions. You can reference comfy/sd1_tokenizer to see what you'd actually need here.
  4. See if there can be more code reuse between ClipTextModel and ClapTextModel stuff, if possible.

frames.append(arr)
next_time += step

return io.NodeOutput(torch.stack(frames))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output is written as io.Video, but here it is not using any Video wrapper around the torch frames, so it breaks compatibility with video.

std[std < 1.0] = 1.0
audio /= std
return ({"waveform": audio, "sample_rate": 44100}, )
sample_rate = vae.first_stage_model.decode_sample_rate or 44100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comfy says that there should be a different way of doing this; he'd prefer that this be turned into a function on the VAE first_stage_model object instead of how it is handled here for maintainability.

embeds = torch.cat([siglip_encoding_1.cpu(), synchformer_encoding_2.cpu()], dim = 0)

x = siglip_encoding_1
positive_tensor = CpuLockedTensor(torch.cat([torch.zeros_like(embeds), text_encoding_negative])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a custom tensor class should not be necessary. What was the reason this was done? If there is some issue, it should be handled in a different way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Core team dependency

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants