Skip to content

Conversation

@zyd-ustc
Copy link

What does this PR do?

Fixes # (issue)
1.modify convert_to_rgb,do_resize,do_normalize for video frames
Adds # (feature)
1.add fast test for sam2_video
2.cv2 error info

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
    documentation guidelines
  • Did you build and run the code without any errors?
  • Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

@zyd-ustc zyd-ustc requested a review from vigo999 as a code owner November 24, 2025 10:13
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zyd-ustc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enabling and testing video processing capabilities for the Sam2Video model within the MindSpore framework. It involves significant updates to image processing utilities to correctly handle video frame batches, refining color conversion logic, and establishing a robust testing framework to ensure functional parity and correctness for the Sam2Video model's video inference session.

Highlights

  • Video Frame Processing Enhancements: The resize and normalize functions in image_processing_utils_fast.py have been updated to correctly handle batch dimensions for video frames, ensuring proper processing of multi-frame inputs by iterating and applying operations per frame.
  • Refined Video Color Conversion: The convert_to_rgb function in video_processing_utils.py has been refactored to explicitly manage 1-channel (grayscale), 3-channel (RGB), and 4-channel (RGBA with alpha blending) video inputs, improving robustness and removing reliance on a specific transform.
  • OpenCV Dependency Handling: An import error message and backend mapping for OpenCV (cv2) have been added to import_utils.py, which will improve error handling and user guidance when the library is missing.
  • SAM2 Video Model Test Suite: A new, comprehensive test suite has been introduced for the Sam2Video model, including dedicated testers for the prompt encoder, mask decoder, and the full model. This suite features parameterized tests that compare MindSpore and PyTorch outputs for functional consistency and correctness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request primarily focuses on adapting image processing utilities for video frames and adding a test for Sam2VideoModel. The modifications in convert_to_rgb and the import utilities are well-implemented. However, the changes to the resize and normalize functions in image_processing_utils_fast.py introduce performance issues by using loops for batch processing. I have provided suggestions to refactor these functions using batched tensor operations, which will significantly improve efficiency. Additionally, I've noted a minor maintainability improvement for the new test file.

Comment on lines 344 to 364
original_shape = image.shape
batch_dims = original_shape[:-3]
num_batch = 1
for dim in batch_dims:
num_batch *= dim
image_flat = image.view(num_batch, *original_shape[-3:])

mean = [float(mean[0]), float(mean[1]), float(mean[2])]
std = [float(std[0]), float(std[1]), float(std[2])]
normalize = vision.Normalize(
mean=mean,
std=std,
)
images = []
for img in image:
images.append(normalize(img.permute(1, 2, 0).asnumpy()))
return ms.tensor(images).permute(0, 3, 1, 2)
normalized_images = []
for img in image_flat:
normalized_img = normalize(img.permute(1, 2, 0).asnumpy())
normalized_images.append(ms.tensor(normalized_img).permute(2, 0, 1))

normalized_flat = mint.stack(normalized_images, dim=0)
_, new_C, new_H, new_W = normalized_flat.shape
return normalized_flat.view(*batch_dims, new_C, new_H, new_W)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The normalize function iterates over each image in the batch, converting it to a NumPy array for processing. This is inefficient. Normalization is a simple arithmetic operation (image - mean) / std that can be performed directly on the entire batch as a tensor operation using broadcasting. This would be significantly faster and avoid unnecessary data conversions.

Suggested change
original_shape = image.shape
batch_dims = original_shape[:-3]
num_batch = 1
for dim in batch_dims:
num_batch *= dim
image_flat = image.view(num_batch, *original_shape[-3:])
mean = [float(mean[0]), float(mean[1]), float(mean[2])]
std = [float(std[0]), float(std[1]), float(std[2])]
normalize = vision.Normalize(
mean=mean,
std=std,
)
images = []
for img in image:
images.append(normalize(img.permute(1, 2, 0).asnumpy()))
return ms.tensor(images).permute(0, 3, 1, 2)
normalized_images = []
for img in image_flat:
normalized_img = normalize(img.permute(1, 2, 0).asnumpy())
normalized_images.append(ms.tensor(normalized_img).permute(2, 0, 1))
normalized_flat = mint.stack(normalized_images, dim=0)
_, new_C, new_H, new_W = normalized_flat.shape
return normalized_flat.view(*batch_dims, new_C, new_H, new_W)
mean = ms.tensor(mean, dtype=image.dtype)
std = ms.tensor(std, dtype=image.dtype)
# Reshape mean and std to broadcast across batch and spatial dimensions
view_shape = (1,) * (image.ndim - 3) + (image.shape[-3], 1, 1)
mean = mean.view(view_shape)
std = std.view(view_shape)
return (image - mean) / std

zyd-ustc and others added 2 commits November 25, 2025 17:52
…video.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Updated the normalization function to simplify the process by removing unnecessary flattening and looping. Enhanced the documentation to clarify input shape and output characteristics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant