Skip to content

Project-MONAI/VLM

Repository files navigation

MONAI Vision Language Models

The repository provides a collection of vision language models, benchmarks, and related applications, released as part of Project MONAI (Medical Open Network for Artificial Intelligence).

💡 News

VILA-M3

VILA-M3 is a vision language model designed specifically for medical applications. It focuses on addressing the unique challenges faced by general-purpose vision-language models when applied to the medical domain and integrated with existing expert segmentation and classification models.

For details, see here.

Online Demo

Please visit the VILA-M3 Demo to try out a preview version of the model. NOTE: The URL https://vila-m3-demo.monai.ngc.nvidia.com/ is temporarily unavailable. Please use the new URL https://20.163.25.224/ instead.

Local Demo

Prerequisites

Recommended: Build Docker Container

  1. To run the demo, we recommend building a Docker container with all the requirements. We use a base image with cuda preinstalled.
    docker build --network=host --progress=plain -t monai-m3:latest -f m3/demo/Dockerfile .
  2. Run the container
    docker run -it --rm --ipc host --gpus all --net host monai-m3:latest bash

    Note: If you want to load your own VILA checkpoint in the demo, you need to mount a folder using -v <your_ckpts_dir>:/data/checkpoints in your docker run command.

  3. Next, follow the steps to start the Gradio Demo.

Alternative: Manual installation

  1. Linux Operating System

  2. CUDA Toolkit 12.2 (with nvcc) for VILA.

    To verify CUDA installation, run:

    nvcc --version

    If CUDA is not installed, use one of the following methods:

    • Recommended Use the Docker image: nvidia/cuda:12.2.2-devel-ubuntu22.04
      docker run -it --rm --ipc host --gpus all --net host nvidia/cuda:12.2.2-devel-ubuntu22.04 bash
    • Manual Installation (not recommended) Download the appropiate package from NVIDIA offical page
  3. Python 3.10 Git Wget and Unzip:

    To install these, run

    sudo apt-get update
    sudo apt-get install -y wget python3.10 python3.10-venv python3.10-dev git unzip

    NOTE: The commands are tailored for the Docker image nvidia/cuda:12.2.2-devel-ubuntu22.04. If using a different setup, adjust the commands accordingly.

  4. GPU Memory: Ensure that the GPU has sufficient memory to run the models:

    • VILA-M3: 8B: ~18GB, 13B: ~30GB
    • CXR: This expert dynamically loads various TorchXRayVision models and performs ensemble predictions. The memory requirement is roughly 1.5GB in total.
    • VISTA3D: This expert model dynamically loads the VISTA3D model to segment a 3D-CT volume. The memory requirement is roughly 12GB, and peak memory usage can be higher, depending on the input size of the 3D volume.
    • BRATS: (TBD)
  5. Setup Environment: Clone the repository, set up the environment, and download the experts' checkpoints:

    git clone https://github.com/Project-MONAI/VLM --recursive
    cd VLM
    python3.10 -m venv .venv
    source .venv/bin/activate
    make demo_m3

Running the Gradio Demo

  1. Navigate to the demo directory:

    cd m3/demo
  2. Start the Gradio demo:

    This will automatically download the default VILA-M3 checkpoint from Hugging Face.

    python gradio_m3.py
  3. Alternative: Start the Gradio demo with a local checkpoint, e.g.:

    python gradio_m3.py  \
    --source local \
    --modelpath /data/checkpoints/<8B-checkpoint-name> \
    --convmode llama_3

For details, see the available commmandline arguments.

Adding your own expert model

  • This is still a work in progress. Please refer to the README for more details.

Contributing

To lint the code, please install these packages:

pip install -r requirements-ci.txt

Then run the following command:

isort --check-only --diff .  # using the configuration in pyproject.toml
black . --check  # using the configuration in pyproject.toml
ruff check .  # using the configuration in ruff.toml

To auto-format the code, run the following command:

isort . && black . && ruff format .