Using Vision Language Models
Aphrodite provides experimental support for Vision Language Models (VLMs). See the list of supported VLMs here. This document shows you how to run and serve these models using Aphrodite.
We are actively working on improving the VLM support in Aphrodite. Expect breaking changes in the future without any deprecation warnings.
Currently, the support for VLMs has the following limitation:
- Only single image input is supported per text prompt.
We are continuously improving user & developer experience. If you have any feedback or feature requests, please open an issue.
Offline Batched Inference
To initialize a VLM, the aforementioned arguments must be passed to the LLM
class for instantiating the engine.
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
To pass an image to the model, note the following in aphrodite.inputs.PromptInputs
:
prompt
: The prompt should follow the format that is documented on Hugging Face.multi_modal_data
: This is a dictionary that follows the schema defined inaphrodite.multimodal.MultiModalDataDict
# Refer to the HuggingFace repo for the correct format to useprompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Load the image using PIL.Imageimage = PIL.Image.open(...)
# Single prompt inferenceoutputs = llm.generate({ "prompt": prompt, "multi_modal_data": {"image": image},})
for o in outputs: generated_text = o.outputs[0].text print(generated_text)
# Batch inferenceimage_1 = PIL.Image.open(...)image_2 = PIL.Image.open(...)outputs = llm.generate( [ { "prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:", "multi_modal_data": {"image": image_1}, }, { "prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:", "multi_modal_data": {"image": image_2}, } ])
for o in outputs: generated_text = o.outputs[0].text print(generated_text)
Online OpenAI Vision API Inference
You can serve vision language models with Aphrodite’s OpenAI server.
Below is an example on how to launch the same llava-hf/llava-1.5-7b-hf with Aphrodite API server.
aphrodite run llava-hf/llava-1.5-7b-hf --chat-template llava.jinja
To send a request to the server, you can use the following code:
from openai import OpenAIopenai_api_key = "EMPTY"openai_api_base = "http://localhost:2242/v1"client = OpenAI( api_key=openai_api_key, base_url=openai_api_base,)chat_response = client.chat.completions.create( model="llava-hf/llava-1.5-7b-hf", messages=[{ "role": "user", "content": [ # NOTE: The prompt formatting with the image token `<image>` is not needed # since the prompt will be processed automatically by the API server. {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", }, }, ], }],)print("Chat response:", chat_response)
Here’s a curl example:
curl -X POST "http://localhost:2242/v1/chat/completions" -H "Content-Type: application/json" \ -H "Authorization : Bearer $OPENAI_API_KEY" \ -d '{ "model": "llava-hf/llava-1.5-7b-hf", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", }, }, ], } ] }'