Released on September 25th, 2024, Llama3.2 11B Vision is torchchat's first multimodal model.
This page goes over the different commands you can run with LLama 3.2 11B Vision.
Note
While the commands refer to the model as some variant of "Llama 3.2 11B Vision", the underlying checkpoint used is based off the "Instruct" variant of the model.
Llama3.2 11B Vision is available via both Hugging Face and directly from Meta.
While we strongly encourage you to use the Hugging Face checkpoint (which is the default for torchchat when utilizing the commands with the argument llama3.2-11B
), we also provide support for manually providing the checkpoint. This can be done by replacing the llama3.2-11B
argument in the commands below with the following:
--checkpoint-path <file.pth> --tokenizer-path <tokenizer.model> --params-path torchchat/model_params/Llama-3.2-11B-Vision.json
This generates text output based on a text prompt and (optional) image prompt.
python torchchat.py generate llama3.2-11B --prompt "What's in this image?" --image-prompt assets/dog.jpg
This mode exposes a REST API for interacting with a model. The server follows the OpenAI API specification for chat completions.
To test out the REST API, you'll need 2 terminals: one to host the server, and one to send the request. In one terminal, start the server
python3 torchchat.py server llama3.2-11B
In another terminal, query the server using curl
. This query might take a few minutes to respond.
Example Query
Setting stream
to "true" in the request emits a response in chunks. If stream
is unset or not "true", then the client will await the full response from the server.
Example Input + Output
curl http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What'\''s in this image?"
},
{
"type": "image_url",
"image_url": ""
}
]
}
],
"max_tokens": 300
}'
{"id": "chatcmpl-cb7b39af-a22e-4f71-94a8-17753fa0d00c", "choices": [{"message": {"role": "assistant", "content": "The image depicts a simple black and white cartoon-style drawing of an animal face. It features a profile view, complete with two ears, expressive eyes, and a partial snout. The animal looks to the left, with its eye and mouth implied, suggesting that the drawn face might belong to a rabbit, dog, or pig. The graphic face has a bold black outline and a smaller, solid black nose. A small circle, forming part of the face, has a white background with two black quirkly short and long curved lines forming an outline of what was likely a mouth, complete with two teeth. The presence of the curve lines give the impression that the animal is smiling or speaking. Grey and black shadows behind the right ear and mouth suggest that this face is looking left and upwards. Given the prominent outline of the head and the outline of the nose, it appears that the depicted face is most likely from the side profile of a pig, although the ears make it seem like a dog and the shape of the nose makes it seem like a rabbit. Overall, it seems that this image, possibly part of a character illustration, is conveying a playful or expressive mood through its design and positioning."}, "finish_reason": "stop"}], "created": 1727487574, "model": "llama3.2", "system_fingerprint": "cpu_torch.float16", "object": "chat.completion"}%
This command opens a basic browser interface for local chat by querying a local server.
First, follow the steps in the Server section above to start a local server. Then, in another terminal, launch the interface. Running the following will open a tab in your browser.
streamlit run torchchat/usages/browser.py
One of the goals of torchchat is to support various execution modes for every model. The following are execution modes that will be supported for Llama3.2 11B Vision in the near future:
- torch.compile: Optimize inference via JIT Compilation
- AOTI: Enable pre-compiled and C++ inference
- ExecuTorch: On-device (Edge) inference
In addition, we are in the process of integrating with lm_evaluation_harness for multimodal model evaluation.