microsoft
/

Magma-8B

@@ -102,7 +102,7 @@ Magma is a multimodal agentic AI model that can generate text based on the input
 ### Highlights
 * **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
-* **Versatile Capabilities:** Magma as a single model not only posseesses generic image and videos understanding ability, but alse generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
 * **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
 * **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
@@ -125,15 +125,21 @@ The model is developed by Microsoft and is funded by Microsoft Research. The mod
 <!-- {{ get_started_code | default("[More Information Needed]", true)}} -->
-Use the code below to get started with the model.
 ```python
 import torch
 from PIL import Image
 import requests
-from transformers import AutoModelForCausalLM
-from transformers import AutoProcessor
 # Load the model and processor
 model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
@@ -141,11 +147,12 @@ processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_cod
 model.to("cuda")
 # Inference
-url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
-image = Image.open(requests.get(url, stream=True).raw)
 convs = [
-    {"role": "system", "content": "You are agent that can see, talk and act."},
     {"role": "user", "content": "<image_start><image><image_end>\nWhat is in this image?"},
 ]
 prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
@@ -159,7 +166,6 @@ with torch.inference_mode():
 generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
 response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
 print(response)
 ```

 ### Highlights
 * **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
+* **Versatile Capabilities:** Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
 * **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
 * **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
 <!-- {{ get_started_code | default("[More Information Needed]", true)}} -->
+To get started with the model, you first need to make sure that `transformers` and `torch` are installed, as well as installing the following dependencies:
+```bash
+pip install torchvision Pillow open_clip_torch
+```
+Then you can run the following code:
 ```python
 import torch
 from PIL import Image
+from io import BytesIO
 import requests
+from transformers import AutoModelForCausalLM, AutoProcessor
 # Load the model and processor
 model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
 model.to("cuda")
 # Inference
+url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
+image = Image.open(BytesIO(requests.get(url, stream=True).content))
+image = image.convert("rgb")
 convs = [
+    {"role": "system", "content": "You are agent that can see, talk and act."},
     {"role": "user", "content": "<image_start><image><image_end>\nWhat is in this image?"},
 ]
 prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
 generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
 response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
 print(response)
 ```