--- license: mit datasets: - allenai/tulu-v2-sft-mixture language: - en base_model: - google/gemma-2-2b-it framework: - llamafactory --- # GENOME: LoRA Expert Models This repository contains 10 expert models fine-tuned via low-rank adaptation (LoRA) on 10 distinct domains extracted from the [Tulu-v2-SFT-mixture](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) dataset. Our base model is **google/gemma-2-2b-it**, and all expert models were trained using the llama-factory framework on an 8×A100-80GB GPU setup. Our goal is to contribute to the open-source community by sharing these domain-specific experts. ## Experimental Setup - **Base Model:** [google/gemma-2-2b-it](https://huggingface.co/google/gemma-2-2b-it) - **Dataset:** 10 subsets from Tulu-v2-SFT-mixture - **Fine-tuning Framework:** llama-factory - **Adaptation Technique:** LoRA - **Training Hardware:** 8×A100-80GB GPUs - **Note**: Deploying a 2B model only requires 12GB of VRAM. For optimal performance, we recommend using an RTX 3090/4090 (24GB) or a comparable GPU. A visualization of the performance (ranks) across various datasets shows that each expert model excels in its respective domain. vLLM supports dynamic LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead, enabling cost-effective optimization. ## Usage Instructions Below is an example deployment script that shows how to use vLLM to serve the base model along with the LoRA weights on a single GPU (adapted from the original multi-GPU script). Make sure to adjust the parameters (such as model path and log directory) to suit your environment. ### Step 1. Deploying the Base Model on a Single GPU (or more) Save the following script as `deploy_single_gpu.sh` and modify the placeholders accordingly: ```bash #!/bin/bash export VLLM_LOGGING_LEVEL=DEBUG export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 # Specify your model path here (this can be a local path or a Hugging Face Hub path) MODEL="input your model path here" # Set the maximum number of LoRAs MAX_LORAS=20 # Log directory for vLLM logs ROOT="input your log dir here" # Maximum LoRA rank MAX_LORA_RANK=16 # Specify the port for the API server (single GPU deployment requires only one port) PORT=9112 echo "Deploying model $MODEL with $MAX_LORAS LoRAs on a single GPU" echo "Starting API server on port $PORT..." # Create the log directory if it doesn't exist mkdir -p vllm_logs/$ROOT COMMON_ARGS="--model $MODEL \ --trust-remote-code \ --enable-lora \ --seed 42 \ --max-lora-rank $MAX_LORA_RANK \ --gpu-memory-utilization 0.95 \ --max-loras $MAX_LORAS \ --max-cpu-loras $MAX_LORAS \ --disable-sliding-window \ --max-model-len 8192" # Single GPU deployment: use only GPU 0 CUDA_VISIBLE_DEVICES=0 nohup python -m vllm.entrypoints.openai.api_server \ $COMMON_ARGS \ --port $PORT > vllm_logs/$ROOT/port_1.log 2>&1 & ``` ### Step 2. Loading and Unloading LoRA Adapters Dynamically vLLM supports online LoRA switching, allowing seamless adaptation of different expert models with minimal computational overhead. 1. Download the LoRA weights and store them under `/lora/*`. 2. Use the following Python code to load and unload LoRA adapters dynamically: ```python import requests import time from loguru import logger def online_load_lora(base_url: str, lora_name: str, lora_path: str): counter = 1 while True: try: response = requests.post( f"{base_url}/load_lora_adapter", json={ "lora_name": lora_name, "lora_path": lora_path } ) time.sleep(3) assert response.status_code == 200, f"Failed to load LORA: {response.text}" break except Exception as e: logger.warning(f"Load LORA Error: {e}, retrying in {min(counter, 10)} seconds ...") time.sleep(min(counter, 10)) counter += 1 continue def online_unload_lora(base_url: str, lora_name: str): while True: try: response = requests.post( f"{base_url}/unload_lora_adapter", json={ "lora_name": lora_name } ) assert response.status_code == 200, f"Failed to unload LORA: {response.text}" break except Exception as e: logger.warning(f"Unload LORA Error: {e}, retrying ...") time.sleep(1) continue ``` ### Step 3: Using OpenAI SDK to Access the Deployed LoRA Models Once the LoRA model is loaded, you can interact with it using the OpenAI SDK. Below is a mock example: ```python import openai def query_lora_model(base_url: str, lora_name: str, prompt: str): client = openai.OpenAI(base_url=base_url) response = client.Completion.create( model=lora_name, prompt=prompt, max_tokens=100 ) return response # Example usage base_url = "http://localhost:9112/v1" lora_name = "example_lora" prompt = "Tell me about the impact of AI in healthcare." response_text = query_lora_model(base_url, lora_name, prompt) print(response_text) ``` ## Related Projects This repository is associated with the [GENOME project](https://github.com/ZhangYiqun018/GENOME). We welcome community feedback and contributions to help further open-source AI development.