Spaces:
Running
on
Zero
Running
on
Zero
Y Phung Nguyen
commited on
Commit
·
af9efda
1
Parent(s):
a1f9b00
Fix PDF upload, add Whisper ASR, and enhance model status display
Browse files- Fix PDF upload error by handling file objects without seek() method (NamedString compatibility)
- Replace Gemini transcription with Whisper large-v3-turbo from Hugging Face (openai/whisper-large-v3-turbo)
- Add comprehensive ASR debugging/logging with [ASR] prefixes
- Load Whisper model at startup alongside maya1 TTS and MedSwin
- Update model status textbox to display MedSwin, TTS (maya1), and ASR (Whisper) loading status
- Ensure all voice models (TTS and ASR) are GPU-decorated for ZeroGPU compatibility
- Update requirements.txt to use transformers/torchaudio instead of openai-whisper
- Update README with new Whisper ASR implementation details
- README.md +5 -3
- app.py +7 -3
- config.py +1 -0
- indexing.py +27 -3
- models.py +39 -0
- requirements.txt +3 -0
- ui.py +120 -29
- voice.py +135 -11
README.md
CHANGED
|
@@ -70,9 +70,10 @@ tags:
|
|
| 70 |
Sample [conversation](sample.md) between MAC-system and patient are recorded.
|
| 71 |
|
| 72 |
### 🎤 **Voice Features**
|
| 73 |
-
- **Speech-to-Text**: Voice input transcription using Gemini MCP
|
| 74 |
- **Inline Mic Experience**: Built-in microphone widget with live recording timer that drops transcripts straight into the chat box
|
| 75 |
- **Text-to-Speech**: Voice output generation using Maya1 TTS model (optional, fallback to MCP if unavailable) plus a one-click "Play Response" control for the latest answer
|
|
|
|
| 76 |
|
| 77 |
|
| 78 |
### 🤝 **MAC Architecture (Multi-Agent Collaboration)**
|
|
@@ -124,7 +125,7 @@ Sample [conversation](sample.md) between MAC-system and patient are recorded.
|
|
| 124 |
- **Architecture**: MAC (Multi-Agent Collaboration) - Gemini Supervisor + MedSwin Specialist
|
| 125 |
- **Translation**: Gemini MCP (gemini-2.5-flash-lite)
|
| 126 |
- **Document Parsing**: Gemini MCP (PDF, Word, TXT, MD, JSON, XML, CSV)
|
| 127 |
-
- **Speech-to-Text**: Gemini MCP
|
| 128 |
- **Supervisor Tasks**: Gemini MCP (gemini-2.5-flash) - query breakdown, search strategies, RAG brainstorming
|
| 129 |
- **MedSwin Execution**: GPU-tagged tasks for efficient inference
|
| 130 |
- **Text-to-Speech**: maya-research/maya1 (optional, with MCP fallback)
|
|
@@ -139,10 +140,11 @@ Sample [conversation](sample.md) between MAC-system and patient are recorded.
|
|
| 139 |
See `requirements.txt` for full dependency list. Key dependencies:
|
| 140 |
- **MCP Integration**: `mcp`, `nest-asyncio`, `google-genai` (for Gemini MCP server)
|
| 141 |
- **Fallback Dependencies**: `requests`, `beautifulsoup4`, `ddgs` (used when MCP web search unavailable)
|
| 142 |
-
- **Core ML**: `transformers`, `torch`, `accelerate`
|
| 143 |
- **RAG Framework**: `llama-index`, `llama_index.llms.huggingface`, `llama_index.embeddings.huggingface`
|
| 144 |
- **Utilities**: `langdetect`, `gradio`, `spaces`, `soundfile`
|
| 145 |
- **TTS**: Optional - `TTS` package (voice features work with MCP fallback if unavailable)
|
|
|
|
| 146 |
|
| 147 |
|
| 148 |
### 🔌 MCP Configuration
|
|
|
|
| 70 |
Sample [conversation](sample.md) between MAC-system and patient are recorded.
|
| 71 |
|
| 72 |
### 🎤 **Voice Features**
|
| 73 |
+
- **Speech-to-Text**: Voice input transcription using Whisper large-v3-turbo (Hugging Face) with Gemini MCP fallback
|
| 74 |
- **Inline Mic Experience**: Built-in microphone widget with live recording timer that drops transcripts straight into the chat box
|
| 75 |
- **Text-to-Speech**: Voice output generation using Maya1 TTS model (optional, fallback to MCP if unavailable) plus a one-click "Play Response" control for the latest answer
|
| 76 |
+
- **Model Status Display**: Real-time status for MedSwin, TTS (maya1), and ASR (Whisper) models
|
| 77 |
|
| 78 |
|
| 79 |
### 🤝 **MAC Architecture (Multi-Agent Collaboration)**
|
|
|
|
| 125 |
- **Architecture**: MAC (Multi-Agent Collaboration) - Gemini Supervisor + MedSwin Specialist
|
| 126 |
- **Translation**: Gemini MCP (gemini-2.5-flash-lite)
|
| 127 |
- **Document Parsing**: Gemini MCP (PDF, Word, TXT, MD, JSON, XML, CSV)
|
| 128 |
+
- **Speech-to-Text**: openai/whisper-large-v3-turbo (Hugging Face, primary) with Gemini MCP fallback
|
| 129 |
- **Supervisor Tasks**: Gemini MCP (gemini-2.5-flash) - query breakdown, search strategies, RAG brainstorming
|
| 130 |
- **MedSwin Execution**: GPU-tagged tasks for efficient inference
|
| 131 |
- **Text-to-Speech**: maya-research/maya1 (optional, with MCP fallback)
|
|
|
|
| 140 |
See `requirements.txt` for full dependency list. Key dependencies:
|
| 141 |
- **MCP Integration**: `mcp`, `nest-asyncio`, `google-genai` (for Gemini MCP server)
|
| 142 |
- **Fallback Dependencies**: `requests`, `beautifulsoup4`, `ddgs` (used when MCP web search unavailable)
|
| 143 |
+
- **Core ML**: `transformers`, `torch`, `accelerate`, `torchaudio`
|
| 144 |
- **RAG Framework**: `llama-index`, `llama_index.llms.huggingface`, `llama_index.embeddings.huggingface`
|
| 145 |
- **Utilities**: `langdetect`, `gradio`, `spaces`, `soundfile`
|
| 146 |
- **TTS**: Optional - `TTS` package (voice features work with MCP fallback if unavailable)
|
| 147 |
+
- **ASR**: Whisper via `transformers` (openai/whisper-large-v3-turbo from Hugging Face)
|
| 148 |
|
| 149 |
|
| 150 |
### 🔌 MCP Configuration
|
app.py
CHANGED
|
@@ -5,7 +5,7 @@ os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
|
| 5 |
from logger import logger
|
| 6 |
from config import DEFAULT_MEDICAL_MODEL
|
| 7 |
import config
|
| 8 |
-
from models import initialize_medical_model, initialize_tts_model
|
| 9 |
from client import MCP_AVAILABLE
|
| 10 |
from ui import create_demo
|
| 11 |
|
|
@@ -17,8 +17,12 @@ if __name__ == "__main__":
|
|
| 17 |
logger.info("App starting - models will be loaded on-demand when first needed")
|
| 18 |
logger.info(f"Default medical model: {DEFAULT_MEDICAL_MODEL}")
|
| 19 |
|
| 20 |
-
# TTS
|
| 21 |
-
logger.info("TTS
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
# Check Gemini MCP availability
|
| 24 |
if MCP_AVAILABLE:
|
|
|
|
| 5 |
from logger import logger
|
| 6 |
from config import DEFAULT_MEDICAL_MODEL
|
| 7 |
import config
|
| 8 |
+
from models import initialize_medical_model, initialize_tts_model, initialize_whisper_model, WHISPER_AVAILABLE
|
| 9 |
from client import MCP_AVAILABLE
|
| 10 |
from ui import create_demo
|
| 11 |
|
|
|
|
| 17 |
logger.info("App starting - models will be loaded on-demand when first needed")
|
| 18 |
logger.info(f"Default medical model: {DEFAULT_MEDICAL_MODEL}")
|
| 19 |
|
| 20 |
+
# TTS and ASR models also use GPU decorator, so skip preloading
|
| 21 |
+
logger.info("TTS and ASR models will be loaded on-demand if needed")
|
| 22 |
+
if WHISPER_AVAILABLE:
|
| 23 |
+
logger.info("Whisper ASR library (transformers) is available")
|
| 24 |
+
else:
|
| 25 |
+
logger.warning("Whisper ASR library not available - install with: pip install transformers torchaudio")
|
| 26 |
|
| 27 |
# Check Gemini MCP availability
|
| 28 |
if MCP_AVAILABLE:
|
config.py
CHANGED
|
@@ -35,6 +35,7 @@ global_medical_models = {}
|
|
| 35 |
global_medical_tokenizers = {}
|
| 36 |
global_file_info = {}
|
| 37 |
global_tts_model = None
|
|
|
|
| 38 |
global_embed_model = None
|
| 39 |
|
| 40 |
# MCP client storage
|
|
|
|
| 35 |
global_medical_tokenizers = {}
|
| 36 |
global_file_info = {}
|
| 37 |
global_tts_model = None
|
| 38 |
+
global_whisper_model = None
|
| 39 |
global_embed_model = None
|
| 40 |
|
| 41 |
# MCP client storage
|
indexing.py
CHANGED
|
@@ -80,13 +80,37 @@ def extract_text_from_document(file):
|
|
| 80 |
file_extension = os.path.splitext(file_name)[1].lower()
|
| 81 |
|
| 82 |
if file_extension == '.txt':
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
return text, len(text.split()), None
|
| 85 |
|
| 86 |
try:
|
| 87 |
with tempfile.NamedTemporaryFile(delete=False, suffix=file_extension) as tmp_file:
|
| 88 |
-
file
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
tmp_file_path = tmp_file.name
|
| 91 |
|
| 92 |
if MCP_AVAILABLE:
|
|
|
|
| 80 |
file_extension = os.path.splitext(file_name)[1].lower()
|
| 81 |
|
| 82 |
if file_extension == '.txt':
|
| 83 |
+
# Handle file objects that might not have seek() method
|
| 84 |
+
try:
|
| 85 |
+
if hasattr(file, 'seek'):
|
| 86 |
+
file.seek(0)
|
| 87 |
+
text = file.read().decode('utf-8')
|
| 88 |
+
except (AttributeError, TypeError):
|
| 89 |
+
# If file is a string path or NamedString, read it differently
|
| 90 |
+
if isinstance(file, str):
|
| 91 |
+
with open(file, 'r', encoding='utf-8') as f:
|
| 92 |
+
text = f.read()
|
| 93 |
+
else:
|
| 94 |
+
# Try to get content directly
|
| 95 |
+
text = str(file) if hasattr(file, '__str__') else file.read() if hasattr(file, 'read') else ""
|
| 96 |
return text, len(text.split()), None
|
| 97 |
|
| 98 |
try:
|
| 99 |
with tempfile.NamedTemporaryFile(delete=False, suffix=file_extension) as tmp_file:
|
| 100 |
+
# Handle file objects that might not have seek() method
|
| 101 |
+
try:
|
| 102 |
+
if hasattr(file, 'seek'):
|
| 103 |
+
file.seek(0)
|
| 104 |
+
file_content = file.read()
|
| 105 |
+
except (AttributeError, TypeError):
|
| 106 |
+
# If file is a string path, read it directly
|
| 107 |
+
if isinstance(file, str):
|
| 108 |
+
with open(file, 'rb') as f:
|
| 109 |
+
file_content = f.read()
|
| 110 |
+
else:
|
| 111 |
+
# Try to get content directly without seek
|
| 112 |
+
file_content = file.read() if hasattr(file, 'read') else bytes(file) if hasattr(file, '__bytes__') else b""
|
| 113 |
+
tmp_file.write(file_content)
|
| 114 |
tmp_file_path = tmp_file.name
|
| 115 |
|
| 116 |
if MCP_AVAILABLE:
|
models.py
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
"""Model initialization and management"""
|
|
|
|
| 2 |
import torch
|
| 3 |
import threading
|
| 4 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
@@ -15,6 +16,19 @@ except ImportError:
|
|
| 15 |
TTS_AVAILABLE = False
|
| 16 |
TTS = None
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
# Model loading state tracking
|
| 19 |
_model_loading_states = {}
|
| 20 |
_model_loading_lock = threading.Lock()
|
|
@@ -88,6 +102,31 @@ def initialize_tts_model():
|
|
| 88 |
config.global_tts_model = None
|
| 89 |
return config.global_tts_model
|
| 90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
def get_or_create_embed_model():
|
| 92 |
"""Reuse embedding model to avoid reloading weights each request"""
|
| 93 |
if config.global_embed_model is None:
|
|
|
|
| 1 |
"""Model initialization and management"""
|
| 2 |
+
import os
|
| 3 |
import torch
|
| 4 |
import threading
|
| 5 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
| 16 |
TTS_AVAILABLE = False
|
| 17 |
TTS = None
|
| 18 |
|
| 19 |
+
try:
|
| 20 |
+
from transformers import WhisperProcessor, WhisperForConditionalGeneration
|
| 21 |
+
try:
|
| 22 |
+
import torchaudio
|
| 23 |
+
except ImportError:
|
| 24 |
+
torchaudio = None
|
| 25 |
+
WHISPER_AVAILABLE = True
|
| 26 |
+
except ImportError:
|
| 27 |
+
WHISPER_AVAILABLE = False
|
| 28 |
+
WhisperProcessor = None
|
| 29 |
+
WhisperForConditionalGeneration = None
|
| 30 |
+
torchaudio = None
|
| 31 |
+
|
| 32 |
# Model loading state tracking
|
| 33 |
_model_loading_states = {}
|
| 34 |
_model_loading_lock = threading.Lock()
|
|
|
|
| 102 |
config.global_tts_model = None
|
| 103 |
return config.global_tts_model
|
| 104 |
|
| 105 |
+
def initialize_whisper_model():
|
| 106 |
+
"""Initialize Whisper model for speech-to-text (ASR) from Hugging Face"""
|
| 107 |
+
if not WHISPER_AVAILABLE:
|
| 108 |
+
logger.warning("Whisper transformers not installed. ASR features will be disabled.")
|
| 109 |
+
return None
|
| 110 |
+
if config.global_whisper_model is None:
|
| 111 |
+
try:
|
| 112 |
+
logger.info("Initializing Whisper model (openai/whisper-large-v3-turbo) from Hugging Face...")
|
| 113 |
+
model_id = "openai/whisper-large-v3-turbo"
|
| 114 |
+
processor = WhisperProcessor.from_pretrained(model_id, token=config.HF_TOKEN)
|
| 115 |
+
model = WhisperForConditionalGeneration.from_pretrained(
|
| 116 |
+
model_id,
|
| 117 |
+
device_map="auto",
|
| 118 |
+
torch_dtype=torch.float16,
|
| 119 |
+
token=config.HF_TOKEN
|
| 120 |
+
)
|
| 121 |
+
# Store both processor and model
|
| 122 |
+
config.global_whisper_model = {"processor": processor, "model": model}
|
| 123 |
+
logger.info(f"Whisper model ({model_id}) initialized successfully")
|
| 124 |
+
except Exception as e:
|
| 125 |
+
logger.warning(f"Whisper model initialization failed: {e}")
|
| 126 |
+
logger.warning("ASR features will be disabled. Install with: pip install transformers torchaudio")
|
| 127 |
+
config.global_whisper_model = None
|
| 128 |
+
return config.global_whisper_model
|
| 129 |
+
|
| 130 |
def get_or_create_embed_model():
|
| 131 |
"""Reuse embedding model to avoid reloading weights each request"""
|
| 132 |
if config.global_embed_model is None:
|
requirements.txt
CHANGED
|
@@ -34,3 +34,6 @@ setuptools>=65.0.0
|
|
| 34 |
# Option 2: pip install TTS (may fail on pyworld, but TTS will work for most models without it)
|
| 35 |
# The app will run without TTS - voice generation will be disabled
|
| 36 |
# TTS
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
# Option 2: pip install TTS (may fail on pyworld, but TTS will work for most models without it)
|
| 35 |
# The app will run without TTS - voice generation will be disabled
|
| 36 |
# TTS
|
| 37 |
+
|
| 38 |
+
# ASR (Automatic Speech Recognition) - Whisper for speech-to-text (via Hugging Face transformers)
|
| 39 |
+
torchaudio
|
ui.py
CHANGED
|
@@ -3,6 +3,7 @@ import time
|
|
| 3 |
import gradio as gr
|
| 4 |
import spaces
|
| 5 |
from config import TITLE, DESCRIPTION, CSS, MEDSWIN_MODELS, DEFAULT_MEDICAL_MODEL
|
|
|
|
| 6 |
from indexing import create_or_update_index
|
| 7 |
from pipeline import stream_chat
|
| 8 |
from voice import transcribe_audio, generate_speech
|
|
@@ -12,7 +13,9 @@ from models import (
|
|
| 12 |
get_model_loading_state,
|
| 13 |
set_model_loading_state,
|
| 14 |
initialize_tts_model,
|
|
|
|
| 15 |
TTS_AVAILABLE,
|
|
|
|
| 16 |
)
|
| 17 |
from logger import logger
|
| 18 |
|
|
@@ -293,35 +296,89 @@ def create_demo():
|
|
| 293 |
def load_model_and_update_status(model_name):
|
| 294 |
"""Load model and update status, return status text and whether model is ready"""
|
| 295 |
try:
|
|
|
|
|
|
|
|
|
|
| 296 |
if is_model_loaded(model_name):
|
| 297 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 298 |
|
| 299 |
-
|
| 300 |
-
if
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 304 |
|
| 305 |
-
|
| 306 |
-
|
| 307 |
return status_text, is_ready
|
| 308 |
except Exception as e:
|
| 309 |
return f"❌ Error: {str(e)[:100]}", False
|
| 310 |
|
| 311 |
def check_model_status(model_name):
|
| 312 |
"""Check current model status without loading"""
|
|
|
|
|
|
|
|
|
|
| 313 |
if is_model_loaded(model_name):
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 320 |
else:
|
| 321 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 322 |
|
| 323 |
# GPU-decorated function to load model on startup
|
| 324 |
-
|
| 325 |
def load_default_model_on_startup():
|
| 326 |
"""Load default medical model on startup (GPU-decorated for ZeroGPU compatibility)"""
|
| 327 |
try:
|
|
@@ -343,25 +400,47 @@ def create_demo():
|
|
| 343 |
logger.error(f"Error in model loading startup: {e}")
|
| 344 |
return f"⚠️ Startup loading error: {str(e)[:100]}"
|
| 345 |
|
| 346 |
-
# GPU-decorated function to load default TTS
|
| 347 |
-
|
| 348 |
-
def
|
| 349 |
-
"""Load default TTS model (maya1) on startup"""
|
| 350 |
try:
|
| 351 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 352 |
logger.warning("TTS library not installed; skipping TTS preload.")
|
| 353 |
-
|
| 354 |
-
|
| 355 |
-
|
| 356 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 357 |
except Exception as e:
|
| 358 |
-
logger.error(f"Error in
|
|
|
|
|
|
|
| 359 |
|
| 360 |
# Initialize status on load
|
| 361 |
def init_model_status():
|
| 362 |
status_text, is_ready = check_model_status(DEFAULT_MEDICAL_MODEL)
|
| 363 |
return status_text
|
| 364 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 365 |
# Handle model selection change
|
| 366 |
def on_model_change(model_name):
|
| 367 |
status_text, is_ready = load_model_and_update_status(model_name)
|
|
@@ -372,23 +451,35 @@ def create_demo():
|
|
| 372 |
gr.update(interactive=submit_enabled)
|
| 373 |
)
|
| 374 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 375 |
medical_model.change(
|
| 376 |
fn=on_model_change,
|
| 377 |
inputs=[medical_model],
|
| 378 |
outputs=[model_status, submit_button, message_input]
|
| 379 |
)
|
| 380 |
|
| 381 |
-
# Load
|
|
|
|
| 382 |
demo.load(
|
| 383 |
fn=load_default_model_on_startup,
|
|
|
|
| 384 |
outputs=[model_status]
|
| 385 |
)
|
| 386 |
-
#
|
| 387 |
demo.load(
|
| 388 |
-
fn=
|
| 389 |
inputs=None,
|
| 390 |
outputs=None
|
| 391 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 392 |
|
| 393 |
# Wrap stream_chat to check model status before execution
|
| 394 |
def stream_chat_with_model_check(
|
|
|
|
| 3 |
import gradio as gr
|
| 4 |
import spaces
|
| 5 |
from config import TITLE, DESCRIPTION, CSS, MEDSWIN_MODELS, DEFAULT_MEDICAL_MODEL
|
| 6 |
+
import config
|
| 7 |
from indexing import create_or_update_index
|
| 8 |
from pipeline import stream_chat
|
| 9 |
from voice import transcribe_audio, generate_speech
|
|
|
|
| 13 |
get_model_loading_state,
|
| 14 |
set_model_loading_state,
|
| 15 |
initialize_tts_model,
|
| 16 |
+
initialize_whisper_model,
|
| 17 |
TTS_AVAILABLE,
|
| 18 |
+
WHISPER_AVAILABLE,
|
| 19 |
)
|
| 20 |
from logger import logger
|
| 21 |
|
|
|
|
| 296 |
def load_model_and_update_status(model_name):
|
| 297 |
"""Load model and update status, return status text and whether model is ready"""
|
| 298 |
try:
|
| 299 |
+
status_lines = []
|
| 300 |
+
|
| 301 |
+
# Medical model status
|
| 302 |
if is_model_loaded(model_name):
|
| 303 |
+
status_lines.append(f"✅ MedSwin ({model_name}): loaded and ready")
|
| 304 |
+
else:
|
| 305 |
+
state = get_model_loading_state(model_name)
|
| 306 |
+
if state == "loading":
|
| 307 |
+
status_lines.append(f"⏳ MedSwin ({model_name}): loading...")
|
| 308 |
+
elif state == "error":
|
| 309 |
+
status_lines.append(f"❌ MedSwin ({model_name}): error loading")
|
| 310 |
+
else:
|
| 311 |
+
# Use GPU-decorated function to load the model
|
| 312 |
+
status_text, is_ready = load_model_with_gpu(model_name)
|
| 313 |
+
if is_ready:
|
| 314 |
+
status_lines.append(f"✅ MedSwin ({model_name}): loaded and ready")
|
| 315 |
+
else:
|
| 316 |
+
status_lines.append(f"⏳ MedSwin ({model_name}): loading...")
|
| 317 |
|
| 318 |
+
# TTS model status
|
| 319 |
+
if TTS_AVAILABLE:
|
| 320 |
+
if config.global_tts_model is not None:
|
| 321 |
+
status_lines.append("✅ TTS (maya1): loaded and ready")
|
| 322 |
+
else:
|
| 323 |
+
status_lines.append("⚠️ TTS (maya1): not loaded")
|
| 324 |
+
else:
|
| 325 |
+
status_lines.append("❌ TTS: library not available")
|
| 326 |
+
|
| 327 |
+
# ASR (Whisper) model status
|
| 328 |
+
if WHISPER_AVAILABLE:
|
| 329 |
+
if config.global_whisper_model is not None:
|
| 330 |
+
status_lines.append("✅ ASR (Whisper large-v3-turbo): loaded and ready")
|
| 331 |
+
else:
|
| 332 |
+
status_lines.append("⚠️ ASR (Whisper large-v3-turbo): not loaded")
|
| 333 |
+
else:
|
| 334 |
+
status_lines.append("❌ ASR: library not available")
|
| 335 |
|
| 336 |
+
status_text = "\n".join(status_lines)
|
| 337 |
+
is_ready = is_model_loaded(model_name)
|
| 338 |
return status_text, is_ready
|
| 339 |
except Exception as e:
|
| 340 |
return f"❌ Error: {str(e)[:100]}", False
|
| 341 |
|
| 342 |
def check_model_status(model_name):
|
| 343 |
"""Check current model status without loading"""
|
| 344 |
+
status_lines = []
|
| 345 |
+
|
| 346 |
+
# Medical model status
|
| 347 |
if is_model_loaded(model_name):
|
| 348 |
+
status_lines.append(f"✅ MedSwin ({model_name}): loaded and ready")
|
| 349 |
+
else:
|
| 350 |
+
state = get_model_loading_state(model_name)
|
| 351 |
+
if state == "loading":
|
| 352 |
+
status_lines.append(f"⏳ MedSwin ({model_name}): loading...")
|
| 353 |
+
elif state == "error":
|
| 354 |
+
status_lines.append(f"❌ MedSwin ({model_name}): error loading")
|
| 355 |
+
else:
|
| 356 |
+
status_lines.append(f"⚠️ MedSwin ({model_name}): not loaded")
|
| 357 |
+
|
| 358 |
+
# TTS model status
|
| 359 |
+
if TTS_AVAILABLE:
|
| 360 |
+
if config.global_tts_model is not None:
|
| 361 |
+
status_lines.append("✅ TTS (maya1): loaded and ready")
|
| 362 |
+
else:
|
| 363 |
+
status_lines.append("⚠️ TTS (maya1): not loaded")
|
| 364 |
else:
|
| 365 |
+
status_lines.append("❌ TTS: library not available")
|
| 366 |
+
|
| 367 |
+
# ASR (Whisper) model status
|
| 368 |
+
if WHISPER_AVAILABLE:
|
| 369 |
+
if config.global_whisper_model is not None:
|
| 370 |
+
status_lines.append("✅ ASR (Whisper large-v3-turbo): loaded and ready")
|
| 371 |
+
else:
|
| 372 |
+
status_lines.append("⚠️ ASR (Whisper large-v3-turbo): not loaded")
|
| 373 |
+
else:
|
| 374 |
+
status_lines.append("❌ ASR: library not available")
|
| 375 |
+
|
| 376 |
+
status_text = "\n".join(status_lines)
|
| 377 |
+
is_ready = is_model_loaded(model_name)
|
| 378 |
+
return status_text, is_ready
|
| 379 |
|
| 380 |
# GPU-decorated function to load model on startup
|
| 381 |
+
@spaces.GPU(max_duration=120)
|
| 382 |
def load_default_model_on_startup():
|
| 383 |
"""Load default medical model on startup (GPU-decorated for ZeroGPU compatibility)"""
|
| 384 |
try:
|
|
|
|
| 400 |
logger.error(f"Error in model loading startup: {e}")
|
| 401 |
return f"⚠️ Startup loading error: {str(e)[:100]}"
|
| 402 |
|
| 403 |
+
# GPU-decorated function to load default TTS and ASR models on startup
|
| 404 |
+
@spaces.GPU(max_duration=120)
|
| 405 |
+
def load_voice_models_on_startup():
|
| 406 |
+
"""Load default TTS model (maya1) and ASR model (Whisper) on startup"""
|
| 407 |
try:
|
| 408 |
+
# Load TTS model
|
| 409 |
+
if TTS_AVAILABLE:
|
| 410 |
+
logger.info("Loading default TTS model (maya1) on startup...")
|
| 411 |
+
initialize_tts_model()
|
| 412 |
+
if config.global_tts_model is not None:
|
| 413 |
+
logger.info("✅ Default TTS model (maya1) loaded successfully on startup!")
|
| 414 |
+
else:
|
| 415 |
+
logger.warning("⚠️ TTS model failed to load on startup")
|
| 416 |
+
else:
|
| 417 |
logger.warning("TTS library not installed; skipping TTS preload.")
|
| 418 |
+
|
| 419 |
+
# Load ASR (Whisper) model
|
| 420 |
+
if WHISPER_AVAILABLE:
|
| 421 |
+
logger.info("Loading default ASR model (Whisper large-v3-turbo) on startup...")
|
| 422 |
+
initialize_whisper_model()
|
| 423 |
+
if config.global_whisper_model is not None:
|
| 424 |
+
logger.info("✅ Default ASR model (Whisper large-v3-turbo) loaded successfully on startup!")
|
| 425 |
+
else:
|
| 426 |
+
logger.warning("⚠️ ASR model failed to load on startup")
|
| 427 |
+
else:
|
| 428 |
+
logger.warning("Whisper transformers not installed; skipping ASR preload.")
|
| 429 |
except Exception as e:
|
| 430 |
+
logger.error(f"Error in voice models loading startup: {e}")
|
| 431 |
+
import traceback
|
| 432 |
+
logger.debug(f"Full traceback: {traceback.format_exc()}")
|
| 433 |
|
| 434 |
# Initialize status on load
|
| 435 |
def init_model_status():
|
| 436 |
status_text, is_ready = check_model_status(DEFAULT_MEDICAL_MODEL)
|
| 437 |
return status_text
|
| 438 |
|
| 439 |
+
# Update status when model selection changes
|
| 440 |
+
def update_model_status_on_change(model_name):
|
| 441 |
+
status_text, is_ready = check_model_status(model_name)
|
| 442 |
+
return status_text
|
| 443 |
+
|
| 444 |
# Handle model selection change
|
| 445 |
def on_model_change(model_name):
|
| 446 |
status_text, is_ready = load_model_and_update_status(model_name)
|
|
|
|
| 451 |
gr.update(interactive=submit_enabled)
|
| 452 |
)
|
| 453 |
|
| 454 |
+
# Update status display periodically or on model status changes
|
| 455 |
+
def refresh_model_status(model_name):
|
| 456 |
+
return update_model_status_on_change(model_name)
|
| 457 |
+
|
| 458 |
medical_model.change(
|
| 459 |
fn=on_model_change,
|
| 460 |
inputs=[medical_model],
|
| 461 |
outputs=[model_status, submit_button, message_input]
|
| 462 |
)
|
| 463 |
|
| 464 |
+
# Load models on startup - they will be loaded in separate GPU sessions
|
| 465 |
+
# First load medical model
|
| 466 |
demo.load(
|
| 467 |
fn=load_default_model_on_startup,
|
| 468 |
+
inputs=None,
|
| 469 |
outputs=[model_status]
|
| 470 |
)
|
| 471 |
+
# Then load voice models (TTS and ASR)
|
| 472 |
demo.load(
|
| 473 |
+
fn=load_voice_models_on_startup,
|
| 474 |
inputs=None,
|
| 475 |
outputs=None
|
| 476 |
)
|
| 477 |
+
# Finally update status to show all models
|
| 478 |
+
demo.load(
|
| 479 |
+
fn=lambda: check_model_status(DEFAULT_MEDICAL_MODEL)[0],
|
| 480 |
+
inputs=None,
|
| 481 |
+
outputs=[model_status]
|
| 482 |
+
)
|
| 483 |
|
| 484 |
# Wrap stream_chat to check model status before execution
|
| 485 |
def stream_chat_with_model_check(
|
voice.py
CHANGED
|
@@ -3,10 +3,12 @@ import os
|
|
| 3 |
import asyncio
|
| 4 |
import tempfile
|
| 5 |
import soundfile as sf
|
|
|
|
| 6 |
from logger import logger
|
| 7 |
from client import MCP_AVAILABLE, call_agent, get_mcp_session, get_cached_mcp_tools
|
| 8 |
import config
|
| 9 |
-
from models import TTS_AVAILABLE, initialize_tts_model
|
|
|
|
| 10 |
|
| 11 |
try:
|
| 12 |
import nest_asyncio
|
|
@@ -69,45 +71,158 @@ async def transcribe_audio_gemini(audio_path: str) -> str:
|
|
| 69 |
logger.error(f"Gemini transcription error: {e}")
|
| 70 |
return ""
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
def transcribe_audio(audio):
|
| 73 |
-
"""Transcribe audio to text using Gemini MCP"""
|
| 74 |
if audio is None:
|
|
|
|
| 75 |
return ""
|
| 76 |
|
| 77 |
try:
|
|
|
|
| 78 |
if isinstance(audio, str):
|
| 79 |
audio_path = audio
|
| 80 |
elif isinstance(audio, tuple):
|
| 81 |
sample_rate, audio_data = audio
|
|
|
|
| 82 |
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
|
| 83 |
sf.write(tmp_file.name, audio_data, samplerate=sample_rate)
|
| 84 |
audio_path = tmp_file.name
|
|
|
|
| 85 |
else:
|
| 86 |
audio_path = audio
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
if MCP_AVAILABLE:
|
| 89 |
try:
|
|
|
|
| 90 |
loop = asyncio.get_event_loop()
|
| 91 |
if loop.is_running():
|
| 92 |
if nest_asyncio:
|
| 93 |
transcribed = nest_asyncio.run(transcribe_audio_gemini(audio_path))
|
| 94 |
if transcribed:
|
| 95 |
-
logger.info(f"Transcribed via Gemini MCP: {transcribed[:50]}...")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
return transcribed
|
| 97 |
else:
|
| 98 |
-
logger.error("nest_asyncio not available for nested async transcription")
|
| 99 |
else:
|
| 100 |
transcribed = loop.run_until_complete(transcribe_audio_gemini(audio_path))
|
| 101 |
if transcribed:
|
| 102 |
-
logger.info(f"Transcribed via Gemini MCP: {transcribed[:50]}...")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
return transcribed
|
| 104 |
except Exception as e:
|
| 105 |
-
logger.error(f"Gemini MCP transcription error: {e}")
|
| 106 |
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
return ""
|
| 109 |
except Exception as e:
|
| 110 |
-
logger.error(f"Transcription error: {e}")
|
|
|
|
|
|
|
| 111 |
return ""
|
| 112 |
|
| 113 |
async def generate_speech_mcp(text: str) -> str:
|
|
@@ -185,6 +300,7 @@ def _generate_speech_via_mcp(text: str):
|
|
| 185 |
logger.warning(f"MCP TTS error (sync wrapper): {e}")
|
| 186 |
return None
|
| 187 |
|
|
|
|
| 188 |
def generate_speech(text: str):
|
| 189 |
"""Generate speech from text using local maya1 TTS model (with MCP fallback).
|
| 190 |
|
|
@@ -193,26 +309,34 @@ def generate_speech(text: str):
|
|
| 193 |
or fails.
|
| 194 |
"""
|
| 195 |
if not text or len(text.strip()) == 0:
|
|
|
|
| 196 |
return None
|
| 197 |
|
|
|
|
|
|
|
| 198 |
if not TTS_AVAILABLE:
|
| 199 |
-
logger.error("TTS library not installed. Please install TTS to use voice generation.")
|
| 200 |
# As a last resort, try MCP-based TTS if available
|
| 201 |
return _generate_speech_via_mcp(text)
|
| 202 |
|
| 203 |
if config.global_tts_model is None:
|
|
|
|
| 204 |
initialize_tts_model()
|
| 205 |
|
| 206 |
if config.global_tts_model is None:
|
| 207 |
-
logger.error("TTS model not available. Please check dependencies.")
|
| 208 |
return _generate_speech_via_mcp(text)
|
| 209 |
|
| 210 |
try:
|
|
|
|
| 211 |
wav = config.global_tts_model.tts(text)
|
| 212 |
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
|
| 213 |
sf.write(tmp_file.name, wav, samplerate=22050)
|
|
|
|
| 214 |
return tmp_file.name
|
| 215 |
except Exception as e:
|
| 216 |
-
logger.error(f"TTS error (local maya1): {e}")
|
|
|
|
|
|
|
| 217 |
return _generate_speech_via_mcp(text)
|
| 218 |
|
|
|
|
| 3 |
import asyncio
|
| 4 |
import tempfile
|
| 5 |
import soundfile as sf
|
| 6 |
+
import torch
|
| 7 |
from logger import logger
|
| 8 |
from client import MCP_AVAILABLE, call_agent, get_mcp_session, get_cached_mcp_tools
|
| 9 |
import config
|
| 10 |
+
from models import TTS_AVAILABLE, WHISPER_AVAILABLE, initialize_tts_model, initialize_whisper_model
|
| 11 |
+
import spaces
|
| 12 |
|
| 13 |
try:
|
| 14 |
import nest_asyncio
|
|
|
|
| 71 |
logger.error(f"Gemini transcription error: {e}")
|
| 72 |
return ""
|
| 73 |
|
| 74 |
+
@spaces.GPU(max_duration=60)
|
| 75 |
+
def transcribe_audio_whisper(audio_path: str) -> str:
|
| 76 |
+
"""Transcribe audio using Whisper model from Hugging Face"""
|
| 77 |
+
if not WHISPER_AVAILABLE:
|
| 78 |
+
logger.warning("[ASR] Whisper not available for transcription")
|
| 79 |
+
return ""
|
| 80 |
+
|
| 81 |
+
try:
|
| 82 |
+
logger.info(f"[ASR] Starting Whisper transcription for: {audio_path}")
|
| 83 |
+
if config.global_whisper_model is None:
|
| 84 |
+
logger.info("[ASR] Whisper model not loaded, initializing...")
|
| 85 |
+
initialize_whisper_model()
|
| 86 |
+
|
| 87 |
+
if config.global_whisper_model is None:
|
| 88 |
+
logger.error("[ASR] Failed to initialize Whisper model")
|
| 89 |
+
return ""
|
| 90 |
+
|
| 91 |
+
# Extract processor and model from stored dict
|
| 92 |
+
processor = config.global_whisper_model["processor"]
|
| 93 |
+
model = config.global_whisper_model["model"]
|
| 94 |
+
|
| 95 |
+
logger.info("[ASR] Loading audio file...")
|
| 96 |
+
# Load audio using torchaudio (imported from models)
|
| 97 |
+
from models import torchaudio
|
| 98 |
+
if torchaudio is None:
|
| 99 |
+
logger.error("[ASR] torchaudio not available")
|
| 100 |
+
return ""
|
| 101 |
+
|
| 102 |
+
waveform, sample_rate = torchaudio.load(audio_path)
|
| 103 |
+
# Resample to 16kHz if needed (Whisper expects 16kHz)
|
| 104 |
+
if sample_rate != 16000:
|
| 105 |
+
resampler = torchaudio.transforms.Resample(sample_rate, 16000)
|
| 106 |
+
waveform = resampler(waveform)
|
| 107 |
+
sample_rate = 16000
|
| 108 |
+
|
| 109 |
+
logger.info("[ASR] Processing audio with Whisper...")
|
| 110 |
+
# Process audio
|
| 111 |
+
inputs = processor(waveform.squeeze().numpy(), sampling_rate=sample_rate, return_tensors="pt")
|
| 112 |
+
|
| 113 |
+
# Move inputs to same device as model
|
| 114 |
+
device = next(model.parameters()).device
|
| 115 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 116 |
+
|
| 117 |
+
logger.info("[ASR] Running Whisper transcription...")
|
| 118 |
+
# Generate transcription
|
| 119 |
+
with torch.no_grad():
|
| 120 |
+
generated_ids = model.generate(**inputs)
|
| 121 |
+
|
| 122 |
+
# Decode transcription
|
| 123 |
+
transcribed_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
|
| 124 |
+
|
| 125 |
+
if transcribed_text:
|
| 126 |
+
logger.info(f"[ASR] ✅ Transcription successful: {transcribed_text[:100]}...")
|
| 127 |
+
logger.info(f"[ASR] Transcription length: {len(transcribed_text)} characters")
|
| 128 |
+
else:
|
| 129 |
+
logger.warning("[ASR] Whisper returned empty transcription")
|
| 130 |
+
|
| 131 |
+
return transcribed_text
|
| 132 |
+
except Exception as e:
|
| 133 |
+
logger.error(f"[ASR] Whisper transcription error: {e}")
|
| 134 |
+
import traceback
|
| 135 |
+
logger.debug(f"[ASR] Full traceback: {traceback.format_exc()}")
|
| 136 |
+
return ""
|
| 137 |
+
|
| 138 |
def transcribe_audio(audio):
|
| 139 |
+
"""Transcribe audio to text using Whisper (primary) or Gemini MCP (fallback)"""
|
| 140 |
if audio is None:
|
| 141 |
+
logger.warning("[ASR] No audio provided")
|
| 142 |
return ""
|
| 143 |
|
| 144 |
try:
|
| 145 |
+
# Convert audio input to file path
|
| 146 |
if isinstance(audio, str):
|
| 147 |
audio_path = audio
|
| 148 |
elif isinstance(audio, tuple):
|
| 149 |
sample_rate, audio_data = audio
|
| 150 |
+
logger.info(f"[ASR] Processing audio tuple: sample_rate={sample_rate}, data_shape={audio_data.shape if hasattr(audio_data, 'shape') else 'unknown'}")
|
| 151 |
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
|
| 152 |
sf.write(tmp_file.name, audio_data, samplerate=sample_rate)
|
| 153 |
audio_path = tmp_file.name
|
| 154 |
+
logger.info(f"[ASR] Created temporary audio file: {audio_path}")
|
| 155 |
else:
|
| 156 |
audio_path = audio
|
| 157 |
|
| 158 |
+
logger.info(f"[ASR] Attempting transcription with Whisper (primary method)...")
|
| 159 |
+
|
| 160 |
+
# Try Whisper first (primary method)
|
| 161 |
+
if WHISPER_AVAILABLE:
|
| 162 |
+
try:
|
| 163 |
+
transcribed = transcribe_audio_whisper(audio_path)
|
| 164 |
+
if transcribed:
|
| 165 |
+
logger.info(f"[ASR] ✅ Successfully transcribed via Whisper: {transcribed[:50]}...")
|
| 166 |
+
# Clean up temp file if we created it
|
| 167 |
+
if isinstance(audio, tuple) and os.path.exists(audio_path):
|
| 168 |
+
try:
|
| 169 |
+
os.unlink(audio_path)
|
| 170 |
+
except:
|
| 171 |
+
pass
|
| 172 |
+
return transcribed
|
| 173 |
+
else:
|
| 174 |
+
logger.warning("[ASR] Whisper transcription returned empty, trying fallback...")
|
| 175 |
+
except Exception as e:
|
| 176 |
+
logger.error(f"[ASR] Whisper transcription failed: {e}, trying fallback...")
|
| 177 |
+
else:
|
| 178 |
+
logger.warning("[ASR] Whisper not available, trying Gemini fallback...")
|
| 179 |
+
|
| 180 |
+
# Fallback to Gemini MCP if Whisper fails or is unavailable
|
| 181 |
if MCP_AVAILABLE:
|
| 182 |
try:
|
| 183 |
+
logger.info("[ASR] Attempting transcription with Gemini MCP (fallback)...")
|
| 184 |
loop = asyncio.get_event_loop()
|
| 185 |
if loop.is_running():
|
| 186 |
if nest_asyncio:
|
| 187 |
transcribed = nest_asyncio.run(transcribe_audio_gemini(audio_path))
|
| 188 |
if transcribed:
|
| 189 |
+
logger.info(f"[ASR] Transcribed via Gemini MCP (fallback): {transcribed[:50]}...")
|
| 190 |
+
# Clean up temp file if we created it
|
| 191 |
+
if isinstance(audio, tuple) and os.path.exists(audio_path):
|
| 192 |
+
try:
|
| 193 |
+
os.unlink(audio_path)
|
| 194 |
+
except:
|
| 195 |
+
pass
|
| 196 |
return transcribed
|
| 197 |
else:
|
| 198 |
+
logger.error("[ASR] nest_asyncio not available for nested async transcription")
|
| 199 |
else:
|
| 200 |
transcribed = loop.run_until_complete(transcribe_audio_gemini(audio_path))
|
| 201 |
if transcribed:
|
| 202 |
+
logger.info(f"[ASR] Transcribed via Gemini MCP (fallback): {transcribed[:50]}...")
|
| 203 |
+
# Clean up temp file if we created it
|
| 204 |
+
if isinstance(audio, tuple) and os.path.exists(audio_path):
|
| 205 |
+
try:
|
| 206 |
+
os.unlink(audio_path)
|
| 207 |
+
except:
|
| 208 |
+
pass
|
| 209 |
return transcribed
|
| 210 |
except Exception as e:
|
| 211 |
+
logger.error(f"[ASR] Gemini MCP transcription error: {e}")
|
| 212 |
|
| 213 |
+
# Clean up temp file if we created it
|
| 214 |
+
if isinstance(audio, tuple) and os.path.exists(audio_path):
|
| 215 |
+
try:
|
| 216 |
+
os.unlink(audio_path)
|
| 217 |
+
except:
|
| 218 |
+
pass
|
| 219 |
+
|
| 220 |
+
logger.warning("[ASR] All transcription methods failed")
|
| 221 |
return ""
|
| 222 |
except Exception as e:
|
| 223 |
+
logger.error(f"[ASR] Transcription error: {e}")
|
| 224 |
+
import traceback
|
| 225 |
+
logger.debug(f"[ASR] Full traceback: {traceback.format_exc()}")
|
| 226 |
return ""
|
| 227 |
|
| 228 |
async def generate_speech_mcp(text: str) -> str:
|
|
|
|
| 300 |
logger.warning(f"MCP TTS error (sync wrapper): {e}")
|
| 301 |
return None
|
| 302 |
|
| 303 |
+
@spaces.GPU(max_duration=60)
|
| 304 |
def generate_speech(text: str):
|
| 305 |
"""Generate speech from text using local maya1 TTS model (with MCP fallback).
|
| 306 |
|
|
|
|
| 309 |
or fails.
|
| 310 |
"""
|
| 311 |
if not text or len(text.strip()) == 0:
|
| 312 |
+
logger.warning("[TTS] Empty text provided")
|
| 313 |
return None
|
| 314 |
|
| 315 |
+
logger.info(f"[TTS] Generating speech for text: {text[:50]}...")
|
| 316 |
+
|
| 317 |
if not TTS_AVAILABLE:
|
| 318 |
+
logger.error("[TTS] TTS library not installed. Please install TTS to use voice generation.")
|
| 319 |
# As a last resort, try MCP-based TTS if available
|
| 320 |
return _generate_speech_via_mcp(text)
|
| 321 |
|
| 322 |
if config.global_tts_model is None:
|
| 323 |
+
logger.info("[TTS] TTS model not loaded, initializing...")
|
| 324 |
initialize_tts_model()
|
| 325 |
|
| 326 |
if config.global_tts_model is None:
|
| 327 |
+
logger.error("[TTS] TTS model not available. Please check dependencies.")
|
| 328 |
return _generate_speech_via_mcp(text)
|
| 329 |
|
| 330 |
try:
|
| 331 |
+
logger.info("[TTS] Running TTS generation...")
|
| 332 |
wav = config.global_tts_model.tts(text)
|
| 333 |
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
|
| 334 |
sf.write(tmp_file.name, wav, samplerate=22050)
|
| 335 |
+
logger.info(f"[TTS] ✅ Speech generated successfully: {tmp_file.name}")
|
| 336 |
return tmp_file.name
|
| 337 |
except Exception as e:
|
| 338 |
+
logger.error(f"[TTS] TTS error (local maya1): {e}")
|
| 339 |
+
import traceback
|
| 340 |
+
logger.debug(f"[TTS] Full traceback: {traceback.format_exc()}")
|
| 341 |
return _generate_speech_via_mcp(text)
|
| 342 |
|