Spaces:

MCP-1st-Birthday
/

MedLLM-Agent

Running on Zero

Y Phung Nguyen commited on 14 days ago

Commit

af9efda

1 Parent(s): a1f9b00

Fix PDF upload, add Whisper ASR, and enhance model status display

- Fix PDF upload error by handling file objects without seek() method (NamedString compatibility)
- Replace Gemini transcription with Whisper large-v3-turbo from Hugging Face (openai/whisper-large-v3-turbo)
- Add comprehensive ASR debugging/logging with [ASR] prefixes
- Load Whisper model at startup alongside maya1 TTS and MedSwin
- Update model status textbox to display MedSwin, TTS (maya1), and ASR (Whisper) loading status
- Ensure all voice models (TTS and ASR) are GPU-decorated for ZeroGPU compatibility
- Update requirements.txt to use transformers/torchaudio instead of openai-whisper
- Update README with new Whisper ASR implementation details

Files changed (8) hide show

README.md +5 -3
app.py +7 -3
config.py +1 -0
indexing.py +27 -3
models.py +39 -0
requirements.txt +3 -0
ui.py +120 -29
voice.py +135 -11

README.md CHANGED Viewed

@@ -70,9 +70,10 @@ tags:
 Sample [conversation](sample.md) between MAC-system and patient are recorded.
 ### 🎤 **Voice Features**
-- **Speech-to-Text**: Voice input transcription using Gemini MCP
 - **Inline Mic Experience**: Built-in microphone widget with live recording timer that drops transcripts straight into the chat box
 - **Text-to-Speech**: Voice output generation using Maya1 TTS model (optional, fallback to MCP if unavailable) plus a one-click "Play Response" control for the latest answer
 ### 🤝 **MAC Architecture (Multi-Agent Collaboration)**
@@ -124,7 +125,7 @@ Sample [conversation](sample.md) between MAC-system and patient are recorded.
 - **Architecture**: MAC (Multi-Agent Collaboration) - Gemini Supervisor + MedSwin Specialist
 - **Translation**: Gemini MCP (gemini-2.5-flash-lite)
 - **Document Parsing**: Gemini MCP (PDF, Word, TXT, MD, JSON, XML, CSV)
-- **Speech-to-Text**: Gemini MCP (gemini-2.5-flash-lite)
 - **Supervisor Tasks**: Gemini MCP (gemini-2.5-flash) - query breakdown, search strategies, RAG brainstorming
 - **MedSwin Execution**: GPU-tagged tasks for efficient inference
 - **Text-to-Speech**: maya-research/maya1 (optional, with MCP fallback)
@@ -139,10 +140,11 @@ Sample [conversation](sample.md) between MAC-system and patient are recorded.
 See `requirements.txt` for full dependency list. Key dependencies:
 - **MCP Integration**: `mcp`, `nest-asyncio`, `google-genai` (for Gemini MCP server)
 - **Fallback Dependencies**: `requests`, `beautifulsoup4`, `ddgs` (used when MCP web search unavailable)
-- **Core ML**: `transformers`, `torch`, `accelerate`
 - **RAG Framework**: `llama-index`, `llama_index.llms.huggingface`, `llama_index.embeddings.huggingface`
 - **Utilities**: `langdetect`, `gradio`, `spaces`, `soundfile`
 - **TTS**: Optional - `TTS` package (voice features work with MCP fallback if unavailable)
 ### 🔌 MCP Configuration

 Sample [conversation](sample.md) between MAC-system and patient are recorded.
 ### 🎤 **Voice Features**
+- **Speech-to-Text**: Voice input transcription using Whisper large-v3-turbo (Hugging Face) with Gemini MCP fallback
 - **Inline Mic Experience**: Built-in microphone widget with live recording timer that drops transcripts straight into the chat box
 - **Text-to-Speech**: Voice output generation using Maya1 TTS model (optional, fallback to MCP if unavailable) plus a one-click "Play Response" control for the latest answer
+- **Model Status Display**: Real-time status for MedSwin, TTS (maya1), and ASR (Whisper) models
 ### 🤝 **MAC Architecture (Multi-Agent Collaboration)**
 - **Architecture**: MAC (Multi-Agent Collaboration) - Gemini Supervisor + MedSwin Specialist
 - **Translation**: Gemini MCP (gemini-2.5-flash-lite)
 - **Document Parsing**: Gemini MCP (PDF, Word, TXT, MD, JSON, XML, CSV)
+- **Speech-to-Text**: openai/whisper-large-v3-turbo (Hugging Face, primary) with Gemini MCP fallback
 - **Supervisor Tasks**: Gemini MCP (gemini-2.5-flash) - query breakdown, search strategies, RAG brainstorming
 - **MedSwin Execution**: GPU-tagged tasks for efficient inference
 - **Text-to-Speech**: maya-research/maya1 (optional, with MCP fallback)
 See `requirements.txt` for full dependency list. Key dependencies:
 - **MCP Integration**: `mcp`, `nest-asyncio`, `google-genai` (for Gemini MCP server)
 - **Fallback Dependencies**: `requests`, `beautifulsoup4`, `ddgs` (used when MCP web search unavailable)
+- **Core ML**: `transformers`, `torch`, `accelerate`, `torchaudio`
 - **RAG Framework**: `llama-index`, `llama_index.llms.huggingface`, `llama_index.embeddings.huggingface`
 - **Utilities**: `langdetect`, `gradio`, `spaces`, `soundfile`
 - **TTS**: Optional - `TTS` package (voice features work with MCP fallback if unavailable)
+- **ASR**: Whisper via `transformers` (openai/whisper-large-v3-turbo from Hugging Face)
 ### 🔌 MCP Configuration

app.py CHANGED Viewed

@@ -5,7 +5,7 @@ os.environ["TOKENIZERS_PARALLELISM"] = "false"
 from logger import logger
 from config import DEFAULT_MEDICAL_MODEL
 import config
-from models import initialize_medical_model, initialize_tts_model
 from client import MCP_AVAILABLE
 from ui import create_demo
@@ -17,8 +17,12 @@ if __name__ == "__main__":
     logger.info("App starting - models will be loaded on-demand when first needed")
     logger.info(f"Default medical model: {DEFAULT_MEDICAL_MODEL}")
-    # TTS model also uses GPU decorator, so skip preloading
-    logger.info("TTS model will be loaded on-demand if needed")
     # Check Gemini MCP availability
     if MCP_AVAILABLE:

 from logger import logger
 from config import DEFAULT_MEDICAL_MODEL
 import config
+from models import initialize_medical_model, initialize_tts_model, initialize_whisper_model, WHISPER_AVAILABLE
 from client import MCP_AVAILABLE
 from ui import create_demo
     logger.info("App starting - models will be loaded on-demand when first needed")
     logger.info(f"Default medical model: {DEFAULT_MEDICAL_MODEL}")
+    # TTS and ASR models also use GPU decorator, so skip preloading
+    logger.info("TTS and ASR models will be loaded on-demand if needed")
+    if WHISPER_AVAILABLE:
+        logger.info("Whisper ASR library (transformers) is available")
+    else:
+        logger.warning("Whisper ASR library not available - install with: pip install transformers torchaudio")
     # Check Gemini MCP availability
     if MCP_AVAILABLE:

config.py CHANGED Viewed

@@ -35,6 +35,7 @@ global_medical_models = {}
 global_medical_tokenizers = {}
 global_file_info = {}
 global_tts_model = None
 global_embed_model = None
 # MCP client storage

 global_medical_tokenizers = {}
 global_file_info = {}
 global_tts_model = None
+global_whisper_model = None
 global_embed_model = None
 # MCP client storage

indexing.py CHANGED Viewed

@@ -80,13 +80,37 @@ def extract_text_from_document(file):
     file_extension = os.path.splitext(file_name)[1].lower()
     if file_extension == '.txt':
-        text = file.read().decode('utf-8')
         return text, len(text.split()), None
     try:
         with tempfile.NamedTemporaryFile(delete=False, suffix=file_extension) as tmp_file:
-            file.seek(0)
-            tmp_file.write(file.read())
             tmp_file_path = tmp_file.name
         if MCP_AVAILABLE:

     file_extension = os.path.splitext(file_name)[1].lower()
     if file_extension == '.txt':
+        # Handle file objects that might not have seek() method
+        try:
+            if hasattr(file, 'seek'):
+                file.seek(0)
+            text = file.read().decode('utf-8')
+        except (AttributeError, TypeError):
+            # If file is a string path or NamedString, read it differently
+            if isinstance(file, str):
+                with open(file, 'r', encoding='utf-8') as f:
+                    text = f.read()
+            else:
+                # Try to get content directly
+                text = str(file) if hasattr(file, '__str__') else file.read() if hasattr(file, 'read') else ""
         return text, len(text.split()), None
     try:
         with tempfile.NamedTemporaryFile(delete=False, suffix=file_extension) as tmp_file:
+            # Handle file objects that might not have seek() method
+            try:
+                if hasattr(file, 'seek'):
+                    file.seek(0)
+                file_content = file.read()
+            except (AttributeError, TypeError):
+                # If file is a string path, read it directly
+                if isinstance(file, str):
+                    with open(file, 'rb') as f:
+                        file_content = f.read()
+                else:
+                    # Try to get content directly without seek
+                    file_content = file.read() if hasattr(file, 'read') else bytes(file) if hasattr(file, '__bytes__') else b""
+            tmp_file.write(file_content)
             tmp_file_path = tmp_file.name
         if MCP_AVAILABLE:

models.py CHANGED Viewed

@@ -1,4 +1,5 @@
 """Model initialization and management"""
 import torch
 import threading
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -15,6 +16,19 @@ except ImportError:
     TTS_AVAILABLE = False
     TTS = None
 # Model loading state tracking
 _model_loading_states = {}
 _model_loading_lock = threading.Lock()
@@ -88,6 +102,31 @@ def initialize_tts_model():
             config.global_tts_model = None
     return config.global_tts_model
 def get_or_create_embed_model():
     """Reuse embedding model to avoid reloading weights each request"""
     if config.global_embed_model is None:

 """Model initialization and management"""
+import os
 import torch
 import threading
 from transformers import AutoModelForCausalLM, AutoTokenizer
     TTS_AVAILABLE = False
     TTS = None
+try:
+    from transformers import WhisperProcessor, WhisperForConditionalGeneration
+    try:
+        import torchaudio
+    except ImportError:
+        torchaudio = None
+    WHISPER_AVAILABLE = True
+except ImportError:
+    WHISPER_AVAILABLE = False
+    WhisperProcessor = None
+    WhisperForConditionalGeneration = None
+    torchaudio = None
 # Model loading state tracking
 _model_loading_states = {}
 _model_loading_lock = threading.Lock()
             config.global_tts_model = None
     return config.global_tts_model
+def initialize_whisper_model():
+    """Initialize Whisper model for speech-to-text (ASR) from Hugging Face"""
+    if not WHISPER_AVAILABLE:
+        logger.warning("Whisper transformers not installed. ASR features will be disabled.")
+        return None
+    if config.global_whisper_model is None:
+        try:
+            logger.info("Initializing Whisper model (openai/whisper-large-v3-turbo) from Hugging Face...")
+            model_id = "openai/whisper-large-v3-turbo"
+            processor = WhisperProcessor.from_pretrained(model_id, token=config.HF_TOKEN)
+            model = WhisperForConditionalGeneration.from_pretrained(
+                model_id,
+                device_map="auto",
+                torch_dtype=torch.float16,
+                token=config.HF_TOKEN
+            )
+            # Store both processor and model
+            config.global_whisper_model = {"processor": processor, "model": model}
+            logger.info(f"Whisper model ({model_id}) initialized successfully")
+        except Exception as e:
+            logger.warning(f"Whisper model initialization failed: {e}")
+            logger.warning("ASR features will be disabled. Install with: pip install transformers torchaudio")
+            config.global_whisper_model = None
+    return config.global_whisper_model
 def get_or_create_embed_model():
     """Reuse embedding model to avoid reloading weights each request"""
     if config.global_embed_model is None:

requirements.txt CHANGED Viewed

@@ -34,3 +34,6 @@ setuptools>=65.0.0
 # Option 2: pip install TTS (may fail on pyworld, but TTS will work for most models without it)
 # The app will run without TTS - voice generation will be disabled
 # TTS

 # Option 2: pip install TTS (may fail on pyworld, but TTS will work for most models without it)
 # The app will run without TTS - voice generation will be disabled
 # TTS
+# ASR (Automatic Speech Recognition) - Whisper for speech-to-text (via Hugging Face transformers)
+torchaudio

ui.py CHANGED Viewed

@@ -3,6 +3,7 @@ import time
 import gradio as gr
 import spaces
 from config import TITLE, DESCRIPTION, CSS, MEDSWIN_MODELS, DEFAULT_MEDICAL_MODEL
 from indexing import create_or_update_index
 from pipeline import stream_chat
 from voice import transcribe_audio, generate_speech
@@ -12,7 +13,9 @@ from models import (
     get_model_loading_state,
     set_model_loading_state,
     initialize_tts_model,
     TTS_AVAILABLE,
 )
 from logger import logger
@@ -293,35 +296,89 @@ def create_demo():
                 def load_model_and_update_status(model_name):
                     """Load model and update status, return status text and whether model is ready"""
                     try:
                         if is_model_loaded(model_name):
-                            return "✅ The model has been loaded successfully", True
-                        state = get_model_loading_state(model_name)
-                        if state == "loading":
-                            return "⏳ The model is being loaded, please wait...", False
-                        elif state == "error":
-                            return "❌ Error loading model. Please try again.", False
-                        # Use GPU-decorated function to load the model
-                        status_text, is_ready = load_model_with_gpu(model_name)
                         return status_text, is_ready
                     except Exception as e:
                         return f"❌ Error: {str(e)[:100]}", False
                 def check_model_status(model_name):
                     """Check current model status without loading"""
                     if is_model_loaded(model_name):
-                        return "✅ The model has been loaded successfully", True
-                    state = get_model_loading_state(model_name)
-                    if state == "loading":
-                        return "⏳ The model is being loaded, please wait...", False
-                    elif state == "error":
-                        return "❌ Error loading model. Please try again.", False
                     else:
-                        return "⚠️ Model not loaded. Click to load or it will load on first use.", False
                 # GPU-decorated function to load model on startup
-                # @spaces.GPU(max_duration=120)
                 def load_default_model_on_startup():
                     """Load default medical model on startup (GPU-decorated for ZeroGPU compatibility)"""
                     try:
@@ -343,25 +400,47 @@ def create_demo():
                         logger.error(f"Error in model loading startup: {e}")
                         return f"⚠️ Startup loading error: {str(e)[:100]}"
-                # GPU-decorated function to load default TTS model on startup
-                # @spaces.GPU(max_duration=120)
-                def load_tts_model_on_startup():
-                    """Load default TTS model (maya1) on startup"""
                     try:
-                        if not TTS_AVAILABLE:
                             logger.warning("TTS library not installed; skipping TTS preload.")
-                            return
-                        logger.info("Loading default TTS model (maya1) on startup...")
-                        initialize_tts_model()
-                        logger.info("✅ Default TTS model (maya1) loaded successfully on startup!")
                     except Exception as e:
-                        logger.error(f"Error in TTS model loading startup: {e}")
                 # Initialize status on load
                 def init_model_status():
                     status_text, is_ready = check_model_status(DEFAULT_MEDICAL_MODEL)
                     return status_text
                 # Handle model selection change
                 def on_model_change(model_name):
                     status_text, is_ready = load_model_and_update_status(model_name)
@@ -372,23 +451,35 @@ def create_demo():
                         gr.update(interactive=submit_enabled)
                     )
                 medical_model.change(
                     fn=on_model_change,
                     inputs=[medical_model],
                     outputs=[model_status, submit_button, message_input]
                 )
-                # Load default medical model on startup (GPU-decorated function)
                 demo.load(
                     fn=load_default_model_on_startup,
                     outputs=[model_status]
                 )
-                # Load default TTS model (maya1) on startup (GPU-decorated function)
                 demo.load(
-                    fn=load_tts_model_on_startup,
                     inputs=None,
                     outputs=None
                 )
                 # Wrap stream_chat to check model status before execution
                 def stream_chat_with_model_check(

 import gradio as gr
 import spaces
 from config import TITLE, DESCRIPTION, CSS, MEDSWIN_MODELS, DEFAULT_MEDICAL_MODEL
+import config
 from indexing import create_or_update_index
 from pipeline import stream_chat
 from voice import transcribe_audio, generate_speech
     get_model_loading_state,
     set_model_loading_state,
     initialize_tts_model,
+    initialize_whisper_model,
     TTS_AVAILABLE,
+    WHISPER_AVAILABLE,
 )
 from logger import logger
                 def load_model_and_update_status(model_name):
                     """Load model and update status, return status text and whether model is ready"""
                     try:
+                        status_lines = []
+                        # Medical model status
                         if is_model_loaded(model_name):
+                            status_lines.append(f"✅ MedSwin ({model_name}): loaded and ready")
+                        else:
+                            state = get_model_loading_state(model_name)
+                            if state == "loading":
+                                status_lines.append(f"⏳ MedSwin ({model_name}): loading...")
+                            elif state == "error":
+                                status_lines.append(f"❌ MedSwin ({model_name}): error loading")
+                            else:
+                                # Use GPU-decorated function to load the model
+                                status_text, is_ready = load_model_with_gpu(model_name)
+                                if is_ready:
+                                    status_lines.append(f"✅ MedSwin ({model_name}): loaded and ready")
+                                else:
+                                    status_lines.append(f"⏳ MedSwin ({model_name}): loading...")
+                        # TTS model status
+                        if TTS_AVAILABLE:
+                            if config.global_tts_model is not None:
+                                status_lines.append("✅ TTS (maya1): loaded and ready")
+                            else:
+                                status_lines.append("⚠️ TTS (maya1): not loaded")
+                        else:
+                            status_lines.append("❌ TTS: library not available")
+                    # ASR (Whisper) model status
+                    if WHISPER_AVAILABLE:
+                        if config.global_whisper_model is not None:
+                            status_lines.append("✅ ASR (Whisper large-v3-turbo): loaded and ready")
+                        else:
+                            status_lines.append("⚠️ ASR (Whisper large-v3-turbo): not loaded")
+                    else:
+                        status_lines.append("❌ ASR: library not available")
+                        status_text = "\n".join(status_lines)
+                        is_ready = is_model_loaded(model_name)
                         return status_text, is_ready
                     except Exception as e:
                         return f"❌ Error: {str(e)[:100]}", False
                 def check_model_status(model_name):
                     """Check current model status without loading"""
+                    status_lines = []
+                    # Medical model status
                     if is_model_loaded(model_name):
+                        status_lines.append(f"✅ MedSwin ({model_name}): loaded and ready")
+                    else:
+                        state = get_model_loading_state(model_name)
+                        if state == "loading":
+                            status_lines.append(f"⏳ MedSwin ({model_name}): loading...")
+                        elif state == "error":
+                            status_lines.append(f"❌ MedSwin ({model_name}): error loading")
+                        else:
+                            status_lines.append(f"⚠️ MedSwin ({model_name}): not loaded")
+                    # TTS model status
+                    if TTS_AVAILABLE:
+                        if config.global_tts_model is not None:
+                            status_lines.append("✅ TTS (maya1): loaded and ready")
+                        else:
+                            status_lines.append("⚠️ TTS (maya1): not loaded")
                     else:
+                        status_lines.append("❌ TTS: library not available")
+                    # ASR (Whisper) model status
+                    if WHISPER_AVAILABLE:
+                        if config.global_whisper_model is not None:
+                            status_lines.append("✅ ASR (Whisper large-v3-turbo): loaded and ready")
+                        else:
+                            status_lines.append("⚠️ ASR (Whisper large-v3-turbo): not loaded")
+                    else:
+                        status_lines.append("❌ ASR: library not available")
+                    status_text = "\n".join(status_lines)
+                    is_ready = is_model_loaded(model_name)
+                    return status_text, is_ready
                 # GPU-decorated function to load model on startup
+                @spaces.GPU(max_duration=120)
                 def load_default_model_on_startup():
                     """Load default medical model on startup (GPU-decorated for ZeroGPU compatibility)"""
                     try:
                         logger.error(f"Error in model loading startup: {e}")
                         return f"⚠️ Startup loading error: {str(e)[:100]}"
+                # GPU-decorated function to load default TTS and ASR models on startup
+                @spaces.GPU(max_duration=120)
+                def load_voice_models_on_startup():
+                    """Load default TTS model (maya1) and ASR model (Whisper) on startup"""
                     try:
+                        # Load TTS model
+                        if TTS_AVAILABLE:
+                            logger.info("Loading default TTS model (maya1) on startup...")
+                            initialize_tts_model()
+                            if config.global_tts_model is not None:
+                                logger.info("✅ Default TTS model (maya1) loaded successfully on startup!")
+                            else:
+                                logger.warning("⚠️ TTS model failed to load on startup")
+                        else:
                             logger.warning("TTS library not installed; skipping TTS preload.")
+                        # Load ASR (Whisper) model
+                        if WHISPER_AVAILABLE:
+                            logger.info("Loading default ASR model (Whisper large-v3-turbo) on startup...")
+                            initialize_whisper_model()
+                            if config.global_whisper_model is not None:
+                                logger.info("✅ Default ASR model (Whisper large-v3-turbo) loaded successfully on startup!")
+                            else:
+                                logger.warning("⚠️ ASR model failed to load on startup")
+                        else:
+                            logger.warning("Whisper transformers not installed; skipping ASR preload.")
                     except Exception as e:
+                        logger.error(f"Error in voice models loading startup: {e}")
+                        import traceback
+                        logger.debug(f"Full traceback: {traceback.format_exc()}")
                 # Initialize status on load
                 def init_model_status():
                     status_text, is_ready = check_model_status(DEFAULT_MEDICAL_MODEL)
                     return status_text
+                # Update status when model selection changes
+                def update_model_status_on_change(model_name):
+                    status_text, is_ready = check_model_status(model_name)
+                    return status_text
                 # Handle model selection change
                 def on_model_change(model_name):
                     status_text, is_ready = load_model_and_update_status(model_name)
                         gr.update(interactive=submit_enabled)
                     )
+                # Update status display periodically or on model status changes
+                def refresh_model_status(model_name):
+                    return update_model_status_on_change(model_name)
                 medical_model.change(
                     fn=on_model_change,
                     inputs=[medical_model],
                     outputs=[model_status, submit_button, message_input]
                 )
+                # Load models on startup - they will be loaded in separate GPU sessions
+                # First load medical model
                 demo.load(
                     fn=load_default_model_on_startup,
+                    inputs=None,
                     outputs=[model_status]
                 )
+                # Then load voice models (TTS and ASR)
                 demo.load(
+                    fn=load_voice_models_on_startup,
                     inputs=None,
                     outputs=None
                 )
+                # Finally update status to show all models
+                demo.load(
+                    fn=lambda: check_model_status(DEFAULT_MEDICAL_MODEL)[0],
+                    inputs=None,
+                    outputs=[model_status]
+                )
                 # Wrap stream_chat to check model status before execution
                 def stream_chat_with_model_check(

voice.py CHANGED Viewed

@@ -3,10 +3,12 @@ import os
 import asyncio
 import tempfile
 import soundfile as sf
 from logger import logger
 from client import MCP_AVAILABLE, call_agent, get_mcp_session, get_cached_mcp_tools
 import config
-from models import TTS_AVAILABLE, initialize_tts_model
 try:
     import nest_asyncio
@@ -69,45 +71,158 @@ async def transcribe_audio_gemini(audio_path: str) -> str:
         logger.error(f"Gemini transcription error: {e}")
         return ""
 def transcribe_audio(audio):
-    """Transcribe audio to text using Gemini MCP"""
     if audio is None:
         return ""
     try:
         if isinstance(audio, str):
             audio_path = audio
         elif isinstance(audio, tuple):
             sample_rate, audio_data = audio
             with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
                 sf.write(tmp_file.name, audio_data, samplerate=sample_rate)
                 audio_path = tmp_file.name
         else:
             audio_path = audio
         if MCP_AVAILABLE:
             try:
                 loop = asyncio.get_event_loop()
                 if loop.is_running():
                     if nest_asyncio:
                         transcribed = nest_asyncio.run(transcribe_audio_gemini(audio_path))
                         if transcribed:
-                            logger.info(f"Transcribed via Gemini MCP: {transcribed[:50]}...")
                             return transcribed
                     else:
-                        logger.error("nest_asyncio not available for nested async transcription")
                 else:
                     transcribed = loop.run_until_complete(transcribe_audio_gemini(audio_path))
                     if transcribed:
-                        logger.info(f"Transcribed via Gemini MCP: {transcribed[:50]}...")
                         return transcribed
             except Exception as e:
-                logger.error(f"Gemini MCP transcription error: {e}")
-        logger.warning("Gemini MCP transcription not available")
         return ""
     except Exception as e:
-        logger.error(f"Transcription error: {e}")
         return ""
 async def generate_speech_mcp(text: str) -> str:
@@ -185,6 +300,7 @@ def _generate_speech_via_mcp(text: str):
         logger.warning(f"MCP TTS error (sync wrapper): {e}")
     return None
 def generate_speech(text: str):
     """Generate speech from text using local maya1 TTS model (with MCP fallback).
@@ -193,26 +309,34 @@ def generate_speech(text: str):
     or fails.
     """
     if not text or len(text.strip()) == 0:
         return None
     if not TTS_AVAILABLE:
-        logger.error("TTS library not installed. Please install TTS to use voice generation.")
         # As a last resort, try MCP-based TTS if available
         return _generate_speech_via_mcp(text)
     if config.global_tts_model is None:
         initialize_tts_model()
     if config.global_tts_model is None:
-        logger.error("TTS model not available. Please check dependencies.")
         return _generate_speech_via_mcp(text)
     try:
         wav = config.global_tts_model.tts(text)
         with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
             sf.write(tmp_file.name, wav, samplerate=22050)
             return tmp_file.name
     except Exception as e:
-        logger.error(f"TTS error (local maya1): {e}")
         return _generate_speech_via_mcp(text)

 import asyncio
 import tempfile
 import soundfile as sf
+import torch
 from logger import logger
 from client import MCP_AVAILABLE, call_agent, get_mcp_session, get_cached_mcp_tools
 import config
+from models import TTS_AVAILABLE, WHISPER_AVAILABLE, initialize_tts_model, initialize_whisper_model
+import spaces
 try:
     import nest_asyncio
         logger.error(f"Gemini transcription error: {e}")
         return ""
+@spaces.GPU(max_duration=60)
+def transcribe_audio_whisper(audio_path: str) -> str:
+    """Transcribe audio using Whisper model from Hugging Face"""
+    if not WHISPER_AVAILABLE:
+        logger.warning("[ASR] Whisper not available for transcription")
+        return ""
+    try:
+        logger.info(f"[ASR] Starting Whisper transcription for: {audio_path}")
+        if config.global_whisper_model is None:
+            logger.info("[ASR] Whisper model not loaded, initializing...")
+            initialize_whisper_model()
+        if config.global_whisper_model is None:
+            logger.error("[ASR] Failed to initialize Whisper model")
+            return ""
+        # Extract processor and model from stored dict
+        processor = config.global_whisper_model["processor"]
+        model = config.global_whisper_model["model"]
+        logger.info("[ASR] Loading audio file...")
+        # Load audio using torchaudio (imported from models)
+        from models import torchaudio
+        if torchaudio is None:
+            logger.error("[ASR] torchaudio not available")
+            return ""
+        waveform, sample_rate = torchaudio.load(audio_path)
+        # Resample to 16kHz if needed (Whisper expects 16kHz)
+        if sample_rate != 16000:
+            resampler = torchaudio.transforms.Resample(sample_rate, 16000)
+            waveform = resampler(waveform)
+            sample_rate = 16000
+        logger.info("[ASR] Processing audio with Whisper...")
+        # Process audio
+        inputs = processor(waveform.squeeze().numpy(), sampling_rate=sample_rate, return_tensors="pt")
+        # Move inputs to same device as model
+        device = next(model.parameters()).device
+        inputs = {k: v.to(device) for k, v in inputs.items()}
+        logger.info("[ASR] Running Whisper transcription...")
+        # Generate transcription
+        with torch.no_grad():
+            generated_ids = model.generate(**inputs)
+        # Decode transcription
+        transcribed_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
+        if transcribed_text:
+            logger.info(f"[ASR] ✅ Transcription successful: {transcribed_text[:100]}...")
+            logger.info(f"[ASR] Transcription length: {len(transcribed_text)} characters")
+        else:
+            logger.warning("[ASR] Whisper returned empty transcription")
+        return transcribed_text
+    except Exception as e:
+        logger.error(f"[ASR] Whisper transcription error: {e}")
+        import traceback
+        logger.debug(f"[ASR] Full traceback: {traceback.format_exc()}")
+        return ""
 def transcribe_audio(audio):
+    """Transcribe audio to text using Whisper (primary) or Gemini MCP (fallback)"""
     if audio is None:
+        logger.warning("[ASR] No audio provided")
         return ""
     try:
+        # Convert audio input to file path
         if isinstance(audio, str):
             audio_path = audio
         elif isinstance(audio, tuple):
             sample_rate, audio_data = audio
+            logger.info(f"[ASR] Processing audio tuple: sample_rate={sample_rate}, data_shape={audio_data.shape if hasattr(audio_data, 'shape') else 'unknown'}")
             with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
                 sf.write(tmp_file.name, audio_data, samplerate=sample_rate)
                 audio_path = tmp_file.name
+                logger.info(f"[ASR] Created temporary audio file: {audio_path}")
         else:
             audio_path = audio
+        logger.info(f"[ASR] Attempting transcription with Whisper (primary method)...")
+        # Try Whisper first (primary method)
+        if WHISPER_AVAILABLE:
+            try:
+                transcribed = transcribe_audio_whisper(audio_path)
+                if transcribed:
+                    logger.info(f"[ASR] ✅ Successfully transcribed via Whisper: {transcribed[:50]}...")
+                    # Clean up temp file if we created it
+                    if isinstance(audio, tuple) and os.path.exists(audio_path):
+                        try:
+                            os.unlink(audio_path)
+                        except:
+                            pass
+                    return transcribed
+                else:
+                    logger.warning("[ASR] Whisper transcription returned empty, trying fallback...")
+            except Exception as e:
+                logger.error(f"[ASR] Whisper transcription failed: {e}, trying fallback...")
+        else:
+            logger.warning("[ASR] Whisper not available, trying Gemini fallback...")
+        # Fallback to Gemini MCP if Whisper fails or is unavailable
         if MCP_AVAILABLE:
             try:
+                logger.info("[ASR] Attempting transcription with Gemini MCP (fallback)...")
                 loop = asyncio.get_event_loop()
                 if loop.is_running():
                     if nest_asyncio:
                         transcribed = nest_asyncio.run(transcribe_audio_gemini(audio_path))
                         if transcribed:
+                            logger.info(f"[ASR] Transcribed via Gemini MCP (fallback): {transcribed[:50]}...")
+                            # Clean up temp file if we created it
+                            if isinstance(audio, tuple) and os.path.exists(audio_path):
+                                try:
+                                    os.unlink(audio_path)
+                                except:
+                                    pass
                             return transcribed
                     else:
+                        logger.error("[ASR] nest_asyncio not available for nested async transcription")
                 else:
                     transcribed = loop.run_until_complete(transcribe_audio_gemini(audio_path))
                     if transcribed:
+                        logger.info(f"[ASR] Transcribed via Gemini MCP (fallback): {transcribed[:50]}...")
+                        # Clean up temp file if we created it
+                        if isinstance(audio, tuple) and os.path.exists(audio_path):
+                            try:
+                                os.unlink(audio_path)
+                            except:
+                                pass
                         return transcribed
             except Exception as e:
+                logger.error(f"[ASR] Gemini MCP transcription error: {e}")
+        # Clean up temp file if we created it
+        if isinstance(audio, tuple) and os.path.exists(audio_path):
+            try:
+                os.unlink(audio_path)
+            except:
+                pass
+        logger.warning("[ASR] All transcription methods failed")
         return ""
     except Exception as e:
+        logger.error(f"[ASR] Transcription error: {e}")
+        import traceback
+        logger.debug(f"[ASR] Full traceback: {traceback.format_exc()}")
         return ""
 async def generate_speech_mcp(text: str) -> str:
         logger.warning(f"MCP TTS error (sync wrapper): {e}")
     return None
+@spaces.GPU(max_duration=60)
 def generate_speech(text: str):
     """Generate speech from text using local maya1 TTS model (with MCP fallback).
     or fails.
     """
     if not text or len(text.strip()) == 0:
+        logger.warning("[TTS] Empty text provided")
         return None
+    logger.info(f"[TTS] Generating speech for text: {text[:50]}...")
     if not TTS_AVAILABLE:
+        logger.error("[TTS] TTS library not installed. Please install TTS to use voice generation.")
         # As a last resort, try MCP-based TTS if available
         return _generate_speech_via_mcp(text)
     if config.global_tts_model is None:
+        logger.info("[TTS] TTS model not loaded, initializing...")
         initialize_tts_model()
     if config.global_tts_model is None:
+        logger.error("[TTS] TTS model not available. Please check dependencies.")
         return _generate_speech_via_mcp(text)
     try:
+        logger.info("[TTS] Running TTS generation...")
         wav = config.global_tts_model.tts(text)
         with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
             sf.write(tmp_file.name, wav, samplerate=22050)
+            logger.info(f"[TTS] ✅ Speech generated successfully: {tmp_file.name}")
             return tmp_file.name
     except Exception as e:
+        logger.error(f"[TTS] TTS error (local maya1): {e}")
+        import traceback
+        logger.debug(f"[TTS] Full traceback: {traceback.format_exc()}")
         return _generate_speech_via_mcp(text)