Y Phung Nguyen commited on
Commit
af9efda
·
1 Parent(s): a1f9b00

Fix PDF upload, add Whisper ASR, and enhance model status display

Browse files

- Fix PDF upload error by handling file objects without seek() method (NamedString compatibility)
- Replace Gemini transcription with Whisper large-v3-turbo from Hugging Face (openai/whisper-large-v3-turbo)
- Add comprehensive ASR debugging/logging with [ASR] prefixes
- Load Whisper model at startup alongside maya1 TTS and MedSwin
- Update model status textbox to display MedSwin, TTS (maya1), and ASR (Whisper) loading status
- Ensure all voice models (TTS and ASR) are GPU-decorated for ZeroGPU compatibility
- Update requirements.txt to use transformers/torchaudio instead of openai-whisper
- Update README with new Whisper ASR implementation details

Files changed (8) hide show
  1. README.md +5 -3
  2. app.py +7 -3
  3. config.py +1 -0
  4. indexing.py +27 -3
  5. models.py +39 -0
  6. requirements.txt +3 -0
  7. ui.py +120 -29
  8. voice.py +135 -11
README.md CHANGED
@@ -70,9 +70,10 @@ tags:
70
  Sample [conversation](sample.md) between MAC-system and patient are recorded.
71
 
72
  ### 🎤 **Voice Features**
73
- - **Speech-to-Text**: Voice input transcription using Gemini MCP
74
  - **Inline Mic Experience**: Built-in microphone widget with live recording timer that drops transcripts straight into the chat box
75
  - **Text-to-Speech**: Voice output generation using Maya1 TTS model (optional, fallback to MCP if unavailable) plus a one-click "Play Response" control for the latest answer
 
76
 
77
 
78
  ### 🤝 **MAC Architecture (Multi-Agent Collaboration)**
@@ -124,7 +125,7 @@ Sample [conversation](sample.md) between MAC-system and patient are recorded.
124
  - **Architecture**: MAC (Multi-Agent Collaboration) - Gemini Supervisor + MedSwin Specialist
125
  - **Translation**: Gemini MCP (gemini-2.5-flash-lite)
126
  - **Document Parsing**: Gemini MCP (PDF, Word, TXT, MD, JSON, XML, CSV)
127
- - **Speech-to-Text**: Gemini MCP (gemini-2.5-flash-lite)
128
  - **Supervisor Tasks**: Gemini MCP (gemini-2.5-flash) - query breakdown, search strategies, RAG brainstorming
129
  - **MedSwin Execution**: GPU-tagged tasks for efficient inference
130
  - **Text-to-Speech**: maya-research/maya1 (optional, with MCP fallback)
@@ -139,10 +140,11 @@ Sample [conversation](sample.md) between MAC-system and patient are recorded.
139
  See `requirements.txt` for full dependency list. Key dependencies:
140
  - **MCP Integration**: `mcp`, `nest-asyncio`, `google-genai` (for Gemini MCP server)
141
  - **Fallback Dependencies**: `requests`, `beautifulsoup4`, `ddgs` (used when MCP web search unavailable)
142
- - **Core ML**: `transformers`, `torch`, `accelerate`
143
  - **RAG Framework**: `llama-index`, `llama_index.llms.huggingface`, `llama_index.embeddings.huggingface`
144
  - **Utilities**: `langdetect`, `gradio`, `spaces`, `soundfile`
145
  - **TTS**: Optional - `TTS` package (voice features work with MCP fallback if unavailable)
 
146
 
147
 
148
  ### 🔌 MCP Configuration
 
70
  Sample [conversation](sample.md) between MAC-system and patient are recorded.
71
 
72
  ### 🎤 **Voice Features**
73
+ - **Speech-to-Text**: Voice input transcription using Whisper large-v3-turbo (Hugging Face) with Gemini MCP fallback
74
  - **Inline Mic Experience**: Built-in microphone widget with live recording timer that drops transcripts straight into the chat box
75
  - **Text-to-Speech**: Voice output generation using Maya1 TTS model (optional, fallback to MCP if unavailable) plus a one-click "Play Response" control for the latest answer
76
+ - **Model Status Display**: Real-time status for MedSwin, TTS (maya1), and ASR (Whisper) models
77
 
78
 
79
  ### 🤝 **MAC Architecture (Multi-Agent Collaboration)**
 
125
  - **Architecture**: MAC (Multi-Agent Collaboration) - Gemini Supervisor + MedSwin Specialist
126
  - **Translation**: Gemini MCP (gemini-2.5-flash-lite)
127
  - **Document Parsing**: Gemini MCP (PDF, Word, TXT, MD, JSON, XML, CSV)
128
+ - **Speech-to-Text**: openai/whisper-large-v3-turbo (Hugging Face, primary) with Gemini MCP fallback
129
  - **Supervisor Tasks**: Gemini MCP (gemini-2.5-flash) - query breakdown, search strategies, RAG brainstorming
130
  - **MedSwin Execution**: GPU-tagged tasks for efficient inference
131
  - **Text-to-Speech**: maya-research/maya1 (optional, with MCP fallback)
 
140
  See `requirements.txt` for full dependency list. Key dependencies:
141
  - **MCP Integration**: `mcp`, `nest-asyncio`, `google-genai` (for Gemini MCP server)
142
  - **Fallback Dependencies**: `requests`, `beautifulsoup4`, `ddgs` (used when MCP web search unavailable)
143
+ - **Core ML**: `transformers`, `torch`, `accelerate`, `torchaudio`
144
  - **RAG Framework**: `llama-index`, `llama_index.llms.huggingface`, `llama_index.embeddings.huggingface`
145
  - **Utilities**: `langdetect`, `gradio`, `spaces`, `soundfile`
146
  - **TTS**: Optional - `TTS` package (voice features work with MCP fallback if unavailable)
147
+ - **ASR**: Whisper via `transformers` (openai/whisper-large-v3-turbo from Hugging Face)
148
 
149
 
150
  ### 🔌 MCP Configuration
app.py CHANGED
@@ -5,7 +5,7 @@ os.environ["TOKENIZERS_PARALLELISM"] = "false"
5
  from logger import logger
6
  from config import DEFAULT_MEDICAL_MODEL
7
  import config
8
- from models import initialize_medical_model, initialize_tts_model
9
  from client import MCP_AVAILABLE
10
  from ui import create_demo
11
 
@@ -17,8 +17,12 @@ if __name__ == "__main__":
17
  logger.info("App starting - models will be loaded on-demand when first needed")
18
  logger.info(f"Default medical model: {DEFAULT_MEDICAL_MODEL}")
19
 
20
- # TTS model also uses GPU decorator, so skip preloading
21
- logger.info("TTS model will be loaded on-demand if needed")
 
 
 
 
22
 
23
  # Check Gemini MCP availability
24
  if MCP_AVAILABLE:
 
5
  from logger import logger
6
  from config import DEFAULT_MEDICAL_MODEL
7
  import config
8
+ from models import initialize_medical_model, initialize_tts_model, initialize_whisper_model, WHISPER_AVAILABLE
9
  from client import MCP_AVAILABLE
10
  from ui import create_demo
11
 
 
17
  logger.info("App starting - models will be loaded on-demand when first needed")
18
  logger.info(f"Default medical model: {DEFAULT_MEDICAL_MODEL}")
19
 
20
+ # TTS and ASR models also use GPU decorator, so skip preloading
21
+ logger.info("TTS and ASR models will be loaded on-demand if needed")
22
+ if WHISPER_AVAILABLE:
23
+ logger.info("Whisper ASR library (transformers) is available")
24
+ else:
25
+ logger.warning("Whisper ASR library not available - install with: pip install transformers torchaudio")
26
 
27
  # Check Gemini MCP availability
28
  if MCP_AVAILABLE:
config.py CHANGED
@@ -35,6 +35,7 @@ global_medical_models = {}
35
  global_medical_tokenizers = {}
36
  global_file_info = {}
37
  global_tts_model = None
 
38
  global_embed_model = None
39
 
40
  # MCP client storage
 
35
  global_medical_tokenizers = {}
36
  global_file_info = {}
37
  global_tts_model = None
38
+ global_whisper_model = None
39
  global_embed_model = None
40
 
41
  # MCP client storage
indexing.py CHANGED
@@ -80,13 +80,37 @@ def extract_text_from_document(file):
80
  file_extension = os.path.splitext(file_name)[1].lower()
81
 
82
  if file_extension == '.txt':
83
- text = file.read().decode('utf-8')
 
 
 
 
 
 
 
 
 
 
 
 
84
  return text, len(text.split()), None
85
 
86
  try:
87
  with tempfile.NamedTemporaryFile(delete=False, suffix=file_extension) as tmp_file:
88
- file.seek(0)
89
- tmp_file.write(file.read())
 
 
 
 
 
 
 
 
 
 
 
 
90
  tmp_file_path = tmp_file.name
91
 
92
  if MCP_AVAILABLE:
 
80
  file_extension = os.path.splitext(file_name)[1].lower()
81
 
82
  if file_extension == '.txt':
83
+ # Handle file objects that might not have seek() method
84
+ try:
85
+ if hasattr(file, 'seek'):
86
+ file.seek(0)
87
+ text = file.read().decode('utf-8')
88
+ except (AttributeError, TypeError):
89
+ # If file is a string path or NamedString, read it differently
90
+ if isinstance(file, str):
91
+ with open(file, 'r', encoding='utf-8') as f:
92
+ text = f.read()
93
+ else:
94
+ # Try to get content directly
95
+ text = str(file) if hasattr(file, '__str__') else file.read() if hasattr(file, 'read') else ""
96
  return text, len(text.split()), None
97
 
98
  try:
99
  with tempfile.NamedTemporaryFile(delete=False, suffix=file_extension) as tmp_file:
100
+ # Handle file objects that might not have seek() method
101
+ try:
102
+ if hasattr(file, 'seek'):
103
+ file.seek(0)
104
+ file_content = file.read()
105
+ except (AttributeError, TypeError):
106
+ # If file is a string path, read it directly
107
+ if isinstance(file, str):
108
+ with open(file, 'rb') as f:
109
+ file_content = f.read()
110
+ else:
111
+ # Try to get content directly without seek
112
+ file_content = file.read() if hasattr(file, 'read') else bytes(file) if hasattr(file, '__bytes__') else b""
113
+ tmp_file.write(file_content)
114
  tmp_file_path = tmp_file.name
115
 
116
  if MCP_AVAILABLE:
models.py CHANGED
@@ -1,4 +1,5 @@
1
  """Model initialization and management"""
 
2
  import torch
3
  import threading
4
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -15,6 +16,19 @@ except ImportError:
15
  TTS_AVAILABLE = False
16
  TTS = None
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  # Model loading state tracking
19
  _model_loading_states = {}
20
  _model_loading_lock = threading.Lock()
@@ -88,6 +102,31 @@ def initialize_tts_model():
88
  config.global_tts_model = None
89
  return config.global_tts_model
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  def get_or_create_embed_model():
92
  """Reuse embedding model to avoid reloading weights each request"""
93
  if config.global_embed_model is None:
 
1
  """Model initialization and management"""
2
+ import os
3
  import torch
4
  import threading
5
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
16
  TTS_AVAILABLE = False
17
  TTS = None
18
 
19
+ try:
20
+ from transformers import WhisperProcessor, WhisperForConditionalGeneration
21
+ try:
22
+ import torchaudio
23
+ except ImportError:
24
+ torchaudio = None
25
+ WHISPER_AVAILABLE = True
26
+ except ImportError:
27
+ WHISPER_AVAILABLE = False
28
+ WhisperProcessor = None
29
+ WhisperForConditionalGeneration = None
30
+ torchaudio = None
31
+
32
  # Model loading state tracking
33
  _model_loading_states = {}
34
  _model_loading_lock = threading.Lock()
 
102
  config.global_tts_model = None
103
  return config.global_tts_model
104
 
105
+ def initialize_whisper_model():
106
+ """Initialize Whisper model for speech-to-text (ASR) from Hugging Face"""
107
+ if not WHISPER_AVAILABLE:
108
+ logger.warning("Whisper transformers not installed. ASR features will be disabled.")
109
+ return None
110
+ if config.global_whisper_model is None:
111
+ try:
112
+ logger.info("Initializing Whisper model (openai/whisper-large-v3-turbo) from Hugging Face...")
113
+ model_id = "openai/whisper-large-v3-turbo"
114
+ processor = WhisperProcessor.from_pretrained(model_id, token=config.HF_TOKEN)
115
+ model = WhisperForConditionalGeneration.from_pretrained(
116
+ model_id,
117
+ device_map="auto",
118
+ torch_dtype=torch.float16,
119
+ token=config.HF_TOKEN
120
+ )
121
+ # Store both processor and model
122
+ config.global_whisper_model = {"processor": processor, "model": model}
123
+ logger.info(f"Whisper model ({model_id}) initialized successfully")
124
+ except Exception as e:
125
+ logger.warning(f"Whisper model initialization failed: {e}")
126
+ logger.warning("ASR features will be disabled. Install with: pip install transformers torchaudio")
127
+ config.global_whisper_model = None
128
+ return config.global_whisper_model
129
+
130
  def get_or_create_embed_model():
131
  """Reuse embedding model to avoid reloading weights each request"""
132
  if config.global_embed_model is None:
requirements.txt CHANGED
@@ -34,3 +34,6 @@ setuptools>=65.0.0
34
  # Option 2: pip install TTS (may fail on pyworld, but TTS will work for most models without it)
35
  # The app will run without TTS - voice generation will be disabled
36
  # TTS
 
 
 
 
34
  # Option 2: pip install TTS (may fail on pyworld, but TTS will work for most models without it)
35
  # The app will run without TTS - voice generation will be disabled
36
  # TTS
37
+
38
+ # ASR (Automatic Speech Recognition) - Whisper for speech-to-text (via Hugging Face transformers)
39
+ torchaudio
ui.py CHANGED
@@ -3,6 +3,7 @@ import time
3
  import gradio as gr
4
  import spaces
5
  from config import TITLE, DESCRIPTION, CSS, MEDSWIN_MODELS, DEFAULT_MEDICAL_MODEL
 
6
  from indexing import create_or_update_index
7
  from pipeline import stream_chat
8
  from voice import transcribe_audio, generate_speech
@@ -12,7 +13,9 @@ from models import (
12
  get_model_loading_state,
13
  set_model_loading_state,
14
  initialize_tts_model,
 
15
  TTS_AVAILABLE,
 
16
  )
17
  from logger import logger
18
 
@@ -293,35 +296,89 @@ def create_demo():
293
  def load_model_and_update_status(model_name):
294
  """Load model and update status, return status text and whether model is ready"""
295
  try:
 
 
 
296
  if is_model_loaded(model_name):
297
- return "✅ The model has been loaded successfully", True
 
 
 
 
 
 
 
 
 
 
 
 
 
298
 
299
- state = get_model_loading_state(model_name)
300
- if state == "loading":
301
- return "⏳ The model is being loaded, please wait...", False
302
- elif state == "error":
303
- return "❌ Error loading model. Please try again.", False
 
 
 
 
 
 
 
 
 
 
 
 
304
 
305
- # Use GPU-decorated function to load the model
306
- status_text, is_ready = load_model_with_gpu(model_name)
307
  return status_text, is_ready
308
  except Exception as e:
309
  return f"❌ Error: {str(e)[:100]}", False
310
 
311
  def check_model_status(model_name):
312
  """Check current model status without loading"""
 
 
 
313
  if is_model_loaded(model_name):
314
- return "✅ The model has been loaded successfully", True
315
- state = get_model_loading_state(model_name)
316
- if state == "loading":
317
- return "⏳ The model is being loaded, please wait...", False
318
- elif state == "error":
319
- return "❌ Error loading model. Please try again.", False
 
 
 
 
 
 
 
 
 
 
320
  else:
321
- return "⚠️ Model not loaded. Click to load or it will load on first use.", False
 
 
 
 
 
 
 
 
 
 
 
 
 
322
 
323
  # GPU-decorated function to load model on startup
324
- # @spaces.GPU(max_duration=120)
325
  def load_default_model_on_startup():
326
  """Load default medical model on startup (GPU-decorated for ZeroGPU compatibility)"""
327
  try:
@@ -343,25 +400,47 @@ def create_demo():
343
  logger.error(f"Error in model loading startup: {e}")
344
  return f"⚠️ Startup loading error: {str(e)[:100]}"
345
 
346
- # GPU-decorated function to load default TTS model on startup
347
- # @spaces.GPU(max_duration=120)
348
- def load_tts_model_on_startup():
349
- """Load default TTS model (maya1) on startup"""
350
  try:
351
- if not TTS_AVAILABLE:
 
 
 
 
 
 
 
 
352
  logger.warning("TTS library not installed; skipping TTS preload.")
353
- return
354
- logger.info("Loading default TTS model (maya1) on startup...")
355
- initialize_tts_model()
356
- logger.info(" Default TTS model (maya1) loaded successfully on startup!")
 
 
 
 
 
 
 
357
  except Exception as e:
358
- logger.error(f"Error in TTS model loading startup: {e}")
 
 
359
 
360
  # Initialize status on load
361
  def init_model_status():
362
  status_text, is_ready = check_model_status(DEFAULT_MEDICAL_MODEL)
363
  return status_text
364
 
 
 
 
 
 
365
  # Handle model selection change
366
  def on_model_change(model_name):
367
  status_text, is_ready = load_model_and_update_status(model_name)
@@ -372,23 +451,35 @@ def create_demo():
372
  gr.update(interactive=submit_enabled)
373
  )
374
 
 
 
 
 
375
  medical_model.change(
376
  fn=on_model_change,
377
  inputs=[medical_model],
378
  outputs=[model_status, submit_button, message_input]
379
  )
380
 
381
- # Load default medical model on startup (GPU-decorated function)
 
382
  demo.load(
383
  fn=load_default_model_on_startup,
 
384
  outputs=[model_status]
385
  )
386
- # Load default TTS model (maya1) on startup (GPU-decorated function)
387
  demo.load(
388
- fn=load_tts_model_on_startup,
389
  inputs=None,
390
  outputs=None
391
  )
 
 
 
 
 
 
392
 
393
  # Wrap stream_chat to check model status before execution
394
  def stream_chat_with_model_check(
 
3
  import gradio as gr
4
  import spaces
5
  from config import TITLE, DESCRIPTION, CSS, MEDSWIN_MODELS, DEFAULT_MEDICAL_MODEL
6
+ import config
7
  from indexing import create_or_update_index
8
  from pipeline import stream_chat
9
  from voice import transcribe_audio, generate_speech
 
13
  get_model_loading_state,
14
  set_model_loading_state,
15
  initialize_tts_model,
16
+ initialize_whisper_model,
17
  TTS_AVAILABLE,
18
+ WHISPER_AVAILABLE,
19
  )
20
  from logger import logger
21
 
 
296
  def load_model_and_update_status(model_name):
297
  """Load model and update status, return status text and whether model is ready"""
298
  try:
299
+ status_lines = []
300
+
301
+ # Medical model status
302
  if is_model_loaded(model_name):
303
+ status_lines.append(f"✅ MedSwin ({model_name}): loaded and ready")
304
+ else:
305
+ state = get_model_loading_state(model_name)
306
+ if state == "loading":
307
+ status_lines.append(f"⏳ MedSwin ({model_name}): loading...")
308
+ elif state == "error":
309
+ status_lines.append(f"❌ MedSwin ({model_name}): error loading")
310
+ else:
311
+ # Use GPU-decorated function to load the model
312
+ status_text, is_ready = load_model_with_gpu(model_name)
313
+ if is_ready:
314
+ status_lines.append(f"✅ MedSwin ({model_name}): loaded and ready")
315
+ else:
316
+ status_lines.append(f"⏳ MedSwin ({model_name}): loading...")
317
 
318
+ # TTS model status
319
+ if TTS_AVAILABLE:
320
+ if config.global_tts_model is not None:
321
+ status_lines.append("✅ TTS (maya1): loaded and ready")
322
+ else:
323
+ status_lines.append("⚠️ TTS (maya1): not loaded")
324
+ else:
325
+ status_lines.append("❌ TTS: library not available")
326
+
327
+ # ASR (Whisper) model status
328
+ if WHISPER_AVAILABLE:
329
+ if config.global_whisper_model is not None:
330
+ status_lines.append("✅ ASR (Whisper large-v3-turbo): loaded and ready")
331
+ else:
332
+ status_lines.append("⚠️ ASR (Whisper large-v3-turbo): not loaded")
333
+ else:
334
+ status_lines.append("❌ ASR: library not available")
335
 
336
+ status_text = "\n".join(status_lines)
337
+ is_ready = is_model_loaded(model_name)
338
  return status_text, is_ready
339
  except Exception as e:
340
  return f"❌ Error: {str(e)[:100]}", False
341
 
342
  def check_model_status(model_name):
343
  """Check current model status without loading"""
344
+ status_lines = []
345
+
346
+ # Medical model status
347
  if is_model_loaded(model_name):
348
+ status_lines.append(f"✅ MedSwin ({model_name}): loaded and ready")
349
+ else:
350
+ state = get_model_loading_state(model_name)
351
+ if state == "loading":
352
+ status_lines.append(f"⏳ MedSwin ({model_name}): loading...")
353
+ elif state == "error":
354
+ status_lines.append(f"❌ MedSwin ({model_name}): error loading")
355
+ else:
356
+ status_lines.append(f"⚠️ MedSwin ({model_name}): not loaded")
357
+
358
+ # TTS model status
359
+ if TTS_AVAILABLE:
360
+ if config.global_tts_model is not None:
361
+ status_lines.append("✅ TTS (maya1): loaded and ready")
362
+ else:
363
+ status_lines.append("⚠️ TTS (maya1): not loaded")
364
  else:
365
+ status_lines.append(" TTS: library not available")
366
+
367
+ # ASR (Whisper) model status
368
+ if WHISPER_AVAILABLE:
369
+ if config.global_whisper_model is not None:
370
+ status_lines.append("✅ ASR (Whisper large-v3-turbo): loaded and ready")
371
+ else:
372
+ status_lines.append("⚠️ ASR (Whisper large-v3-turbo): not loaded")
373
+ else:
374
+ status_lines.append("❌ ASR: library not available")
375
+
376
+ status_text = "\n".join(status_lines)
377
+ is_ready = is_model_loaded(model_name)
378
+ return status_text, is_ready
379
 
380
  # GPU-decorated function to load model on startup
381
+ @spaces.GPU(max_duration=120)
382
  def load_default_model_on_startup():
383
  """Load default medical model on startup (GPU-decorated for ZeroGPU compatibility)"""
384
  try:
 
400
  logger.error(f"Error in model loading startup: {e}")
401
  return f"⚠️ Startup loading error: {str(e)[:100]}"
402
 
403
+ # GPU-decorated function to load default TTS and ASR models on startup
404
+ @spaces.GPU(max_duration=120)
405
+ def load_voice_models_on_startup():
406
+ """Load default TTS model (maya1) and ASR model (Whisper) on startup"""
407
  try:
408
+ # Load TTS model
409
+ if TTS_AVAILABLE:
410
+ logger.info("Loading default TTS model (maya1) on startup...")
411
+ initialize_tts_model()
412
+ if config.global_tts_model is not None:
413
+ logger.info("✅ Default TTS model (maya1) loaded successfully on startup!")
414
+ else:
415
+ logger.warning("⚠️ TTS model failed to load on startup")
416
+ else:
417
  logger.warning("TTS library not installed; skipping TTS preload.")
418
+
419
+ # Load ASR (Whisper) model
420
+ if WHISPER_AVAILABLE:
421
+ logger.info("Loading default ASR model (Whisper large-v3-turbo) on startup...")
422
+ initialize_whisper_model()
423
+ if config.global_whisper_model is not None:
424
+ logger.info("✅ Default ASR model (Whisper large-v3-turbo) loaded successfully on startup!")
425
+ else:
426
+ logger.warning("⚠️ ASR model failed to load on startup")
427
+ else:
428
+ logger.warning("Whisper transformers not installed; skipping ASR preload.")
429
  except Exception as e:
430
+ logger.error(f"Error in voice models loading startup: {e}")
431
+ import traceback
432
+ logger.debug(f"Full traceback: {traceback.format_exc()}")
433
 
434
  # Initialize status on load
435
  def init_model_status():
436
  status_text, is_ready = check_model_status(DEFAULT_MEDICAL_MODEL)
437
  return status_text
438
 
439
+ # Update status when model selection changes
440
+ def update_model_status_on_change(model_name):
441
+ status_text, is_ready = check_model_status(model_name)
442
+ return status_text
443
+
444
  # Handle model selection change
445
  def on_model_change(model_name):
446
  status_text, is_ready = load_model_and_update_status(model_name)
 
451
  gr.update(interactive=submit_enabled)
452
  )
453
 
454
+ # Update status display periodically or on model status changes
455
+ def refresh_model_status(model_name):
456
+ return update_model_status_on_change(model_name)
457
+
458
  medical_model.change(
459
  fn=on_model_change,
460
  inputs=[medical_model],
461
  outputs=[model_status, submit_button, message_input]
462
  )
463
 
464
+ # Load models on startup - they will be loaded in separate GPU sessions
465
+ # First load medical model
466
  demo.load(
467
  fn=load_default_model_on_startup,
468
+ inputs=None,
469
  outputs=[model_status]
470
  )
471
+ # Then load voice models (TTS and ASR)
472
  demo.load(
473
+ fn=load_voice_models_on_startup,
474
  inputs=None,
475
  outputs=None
476
  )
477
+ # Finally update status to show all models
478
+ demo.load(
479
+ fn=lambda: check_model_status(DEFAULT_MEDICAL_MODEL)[0],
480
+ inputs=None,
481
+ outputs=[model_status]
482
+ )
483
 
484
  # Wrap stream_chat to check model status before execution
485
  def stream_chat_with_model_check(
voice.py CHANGED
@@ -3,10 +3,12 @@ import os
3
  import asyncio
4
  import tempfile
5
  import soundfile as sf
 
6
  from logger import logger
7
  from client import MCP_AVAILABLE, call_agent, get_mcp_session, get_cached_mcp_tools
8
  import config
9
- from models import TTS_AVAILABLE, initialize_tts_model
 
10
 
11
  try:
12
  import nest_asyncio
@@ -69,45 +71,158 @@ async def transcribe_audio_gemini(audio_path: str) -> str:
69
  logger.error(f"Gemini transcription error: {e}")
70
  return ""
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  def transcribe_audio(audio):
73
- """Transcribe audio to text using Gemini MCP"""
74
  if audio is None:
 
75
  return ""
76
 
77
  try:
 
78
  if isinstance(audio, str):
79
  audio_path = audio
80
  elif isinstance(audio, tuple):
81
  sample_rate, audio_data = audio
 
82
  with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
83
  sf.write(tmp_file.name, audio_data, samplerate=sample_rate)
84
  audio_path = tmp_file.name
 
85
  else:
86
  audio_path = audio
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  if MCP_AVAILABLE:
89
  try:
 
90
  loop = asyncio.get_event_loop()
91
  if loop.is_running():
92
  if nest_asyncio:
93
  transcribed = nest_asyncio.run(transcribe_audio_gemini(audio_path))
94
  if transcribed:
95
- logger.info(f"Transcribed via Gemini MCP: {transcribed[:50]}...")
 
 
 
 
 
 
96
  return transcribed
97
  else:
98
- logger.error("nest_asyncio not available for nested async transcription")
99
  else:
100
  transcribed = loop.run_until_complete(transcribe_audio_gemini(audio_path))
101
  if transcribed:
102
- logger.info(f"Transcribed via Gemini MCP: {transcribed[:50]}...")
 
 
 
 
 
 
103
  return transcribed
104
  except Exception as e:
105
- logger.error(f"Gemini MCP transcription error: {e}")
106
 
107
- logger.warning("Gemini MCP transcription not available")
 
 
 
 
 
 
 
108
  return ""
109
  except Exception as e:
110
- logger.error(f"Transcription error: {e}")
 
 
111
  return ""
112
 
113
  async def generate_speech_mcp(text: str) -> str:
@@ -185,6 +300,7 @@ def _generate_speech_via_mcp(text: str):
185
  logger.warning(f"MCP TTS error (sync wrapper): {e}")
186
  return None
187
 
 
188
  def generate_speech(text: str):
189
  """Generate speech from text using local maya1 TTS model (with MCP fallback).
190
 
@@ -193,26 +309,34 @@ def generate_speech(text: str):
193
  or fails.
194
  """
195
  if not text or len(text.strip()) == 0:
 
196
  return None
197
 
 
 
198
  if not TTS_AVAILABLE:
199
- logger.error("TTS library not installed. Please install TTS to use voice generation.")
200
  # As a last resort, try MCP-based TTS if available
201
  return _generate_speech_via_mcp(text)
202
 
203
  if config.global_tts_model is None:
 
204
  initialize_tts_model()
205
 
206
  if config.global_tts_model is None:
207
- logger.error("TTS model not available. Please check dependencies.")
208
  return _generate_speech_via_mcp(text)
209
 
210
  try:
 
211
  wav = config.global_tts_model.tts(text)
212
  with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
213
  sf.write(tmp_file.name, wav, samplerate=22050)
 
214
  return tmp_file.name
215
  except Exception as e:
216
- logger.error(f"TTS error (local maya1): {e}")
 
 
217
  return _generate_speech_via_mcp(text)
218
 
 
3
  import asyncio
4
  import tempfile
5
  import soundfile as sf
6
+ import torch
7
  from logger import logger
8
  from client import MCP_AVAILABLE, call_agent, get_mcp_session, get_cached_mcp_tools
9
  import config
10
+ from models import TTS_AVAILABLE, WHISPER_AVAILABLE, initialize_tts_model, initialize_whisper_model
11
+ import spaces
12
 
13
  try:
14
  import nest_asyncio
 
71
  logger.error(f"Gemini transcription error: {e}")
72
  return ""
73
 
74
+ @spaces.GPU(max_duration=60)
75
+ def transcribe_audio_whisper(audio_path: str) -> str:
76
+ """Transcribe audio using Whisper model from Hugging Face"""
77
+ if not WHISPER_AVAILABLE:
78
+ logger.warning("[ASR] Whisper not available for transcription")
79
+ return ""
80
+
81
+ try:
82
+ logger.info(f"[ASR] Starting Whisper transcription for: {audio_path}")
83
+ if config.global_whisper_model is None:
84
+ logger.info("[ASR] Whisper model not loaded, initializing...")
85
+ initialize_whisper_model()
86
+
87
+ if config.global_whisper_model is None:
88
+ logger.error("[ASR] Failed to initialize Whisper model")
89
+ return ""
90
+
91
+ # Extract processor and model from stored dict
92
+ processor = config.global_whisper_model["processor"]
93
+ model = config.global_whisper_model["model"]
94
+
95
+ logger.info("[ASR] Loading audio file...")
96
+ # Load audio using torchaudio (imported from models)
97
+ from models import torchaudio
98
+ if torchaudio is None:
99
+ logger.error("[ASR] torchaudio not available")
100
+ return ""
101
+
102
+ waveform, sample_rate = torchaudio.load(audio_path)
103
+ # Resample to 16kHz if needed (Whisper expects 16kHz)
104
+ if sample_rate != 16000:
105
+ resampler = torchaudio.transforms.Resample(sample_rate, 16000)
106
+ waveform = resampler(waveform)
107
+ sample_rate = 16000
108
+
109
+ logger.info("[ASR] Processing audio with Whisper...")
110
+ # Process audio
111
+ inputs = processor(waveform.squeeze().numpy(), sampling_rate=sample_rate, return_tensors="pt")
112
+
113
+ # Move inputs to same device as model
114
+ device = next(model.parameters()).device
115
+ inputs = {k: v.to(device) for k, v in inputs.items()}
116
+
117
+ logger.info("[ASR] Running Whisper transcription...")
118
+ # Generate transcription
119
+ with torch.no_grad():
120
+ generated_ids = model.generate(**inputs)
121
+
122
+ # Decode transcription
123
+ transcribed_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
124
+
125
+ if transcribed_text:
126
+ logger.info(f"[ASR] ✅ Transcription successful: {transcribed_text[:100]}...")
127
+ logger.info(f"[ASR] Transcription length: {len(transcribed_text)} characters")
128
+ else:
129
+ logger.warning("[ASR] Whisper returned empty transcription")
130
+
131
+ return transcribed_text
132
+ except Exception as e:
133
+ logger.error(f"[ASR] Whisper transcription error: {e}")
134
+ import traceback
135
+ logger.debug(f"[ASR] Full traceback: {traceback.format_exc()}")
136
+ return ""
137
+
138
  def transcribe_audio(audio):
139
+ """Transcribe audio to text using Whisper (primary) or Gemini MCP (fallback)"""
140
  if audio is None:
141
+ logger.warning("[ASR] No audio provided")
142
  return ""
143
 
144
  try:
145
+ # Convert audio input to file path
146
  if isinstance(audio, str):
147
  audio_path = audio
148
  elif isinstance(audio, tuple):
149
  sample_rate, audio_data = audio
150
+ logger.info(f"[ASR] Processing audio tuple: sample_rate={sample_rate}, data_shape={audio_data.shape if hasattr(audio_data, 'shape') else 'unknown'}")
151
  with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
152
  sf.write(tmp_file.name, audio_data, samplerate=sample_rate)
153
  audio_path = tmp_file.name
154
+ logger.info(f"[ASR] Created temporary audio file: {audio_path}")
155
  else:
156
  audio_path = audio
157
 
158
+ logger.info(f"[ASR] Attempting transcription with Whisper (primary method)...")
159
+
160
+ # Try Whisper first (primary method)
161
+ if WHISPER_AVAILABLE:
162
+ try:
163
+ transcribed = transcribe_audio_whisper(audio_path)
164
+ if transcribed:
165
+ logger.info(f"[ASR] ✅ Successfully transcribed via Whisper: {transcribed[:50]}...")
166
+ # Clean up temp file if we created it
167
+ if isinstance(audio, tuple) and os.path.exists(audio_path):
168
+ try:
169
+ os.unlink(audio_path)
170
+ except:
171
+ pass
172
+ return transcribed
173
+ else:
174
+ logger.warning("[ASR] Whisper transcription returned empty, trying fallback...")
175
+ except Exception as e:
176
+ logger.error(f"[ASR] Whisper transcription failed: {e}, trying fallback...")
177
+ else:
178
+ logger.warning("[ASR] Whisper not available, trying Gemini fallback...")
179
+
180
+ # Fallback to Gemini MCP if Whisper fails or is unavailable
181
  if MCP_AVAILABLE:
182
  try:
183
+ logger.info("[ASR] Attempting transcription with Gemini MCP (fallback)...")
184
  loop = asyncio.get_event_loop()
185
  if loop.is_running():
186
  if nest_asyncio:
187
  transcribed = nest_asyncio.run(transcribe_audio_gemini(audio_path))
188
  if transcribed:
189
+ logger.info(f"[ASR] Transcribed via Gemini MCP (fallback): {transcribed[:50]}...")
190
+ # Clean up temp file if we created it
191
+ if isinstance(audio, tuple) and os.path.exists(audio_path):
192
+ try:
193
+ os.unlink(audio_path)
194
+ except:
195
+ pass
196
  return transcribed
197
  else:
198
+ logger.error("[ASR] nest_asyncio not available for nested async transcription")
199
  else:
200
  transcribed = loop.run_until_complete(transcribe_audio_gemini(audio_path))
201
  if transcribed:
202
+ logger.info(f"[ASR] Transcribed via Gemini MCP (fallback): {transcribed[:50]}...")
203
+ # Clean up temp file if we created it
204
+ if isinstance(audio, tuple) and os.path.exists(audio_path):
205
+ try:
206
+ os.unlink(audio_path)
207
+ except:
208
+ pass
209
  return transcribed
210
  except Exception as e:
211
+ logger.error(f"[ASR] Gemini MCP transcription error: {e}")
212
 
213
+ # Clean up temp file if we created it
214
+ if isinstance(audio, tuple) and os.path.exists(audio_path):
215
+ try:
216
+ os.unlink(audio_path)
217
+ except:
218
+ pass
219
+
220
+ logger.warning("[ASR] All transcription methods failed")
221
  return ""
222
  except Exception as e:
223
+ logger.error(f"[ASR] Transcription error: {e}")
224
+ import traceback
225
+ logger.debug(f"[ASR] Full traceback: {traceback.format_exc()}")
226
  return ""
227
 
228
  async def generate_speech_mcp(text: str) -> str:
 
300
  logger.warning(f"MCP TTS error (sync wrapper): {e}")
301
  return None
302
 
303
+ @spaces.GPU(max_duration=60)
304
  def generate_speech(text: str):
305
  """Generate speech from text using local maya1 TTS model (with MCP fallback).
306
 
 
309
  or fails.
310
  """
311
  if not text or len(text.strip()) == 0:
312
+ logger.warning("[TTS] Empty text provided")
313
  return None
314
 
315
+ logger.info(f"[TTS] Generating speech for text: {text[:50]}...")
316
+
317
  if not TTS_AVAILABLE:
318
+ logger.error("[TTS] TTS library not installed. Please install TTS to use voice generation.")
319
  # As a last resort, try MCP-based TTS if available
320
  return _generate_speech_via_mcp(text)
321
 
322
  if config.global_tts_model is None:
323
+ logger.info("[TTS] TTS model not loaded, initializing...")
324
  initialize_tts_model()
325
 
326
  if config.global_tts_model is None:
327
+ logger.error("[TTS] TTS model not available. Please check dependencies.")
328
  return _generate_speech_via_mcp(text)
329
 
330
  try:
331
+ logger.info("[TTS] Running TTS generation...")
332
  wav = config.global_tts_model.tts(text)
333
  with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
334
  sf.write(tmp_file.name, wav, samplerate=22050)
335
+ logger.info(f"[TTS] ✅ Speech generated successfully: {tmp_file.name}")
336
  return tmp_file.name
337
  except Exception as e:
338
+ logger.error(f"[TTS] TTS error (local maya1): {e}")
339
+ import traceback
340
+ logger.debug(f"[TTS] Full traceback: {traceback.format_exc()}")
341
  return _generate_speech_via_mcp(text)
342