Generate speech using reference audio and text
Real-time video captioning powered by FastVLM
Display a static web page