chenjoya
/

LiveCC-7B-Instruct

@@ -1,13 +1,15 @@
 ---
-license: apache-2.0
 datasets:
 - chenjoya/Live-CC-5M
 - chenjoya/Live-WhisperX-526K
 - lmms-lab/LLaVA-Video-178K
 language:
 - en
-base_model:
-- Qwen/Qwen2-VL-7B
 tags:
 - qwen_vl
 - video
@@ -15,6 +17,7 @@ tags:
 - multimodal
 - LLM
 ---
 # LiveCC-7B-Instruct
 ## Introduction
@@ -22,6 +25,7 @@ tags:
 We introduce LiveCC, the first video LLM capable of real-time commentary, trained with a novel video-ASR streaming method, SOTA on both streaming and offline benchmarks.
 - Project Page: https://showlab.github.io/livecc
 > [!Important]
 > This is the SFT model. The base model is at [LiveCC-7B-Base](https://huggingface.co/chenjoya/LiveCC-7B-Base).
@@ -154,7 +158,8 @@ class LiveCCDemoInfer:
           texts = self.processor.apply_chat_template([message], tokenize=False, add_generation_prompt=True, return_tensors='pt')
           past_ids = state.get('past_ids', None)
           if past_ids is not None:
-              texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
           inputs = self.processor(
               text=texts,
               images=None,
@@ -276,7 +281,8 @@ class LiveCCDemoInfer:
       image_inputs, video_inputs = process_vision_info(conversation)
       texts = self.processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, return_tensors='pt')
       if past_ids is not None:
-          texts = '<|im_end|>\n' + texts[self.system_prompt_offset:]
       inputs = self.processor(
           text=texts,
           images=image_inputs,

 ---
+base_model:
+- Qwen/Qwen2-VL-7B
 datasets:
 - chenjoya/Live-CC-5M
 - chenjoya/Live-WhisperX-526K
 - lmms-lab/LLaVA-Video-178K
 language:
 - en
+license: apache-2.0
+pipeline_tag: video-text-to-text
+library_name: transformers
 tags:
 - qwen_vl
 - video
 - multimodal
 - LLM
 ---
 # LiveCC-7B-Instruct
 ## Introduction
 We introduce LiveCC, the first video LLM capable of real-time commentary, trained with a novel video-ASR streaming method, SOTA on both streaming and offline benchmarks.
 - Project Page: https://showlab.github.io/livecc
+- Paper: https://arxiv.org/abs/2504.16030
 > [!Important]
 > This is the SFT model. The base model is at [LiveCC-7B-Base](https://huggingface.co/chenjoya/LiveCC-7B-Base).
           texts = self.processor.apply_chat_template([message], tokenize=False, add_generation_prompt=True, return_tensors='pt')
           past_ids = state.get('past_ids', None)
           if past_ids is not None:
+              texts = '<|im_end|>
+' + texts[self.system_prompt_offset:]
           inputs = self.processor(
               text=texts,
               images=None,
       image_inputs, video_inputs = process_vision_info(conversation)
       texts = self.processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True, return_tensors='pt')
       if past_ids is not None:
+          texts = '<|im_end|>
+' + texts[self.system_prompt_offset:]
       inputs = self.processor(
           text=texts,
           images=image_inputs,