Upload 3 files

Browse files

Files changed (4) hide show

.gitattributes +1 -0
README.md +328 -1
README_CN.md +324 -0
main_results.png +3 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+main_results.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -243,4 +243,331 @@ model-index:
             value: 92.2
             verified: false
----

             value: 92.2
             verified: false
+---
+# MedGo: Medical Large Language Model Based on Qwen2.5-32B
+<div align="center">
+[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow)](https://huggingface.co/OpenMedZoo/MedGo)
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
+[![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/)
+English | [简体中文](./README_CN.md)
+</div>
+## 📋 Table of Contents
+- [Introduction](#introduction)
+- [Key Features](#key-features)
+- [Performance](#performance)
+- [Quick Start](#quick-start)
+- [Training Details](#training-details)
+- [Use Cases](#use-cases)
+- [Limitations & Risks](#limitations--risks)
+- [Citation](#citation)
+- [License](#license)
+- [Contributing](#contributing)
+- [Contact](#contact)
+## 🎯 Introduction
+**MedGo** is a general-purpose medical large language model fine-tuned from **Qwen2.5-32B**, designed for clinical medicine and research scenarios. The model is trained on large-scale multi-source medical corpora and enhanced with complex case data, supporting various capabilities including medical Q&A, clinical summary, clinical reasoning, multi-turn dialogue, and scientific text generation.
+### 🌟 Core Capabilities
+- **📚 Medical Knowledge Q&A**: Professional responses based on authoritative medical literature and clinical guidelines
+- **📝 Clinical Documentation**: Automated medical record summaries, diagnostic reports, and medical documentation
+- **🔍 Clinical Reasoning**: Differential diagnosis, examination recommendations, and treatment suggestions
+- **💬 Multi-turn Dialogue**: Patient-doctor interaction simulation and complex case discussions
+- **🔬 Research Support**: Literature summarization, research idea generation, and quality control review
+## ✨ Key Features
+| Feature | Details |
+|---------|---------|
+| **Base Architecture** | Qwen2.5-32B |
+| **Parameters** | 32B |
+| **Domain** | Clinical Medicine, Research Support, Healthcare System Integration |
+| **Fine-tuning Method** | SFT + Preference Alignment (DPO/KTO) |
+| **Data Sources** | Authoritative medical literature, clinical guidelines, real cases (anonymized) |
+| **Deployment** | Local deployment, HIS/EMR system integration |
+| **License** | Apache 2.0 |
+## 📊 Performance
+MedGo demonstrates excellent performance across multiple medical and general evaluation benchmarks, showing competitive results among 30B-parameter models:
+### Key Benchmark Results
+- **AIMedQA**: Medical question answering comprehension
+- **CME**: Clinical reasoning evaluation
+- **DiagnosisArena**: Diagnostic capability assessment
+- **MedQA / MedMCQA**: Medical multiple-choice questions
+- **PubMedQA**: Biomedical literature Q&A
+- **MMLU-Pro**: Comprehensive capability evaluation
+![Performance Comparison](./main_results.png)
+**Performance Highlights**:
+- ✅ **Average Score**: ~70 points (excellent performance in the 30B parameter class)
+- ✅ **Strong Tasks**: Clinical reasoning (DiagnosisArena, CME) and multi-turn medical Q&A
+- ✅ **Balanced Capability**: Good performance in medical semantic understanding and multi-task generalization
+## 🚀 Quick Start
+### Requirements
+- Python >= 3.8
+- PyTorch >= 2.0
+- Transformers >= 4.35.0
+- CUDA >= 11.8 (for GPU inference)
+### Installation
+```bash
+# Clone the repository
+git clone https://github.com/OpenMedZoo/MedGo.git
+cd MedGo
+# Install dependencies
+pip install -r requirements.txt
+```
+### Model Download
+Download model weights from HuggingFace:
+```bash
+# Using huggingface-cli
+huggingface-cli download OpenMedZoo/MedGo --local-dir ./models/MedGo
+# Or using git-lfs
+git lfs install
+git clone https://huggingface.co/OpenMedZoo/MedGo
+```
+### Basic Inference
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+model_path = "OpenMedZoo/MedGo"
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    device_map="auto",
+    trust_remote_code=True,
+    torch_dtype="auto"
+)
+# Medical Q&A example
+messages = [
+    {"role": "system", "content": "You are a professional medical assistant. Please answer questions based on medical knowledge."},
+    {"role": "user", "content": "What is hypertension and what are the common treatment methods?"}
+]
+# Generate response
+inputs = tokenizer.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt"
+).to(model.device)
+outputs = model.generate(
+    inputs,
+    max_new_tokens=512,
+    temperature=0.7,
+    top_p=0.9,
+    do_sample=True
+)
+response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
+print(response)
+```
+### Batch Inference
+```bash
+# Use the provided inference script
+python scripts/inference.py \
+    --model_path OpenMedZoo/MedGo \
+    --input_file examples/medical_qa.jsonl \
+    --output_file results/predictions.jsonl \
+    --batch_size 4
+```
+### Accelerated Inference with vLLM
+```python
+from vllm import LLM, SamplingParams
+# Initialize vLLM
+llm = LLM(model="OpenMedZoo/MedGo", trust_remote_code=True)
+sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
+# Batch inference
+prompts = [
+    "What are the symptoms and treatment methods for diabetes?",
+    "What dietary precautions should hypertensive patients take?"
+]
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    print(output.outputs[0].text)
+```
+## 🔧 Training Details
+MedGo employs a **two-stage fine-tuning strategy** to balance general medical knowledge with clinical task adaptation.
+### Stage I: General Medical Alignment
+**Objective**: Establish a solid foundation of medical knowledge and improve Q&A standardization
+- **Data Sources**:
+  - Authoritative medical literature (PubMed, medical textbooks)
+  - Clinical guidelines and diagnostic standards
+  - Medical encyclopedia entries and terminology databases
+- **Training Methods**:
+  - Supervised Fine-Tuning (SFT)
+  - Chain-of-Thought (CoT) guided samples
+  - Medical terminology alignment and safety constraints
+### Stage II: Clinical Task Enhancement
+**Objective**: Enhance complex case reasoning and multi-task processing capabilities
+- **Data Sources**:
+  - Real medical records (fully anonymized)
+  - Outpatient and emergency records with complex multi-diagnosis samples
+  - Research articles and quality control cases
+- **Data Augmentation Techniques**:
+  - Semantic paraphrasing and multi-perspective expansion
+  - Complex case synthesis
+  - Doctor-patient interaction simulation
+- **Training Methods**:
+  - Multi-Task Learning (medical record summary, differential diagnosis, examination suggestions, etc.)
+  - Preference Alignment (DPO/KTO)
+  - Expert feedback iterative optimization
+### Training Optimization Focus
+- ✅ Strengthen information extraction and cross-evidence reasoning for complex cases
+- ✅ Improve medical consistency and interpretability of outputs
+- ✅ Optimize expression compliance and safety
+- ✅ Continuous iteration through expert samples and automated evaluation
+## 💡 Use Cases
+### ✅ Suitable Scenarios
+| Scenario | Description |
+|----------|-------------|
+| **Clinical Assistance** | Preliminary diagnosis suggestions, medical record writing, formatted report generation |
+| **Research Support** | Literature summarization, research idea generation, data analysis assistance |
+| **Quality Control** | Medical document compliance checking, clinical process quality control |
+| **System Integration** | Embedded in HIS/EMR systems to provide intelligent decision support |
+| **Medical Education** | Case discussions, medical knowledge Q&A, clinical reasoning training |
+### 🚫 Unsuitable Scenarios
+- ❌ **Cannot Replace Doctors**: Only an auxiliary tool, not a standalone diagnostic basis
+- ❌ **High-Risk Operations**: Not recommended for surgical decisions or other high-risk medical operations
+- ❌ **Rare Disease Limitations**: May perform poorly on rare diseases outside training data
+- ❌ **Emergency Care**: Not suitable for scenarios requiring immediate decisions
+## ⚠️ Limitations & Risks
+### Model Limitations
+1. **Understanding Bias**: Despite covering extensive medical knowledge, may still produce understanding biases or incorrect recommendations
+2. **Complex Cases**: Higher risk for cases with complex conditions, severe complications, or missing information
+3. **Knowledge Currency**: Medical knowledge continuously updates; training data may lag
+4. **Language Limitation**: Primarily designed for Chinese medical scenarios; performance in other languages may vary
+### Usage Recommendations
+- ⚠️ Use in controlled environments with clinical expert review of generated results
+- ⚠️ Treat model outputs as auxiliary references, not final diagnostic conclusions
+- ⚠️ For sensitive cases or high-risk scenarios, expert consultation is mandatory
+- ⚠️ Deployment requires internal validation, security review, and clinical testing
+### Data Privacy & Compliance
+- 🔒 Training data fully anonymized
+- 🔒 Attention to patient privacy protection during use
+- 🔒 Production deployment must comply with healthcare data security regulations (e.g., HIPAA, GDPR)
+- 🔒 Local deployment recommended to avoid sensitive data transmission
+## 📚 Citation
+If MedGo is helpful for your research or project, please cite our work:
+```bibtex
+@misc{openmedzoo_2025,
+	author       = { OpenMedZoo },
+	title        = { MedGo (Revision 640a2e2) },
+	year         = 2025,
+	url          = { https://huggingface.co/OpenMedZoo/MedGo },
+	doi          = { 10.57967/hf/7024 },
+	publisher    = { Hugging Face }
+}
+```
+## 📄 License
+This project is licensed under the [Apache License 2.0](LICENSE).
+**Commercial Use Notice**:
+- ✅ Commercial use and modification allowed
+- ✅ Original license and copyright notice must be retained
+- ✅ Contact us for technical support when integrating into healthcare systems
+## 🤝 Contributing
+We welcome community contributions! Here's how to participate:
+### Contribution Types
+- 🐛 Submit bug reports
+- 💡 Propose new features
+- 📝 Improve documentation
+- 🔧 Submit code fixes or optimizations
+- 📊 Share evaluation results and use cases
+## 🙏 Acknowledgments
+Thanks to all contributors to the MedGo project:
+- Model development and fine-tuning algorithm team
+- Data annotation and quality control team
+- Clinical expert guidance and review team
+- Open-source community support and feedback
+Special thanks to:
+- [Qwen Team](https://github.com/QwenLM/Qwen) for providing excellent foundation models
+- All healthcare institutions that provided data and feedback
+## 📧 Contact
+- **HuggingFace**: [Model Homepage](https://huggingface.co/OpenMedZoo/MedGo)
+---
+<div align="center">
+[⬆ Back to Top](#medgo-medical-large-language-model-based-on-qwen25-32b)
+</div>

README_CN.md ADDED Viewed

	@@ -0,0 +1,324 @@

+# MedGo: 基于 Qwen2.5-32B 的医疗大模型
+<div align="center">
+[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow)](https://huggingface.co/OpenMedZoo/MedGo)
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
+[![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/)
+[English](./README.md) | 简体中文
+</div>
+## 📋 目录
+- [简介](#简介)
+- [模型特点](#模型特点)
+- [性能评估](#性能评估)
+- [快速开始](#快速开始)
+- [训练细节](#训练细节)
+- [使用场景](#使用场景)
+- [限制与风险](#限制与风险)
+- [引用](#引用)
+- [许可证](#许可证)
+- [贡献](#贡献)
+- [联系方式](#联系方式)
+## 🎯 简介
+**MedGo** 是一个基于 **Qwen2.5-32B** 微调的通用医疗大语言模型，专为临床医学与科研场景设计。模型通过大规模多源医学语料和复杂病例数据增强进行训练，支持医学问答、病历摘要、临床推理、多轮对话和科研文本生成等多任务能力。
+### 🌟 核心能力
+- **📚 医学知识问答**: 基于权威医学文献和临床指南的专业问答
+- **📝 病历文书生成**: 自动化病历摘要、诊断报告和医疗文书
+- **🔍 临床推理**: 鉴别诊断、检查建议和治疗方案推荐
+- **💬 多轮对话**: 医患交互模拟和复杂病例讨论
+- **🔬 科研辅助**: 文献摘要、研究思路生成和质控审查
+## ✨ 模型特点
+| 特性 | 详情 |
+|------|------|
+| **基础架构** | Qwen2.5-32B |
+| **参数规模** | 32B |
+| **应用领域** | 临床医学、科研辅助、医疗系统集成 |
+| **微调方法** | SFT + Preference Alignment (DPO/KTO) |
+| **数据来源** | 权威医学文献、临床指南、真实病例（脱敏） |
+| **部署方式** | 本地部署、HIS/EMR 系统集成 |
+| **开源许可** | Apache 2.0 |
+## 📊 性能评估
+MedGo 在多项医学与综合评测基准上表现优异，在 30B 参数级别模型中具有竞争力：
+### 主要基准测试结果
+- **AIMedQA**: 医学问答理解
+- **CME**: 临床推理评估
+- **DiagnosisArena**: 诊断能力测试
+- **MedQA / MedMCQA**: 医学选择题
+- **PubMedQA**: 生物医学文献问答
+- **MMLU-Pro**: 综合能力评估
+![Performance Comparison](./main_results.png)
+**性能亮点**：
+- ✅ **平均得分**: 约 70 分（30B 级别模型中表现优异）
+- ✅ **优势任务**: 临床推理（DiagnosisArena、CME）和多轮医学问答
+- ✅ **平衡能力**: 在医疗语义理解和多任务泛化上表现良好
+## 🚀 快速开始
+### 环境要求
+- Python >= 3.8
+- PyTorch >= 2.0
+- Transformers >= 4.35.0
+- CUDA >= 11.8 (GPU 推理)
+### 安装
+```bash
+# 克隆仓库
+git clone https://github.com/OpenMedZoo/MedGo.git
+cd MedGo
+# 安装依赖
+pip install -r requirements.txt
+```
+### 模型下载
+从 HuggingFace 下载模型权重：
+```bash
+# 使用 huggingface-cli
+huggingface-cli download OpenMedZoo/MedGo --local-dir ./models/MedGo
+# 或使用 git-lfs
+git lfs install
+git clone https://huggingface.co/OpenMedZoo/MedGo
+```
+### 基础推理
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# 加载模型和分词器
+model_path = "OpenMedZoo/MedGo"
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    device_map="auto",
+    trust_remote_code=True,
+    torch_dtype="auto"
+)
+# 医学问答示例
+messages = [
+    {"role": "system", "content": "你是一个专业的医疗助手，请基于医学知识回答问题。"},
+    {"role": "user", "content": "请解释什么是高血压，以及常见的治疗方法。"}
+]
+# 生成回复
+inputs = tokenizer.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt"
+).to(model.device)
+outputs = model.generate(
+    inputs,
+    max_new_tokens=512,
+    temperature=0.7,
+    top_p=0.9,
+    do_sample=True
+)
+response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
+print(response)
+```
+### 批量推理
+```bash
+# 使用提供的推理脚本
+python scripts/inference.py \
+    --model_path OpenMedZoo/MedGo \
+    --input_file examples/medical_qa.jsonl \
+    --output_file results/predictions.jsonl \
+    --batch_size 4
+```
+### vLLM 加速推理
+```python
+from vllm import LLM, SamplingParams
+# 初始化 vLLM
+llm = LLM(model="OpenMedZoo/MedGo", trust_remote_code=True)
+sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
+# 批量推理
+prompts = [
+    "请解释糖尿病的症状和治疗方法。",
+    "高血压患者应该注意哪些饮食事项？"
+]
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    print(output.outputs[0].text)
+```
+## 🔧 训练细节
+MedGo 采用**两阶段微调策略**，兼顾通用医学知识���临床任务适配。
+### 阶段 I：通识医学对齐
+**目标**: 建立扎实的医学知识基础，提高问答规范性
+- **数据来源**:
+  - 权威医学文献（PubMed、医学教科书）
+  - 临床指南和诊疗规范
+  - 医学百科条目和术语库
+- **训练方法**:
+  - Supervised Fine-Tuning (SFT)
+  - Chain-of-Thought (CoT) 引导样本
+  - 医学术语对齐和安全性约束
+### 阶段 II：临床任务增强
+**目标**: 增强复杂病例推理和多任务处理能力
+- **数据来源**:
+  - 真实病历（完全脱敏处理）
+  - 门急诊记录和复杂多诊断样本
+  - 科研文章和质控案例
+- **数据增强技术**:
+  - 语义改写和多视角扩写
+  - 复杂病例合成
+  - 医患交互模拟
+- **训练方法**:
+  - Multi-Task Learning（病历摘要、鉴别诊断、检查建议等）
+  - Preference Alignment (DPO/KTO)
+  - 专家反馈迭代优化
+### 训练优化重点
+- ✅ 强化复杂病例的信息抽取与跨证据推理
+- ✅ 提升输出的医学一致性和可解释性
+- ✅ 优化表达的合规性和安全性
+- ✅ 通过专家样本和自动评测持续迭代
+## 💡 使用场景
+### ✅ 适用场景
+| 场景 | 说明 |
+|------|------|
+| **临床辅助** | 初步诊断建议、病历书写、格式化报告生成 |
+| **科研支持** | 文献摘要、研究思路生成、数据分析辅助 |
+| **质控审查** | 医疗文书规范性检查、诊疗流程质控 |
+| **系统集成** | 嵌入 HIS/EMR 系统，提供智能辅助决策 |
+| **医学教育** | 病例讨论、医学知识问答、临床推理训练 |
+### 🚫 不适用场景
+- ❌ **不能替代医生**: 仅为辅助工具，不能单独作为诊断依据
+- ❌ **高风险操作**: 不建议用于手术决策等高风险医疗操作
+- ❌ **罕见病局限**: 对训练数据外的罕见病表现可能欠佳
+- ❌ **实时急救**: 不适用于需要即时决策的急救场景
+## ⚠️ 限制与风险
+### 模型局限性
+1. **理解偏差**: 虽已覆盖大量医学知识，仍可能出现理解偏差或错误推荐
+2. **复杂病例**: 对病情复杂、并发症严重、资料缺失的病例风险较高
+3. **知识时效**: 医学知识持续更新，模型训练数据可能滞后
+4. **语言限制**: 主要针对中文医学场景，其他语言表现可能不佳
+### 使用建议
+- ⚠️ 请在受控环境中使用，并由临床专家审核生成结果
+- ⚠️ 将模型输出作为辅助参考，而非最终诊断依据
+- ⚠️ 对敏感病案或高风险场景，必须结合专家意见
+- ⚠️ 部署前需通过内部验证、安全审查和临床测试
+### 数据隐私与合规
+- 🔒 训练数据已完全脱敏处理
+- 🔒 使用时注意患者隐私保护
+- 🔒 生产环境部署需符合医疗数据安全法规（如 HIPAA、GDPR）
+- 🔒 建议在本地部署，避免敏感数据外传
+## 📚 引用
+如果 MedGo 对您的研究或项目有帮助，请引用我们的工作：
+```bibtex
+@misc{openmedzoo_2025,
+	author       = { OpenMedZoo },
+	title        = { MedGo (Revision 640a2e2) },
+	year         = 2025,
+	url          = { https://huggingface.co/OpenMedZoo/MedGo },
+	doi          = { 10.57967/hf/7024 },
+	publisher    = { Hugging Face }
+}
+```
+## 📄 许可证
+本项目采用 [Apache License 2.0](LICENSE) 开源协议。
+**商业使用须知**：
+- ✅ 允许商业使用和修改
+- ✅ 需保留原始许可证和版权声明
+- ✅ 医疗系统集成建议联系我们获取技术支持
+## 🤝 贡献
+我们欢迎社区贡献！以下是参与方式：
+### 贡献类型
+- 🐛 提交 Bug 报告
+- 💡 提出新功能建议
+- 📝 改进文档
+- 🔧 提交代码修复或优化
+- 📊 分享评测结果和使用案例
+## 🙏 致谢
+感谢所有参与 MedGo 项目的人员：
+- 模型研发与微调算法团队
+- 数据标注与质量控制团队
+- 临床专家指导与审核团队
+- 开源社区的支持与反馈
+特别感谢：
+- [Qwen Team](https://github.com/QwenLM/Qwen) 提供优秀的基础模型
+- 所有提供数据和反馈的医疗机构
+## 📧 联系方式
+- **HuggingFace**: [模型主页](https://huggingface.co/OpenMedZoo/MedGo)
+---
+<div align="center">
+[⬆ 回到顶部](#medgo-基于-qwen25-32b-的医疗大模型)
+</div>

main_results.png ADDED Viewed

Git LFS Details

SHA256: 87932bc2b934dc9992d8db349cb33a2cd21dba832d7ccbdcdd358848e4a005be
Pointer size: 132 Bytes
Size of remote file: 1.61 MB