baichuan-inc
/

Baichuan-M1-14B-Base

Model card Files Files and versions

DanielWang commited on Jan 24

Commit

28d4b82

·

verified ·

1 Parent(s): 9cee162

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -74,17 +74,17 @@ We conducted meticulous data collection and synthesis for the medical field, inc
 ## Sliding Window Attention Mechanism
 - Adopting a sliding window attention mechanism in some layers to reduce KV Cache memory usage.
-- **Optimization**: Balancing computational efficiency and performance, especially suitable for long-sequence tasks.
 ## Optimizing Position Encoding Oscillation
 - By increasing the dimensions of some attention heads, RoPE curve oscillation is reduced.
-- **Result**: More stable performance in long-sequence tasks while maintaining the model's ability to capture diverse features.
 ## High Peak Learning Rate Strategy
 - Using **WSD learning rate scheduling strategy** with high peak learning rates to promote model generalization.
-- **Comparison results**: Significant improvement in benchmark task performance.
 ## Adaptive Gradient Update

 ## Sliding Window Attention Mechanism
 - Adopting a sliding window attention mechanism in some layers to reduce KV Cache memory usage.
+- Balancing computational efficiency and performance, especially suitable for long-sequence tasks.
 ## Optimizing Position Encoding Oscillation
 - By increasing the dimensions of some attention heads, RoPE curve oscillation is reduced.
+- More stable performance in long-sequence tasks while maintaining the model's ability to capture diverse features.
 ## High Peak Learning Rate Strategy
 - Using **WSD learning rate scheduling strategy** with high peak learning rates to promote model generalization.
+- Significant improvement in benchmark task performance.
 ## Adaptive Gradient Update