DanielWang commited on
Commit
28d4b82
·
verified ·
1 Parent(s): 9cee162

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -74,17 +74,17 @@ We conducted meticulous data collection and synthesis for the medical field, inc
74
  ## Sliding Window Attention Mechanism
75
 
76
  - Adopting a sliding window attention mechanism in some layers to reduce KV Cache memory usage.
77
- - **Optimization**: Balancing computational efficiency and performance, especially suitable for long-sequence tasks.
78
 
79
  ## Optimizing Position Encoding Oscillation
80
 
81
  - By increasing the dimensions of some attention heads, RoPE curve oscillation is reduced.
82
- - **Result**: More stable performance in long-sequence tasks while maintaining the model's ability to capture diverse features.
83
 
84
  ## High Peak Learning Rate Strategy
85
 
86
  - Using **WSD learning rate scheduling strategy** with high peak learning rates to promote model generalization.
87
- - **Comparison results**: Significant improvement in benchmark task performance.
88
 
89
  ## Adaptive Gradient Update
90
 
 
74
  ## Sliding Window Attention Mechanism
75
 
76
  - Adopting a sliding window attention mechanism in some layers to reduce KV Cache memory usage.
77
+ - Balancing computational efficiency and performance, especially suitable for long-sequence tasks.
78
 
79
  ## Optimizing Position Encoding Oscillation
80
 
81
  - By increasing the dimensions of some attention heads, RoPE curve oscillation is reduced.
82
+ - More stable performance in long-sequence tasks while maintaining the model's ability to capture diverse features.
83
 
84
  ## High Peak Learning Rate Strategy
85
 
86
  - Using **WSD learning rate scheduling strategy** with high peak learning rates to promote model generalization.
87
+ - Significant improvement in benchmark task performance.
88
 
89
  ## Adaptive Gradient Update
90