Update README.md
Browse files
README.md
CHANGED
|
@@ -74,17 +74,17 @@ We conducted meticulous data collection and synthesis for the medical field, inc
|
|
| 74 |
## Sliding Window Attention Mechanism
|
| 75 |
|
| 76 |
- Adopting a sliding window attention mechanism in some layers to reduce KV Cache memory usage.
|
| 77 |
-
-
|
| 78 |
|
| 79 |
## Optimizing Position Encoding Oscillation
|
| 80 |
|
| 81 |
- By increasing the dimensions of some attention heads, RoPE curve oscillation is reduced.
|
| 82 |
-
-
|
| 83 |
|
| 84 |
## High Peak Learning Rate Strategy
|
| 85 |
|
| 86 |
- Using **WSD learning rate scheduling strategy** with high peak learning rates to promote model generalization.
|
| 87 |
-
-
|
| 88 |
|
| 89 |
## Adaptive Gradient Update
|
| 90 |
|
|
|
|
| 74 |
## Sliding Window Attention Mechanism
|
| 75 |
|
| 76 |
- Adopting a sliding window attention mechanism in some layers to reduce KV Cache memory usage.
|
| 77 |
+
- Balancing computational efficiency and performance, especially suitable for long-sequence tasks.
|
| 78 |
|
| 79 |
## Optimizing Position Encoding Oscillation
|
| 80 |
|
| 81 |
- By increasing the dimensions of some attention heads, RoPE curve oscillation is reduced.
|
| 82 |
+
- More stable performance in long-sequence tasks while maintaining the model's ability to capture diverse features.
|
| 83 |
|
| 84 |
## High Peak Learning Rate Strategy
|
| 85 |
|
| 86 |
- Using **WSD learning rate scheduling strategy** with high peak learning rates to promote model generalization.
|
| 87 |
+
- Significant improvement in benchmark task performance.
|
| 88 |
|
| 89 |
## Adaptive Gradient Update
|
| 90 |
|