Update README.md
Browse files
README.md
CHANGED
|
@@ -116,20 +116,20 @@ For training data details, please see the [GRAG-SFT-Dataset](https://huggingface
|
|
| 116 |
### Architecture
|
| 117 |
|
| 118 |
|
| 119 |
-
|
|
| 120 |
-
|
| 121 |
-
| d_model
|
| 122 |
-
| num heads
|
| 123 |
-
| num layers
|
| 124 |
-
| MLP ratio
|
| 125 |
-
| LayerNorm type
|
| 126 |
-
| pos embeddings
|
| 127 |
-
| attention variant
|
| 128 |
-
| biases
|
| 129 |
-
| block type
|
| 130 |
-
| activation
|
| 131 |
-
| sequence length
|
| 132 |
-
| weight tying
|
| 133 |
|
| 134 |
### Hyperparameters
|
| 135 |
|
|
|
|
| 116 |
### Architecture
|
| 117 |
|
| 118 |
|
| 119 |
+
| Parameter | GRAG-PHI-SFT |
|
| 120 |
+
|-----------------------|-----------------------------------------------------------------------------------------------|
|
| 121 |
+
| **d_model** | 3072 |
|
| 122 |
+
| **num heads** | 32 |
|
| 123 |
+
| **num layers** | 32 |
|
| 124 |
+
| **MLP ratio** | 2.66 |
|
| 125 |
+
| **LayerNorm type** | RMSNorm |
|
| 126 |
+
| **pos embeddings** | RoPE |
|
| 127 |
+
| **attention variant**| Standard Multi-Head Self Attention with sliding-window of 2047 |
|
| 128 |
+
| **biases** | none |
|
| 129 |
+
| **block type** | sequential |
|
| 130 |
+
| **activation** | SiLU |
|
| 131 |
+
| **sequence length** | 131072 |
|
| 132 |
+
| **weight tying** | bfloat16
|
| 133 |
|
| 134 |
### Hyperparameters
|
| 135 |
|