---
library_name: transformers
tags:
- exaone
language:
- ko
- en
base_model:
- LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct
---
# Model Card for EXAONE-3.5-7.8B-s1-Ko-no-sample-packing

### Base Model

- [beomi/EXAONE-3.5-7.8B-Instruct-Llamafied](https://huggingface.co/beomi/EXAONE-3.5-7.8B-Instruct-Llamafied), Llamafied version of [LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct)

### Dataset

- [werty1248/s1k-KoEnKo-formatted](https://huggingface.co/datasets/werty1248/s1k-KoEnKo-formatted)
- From [simplescaling/s1K](https://huggingface.co/datasets/simplescaling/s1K), **only question and attempt** translated, thinking part not translated.
- template: <think>{thinking_trajectories}</think>{attempt}

### Training

- 4xH100 SXM, 1 hours - Total 4 GPU Hours with H100 SXM

- Trained with Axolotl, but trying to follow [simplescaling's training config](https://github.com/simplescaling/s1/blob/main/train/sft.sh) as much as possible.

- Difference
  - optimizer: paged_adamw_32bit
  - deepspeed: ./deepspeed_configs/zero3_bf16.json
  - plugins:
    - axolotl.integrations.liger.LigerPlugin

- 특이사항: sample packing 시도 시 epoch 당 12step밖에 되지 않아, 첫 batch에 대한 학습이 제대로 이루어지지 않음.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6629154d55d7c289634b8c5d/_Mb7E5y1b66cvhIEtHLrs.png)

<details><summary>Axolotl config</summary>

```
base_model: beomi/EXAONE-3.5-7.8B-Instruct-Llamafied
model_type: AutoModelForCausalLM
tokenizer_config: beomi/EXAONE-3.5-7.8B-Instruct-Llamafied
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: werty1248/s1k-KoEnKo-formatted
    field_messages: conversations
    type: chat_template
    chat_template: chatml

dataset_prepared_path: ./data_preparation
output_dir: /workspace/data

hf_use_auth_token: true

sequence_len: 32768
sample_packing: true
pad_to_sequence_len: true

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: true

wandb_project:
#wandb_entity:
#wandb_watch:
wandb_name:
#wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 5
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 1.0e-5
adam_beta1: 0.9
adam_beta2: 0.95

train_on_inputs: false
group_by_length: false
bf16: auto
fp16: 
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_ratio: 0.05
eval_table_size:

deepspeed: ./deepspeed_configs/zero3_bf16.json
```

</details>

### Example

#### Q1: From [ChuGyouk/GSM8k-Ko](https://huggingface.co/datasets/ChuGyouk/GSM8k-Ko)

```
input = tokenizer.apply_chat_template([ { "content": "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\\n\\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution written in the same language as the question.} <|end_of_solution|> Now, try to solve the following question through the above guidelines", "role": "system" },
                                       { "content": "Janet의 오리는 하루에 16개의 달걀을 낳습니다. 그녀는 매일 아침 3개의 달걀을 먹고, 친구들을 위해 4개의 달걀로 머핀을 굽습니다. 그녀는 남은 달걀을 농산물 시장에서 신선한 오리알 개당 2달러에 판매합니다. 그녀는 농산물 시장에서 매일 얼마나 많은 돈을 벌까요?", "role": "user" },], add_generation_prompt=True, return_tensors='pt')
res = model.generate(input.to('cuda'), max_length = 2048, tokenizer=tokenizer)
tokenizer.decode(res[0])
```

<details><summary>Output(eos 토큰 없음)</summary>

```
[|assistant|]다음은 문제를 해결하는 방법입니다.

* **매일 태어나는 달걀의 수:** 오리는 하루에 16개의 달걀을 낳습니다.
* **달걀 사용량:**
    * Janet은 아침에 3개의 달걀을 먹습니다.
    * 그녀는 친구들을 위해 4개의 달걀으로 머핀을 굽습니다.
    * **총 달걀 사용량:** 3 + 4 = 7개
* **판매용으로 남은 달걀 수:**
    * **남은 달걀:** 16 (태어나는 달걀) - 7 (사용한 달걀) = 9개
* **매일 수익:**
    * 그녀는 각 오리알을 2달러에 판매합니다.
    * **매일 수익:** 9 (판매용으로 남은 달걀) x 2 (각 오리알의 가격) = 18달러

**답:** Janet은 농산물 시장에서 매일 18달러를 벌게 됩니다.

**요약:**

* 태어나는 달걀: 16개
* 사용하는 달걀: 3 + 4 = 7개
* 남은 달걀: 16 - 7 = 9개
* 각 남은 달걀의 가격: $2
* 매일 수익: 9 x 2 = $18

Final Answer: The final answer is $\boxed{18}$
```

</details>

#### Q2: From [KSM, HAERAE-HUB/HRM8K](https://huggingface.co/datasets/HAERAE-HUB/HRM8K)

```
input = tokenizer.apply_chat_template([ { "content": "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\\n\\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution written in the same language as the question.} <|end_of_solution|> Now, try to solve the following question through the above guidelines", "role": "system" },
                                       { "content": "양의 실수 \\( t \\)에 대하여 함수 \\( f \\)를 \\[ f(t) = \\int_{0}^{\\sqrt{t}} \\int_{y}^{\\sqrt{t}} \\frac{1}{2 + \\sin(x^2)} \\, dx \\, dy \\] 로 정의할 때, \\( f'\\left(\\frac{\\pi}{2}\\right) \\)의 값을 구하시오.", "role": "user" },], add_generation_prompt=True, return_tensors='pt')
res = model.generate(input.to('cuda'), max_length = 8192, stop_strings = ['<|end_of_solution|>'], tokenizer=tokenizer)
tokenizer.decode(res[0])
```

<details><summary>Output(오답, think 없음, eos 토큰 없음)</summary>

```
[|assistant|]먼저, 주어진 함수 \( f(t) \)를 다음과 같이 나타낼 수 있습니다:
\[ f(t) = \iint_{D(t)} \frac{1}{2 + \sin(x^2)} \, dA \]
여기서 \( D(t) = [0, \sqrt{t}] \times [\sqrt{t}, \sqrt{t}] \)입니다.

함수 \( f(t) \)는 두 변수 함수의 적분으로 표현될 수 있으므로, 라이베니츠 규칙을 사용하여 도함수를 구할 수 있습니다:
\[ f'(t) = \frac{d}{dt} \left( \iint_{D(t)} \frac{1}{2 + \sin(x^2)} \, dA \right) = \iint_{D'(t)} \frac{\partial}{\partial t} \left( \frac{1}{2 + \sin(x^2)} \right) \, dA \]
여기서 \( D'(t) \)는 영역 \( D(t) \)의 경계가 \( t \)에 대해 미분 가능하게 변할 때, \( t \)에 의존하는 부분입니다.

영역 \( D(t) \)의 경계는 다음과 같습니다:
\( t \ge 0 \)에 대해, \( 0 \le y \le \sqrt{t} \)이고 \( y \le x \le \sqrt{t} \)입니다.
\( t \)에 대한 도함수를 구하기 위해, \( x \)와 \( y \)에 대한 정보를 사용합니다.
\( t \)에 대한 \( y \)의 범위는 \( y = \sqrt{t} \)입니다.
\( t \)에 대한 \( x \)의 범위는 \( x = \sqrt{t} \)입니다.

\( D(t) \)의 경계 \( \sqrt{t} = \sqrt{t} \)와 \( x = y \)로 주어집니다.
\( t \)에 대한 \( x=y \)의 변화는 \( x=y \)를 따라 발생합니다.
또한, 경계 \( y = \sqrt{t} \)를 따라 \( x \)가 증가합니다.

\( f'(t) \)를 구하기 위해, 우리는 다음과 같은 영역에서 적분해야 합니다:
\( D'(t) = [0, \sqrt{t}] \times [\sqrt{t}, \sqrt{t}] \)에서 \( t \)에 의존하는 부분입니다.
\( y = \sqrt{t} \)와 \( x = y \)로 경계가 형성됩니다.

\( t \)에 대한 도함수를 계산해 봅시다:
\( \frac{\partial}{\partial t} \left( \frac{1}{2 + \sin(x^2)} \right) = 0 \)입니다. 이는 \( \frac{1}{2 + \sin(x^2)} \)가 \( t \)와 무관하기 때문입니다.

따라서, \( f'(t) = \iint_{D'(t)} 0 \, dA = 0 \)입니다.
이것은 부분 적분이 항상 0이라는 의미일까요? 아니면 도함수 계산에 문제가 있는 것일까요?

우리는 먼저 \( f(t) \)를 직접 미분할 수 있는지 살펴봐야 합니다.
\( f(t) = \int_{0}^{\sqrt{t}} \int_{y}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \, dx \, dy \)
이것을 Leibniz 적분 규칙을 사용하여 미분하면:
\( f'(t) = \frac{d}{dt} \left( \int_{0}}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \, dx \right) \times \text{상수} + \int_{0}^{\int_{y}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \, dx \, dy} \frac{\partial}{\partial t} \left( \frac{1}{2 + \sin(x^2)} \right) \, dy \)
이것은 약간 복잡하게 들릴 수 있습니다.

두 번 적분하는 또 다른 방법을 고려해 보겠습니다.
\( f(t) = \int_{0}^{\sqrt{t}} \left( \int_{y}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \, dx \right) dy \)
Leibniz 규칙에 의해:
\( f'(t) = \frac{d}{dt} \left( \int_{y}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \, dx \right)_{y=\sqrt{t}} + \int_{0}^{\sqrt{t}} \frac{\partial}{\partial t} \left( \int_{y}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \, dx \right) dy \)
첫 번째 항은 다음과 같습니다:
\( \frac{d}{dt} \left( \int_{y}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \, dx \right) = \frac{1}{2 + \sin((\sqrt{t})^2)} \cdot \frac{d}{dt} (\sqrt{t}) = \frac{1}{2 + \sin(t)} \cdot \frac{1}{2\sqrt{t}} \)
이 항은 \( y = \sqrt{t} \)에서 평가됩니다.

두 번째 항은 다음과 같습니다:
\( \int_{0}^{\sqrt{t}} \frac{\partial}{\partial t} \left( \int_{y}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \, dx \right) dy \)
\( \frac{\partial}{\partial t} \left( \int_{y}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \, dx \right) = -\frac{1}{2 + \sin((\sqrt{t})^2)} \cdot \frac{1}{2\sqrt{t}} = -\frac{1}{2 + \sin(t)} \cdot \frac{1}{2\sqrt{t}} \)
이것을 적분합니다:
\( -\int_{0}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} \cdot \frac{1}{2\sqrt{t}} dx = -\frac{1}{2\sqrt{t}} \int_{0}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} dx \)

\( f'(t) \)에 대한 전체 표현은 다음과 같습니다:
\( f'(t) = \frac{1}{2\sqrt{t}(2 + \sin(t))} - \frac{1}{2\sqrt{t}} \int_{0}^{\sqrt{t}} \frac{1}{2 + \sin(x^2)} dx \)

\( f'\left(\frac{\pi}{2}\right) \)를 구해야 합니다.
\( t = \frac{\pi}{2} \)를 대입하면:
\( f'\left(\frac{\pi}{2}\right) = \frac{1}{2\sqrt{\pi/2}(2 + \sin(\pi/2))} - \frac{1}{2\sqrt{\pi/2}} \int_{0}^{\sqrt{\pi/2}} \frac{1}{2 + \sin(x^2)} dx \)
\( f'\left(\frac{\pi}{2}\right) = \frac{1}{2 \cdot \frac{\sqrt{2\pi}}{2} (2 + 1)} - \frac{1}{\sqrt{2\pi}} \int_{0}^{\sqrt{\pi/2}} \frac{1}{2 + \sin(x^2)} dx \)
\( f'\left(\frac{\pi}{2}\right) = \frac{1}{\sqrt{2\pi} \cdot 3} - \frac{1}{\sqrt{2\pi}} \int_{0}^{\sqrt{\pi/2}} \frac{1}{2 + \sin(x^2)} dx \)
\( f'\left(\frac{\pi}{2}\right) = \frac{1}{3\sqrt{2\pi}} - \frac{1}{\sqrt{2\pi}} \int_{0}^{\sqrt{\pi/2}} \frac{1}{2 + \sin(x^2)} dx \)

또는, \( f'(t) = 0 \)이라고 생각해 봅시다. 이는 경계 \( D'(t) \)가 실제로 존재하고 적분이 0이 아님을 의미하지는 않습니다.
\( D(t) \)의 경계를 \( \partial D(t) \)라고 합시다.
\( f(t) = \iint_{D(t)} g(x, y) \, dA \)라고 합시다. 여기서 \( g(x, y) = \frac{1}{2 + \sin(x^2)} \).
Leibniz 규칙에 의해:
\( f'(t) = \iint_{\partial D(t)} g(x, y) \mathbf{n} \cdot \mathbf{k} \, dS \) + \( \iint_{D(t)} \frac{\partial g}{\partial t} \, dA \)
여기서 \( \mathbf{n} \)은 외부 법선 벡터이고, \( \mathbf{k} \)는 \( t \) 방향의 단위 벡터입니다.
\( \mathbf{n} \cdot \mathbf{k} \)는 경계에서 \( t \)에 대한 단위 법선의 계수입니다.

영역 \( D(t) \)의 경계는 \( y = \sqrt{t} \)와 \( x = y \)로 구성됩니다.
\( t \)에 대한 변화는 \( y = \sqrt{t} \)에서 발생하고, 이는 \( x=y \)를 따라 일어납니다.

경계 \( y = \sqrt{t} \)를 고려해 봅시다. 여기서 외부 법선 벡터는 \( \mathbf{n} \)은 \( \frac{\partial(x, y)}{\partial t} \)의 단위 법선입니다, 즉 \( \mathbf{n} = \frac{(1, 0)}{\sqrt{1 + 0^2}} = (1, 0) \)입니다.
\( \mathbf{n} \cdot \mathbf{k} = 0 \)입니다.

경계 \( x = y \)를 고려해 봅시다. 여기서 \( t \)에 대한 \( x \)의 변화가 발생합니다.
외부 법선 벡터는 \( \frac{\partial(x, y)}{\partial t} \)입니다.
이 경계에서 \( \mathbf{n} \cdot \mathbf{k} \)는 \( \frac{\partial x}{\partial t} \)입니다.
\( x = y \)이므로, \( \frac{\partial x}{\partial t} = \frac{\partial y}{\partial t} = \frac{1}{2\sqrt{t}} \)입니다.

\( f'(t) \)에 대한 표현은 다음과 같습니다:
\( f'(t) = \iint_{\partial D(t)} g(x, y) \cdot 0 \, dS + \iint_{D(t)} \frac{\partial g}{\partial t} \, dA \)
\( f'(t) = \iint_{D(t)} \frac{\partial}{\partial t} \left( \frac{1}{2 + \sin(x^2)} \right) \, dA \)
이는 이전에 얻은 결과와 일치합니다. \( \frac{\partial}{\partial t} \left( \frac{1}{2 + \sin(x^2)} \right) = 0 \)이기 때문입니다.

Final Answer: The final answer is $\boxed{0}$
```

</details>

#### Q3: General Knowledge

```
input = tokenizer.apply_chat_template([ { "content": "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\\n\\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution written in the same language as the question.} <|end_of_solution|> Now, try to solve the following question through the above guidelines", "role": "system" },
                                       { "content": "세계에서 세 번째로 높은 산은?", "role": "user" },], add_generation_prompt=True, return_tensors='pt')
res = model.generate(input.to('cuda'), max_length = 2048, temperature = 0.5, stop_strings = ['<|end_of_solution|>'], tokenizer=tokenizer)
tokenizer.decode(res[0])
```

- Normal temperature: hallucination (unstable answer)
- Low temperature: repitition

<details><summary>Output(오답, think 없음)</summary>

```
[|assistant|]세계에서 세 번째로 높은 산은 **K2(중국명 카라코람 K2)**입니다.

* **높이:** 8,611미터 (28,251피트)
* **위치:** 파키스탄과 중국(신장) 국경, 카라코람 산맥
* **등반 역사:** 1954년 처음 등반 성공 (아드라 등반대)
* **특징:** K2는 가장 높은 산인 에베레스트와 네 번째로 높은 산인 칸첸जunga보다 더 어렵고 위험한 등반으로 알려져 있습니다. "세상에서 가장 위험한 산"이라는 별명을 가지고 있습니다.

**참고:** 산의 높이에 대한 논쟁은 종종 존재하며, 특히 눈 표면의 변동성으로 인해 더욱 그렇습니다. 위에 나열된 높이는 일반적으로 인용되는 값입니다.

세계에서 가장 높은 산은 에베레스트(8,848미터 또는 29,031피트)이고, 두 번째로 높은 산은 칸첸자웅(8,586미터 또는 28,169피트)입니다.
```

</details>