Built with Axolotl

See axolotl config

axolotl version: 0.13.0.dev0

# ------------------------------------------------------------------
# 0.  Model & Tokeniser
# ------------------------------------------------------------------
base_model:     ibm-granite/granite-4.0-h-micro
trust_remote_code: true

# ------------------------------------------------------------------
# 1.  Precision & Memory
# ------------------------------------------------------------------
bf16:  auto
fp16:
tf32:  false

load_in_8bit:  false
load_in_4bit:  false

# vram helpers
flash_attention:  true
# gradient_checkpointing: true   # <-- uncomment if you want old-style GC instead of FSDP AC

# ------------------------------------------------------------------
# 2.  FSDP (zero-3 + cpu-offload)
# ------------------------------------------------------------------
fsdp:
  - auto_wrap
  - full_shard

fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: GraniteMoeHybridDecoderLayer
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_reshard_after_forward: true
  fsdp_activation_checkpointing: true     # disables itself if unsupported

# ------------------------------------------------------------------
# 3.  Training Schedule
# ------------------------------------------------------------------
num_epochs: 2
learning_rate: 2e-5
lr_scheduler: cosine
warmup_ratio: 0.05
max_grad_norm: 0.1
weight_decay: 0.0
optimizer: adamw_torch_8bit

micro_batch_size: 2
gradient_accumulation_steps: 2
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

# saves / eval frequency
saves_per_epoch: 4
val_set_size: 0.0
logging_steps: 1
strict: false

# ------------------------------------------------------------------
# 4.  Data & Prompt Template
# ------------------------------------------------------------------
datasets:
  - path: allura-forge/claude-oss-sft
    type: chat_template
    split: train
    field_messages: conversations
    message_field_role: from
    message_field_content: value

chat_template: jinja
chat_template_jinja: |
  {%- for message in messages -%}
    {{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' + message['content'] + '<|end_of_text|>' -}}
    {%- if loop.last and add_generation_prompt -%}
        {{- '<|start_of_role|>assistant<|end_of_role|>' -}}
    {%- endif -%}
  {%- endfor -%}

shuffle_merged_datasets: true
dataset_prepared_path: last_run_prepared
remove_unused_columns: false
train_on_inputs: false
group_by_length: false

# ------------------------------------------------------------------
# 5.  Plug-ins (memory / speed)
# ------------------------------------------------------------------
plugins:
  - axolotl.integrations.liger.LigerPlugin

# ------------------------------------------------------------------
# 6.  Weights & Biases
# ------------------------------------------------------------------
wandb_project: claumba-micro
wandb_name: woke
wandb_entity:
wandb_watch:
wandb_log_model:

# ------------------------------------------------------------------
# 7.  I/O & Resume
# ------------------------------------------------------------------
output_dir: ./model-output
resume_from_checkpoint:
local_rank:

# ------------------------------------------------------------------
# 8.  Unused / commented-out
# ------------------------------------------------------------------
# evals_per_epoch:
# eval_steps: 100
# eval_sample_packing: false
# early_stopping_patience:
# xformers_attention:


model-output

This model is a fine-tuned version of ibm-granite/granite-4.0-h-micro on the allura-forge/claude-oss-sft dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • total_eval_batch_size: 16
  • optimizer: Use OptimizerNames.ADAMW_TORCH_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 15
  • training_steps: 308

Training results

Framework versions

  • Transformers 4.57.1
  • Pytorch 2.8.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.1
Downloads last month
132
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for allura-forge/claumba-micro-sft

Finetuned
(7)
this model

Dataset used to train allura-forge/claumba-micro-sft