Qwen3-1.7B SFT Model

Model Description

This is a fine-tuned version of Qwen3-1.7B using Supervised Fine-Tuning (SFT) with FSDP (Fully Sharded Data Parallel) + QLoRA (Quantized Low-Rank Adaptation) techniques.

Training Details

Base Model

Model: Qwen/Qwen3-1.7B
Architecture: Transformer-based causal language model
Parameters: 1.7 billion

Training Configuration

Method: FSDP + QLoRA
Quantization: 4-bit QLoRA
LoRA Parameters:
- r: 64
- alpha: 16
- dropout: 0.1
- target: linear layers
Hardware: 8x H100 80GB HBM3
Precision: bfloat16
Flash Attention: Enabled

Training Hyperparameters

Epochs: 1
Micro Batch Size: 1
Gradient Accumulation Steps: 16
Learning Rate: 1e-4
Scheduler: Cosine with warmup
Warmup Ratio: 0.03
Optimizer: AdamW
Sequence Length: 1024

Dataset

Custom SFT dataset (SFT_004_origin_4.parquet)
Validation split: 10%
Sample packing enabled for training efficiency

Model Performance

The model has been trained for efficient instruction following and maintains the original Qwen3 capabilities while being optimized for custom tasks.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "u-10bei/qwen3-1.7b-sft-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "u-10bei/qwen3-1.7b-sft-merged",
    trust_remote_code=True
)

# Chat format
messages = [
    {"role": "user", "content": "Hello! How can I help you today?"}
]

# Format conversation
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize
inputs = tokenizer(text, return_tensors="pt")

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

# Decode response
response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print(response)

Direct Chat Format

# Manual chat formatting
prompt = "<|im_start|>user\nHello! How are you?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>")
)

response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)

Special Tokens

BOS Token: <|im_start|>
EOS Token: <|im_end|>
UNK Token: <|endoftext|>
PAD Token: <|endoftext|>

Technical Specifications

Model Architecture

Attention: Flash Attention 2 (training and inference)
Precision: bfloat16 native support
Context Length: 1024 tokens (training), extensible for inference
Vocabulary Size: 151,669 tokens

Optimization Features

Memory Efficient: FSDP sharding reduces memory footprint
Quantization Ready: QLoRA-compatible for efficient fine-tuning
Multi-GPU: Optimized for distributed inference

Training Infrastructure

Distributed Training: FSDP (Fully Sharded Data Parallel)
Communication: NCCL with Ethernet backend
Memory Management: Expandable segments, optimized allocation
Monitoring: Weights & Biases integration

Limitations

This model is optimized for the specific training dataset and may not generalize to all use cases
Context length is limited to 1024 tokens during training
Performance may vary depending on the specific task and input format

Ethical Considerations

This model inherits the capabilities and limitations of the base Qwen3-1.7B model. Users should be aware of potential biases and use the model responsibly.

Citation

If you use this model, please cite:

@model{qwen3-1.7b-sft-merged,
  title={Qwen3-1.7B SFT Model with FSDP+QLoRA},
  author={u-10bei},
  year={2025},
  url={https://huggingface.co/u-10bei/qwen3-1.7b-sft-merged}
}

Model Card Authors

u-10bei

Training Date

August 2025

This model was trained using advanced distributed training techniques (FSDP + QLoRA) on high-performance H100 hardware for optimal efficiency and scalability.

Downloads last month: 2

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for u-10bei/qwen3-1.7b-sft-merged

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(337)

this model