When training Qwen3-30B-A3B-Instruct-2507 using DeepSpeed Zero 3, the training progress bar gets stuck after loading the model and data.
#9
by
lynn0999
- opened
As mentioned in the title,has anyone else run into a similar issue?
same here. (with 2 nodes (H100 x 8), HFTrainer, DeepSpeed zero 3)
no errors with dense model(Qwen3 32B), but falling into deadlock at first step with Qwen3-30B-A3B-Instruct-2507
solved with https://github.com/deepspeedai/DeepSpeed/issues/7461
from transformers.models.qwen3_moe.modeling_qwen3_moe import Qwen3MoeSparseMoeBlock
import deepspeed
...
...
deepspeed.utils.set_z3_leaf_modules(model, [Qwen3MoeSparseMoeBlock])
trainer = SFTTrainer(
model=model,
processing_class=tokenizer,
args=args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
peft_config=peft_config,
init_params_for_logging=params_for_log,
)
I added this snippet before creating Trainer