When training Qwen3-30B-A3B-Instruct-2507 using DeepSpeed Zero 3, the training progress bar gets stuck after loading the model and data.

#9
by lynn0999 - opened

As mentioned in the title,has anyone else run into a similar issue?

same here. (with 2 nodes (H100 x 8), HFTrainer, DeepSpeed zero 3)
no errors with dense model(Qwen3 32B), but falling into deadlock at first step with Qwen3-30B-A3B-Instruct-2507

solved with https://github.com/deepspeedai/DeepSpeed/issues/7461

    from transformers.models.qwen3_moe.modeling_qwen3_moe import Qwen3MoeSparseMoeBlock
    import deepspeed
    ...
    ...
    deepspeed.utils.set_z3_leaf_modules(model, [Qwen3MoeSparseMoeBlock])

    trainer = SFTTrainer(
        model=model,
        processing_class=tokenizer,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        peft_config=peft_config,
        init_params_for_logging=params_for_log,
    )

I added this snippet before creating Trainer

@jeinsong you have used Accelerate + Deepspeed + SFTTrainer? Can you share your configurations? I have the same problem, but I couldn't solve it with the workaround suggested in the PR.

Sign up or log in to comment