When training Qwen3-30B-A3B-Instruct-2507 using DeepSpeed Zero 3, the training progress bar gets stuck after loading the model and data.

by lynn0999 - opened Sep 7

Discussion

lynn0999

Sep 7

As mentioned in the title,has anyone else run into a similar issue?

jeinsong

Sep 10

•

edited Sep 10

same here. (with 2 nodes (H100 x 8), HFTrainer, DeepSpeed zero 3)
no errors with dense model(Qwen3 32B), but falling into deadlock at first step with Qwen3-30B-A3B-Instruct-2507

jeinsong

Sep 10

•

edited Sep 10

solved with https://github.com/deepspeedai/DeepSpeed/issues/7461

    from transformers.models.qwen3_moe.modeling_qwen3_moe import Qwen3MoeSparseMoeBlock
    import deepspeed
    ...
    ...
    deepspeed.utils.set_z3_leaf_modules(model, [Qwen3MoeSparseMoeBlock])

    trainer = SFTTrainer(
        model=model,
        processing_class=tokenizer,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        peft_config=peft_config,
        init_params_for_logging=params_for_log,
    )

I added this snippet before creating Trainer

notTyche

Sep 26

@jeinsong you have used Accelerate + Deepspeed + SFTTrainer? Can you share your configurations? I have the same problem, but I couldn't solve it with the workaround suggested in the PR.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment