openbmb/VoxCPM-0.5B · Failure due to unnecessary dependency

Sep 17

Trying to install VoxCPM, I get an error:

      running build_ext
      building '_pywrapfst' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pynini
Failed to build pynini

Why do you use something that requires MSVC? This is the first time I ever see a python-based project that does, and it's preposterous.

EasonLiu

OpenBMB org Sep 17

We currently use the WeTextProcessing library for text normalization, but this library depends on pynini, which sometimes fails to build successfully. We are in the process of replacing it with a text normalization library that does not depend on pynini. We've developed a solution and are now in the final stages of testing it. In the meantime, you can try cloning the latest code from https://github.com/OpenBMB/VoxCPM to see if that resolves the issue. We will release a new version of the library once testing is successful.

notafraud

Sep 17

Thank you for the answer and thank you for your effort!

I followed your suggestion, and after installing missing modules, I'm now stuck at:

Traceback (most recent call last):
  File "F:\mars5\voxCMP_dwnld.py", line 4, in <module>
    model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
  File "M:\VENV2\lib\site-packages\voxcpm\core.py", line 80, in from_pretrained
    return cls(
  File "M:\VENV2\lib\site-packages\voxcpm\core.py", line 28, in __init__
    from .zipenhancer import ZipEnhancer
  File "M:\VENV2\lib\site-packages\voxcpm\zipenhancer.py", line 13, in <module>
    from modelscope.pipelines import pipeline
  File "M:\VENV2\lib\site-packages\modelscope\pipelines\__init__.py", line 4, in <module>
    from .base import Pipeline
  File "M:\VENV2\lib\site-packages\modelscope\pipelines\base.py", line 16, in <module>
    from modelscope.msdatasets import MsDataset
  File "M:\VENV2\lib\site-packages\modelscope\msdatasets\__init__.py", line 2, in <module>
    from modelscope.msdatasets.ms_dataset import MsDataset
  File "M:\VENV2\lib\site-packages\modelscope\msdatasets\ms_dataset.py", line 25, in <module>
    from modelscope.msdatasets.utils.hf_datasets_util import load_dataset_with_ctx
  File "M:\VENV2\lib\site-packages\modelscope\msdatasets\utils\hf_datasets_util.py", line 31, in <module>
    from datasets.load import (
ImportError: cannot import name 'HubDatasetModuleFactoryWithoutScript' from 'datasets.load' (M:\VENV2\lib\site-packages\datasets\load.py)

I had to install datasets manually with pip install datasets, did I do something wrong here? I'm trying to run Basic Usage example.

EasonLiu

OpenBMB org Sep 17

•

edited Sep 17

Maybe your dadasets version is too high, please downgrade you datasets library: pip install "datasets>=2,<4"

notafraud

Sep 17

Thank you again, it did work (plus more packages to install manually). I finally got the output.

Just to be sure: the inference uses CPU only, and it says:

modelscope - INFO - cuda is not available, using cpu instead.

Is it ok? I've reinstalled torchaudio with --index-url https://download.pytorch.org/whl/cu118 , but it didn't help.

Also, voice cloning doesn't seem to work with just an audio clip? I used the same speech clip that worked in other TTS modes, but here it doesn't sound similar.

EasonLiu

OpenBMB org Sep 17

For the inference device issue, please ensure that python -c "import torch; print(torch.cuda.is_available())" outputs True. Under normal circumstances, inference should be performed using CUDA.

Regarding the voice cloning issue, did you provide the corresponding prompt text for the prompt audio? If not, voice cloning will be converted to voice synthesis. If you have provided the correct parameters but it still didn't work, please provide your reference audio to us so we can test and find the root cause of the problem.

notafraud

Sep 17

•

edited Sep 17

For the inference device issue, please ensure that python -c "import torch; print(torch.cuda.is_available())" outputs True. Under normal circumstances, inference should be performed using CUDA.

Regarding the voice cloning issue, did you provide the corresponding prompt text for the prompt audio? If not, voice cloning will be converted to voice synthesis. If you have provided the correct parameters but it still didn't work, please provide your reference audio to us so we can test and find the root cause of the problem.

Unfortunately, when I switched to torch with Cuda, it required Triton, which is not available on Windows.

If the voice cloning requires text transcription too, it's a pity, because other models work without it. It allows me using a clip in Japanese, for example, which would be difficult to transcribe. It works well in VibeVoice, for example, and some other models, despite them not being trained on Japanese text.

Oh, and it seems like voice cloning doesn't even work on CPU and requires CUDA, it automatically errors when text transcription is provided.

EasonLiu

OpenBMB org Sep 17

@notafraud Hi, We've just released an updated version of our inference library. We've added a feature to catch torch.compile errors. When torch.compile isn't supported, the library will now fall back to native PyTorch code for inference.

You can try the latest version and see if you're now able to use CUDA for inference.

notafraud

Sep 17

@notafraud Hi, We've just released an updated version of our inference library. We've added a feature to catch torch.compile errors. When torch.compile isn't supported, the library will now fall back to native PyTorch code for inference.

You can try the latest version and see if you're now able to use CUDA for inference.

Thank you, but I don't think is works fully.

On torch with CUDA, I get:

Traceback (most recent call last):
  File "F:\mars5\voxCMP_dwnld.py", line 4, in <module>
    model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
  File "M:\VENV2\lib\site-packages\voxcpm\core.py", line 80, in from_pretrained
    return cls(
  File "M:\VENV2\lib\site-packages\voxcpm\core.py", line 33, in __init__
    self.tts_model.generate(
  File "M:\VENV2\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "M:\VENV2\lib\site-packages\voxcpm\model\voxcpm.py", line 263, in generate
    latent_pred, pred_audio_feat = self.inference(
  File "M:\VENV2\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "M:\VENV2\lib\site-packages\voxcpm\model\voxcpm.py", line 545, in inference
    pred_feat = self.feat_decoder(
  File "M:\VENV2\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "M:\VENV2\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "M:\VENV2\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "M:\VENV2\lib\site-packages\voxcpm\modules\locdit\unified_cfm.py", line 65, in forward
    return self.solve_euler(z, t_span=t_span, mu=mu, cond=cond, cfg_value=cfg_value, use_cfg_zero_star=use_cfg_zero_star)
  File "M:\VENV2\lib\site-packages\voxcpm\modules\locdit\unified_cfm.py", line 118, in solve_euler
    dphi_dt = self.estimator(x_in, mu_in, t_in, cond_in, dt_in)
  File "M:\VENV2\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "M:\VENV2\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "M:\VENV2\lib\site-packages\torch\_dynamo\eval_frame.py", line 663, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
  File "M:\VENV2\lib\site-packages\torch\_inductor\scheduler.py", line 3957, in create_backend
    raise TritonMissing(inspect.currentframe())
torch._inductor.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

Like I've mentioned, Triton is not available on Windows. I've also tried to upgrade to cu121, but the result is the same.

On torch without CUDA the model starts analyzing the input clip, but then I get:

2025-09-17 20:48:12,651 - modelscope - INFO - cuda is not available, using cpu instead.
Warm up VoxCPMModel...
100%|████████████████████████████████████████████████████████| 10/10 [00:06<00:00,  1.61it/s]
inputs:(1, 112000)
decode_do_segement
padding: 16000
inputs after padding:(1, 128000)
current_idx: 120000 100.00%
M:\VENV2_cpu\lib\site-packages\torchaudio\_backend\utils.py:213: UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.load_with_torchcodec` under the hood. Some parameters like ``normalize``, ``format``, ``buffer_size``, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's decoder instead: https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder.
  warnings.warn(
M:\VENV2_cpu\lib\site-packages\torchaudio\_backend\utils.py:337: UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.save_with_torchcodec` under the hood. Some parameters like format, encoding, bits_per_sample, buffer_size, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's encoder instead: https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.encoders.AudioEncoder
  warnings.warn(
Traceback (most recent call last):
  File "F:\mars5\voxCMP_dwnld.py", line 6, in <module>
    wav = model.generate(
  File "M:\VENV2_cpu\lib\site-packages\voxcpm\core.py", line 135, in generate
    fixed_prompt_cache = self.tts_model.build_prompt_cache(
  File "M:\VENV2_cpu\lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "M:\VENV2_cpu\lib\site-packages\voxcpm\model\voxcpm.py", line 320, in build_prompt_cache
    audio_feat = self.audio_vae.encode(audio.cuda(), self.sample_rate).cpu()
  File "M:\VENV2_cpu\lib\site-packages\torch\cuda\__init__.py", line 403, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

For Torch + CPU, this only happens if I give text transcription together with the audio clip. For Torch + CUDA, it cannot generate even without an audio clip.

notafraud

Sep 18

•

edited Sep 18

Just tried 1a46c5d1ad724097200d1b3fa713e1baad595cf8 commit from https://github.com/OpenBMB/VoxCPM, works correctly with CUDA now, including voice cloning (with text transcription). I'm going to test it more and will come back with a review.

Thank you very much for your effort!