Failure due to unnecessary dependency
Trying to install VoxCPM, I get an error:
running build_ext
building '_pywrapfst' extension
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pynini
Failed to build pynini
Why do you use something that requires MSVC? This is the first time I ever see a python-based project that does, and it's preposterous.
We currently use the WeTextProcessing library for text normalization, but this library depends on pynini, which sometimes fails to build successfully. We are in the process of replacing it with a text normalization library that does not depend on pynini. We've developed a solution and are now in the final stages of testing it. In the meantime, you can try cloning the latest code from https://github.com/OpenBMB/VoxCPM to see if that resolves the issue. We will release a new version of the library once testing is successful.
Thank you for the answer and thank you for your effort!
I followed your suggestion, and after installing missing modules, I'm now stuck at:
Traceback (most recent call last):
File "F:\mars5\voxCMP_dwnld.py", line 4, in <module>
model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
File "M:\VENV2\lib\site-packages\voxcpm\core.py", line 80, in from_pretrained
return cls(
File "M:\VENV2\lib\site-packages\voxcpm\core.py", line 28, in __init__
from .zipenhancer import ZipEnhancer
File "M:\VENV2\lib\site-packages\voxcpm\zipenhancer.py", line 13, in <module>
from modelscope.pipelines import pipeline
File "M:\VENV2\lib\site-packages\modelscope\pipelines\__init__.py", line 4, in <module>
from .base import Pipeline
File "M:\VENV2\lib\site-packages\modelscope\pipelines\base.py", line 16, in <module>
from modelscope.msdatasets import MsDataset
File "M:\VENV2\lib\site-packages\modelscope\msdatasets\__init__.py", line 2, in <module>
from modelscope.msdatasets.ms_dataset import MsDataset
File "M:\VENV2\lib\site-packages\modelscope\msdatasets\ms_dataset.py", line 25, in <module>
from modelscope.msdatasets.utils.hf_datasets_util import load_dataset_with_ctx
File "M:\VENV2\lib\site-packages\modelscope\msdatasets\utils\hf_datasets_util.py", line 31, in <module>
from datasets.load import (
ImportError: cannot import name 'HubDatasetModuleFactoryWithoutScript' from 'datasets.load' (M:\VENV2\lib\site-packages\datasets\load.py)
I had to install datasets manually with pip install datasets, did I do something wrong here? I'm trying to run Basic Usage example.
Maybe your dadasets version is too high, please downgrade you datasets library: pip install "datasets>=2,<4"
Thank you again, it did work (plus more packages to install manually). I finally got the output.
Just to be sure: the inference uses CPU only, and it says:
modelscope - INFO - cuda is not available, using cpu instead.
Is it ok? I've reinstalled torchaudio with --index-url https://download.pytorch.org/whl/cu118 , but it didn't help.
Also, voice cloning doesn't seem to work with just an audio clip? I used the same speech clip that worked in other TTS modes, but here it doesn't sound similar.
For the inference device issue, please ensure that python -c "import torch; print(torch.cuda.is_available())" outputs True. Under normal circumstances, inference should be performed using CUDA.
Regarding the voice cloning issue, did you provide the corresponding prompt text for the prompt audio? If not, voice cloning will be converted to voice synthesis. If you have provided the correct parameters but it still didn't work, please provide your reference audio to us so we can test and find the root cause of the problem.
For the inference device issue, please ensure that
python -c "import torch; print(torch.cuda.is_available())"outputs True. Under normal circumstances, inference should be performed using CUDA.Regarding the voice cloning issue, did you provide the corresponding prompt text for the prompt audio? If not, voice cloning will be converted to voice synthesis. If you have provided the correct parameters but it still didn't work, please provide your reference audio to us so we can test and find the root cause of the problem.
Unfortunately, when I switched to torch with Cuda, it required Triton, which is not available on Windows.
If the voice cloning requires text transcription too, it's a pity, because other models work without it. It allows me using a clip in Japanese, for example, which would be difficult to transcribe. It works well in VibeVoice, for example, and some other models, despite them not being trained on Japanese text.
Oh, and it seems like voice cloning doesn't even work on CPU and requires CUDA, it automatically errors when text transcription is provided.
@notafraud Hi, We've just released an updated version of our inference library. We've added a feature to catch torch.compile errors. When torch.compile isn't supported, the library will now fall back to native PyTorch code for inference.
You can try the latest version and see if you're now able to use CUDA for inference.
@notafraud Hi, We've just released an updated version of our inference library. We've added a feature to catch torch.compile errors. When torch.compile isn't supported, the library will now fall back to native PyTorch code for inference.
You can try the latest version and see if you're now able to use CUDA for inference.
Thank you, but I don't think is works fully.
On torch with CUDA, I get:
Traceback (most recent call last):
File "F:\mars5\voxCMP_dwnld.py", line 4, in <module>
model = VoxCPM.from_pretrained("openbmb/VoxCPM-0.5B")
File "M:\VENV2\lib\site-packages\voxcpm\core.py", line 80, in from_pretrained
return cls(
File "M:\VENV2\lib\site-packages\voxcpm\core.py", line 33, in __init__
self.tts_model.generate(
File "M:\VENV2\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "M:\VENV2\lib\site-packages\voxcpm\model\voxcpm.py", line 263, in generate
latent_pred, pred_audio_feat = self.inference(
File "M:\VENV2\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "M:\VENV2\lib\site-packages\voxcpm\model\voxcpm.py", line 545, in inference
pred_feat = self.feat_decoder(
File "M:\VENV2\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "M:\VENV2\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "M:\VENV2\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "M:\VENV2\lib\site-packages\voxcpm\modules\locdit\unified_cfm.py", line 65, in forward
return self.solve_euler(z, t_span=t_span, mu=mu, cond=cond, cfg_value=cfg_value, use_cfg_zero_star=use_cfg_zero_star)
File "M:\VENV2\lib\site-packages\voxcpm\modules\locdit\unified_cfm.py", line 118, in solve_euler
dphi_dt = self.estimator(x_in, mu_in, t_in, cond_in, dt_in)
File "M:\VENV2\lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "M:\VENV2\lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "M:\VENV2\lib\site-packages\torch\_dynamo\eval_frame.py", line 663, in _fn
raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1
File "M:\VENV2\lib\site-packages\torch\_inductor\scheduler.py", line 3957, in create_backend
raise TritonMissing(inspect.currentframe())
torch._inductor.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton
Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
Like I've mentioned, Triton is not available on Windows. I've also tried to upgrade to cu121, but the result is the same.
On torch without CUDA the model starts analyzing the input clip, but then I get:
2025-09-17 20:48:12,651 - modelscope - INFO - cuda is not available, using cpu instead.
Warm up VoxCPMModel...
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 10/10 [00:06<00:00, 1.61it/s]
inputs:(1, 112000)
decode_do_segement
padding: 16000
inputs after padding:(1, 128000)
current_idx: 120000 100.00%
M:\VENV2_cpu\lib\site-packages\torchaudio\_backend\utils.py:213: UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.load_with_torchcodec` under the hood. Some parameters like ``normalize``, ``format``, ``buffer_size``, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's decoder instead: https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder.
warnings.warn(
M:\VENV2_cpu\lib\site-packages\torchaudio\_backend\utils.py:337: UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.save_with_torchcodec` under the hood. Some parameters like format, encoding, bits_per_sample, buffer_size, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's encoder instead: https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.encoders.AudioEncoder
warnings.warn(
Traceback (most recent call last):
File "F:\mars5\voxCMP_dwnld.py", line 6, in <module>
wav = model.generate(
File "M:\VENV2_cpu\lib\site-packages\voxcpm\core.py", line 135, in generate
fixed_prompt_cache = self.tts_model.build_prompt_cache(
File "M:\VENV2_cpu\lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "M:\VENV2_cpu\lib\site-packages\voxcpm\model\voxcpm.py", line 320, in build_prompt_cache
audio_feat = self.audio_vae.encode(audio.cuda(), self.sample_rate).cpu()
File "M:\VENV2_cpu\lib\site-packages\torch\cuda\__init__.py", line 403, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
For Torch + CPU, this only happens if I give text transcription together with the audio clip. For Torch + CUDA, it cannot generate even without an audio clip.
Just tried 1a46c5d1ad724097200d1b3fa713e1baad595cf8 commit from https://github.com/OpenBMB/VoxCPM, works correctly with CUDA now, including voice cloning (with text transcription). I'm going to test it more and will come back with a review.
Thank you very much for your effort!