HTDemucs (Core ML)

Core ML conversion of Meta's HTDemucs for music source separation on Apple Silicon. Splits a stereo 44.1 kHz mix into 4 stems (drums, bass, other, vocals).

Produced by the pocket-tts-macos conversion pipeline. FP32 throughout, CPU-only dispatch.

Spec

Input mix [1, 2, 343980] fp32 (stereo, 44.1 kHz, 7.8 s chunk)
Output sources [1, 8, 343980] fp32
Channel layout drums(0,1), bass(2,3), other(4,5), vocals(6,7)
Precision fp32 (FLOAT16 catastrophically destroys spectral reconstruction; do not retry)
Compute units .all (CPU+GPU) on macOS 15+ β€” GPU dispatch verified working without watchdog issues. ANE compiler rejects the loop-unrolled ISTFT graph and falls back to CPU, so .cpuAndNeuralEngine is no faster than CPU-only; use .all to let GPU take the bulk.
Throughput 3.9Γ— faster on GPU vs CPU: warm chunk latency 0.18 s (.all) vs 0.70 s (.cpuOnly) on M1 Ultra. End-to-end ~42Γ— real-time at .all, ~11.5Γ— at .cpuOnly.
File size ~405 MB unzipped, ~395 MB zipped

Per-stem fidelity (vs PyTorch fp32 reference)

Validated on a 7.8 s canonical chunk + a real-music ear-test (70 s hip-hop track) + a real-dialog-over-music ear-test (256 s film scene). Numeric:

Stem SI-SDR (dB)
drums 73.66
bass 62.56
other 77.91
vocals 68.47

Conversion gate was β‰₯25 dB per stem; all four pass by ~37 dB or more.

Usage (Swift)

let config = MLModelConfiguration()
config.computeUnits = .cpuOnly  // required β€” see "CPU_ONLY" above

let model = try MLModel(
    contentsOf: URL(fileURLWithPath: "htdemucs.mlpackage"),
    configuration: config
)

// Input: MLMultiArray, shape [1, 2, 343980], fp32
// Output: MLMultiArray, shape [1, 8, 343980], fp32
// Stems are paired stereo channels in order: drums, bass, other, vocals.

Conversion details

The conversion script applies six monkey-patches before tracing β€” each addresses a specific coremltools 9 gap, removable as upstream adds native support. They preserve eager-PyTorch numerics to within 5e-7 max abs diff.

  • patch_segment_to_float β€” Fraction(39, 5) β†’ 7.8 (tracer rejects int(Fraction))
  • patch_spec_slice_on_real β€” slice real tensor before reconstructing complex (coremltools 9's slice rejects complex64)
  • patch_multihead_attention β€” explicit SDPA path (avoids _native_multi_head_attention)
  • patch_mask_no_view_as_complex β€” torch.complex(real, imag) instead of view_as_complex + stride-slice
  • patch_ispec_manual_istft β€” manual ISTFT via torch.fft.irfft + loop-unrolled overlap-add (torch.istft not registered)
  • patch_crosstransformer_pos_embedding β€” deterministic shift=0 (avoids random.randrange in graph)

Full conversion source + validation suite: https://github.com/slaughters85j/pocket-tts-macos (conversion subproject)

Attribution

This is a derivative work of two MIT-licensed upstreams:

  1. HTDemucs β€” Copyright (c) Meta Platforms, Inc. and affiliates. https://github.com/facebookresearch/demucs β€” MIT License.
  2. john-rocky/CoreML-Models β€” convert_htdemucs.py β€” Copyright (c) john-rocky. https://github.com/john-rocky/CoreML-Models β€” MIT License. The reference conversion script is the basis for the working surgical-patch version.

License

MIT. See LICENSE in the upstream conversion repo.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support