HTDemucs (Core ML)
Core ML conversion of Meta's HTDemucs for music source separation on Apple Silicon. Splits a stereo 44.1 kHz mix into 4 stems (drums, bass, other, vocals).
Produced by the pocket-tts-macos conversion pipeline. FP32 throughout, CPU-only dispatch.
Spec
| Input | mix [1, 2, 343980] fp32 (stereo, 44.1 kHz, 7.8 s chunk) |
| Output | sources [1, 8, 343980] fp32 |
| Channel layout | drums(0,1), bass(2,3), other(4,5), vocals(6,7) |
| Precision | fp32 (FLOAT16 catastrophically destroys spectral reconstruction; do not retry) |
| Compute units | .all (CPU+GPU) on macOS 15+ β GPU dispatch verified working without watchdog issues. ANE compiler rejects the loop-unrolled ISTFT graph and falls back to CPU, so .cpuAndNeuralEngine is no faster than CPU-only; use .all to let GPU take the bulk. |
| Throughput | 3.9Γ faster on GPU vs CPU: warm chunk latency 0.18 s (.all) vs 0.70 s (.cpuOnly) on M1 Ultra. End-to-end ~42Γ real-time at .all, ~11.5Γ at .cpuOnly. |
| File size | ~405 MB unzipped, ~395 MB zipped |
Per-stem fidelity (vs PyTorch fp32 reference)
Validated on a 7.8 s canonical chunk + a real-music ear-test (70 s hip-hop
track) + a real-dialog-over-music ear-test (256 s film scene). Numeric:
| Stem | SI-SDR (dB) |
|---|---|
| drums | 73.66 |
| bass | 62.56 |
| other | 77.91 |
| vocals | 68.47 |
Conversion gate was β₯25 dB per stem; all four pass by ~37 dB or more.
Usage (Swift)
let config = MLModelConfiguration()
config.computeUnits = .cpuOnly // required β see "CPU_ONLY" above
let model = try MLModel(
contentsOf: URL(fileURLWithPath: "htdemucs.mlpackage"),
configuration: config
)
// Input: MLMultiArray, shape [1, 2, 343980], fp32
// Output: MLMultiArray, shape [1, 8, 343980], fp32
// Stems are paired stereo channels in order: drums, bass, other, vocals.
Conversion details
The conversion script applies six monkey-patches before tracing β each addresses a specific coremltools 9 gap, removable as upstream adds native support. They preserve eager-PyTorch numerics to within 5e-7 max abs diff.
patch_segment_to_floatβFraction(39, 5)β7.8(tracer rejectsint(Fraction))patch_spec_slice_on_realβ slice real tensor before reconstructing complex (coremltools 9'sslicerejects complex64)patch_multihead_attentionβ explicit SDPA path (avoids_native_multi_head_attention)patch_mask_no_view_as_complexβtorch.complex(real, imag)instead ofview_as_complex+ stride-slicepatch_ispec_manual_istftβ manual ISTFT viatorch.fft.irfft+ loop-unrolled overlap-add (torch.istftnot registered)patch_crosstransformer_pos_embeddingβ deterministic shift=0 (avoidsrandom.randrangein graph)
Full conversion source + validation suite: https://github.com/slaughters85j/pocket-tts-macos (conversion subproject)
Attribution
This is a derivative work of two MIT-licensed upstreams:
- HTDemucs β Copyright (c) Meta Platforms, Inc. and affiliates. https://github.com/facebookresearch/demucs β MIT License.
- john-rocky/CoreML-Models β convert_htdemucs.py β Copyright (c) john-rocky. https://github.com/john-rocky/CoreML-Models β MIT License. The reference conversion script is the basis for the working surgical-patch version.
License
MIT. See LICENSE in the upstream conversion repo.