HTDemucs (Core ML)

Core ML conversion of Meta's HTDemucs for music source separation on Apple Silicon. Splits a stereo 44.1 kHz mix into 4 stems (drums, bass, other, vocals).

Produced by the pocket-tts-macos conversion pipeline. FP32 throughout, CPU-only dispatch.

Spec


Input	`mix` `[1, 2, 343980]` fp32 (stereo, 44.1 kHz, 7.8 s chunk)
Output	`sources` `[1, 8, 343980]` fp32
Channel layout	drums(0,1), bass(2,3), other(4,5), vocals(6,7)
Precision	fp32 (FLOAT16 catastrophically destroys spectral reconstruction; do not retry)
Compute units	`.all` (CPU+GPU) on macOS 15+ — GPU dispatch verified working without watchdog issues. ANE compiler rejects the loop-unrolled ISTFT graph and falls back to CPU, so `.cpuAndNeuralEngine` is no faster than CPU-only; use `.all` to let GPU take the bulk.
Throughput	3.9× faster on GPU vs CPU: warm chunk latency 0.18 s (`.all`) vs 0.70 s (`.cpuOnly`) on M1 Ultra. End-to-end ~42× real-time at `.all`, ~11.5× at `.cpuOnly`.
File size	~405 MB unzipped, ~395 MB zipped

Per-stem fidelity (vs PyTorch fp32 reference)

Validated on a 7.8 s canonical chunk + a real-music ear-test (~~70 s hip-hop track) + a real-dialog-over-music ear-test (~~256 s film scene). Numeric:

Stem	SI-SDR (dB)
drums	73.66
bass	62.56
other	77.91
vocals	68.47

Conversion gate was ≥25 dB per stem; all four pass by ~37 dB or more.

Usage (Swift)

let config = MLModelConfiguration()
config.computeUnits = .cpuOnly  // required — see "CPU_ONLY" above

let model = try MLModel(
    contentsOf: URL(fileURLWithPath: "htdemucs.mlpackage"),
    configuration: config
)

// Input: MLMultiArray, shape [1, 2, 343980], fp32
// Output: MLMultiArray, shape [1, 8, 343980], fp32
// Stems are paired stereo channels in order: drums, bass, other, vocals.

Conversion details

The conversion script applies six monkey-patches before tracing — each addresses a specific coremltools 9 gap, removable as upstream adds native support. They preserve eager-PyTorch numerics to within 5e-7 max abs diff.

patch_segment_to_float — Fraction(39, 5) → 7.8 (tracer rejects int(Fraction))
patch_spec_slice_on_real — slice real tensor before reconstructing complex (coremltools 9's slice rejects complex64)
patch_multihead_attention — explicit SDPA path (avoids _native_multi_head_attention)
patch_mask_no_view_as_complex — torch.complex(real, imag) instead of view_as_complex + stride-slice
patch_ispec_manual_istft — manual ISTFT via torch.fft.irfft + loop-unrolled overlap-add (torch.istft not registered)
patch_crosstransformer_pos_embedding — deterministic shift=0 (avoids random.randrange in graph)

Full conversion source + validation suite: https://github.com/slaughters85j/pocket-tts-macos (conversion subproject)

Attribution

This is a derivative work of two MIT-licensed upstreams:

HTDemucs — Copyright (c) Meta Platforms, Inc. and affiliates. https://github.com/facebookresearch/demucs — MIT License.
john-rocky/CoreML-Models — convert_htdemucs.py — Copyright (c) john-rocky. https://github.com/john-rocky/CoreML-Models — MIT License. The reference conversion script is the basis for the working surgical-patch version.

License

MIT. See LICENSE in the upstream conversion repo.

Downloads last month: -; Downloads are not tracked for this model. How to track