Open agents on AWS SageMaker AI with open models from the Hugging Face Hub!
> Deploy an open model from the Hugging Face Hub on SageMaker AI > Connect the deployed model to Strands Agents > Add built-in and custom tools for tool calling > Expose external capabilities through MCP integration > Bonus: talk to your agent and visualize traces with Gradio
Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!
TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
π§ hf-mem now splits MoE memory into base model weights, routed experts, and KV cache ποΈ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them β‘ Active params isn't the same as memory footprint, especially for sparse architectures π¦ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident π KV cache can still dominate depending on context length, batch size, and concurrency π Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate π Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving
Earlier this month, Apple introduced Simple Self-Distillation: a fine-tuning method that improves models on coding tasks just by sampling from the model and training on its own outputs with plain cross-entropy
Andβ¦ it's already supported in TRL, built by Kashif Rasul. you can really feel the pace of development in the team π
Paper by Ruixiang ZHANG, He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang at Apple π
How it works: the model generates completions at a training-time temperature (T_train) with top_k/top_p truncation, then fine-tunes on them with plain cross-entropy. no labels or verifier needed
One neat insight from the paper: T_train and T_eval compose into an effective T_eff = T_train Γ T_eval, so a broad band of configs works well. even very noisy samples still help