blhf

community
Activity Feed

AI & ML interests

None defined yet.

alvarobarttย 
posted an update 9 days ago
view post
Post
270
Open agents on AWS SageMaker AI with open models from the Hugging Face Hub!

> Deploy an open model from the Hugging Face Hub on SageMaker AI
> Connect the deployed model to Strands Agents
> Add built-in and custom tools for tool calling
> Expose external capabilities through MCP integration
> Bonus: talk to your agent and visualize traces with Gradio

https://alvarobartt.com/agents-on-aws-sagemaker
alvarobarttย 
posted an update 13 days ago
view post
Post
3270
Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!

TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.

๐Ÿง  hf-mem now splits MoE memory into base model weights, routed experts, and KV cache
๐Ÿ—๏ธ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
โšก Active params isn't the same as memory footprint, especially for sparse architectures
๐Ÿ“ฆ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
๐Ÿ“š KV cache can still dominate depending on context length, batch size, and concurrency
๐Ÿ”€ Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
๐Ÿš€ Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving

Check the repository at https://github.com/alvarobartt/hf-mem
alvarobarttย 
posted an update 3 months ago
view post
Post
3742
Learn how to deploy Microsoft Research VibeVoice ASR on Microsoft Azure Foundry with Hugging Face to generate rich audio transcriptions with Who, When, and What! ๐Ÿ’ฅ

> ๐Ÿ•’ 60-minute single-pass processing, no chunking or stitching
> ๐Ÿ‘ค Customized hotwords to guide recognition on domain-specific content
> ๐Ÿ“ Rich transcription: joint ASR + diarization + timestamping in one pass
> ๐ŸŒ 50+ languages with automatic detection and code-switching support
> ๐Ÿค— Deployed on Microsoft Foundry via an OpenAI-compatible Chat Completions API

https://huggingface.co/docs/microsoft-azure/foundry/examples/deploy-vibevoice-asr
alvarobarttย 
posted an update 4 months ago
view post
Post
3271
๐Ÿ’ฅ hf-mem v0.4.1 now also estimates KV cache memory requirements for any context length and batch size with the --experimental flag!

uvx hf-mem --model-id ... --experimental will automatically pull the required information from the Hugging Face Hub to include the KV cache estimation, when applicable.

๐Ÿ’ก Alternatively, you can also set the --max-model-len, --batch-size and --kv-cache-dtype arguments (ร  la vLLM) manually if preferred.
  • 1 reply
ยท
lysandreย 
posted an update 9 months ago
view post
Post
8962
We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez !

v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025.

Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!
  • 6 replies
ยท
philschmidย 
posted an update about 1 year ago
view post
Post
6057
Gemini 2.5 Flash is here! We excited launch our first hybrid reasoning Gemini model. In Flash 2.5 developer can turn thinking off.

**TL;DR:**
- ๐Ÿง ย Controllable "Thinking" with thinking budget with up to 24k token
- ๐ŸŒŒย 1 Million multimodal inputย context for text, image, video, audio, and pdf
- ๐Ÿ› ๏ธย Function calling, structured output, google search & code execution.
- ๐Ÿฆย $0.15 1M input tokens; $0.6 or $3.5 (thinking on) per million output tokens (thinking tokens are billed as output tokens)
- ๐Ÿ’กย Knowledge cut ofย January 2025
- ๐Ÿš€ย Rate limits - Free 10 RPM 500 req/day
- ๐Ÿ…Outperforms 2.0 Flash on every benchmark

Try it โฌ‡๏ธ
https://aistudio.google.com/prompts/new_chat?model=gemini-2.5-flash-preview-04-17
  • 1 reply
ยท
philschmidย 
posted an update about 1 year ago
view post
Post
4606
Gemini 2.5 Pro, thinking by default! We excited launch our best Gemini model for reasoning, multimodal and coding yet! #1 on LMSYS, Humanityโ€™s Last Exam, AIME and GPQA and more!

TL;DR:
- ๐Ÿ’ปย Best Gemini coding model yet, particularly for web development (excels on LiveCodeBench).
- ๐Ÿง ย Default "Thinking" with up to 64k token output
- ๐ŸŒŒย 1 Million multimodal inputย context for text, image, video, audio, and pdf
- ๐Ÿ› ๏ธย Function calling, structured output, google search & code execution.
- ๐Ÿ†ย ย #1 on LMArena & sota on AIME, GPQA, Humanity's Last Exam
- ๐Ÿ’กย Knowledge cut ofย January 2025
- ๐Ÿค—ย Available for free as Experimental in AI Studio, Gemini API & Gemini APP
- ๐Ÿš€ย Rate limits - Free 2 RPM 50 req/day

Try it โฌ‡๏ธ

https://aistudio.google.com/?model=gemini-2.5-pro-exp-03-25
  • 3 replies
ยท
alvarobarttย 
posted an update over 1 year ago
view post
Post
3647
๐Ÿ”ฅ Agents can do anything! @microsoft Research just announced the release of Magma 8B!

Magma is a new Visual Language Model (VLM) with 8B parameters for multi-modal agents designed to handle complex interactions across virtual and real environments; and it's MIT licensed!

Magma comes with exciting new features such as:
- Introduces the Set-of-Mark and Trace-of-Mark techniques for fine-tuning
- Leverages a large amount of unlabeled video data to learn the spatial-temporal grounding and planning
- A strong generalization and ability to be fine-tuned for other agentic tasks
- SOTA in different multi-modal benchmarks spanning across UI navigation, robotics manipulation, image / video understanding and spatial understanding and reasoning
- Generates goal-driven visual plans and actions for agentic use cases

Model: microsoft/Magma-8B
Technical Report: Magma: A Foundation Model for Multimodal AI Agents (2502.13130)
lysandreย 
posted an update over 1 year ago
view post
Post
8570
SmolVLM-2 and SigLIP-2 are now part of transformers in dedicated releases!

They're added on top of the v4.49.0 release, and can be installed from the following tags: v4.49.0-SmolVLM-2 and v4.49.0-SigLIP-2.

This marks a new beginning for the release process of transformers. For the past five years, we've been doing monthly releases featuring many models (v4.49.0, the latest release, features 9 new architectures).

Starting with SmolVLM-2 & SigLIP2, we'll now additionally release tags supporting new models on a stable branch. These models are therefore directly available for use by installing from the tag itself. These tags will continue to be updated with fixes applied to these models.

Going forward, continue expecting software releases following semantic versioning: v4.50.0 will have ~10 new architectures compared to v4.49.0, as well as a myriad of new features, improvements and bug fixes. Accompanying these software releases, we'll release tags offering brand new models as fast as possible, to make them accessible to all immediately.
  • 1 reply
ยท
ArthurZย 
posted an update over 1 year ago
view post
Post
6125
Native tensor parallel has landed in transformers!!! https://github.com/huggingface/transformers/pull/34184 thanks a lot to the torch team for their support!

Contributions are welcome to support more models! ๐Ÿ”ฅ
alvarobarttย 
posted an update almost 2 years ago
view post
Post
3043
๐Ÿค— Serving Meta Llama 3.1 405B on Google Cloud is now possible via the Hugging Face Deep Learning Containers (DLCs) for Text Generation Inference (TGI)

In this post, we showcase how to deploy https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 on an A3 instance with 8 x H100 GPUs on Vertex AI

Thanks to the Hugging Face DLCs for TGI and Google Cloud Vertex AI, deploying a high-performance text generation container for serving Large Language Models (LLMs) has never been easier. And weโ€™re not going to stop here โ€“ stay tuned as we enable more experiences to build AI with open models on Google Cloud!

Read the full post at https://huggingface.co/blog/llama31-on-vertex-ai
alvarobarttย 
posted an update about 2 years ago
view post
Post
3282
๐Ÿ”ฅ Prometheus 2 was recently released by Kaist AI as an alternative and closely mirroring both human and GPT-4 evaluation, and surpassing the former Prometheus!

prometheus-eval/prometheus-7b-v2.0
prometheus-eval/prometheus-8x7b-v2.0

๐ŸŒฌ๏ธFine-tuned on top of mistralai/Mistral-7B-Instruct-v0.2 and mistralai/Mixtral-8x7B-Instruct-v0.1
๐Ÿ—‚๏ธThe datasets used for fine-tuning have been publicly released i.e. prometheus-eval/Feedback-Collection and prometheus-eval/Preference-Collection
๐Ÿค๐ŸปUnified LM evaluator for absolute (a single prompt-completion pair) and relative (two completions for a given prompt) due to model merging
โŒNo longer needs a mandatory reference / golden answer, but can still be provided optionally
๐Ÿ”Surpasses the former version of Prometheus, and has a high correlation with human, GPT-4, and Claude 3 Opus scores when evaluating LMs
๐Ÿ“Apache 2.0 license

Long-story short, an amazing job from Kaist AI bridging the gap with LLM evaluators other than proprietary and bigger models!

This week at Argilla, we decided to add a new task to use Prometheus 2 as an LLM evaluator using distilabel, so we implemented PrometheusEval.

๐Ÿ˜ฑ Using PrometheusEval running their 7B variant with vLLM in a single L40 on top of HuggingFaceH4/instruction-dataset, we got the 327 existing prompt-completion pairs evaluated and pushed to the Hub in less than 2 minutes!

Find the generated dataset and the code at distilabel-internal-testing/instruction-dataset-prometheus
  • 1 reply
ยท
alvarobarttย 
posted an update about 2 years ago
view post
Post
2783
๐Ÿฆซ We have just released argilla/Capybara-Preferences in collaboration with Kaist AI (@JW17 , @nlee-208 ) and Hugging Face (@lewtun )

A new synthetic preference dataset built using distilabel on top of the awesome LDJnr/Capybara from @LDJnr

The current dataset combines the already generated alternative completions from argilla/distilabel-capybara-dpo-7k-binarized, while also adding the remaining ones using the same approach!

Here are some key features on how we built it:

- ๐Ÿงน Duplicate removal, keeping the conversation besides the last assistant response, and some slight pre-processing

- ๐Ÿค– Generation of alternative completions for the existing conversations (last turn only) with: mlabonne/NeuralBeagle14-7B, argilla/notus-7b-v1, and teknium/OpenHermes-2.5-Mistral-7B

- ๐Ÿ‘จ๐Ÿปโ€๐Ÿซ Running UltraFeedback via GPT-4 to generate the critique i.e. ratings and rationales, for the last assistant responses

- ๐ŸŽ‰ Finally, we selected the chosen and rejected responses based on their UltraFeedback score, and applied some slight post-processing!

Sounds simple right? Start building your own synthetic datasets with https://github.com/argilla-io/distilabel already!
philschmidย 
posted an update about 2 years ago
view post
Post
8515
New state-of-the-art open LLM! ๐Ÿš€ Databricks just released DBRX, a 132B MoE trained on 12T tokens. Claiming to surpass OpenAI GPT-3.5 and is competitive with Google Gemini 1.0 Pro. ๐Ÿคฏ

TL;DR
๐Ÿงฎ 132B MoE with 16 experts with 4 active in generation
๐ŸชŸ 32 000 context window
๐Ÿ“ˆ Outperforms open LLMs on common benchmarks, including MMLU
๐Ÿš€ Up to 2x faster inference than Llama 2 70B
๐Ÿ’ป Trained on 12T tokens
๐Ÿ”ก Uses the GPT-4 tokenizer
๐Ÿ“œ Custom License, commercially useable

Collection: databricks/dbrx-6601c0852a0cdd3c59f71962
Demo: https://huggingface.co/spaces/databricks/dbrx-instruct

Kudos to the Team at Databricks and MosaicML for this strong release in the open community! ๐Ÿค—
  • 4 replies
ยท
Titus-von-Koellerย 
posted an update about 2 years ago
view post
Post
2174
๐Ÿ”ฅ Level up your model training w/ GaLore + Transformers for SOTA results on consumer-grade hardware!

โฌ‡๏ธ 82.5% less optimizer state memory footprint without performance degradation by expressing the gradient weight matrix as low rank.

๐Ÿ‘ฉ๐Ÿฟโ€๐Ÿ’ป Install via pip install transformers>=4.39.0 galore-torch. #ProudlyGpuPoor

The integration of GaLore into the training of large language models (LLMs) marks a significant advancement in the field of deep learning, particularly in terms of memory efficiency and the democratization of AI research. By allowing for the training of billion-parameter models on consumer-grade hardware, reducing memory footprint in optimizer states, and leveraging advanced projection matrix techniques, GaLore opens new horizons for researchers and practitioners with limited access to high-end computational resources.

๐Ÿ”ฌ Find out more about GaLore and investigate lots of juicy technical details: https://huggingface.co/blog/galore

๐Ÿค— Huge thanks to everyone involved โค๏ธ:

โ€ข authors: @jiaweizhao @Kyriection @beidic Zhangyang Wang @animakumar @tydsh
โ€ข community contributors: @hiyouga @mdouglas and others!
โ€ข @ybelkada for taking such swift action in composing and coordinating necessary PRs to get this live at โšก speed!

๐Ÿ—๏ธ๐Ÿ“ˆ Super rewarding to see how @timdettmers work with optimizers is being built upon to achieve even greater heights!

๐Ÿšง Actually, there are ongoing works to integrate GaLore into bitsandbytes and optimize memory efficiency even further ๐Ÿ’ช. We'll keep you posted!
  • 1 reply
ยท