Instructions to use google/gemma-4-31B-it-assistant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-31B-it-assistant with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it-assistant") model = AutoModelForCausalLM.from_pretrained("google/gemma-4-31B-it-assistant") - Notebooks
- Google Colab
- Kaggle
Issues with multiple tool calls in parallel
I'm running this drafter in combination with NVFP4 version of Gemma4 from Nvidia (using vLLM, nightly).
If I use opencode as a client and if it tries to read multiple files in parallel it will fail to generate proper tool calls for each file. Logically (at least to me) it feels this would be the issue with vLLM and parser used there, but when tried to do the same without speculative model (drafter) it worked every time.
Reading files sequentially works for both combinations, no problem there. It still feels like an issue is on vllm side, but why would it work without drafter model (a lot slower that feels like seq?)...
EDIT: I found out there are same or similar issues opened at vLLM, like this one for example: https://github.com/vllm-project/vllm/issues/41967 or this one https://github.com/vllm-project/vllm/pull/42006