Request for Unquantized Model Weights or Separate LoRA Adapter
Hey @dousery !
The current model weights are distributed with BitsAndBytes quantization (load_in_4bit=True with NF4 quantization), which makes them incompatible with conversion tools like llama.cpp's convert_hf_to_gguf.py.
Technical Details:
- When attempting conversion, the script fails with:
KeyError: ('bnb', '4bit') - The error occurs in
gguf-py/gguf/lazy.pywhen encountering BitsAndBytes quantized tensors - llama.cpp's conversion pipeline doesn't support dequantizing BitsAndBytes format
Request: Could you please provide one of the following:
- Unquantized weights (FP16/BF16) of the fine-tuned model
- Separate LoRA adapter weights that can be applied to the base
openai/gpt-oss-20bmodel
Either option would enable conversion to other formats (GGUF, etc.) for broader inference framework compatibility.
Use Case: Deploying this medical reasoning model with llama.cpp for efficient CPU/GPU inference.
Additional Context: The model card indicates this was LoRA fine-tuned with rank 8 and 3.98M trainable parameters. If the LoRA adapter is available separately, that would be the most space-efficient option.
Thanks!
Hi,
I’m unable to provide either of the requested options. The fine-tuned model was trained and saved in 4-bit quantized format due to hardware constraints, and the LoRA adapters were merged into the full model. Therefore, separate FP16/BF16 weights or standalone LoRA adapters are not available.
Thanks for understanding.
Hey @dousery !
The current model weights are distributed with BitsAndBytes quantization (
load_in_4bit=Truewith NF4 quantization), which makes them incompatible with conversion tools like llama.cpp'sconvert_hf_to_gguf.py.Technical Details:
- When attempting conversion, the script fails with:
KeyError: ('bnb', '4bit')- The error occurs in
gguf-py/gguf/lazy.pywhen encountering BitsAndBytes quantized tensors- llama.cpp's conversion pipeline doesn't support dequantizing BitsAndBytes format
Request: Could you please provide one of the following:
- Unquantized weights (FP16/BF16) of the fine-tuned model
- Separate LoRA adapter weights that can be applied to the base
openai/gpt-oss-20bmodelEither option would enable conversion to other formats (GGUF, etc.) for broader inference framework compatibility.
Use Case: Deploying this medical reasoning model with llama.cpp for efficient CPU/GPU inference.
Additional Context: The model card indicates this was LoRA fine-tuned with rank 8 and 3.98M trainable parameters. If the LoRA adapter is available separately, that would be the most space-efficient option.
Thanks!
Hey! I updated my model. You can access the LoRA adapter now.
Very cool - thx @dousery !