Unable to run with vLLM
When I try to serve the model through vLLM, I'm getting the below Pydantic error, can you please help me resolve this:
(APIServer pid=3244) Value error, Model architectures ['LightOnOCRForConditionalGeneration'] are not supported for now.
Hi,
You'll need to use the vLLM nightly build, as support for LightOnOCR is not in the latest release yet.
Try the installation steps in the model card to get the latest nightly build.
can you please give a clear installation script ?
I mean use nightly vLLM .
or a clear feasible workaround.
Thanks.
Hi,
These exact commands are tested and verified to work!
uv venv --python 3.12 --seed
source .venv/bin/activate
# install vllm nightly with triton-kernels
uv pip install -U vllm \
'triton-kernels @ git+https://github.com/triton-lang/[email protected]#subdirectory=python/triton_kernels' \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly \
--prerelease=allow
# start server
vllm serve lightonai/LightOnOCR-1B-1025 \
--limit-mm-per-prompt '{"image": 1}' \
--async-scheduling
Thanks for the clarification, the model is being recognized now. But I am still unable to serve the model successfully and I suspect the cause to be my GPU and system requirements being too low (I can see some memory related error logs).
I just want to load and run the model on a single page pdf for testing and I am using the reference Python script provided (with 1 page pdf for my use-case). What would be the minimum system/GPU requirements to run this successfully?
Thanks
The model being 1B you should be able to run it with as little as 16GB or less, maybe you should reduce the --max-model-len in vllm serve to something like 4096.
Here is an example with Hugging Face Transformers running on Colab
Thanks for your inputs, I tried using the transformer integration and it works! (yet to try the vLLM approach with your suggestions)
I have a doubt about the model capability and working: I read your blog post and since this model was designed and trained specifically to perform 'one-pass OCR', I want to clarify if my following assumptions/understanding is right?
- The model will not produce any bounding box for the detected text (as it was designed not to have this 'pipeline' approach of the traditional OCRs) or any spatial understanding?
- Is it possible to interact with the document through prompts? Or is that not a possibility at all since the downstream task for the model is OCR?
Thanks