Docling OCR output

#38
by InformaticsSolutions - opened

Is the OCR of https://huggingface.co/ibm-granite/granite-docling-258M/blob/main/assets/new_arxiv.png supposed to look like this:

<loc_115><loc_27><loc_388><loc_35>Energy Budget of WASP-121 b from JWST/NIRISS Phase Curve <loc_454><loc_28><loc_462><loc_35>9 <loc_41><loc_42><loc_241><loc_88>while the kernel weights are structured as ( N
s
l
i
c
e
slice
​
  , N
t
i
m
e
time
​
  ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence. <loc_41><loc_89><loc_241><loc_207>To address this, we follow a similar approach to our sinusoidal fits using emcee , but we increase the total number of steps to 100,000 and use 100 walkers. Naïvely, the fit would include 2 N
s
l
i
c
e
slice
​
  + 1 parameters: N
s
l
i
c
e
slice
​
  for the albedo values, N
s
l
i
c
e
slice
​
  for the emission parameters, and one additional scatter parameter, σ . However, since night-side slices do not contribute to the reflected light component, we exclude these albedo values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of walkers per free parameer. Following Coulombe et al. (2025) we set an upper prior limit of 3/2 on all albedo slices as a fully Lambertian sphere ( A
i
i
​
  = 1) corresponds to a geometric albedo of A
g
g
​
  = 2/3. For thermal emission we impose a uniform prior between 0 and 500 ppm for each slice. <loc_41><loc_208><loc_241><loc_270>We choose to fit our detrended lightcurves considering 4, 6 and 8 longitudinal slices ( N
s
l
i
c
e
slice
​
  = 4, 6, 8). However, we show the results of the simplest 4 slice model. As in our previous fits, we conduct an initial run with 25,000 steps (25% of the total run) and use the maximumprobability parameters from this preliminary fit as the starting positions for the final 75,000-step run. We then discard the first 60% of the final run as burn-in. <loc_73><loc_277><loc_209><loc_284>2.5. Planetary Effective Temperature <loc_41><loc_287><loc_241><loc_348>Phase curves are the only way to probe thermal emission from the day and nightside of an exoplanet and hence determine its global energy budget (Partmentier & Crossfield 2018). The wavelength range of NIRISS/SOSS covers a large portion of the emitted flux of WASP-121 b (~ 50-83%; see Figure 2), enabling a precise and robust constraint of the planet's energy budget. <loc_41><loc_349><loc_241><loc_364>We convert the fitted F
p
p
​
  / F
∗
∗
​
  emission spectra to brightness temperature by wavelength, <loc_60><loc_368><loc_241><loc_388>T _ { b r i g h t } = \frac { b c } { k \lambda } \cdot \left [ \ln \left ( \frac { 2 b c ^ { 2 } } { \lambda ^ { 5 } B _ { \lambda , p l a n e t } } + 1 \right ) \right ] ^ { - 1 } , \quad ( 1 6 ) <loc_41><loc_392><loc_181><loc_399>where the planet's thermal emission is <loc_85><loc_404><loc_241><loc_419>B _ { \lambda , , p l a n e t } = \frac { F _ { p } / F _ { * } } { ( R _ { p } / R _ { * } ) ^ { 2 } } \cdot B _ { \lambda , , s t a r } , . <loc_41><loc_425><loc_241><loc_455>There are many ways of converting brightness temperatures to effective temperature, including the ErrorWeighted Mean (EWM), Power-Weighted mean (PWM) and with a Gaussian Process (Schwartz & Cowan 2015; <loc_273><loc_50><loc_454><loc_134><line_chart><loc_261><loc_141><loc_462><loc_266>Figure 2. Estimated captured flux of the planet assuming the planet radiates as a blackbody. The captured flux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over all wavelengths, i.e., γ = ∫ λ
m
i
n
min
​
  λ
m
i
n
min
​
  B ( λ, T ) dλ/ ∫ ∞ 0 B ( λ, T ) dλ. The captured flux fraction is shown for NIRISS SOSS [0.6-2.85 µm] (red line); Hubble WFC3 [1.12-1.64 µm] (dashed green line); NIRSpec G395H [2.7-5.15 µm] (dash dotted blue line). The red-shaded region shows the temperature range on WASP-121 b based on T
e
f
f
eff
​
  estimates. Red dashed lines indicate the boundaries of the planet's temperature range within the NIRISS SOSS captured flux fraction. From this we estimate that these observations capture between 55% and 82% of the planet's bolometric flux, depending on orbital phase. Using the minimum temperature from the NAMELESS fit, this estimate decreases to 50%. In either case, the wavelength coverage of NIRISS exceeds that of any other instrument. <loc_261><loc_274><loc_462><loc_360>Pass et al. 2019). In this work, we elect to compute our effective temperature estimates with a novel method that is essentially a combination of the PWM and EWM. We create the effective temperature by using a simple Monte Carlo process. First, we perturb our F
p
p
​
  / F
s
s
​
  emission spectra at each point in the orbit by a Gaussian based on the measurement uncertainty. Our new emission spectrum is then used to create an estimate of the brightness temperature spectrum. This process is repeated at each orbital phase. We then estimate the effective temperature, T
e
f
f
eff
​
  for a given orbital phase as <loc_317><loc_363><loc_460><loc_382>T _ { \text {eff} } = \frac { \sum _ { i = 1 } ^ { N } w _ { i } T _ { \text {bright} , i } } { \sum _ { i = 1 } ^ { N } w _ { i } } , <loc_261><loc_384><loc_462><loc_415>where w
i
i
​
  is the weight for the i -th wavelength given by the fraction of the planet's bolometric flux that falls within that wavelength bin scaled by the inverse variance of the measurement, <loc_306><loc_418><loc_460><loc_438>w _ { i } = \frac { \int _ { \lambda _ { i } } ^ { \lambda _ { i } + 1 } B ( \lambda _ { i } , T _ { \text {est} } ) , d \lambda } { \int _ { 0 } ^ { \infty } B ( \lambda _ { i } , T _ { \text {est} } ) , d \lambda } \cdot \frac { 1 } { \sigma _ { i } ^ { 2 } } , <loc_261><loc_440><loc_462><loc_455>with T
e
s
t
est
​
  representing an estimated effective temperature at the orbital phase of interest. When computing

Using danchev/ibm-granite-docling-258M-GGUF with llama-server version: 6710 (74b8fc17f).
llama-serve -m ${models}/ibm-granite-docling-258M-bf16.gguf --ctx-size 56000 -ub 2048 -b 4096 --temp 0.1 --min-p 0.05 --top-p 0.95 --top-k 10.0 --mmproj ${models}/mmproj-ibm-granite-docling-258M-f16.gguf --n-gpu-layers 999
Prompt: Convert this document to docling format.

I have the same issue when I deploy it through VLLM using the chat completions API. Using the VLLM library itself, it seems to work fine. I noticed that the doctag tags are missing when I use the chat completions API.

@momodadi you mean "when using the docling library itself", right?
when i use the docker version of docling, the output is beautiful. But when using the model directly, we are supposed to post-process the output, right? If tyes, are there any instructions about how to do it? Those loc_ tags seem to be some kind of coordinates, any idea what they mean and how to use them?

Yep, we have to indeed normally perform some post-processing using the following:

doctags_doc = DocTagsDocument.from_doctags_and_image_pairs(doctags_list, images) 
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name=base_filename)

as you have noted, the output of the model is indeed various loc_tags. I tried to check for the intermediary output through vllm, but in there it seems to be also loc_tags wrapped around doctag tags. I am hence not entirely sure what the correct way is to proceed in our case. It would be nice if one of the maintainers of developers could share their thoughts with us?

I don't really want to use vllm or the entire docling library since I prefer to have some separation of concerns, i.e. running the LLM in a different environment on the server-side and running our API solution elsewhere whilst using only the docling_core library and the openAI completions API on the client-side.

So, I think the main reason why this is happening is because of prompt = processor.apply_chat_template(messages, add_generation_prompt=True) not being applied on our end, there's a specific way in which the prompt is formatted. I am not entirely sure how to replicate this behavior in some way so that it would result in the same response from an openai chat completions call. I have tried to recreate the effort using only the /completions API but the model pretty much replies with bogus.

processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")


# Asynchronous function to process a single page image from the invoice
# Process a single invoice page using the /completions endpoint
async def process_single_page(
    img_path: str,
    client,
    model_name: str,
    prompt: str,
    page_num: int
) -> Dict[str, Any]:
    try:
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image"},
                    {"type": "text", "text": prompt},
                ],
            }
        ]
        formatted_prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

        base64_image = encode_image(img_path)

        # Call vLLM endpoint (OpenAI-compatible /completions API)
        response = await client.completions.create(
            model=model_name,
            prompt=formatted_prompt,
            max_tokens=6192,
            temperature=0.0,
            extra_body={
                "multi_modal_data": {
                    "image": f"data:image/png;base64,{base64_image}"
                }
            },
        )

        doctags_output = response.choices[0].text.strip()

        return {
            "page": page_num,
            "doctags": doctags_output,
            "image_path": img_path,
            "status": "success",
        }

    except Exception as e:
        print(f"Error processing page {page_num}: {e}")
        return {
            "page": page_num,
            "error": str(e),
            "image_path": img_path,
            "status": "error",
        }

I would advise just to use the mentioned methods of inference in the model card.

i found this discussion very informative and instructive: https://github.com/docling-project/docling/discussions/354

Sign up or log in to comment