yinbq commited on 8 days ago

Commit

0341b51

verified ·

1 Parent(s): b35d13a

Add files using upload-large-folder tool

Browse files

Files changed (21) hide show

.gitignore +16 -0
EVAL.md +387 -0
LICENSE +201 -0
README.md +139 -0
TRAIN.md +168 -0
app.py +613 -0
bug.log +13 -0
data/configs/example_smm_semantic.yaml +50 -0
data/data_utils.py +177 -0
data/interleave_datasets/__init__.py +6 -0
data/parquet_utils.py +89 -0
data/t2i_dataset.py +128 -0
data/transforms.py +287 -0
data/video_utils.py +165 -0
data/vlm_dataset.py +195 -0
download_model.py +12 -0
inference.ipynb +535 -0
inferencer.py +300 -0
infz_bf16.py +704 -0
modeling/bagel/__init__.py +18 -0
requirements.txt +25 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,16 @@

+wandb
+__pycache__
+.vscode
+notebooks
+results
+*.ipynb_checkpoints
+eval_results
+tests
+.DS_Store
+gradio.sh
+models
+bagel_example
+Zebra-CoT
+model_bf16.safetensors
+zebra-cot.tar.gz
+reasoning_output*

EVAL.md ADDED Viewed

	@@ -0,0 +1,387 @@

+# VLM
+We follow [InternVL2](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html) to evaluate the performance on MME, MMBench, MMMU, MMVet, MathVista and MMVP.
+## Data prepration
+Please follow the [InternVL2](https://internvl.readthedocs.io/en/latest/get_started/eval_data_preparation.html) to prepare the corresponding data. And the link the data under `vlm`.
+The final directory structure is:
+```shell
+data
+├── MathVista
+├── mmbench
+├── mme
+├── MMMU
+├── mm-vet
+└── MMVP
+```
+## Evaluation
+Directly run `scripts/eval/run_eval_vlm.sh` to evaluate different benchmarks. The output will be saved in `$output_path`.
+- Set `$model_path` and `$output_path` for the path for checkpoint and log.
+- Increase `GPUS` if you want to run faster.
+- For MMBench, please use the official [evaluation server](https://mmbench.opencompass.org.cn/mmbench-submission).
+- For MMVet, please use the official [evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator).
+- For MathVista, please set `$openai_api_key` in `scripts/eval/run_eval_vlm.sh` and `your_api_url` in `eval/vlm/eval/mathvista/utilities.py`. The default GPT version is `gpt-4o-2024-11-20`.
+- For MMMU, we use CoT in the report, which improve the accuracy by about 2%. For evaluation of the oprn-ended answer, we use GPT-4o for judgement.
+# GenEval
+We modify the code in [GenEval](https://github.com/djghosh13/geneval/tree/main) for faster evaluation.
+## Setup
+Install the following dependencies:
+```shell
+pip install open-clip-torch
+pip install clip-benchmark
+pip install --upgrade setuptools
+sudo pip install -U openmim
+sudo mim install mmengine mmcv-full==1.7.2
+git clone https://github.com/open-mmlab/mmdetection.git
+cd mmdetection; git checkout 2.x
+pip install -v -e .
+```
+Download Detector:
+```shell
+cd ./eval/gen/geneval
+mkdir model
+bash ./evaluation/download_models.sh ./model
+```
+## Evaluation
+Directly run `scripts/eval/run_geneval.sh` to evaluate GenEVAL. The output will be saved in `$output_path`.
+- Set `$model_path` and `$output_path` for the path for checkpoint and log.
+- Set `metadata_file` to `./eval/gen/geneval/prompts/evaluation_metadata.jsonl` for original GenEval prompts.
+# WISE
+We modify the code in [WISE](https://github.com/PKU-YuanGroup/WISE/tree/main) for faster evaluation.
+## Evaluation
+Directly run `scripts/eval/run_wise.sh` to evaluate WISE. The output will be saved in `$output_path`.
+- Set `$model_path` and `$output_path` for the path for checkpoint and log.
+- Set `$openai_api_key` in `scripts/eval/run_wise.sh` and `your_api_url` in `eval/gen/wise/gpt_eval_mp.py`. The default GPT version is `gpt-4o-2024-11-20`.
+- Use `think` for thinking mode.
+# GEdit-Bench
+We adopt the code in [GEdit-Bench](https://github.com/stepfun-ai/Step1X-Edit/blob/main/GEdit-Bench/EVAL.md) for evaluation.
+## Evaluation
+Modify the model path, the output path, the api key, and the api url in `scripts/eval/run_gedit.sh`. Then, run the following command:
+```shell
+bash script/eval/run_gedit.sh
+```
+The GPT version for evaluation is `gpt-4.1-2025-04-14`.
+# IntelligentBench
+TBD
+# KRIS
+We modify the code in [KRIS-Bench](https://github.com/mercurystraw/Kris_Bench) for faster evaluation.
+## Data prepration
+Please download the benchmark data from [KRIS-Bench](https://huggingface.co/datasets/Liang0223/KRIS_Bench) and and place it in the `KRIS_Bench` directory.
+The final directory structure is:
+```shell
+KRIS_Bench
+├── abstract_reasoning
+├── anomaly_correction
+├── biology
+├── chemistry
+├── color_change
+├── count_change
+├── geography
+├── humanities
+├── mathematics
+├── medicine
+├── multi-element_composition
+├── multi-instruction_execution
+├── part_completion
+├── physics
+├── position_movement
+├── practical_knowledge
+├── rule-based_reasoning
+├── size_adjustment
+├── temporal_prediction
+└── viewpoint_change
+```
+## Evaluation
+Directly run `scripts/eval/run_kris.sh` to evaluate KRIS-Bench. The output will be saved in `$output_path`.
+- Set `$model_path` and `$output_path` for the path for checkpoint and log.
+- Set `$openai_api_key` in `scripts/eval/run_kris.sh` and `your_api_url` in `eval/gen/kris/metrics_xx.py`. The default GPT version is `gpt-4o-2024-11-20`.
+- Use `think` for thinking mode.
+- We set `cfg_text_scale=4` and `cfg_img_scale=1.5` by default. Additionally, `cfg_renorm_min=0` is specified for CFG Renorm.
+<details>
+<summary><b>Results</b></summary>
+<pre>
+Category, meta-category, and overall average scores (100-point scale):
+Attribute Perception:
+  VC: 76.64
+  VQ: 74.45
+  IF: 41.73
+  AVG: 64.27
+Spatial Perception:
+  VC: 70.25
+  VQ: 80.00
+  IF: 37.00
+  AVG: 62.42
+Temporal Prediction:
+  VC: 36.49
+  VQ: 61.82
+  IF: 29.05
+  AVG: 42.45
+Social Science:
+  VC: 76.20
+  VQ: 78.80
+  IF: 37.00
+  KP: 29.60
+  AVG: 55.40
+Natural Science:
+  VC: 69.59
+  VQ: 84.03
+  IF: 40.27
+  KP: 30.15
+  AVG: 56.01
+Logical Reasoning:
+  VC: 80.17
+  VQ: 85.67
+  IF: 26.33
+  KP: 18.00
+  AVG: 52.54
+Instruction Decomposition:
+  VC: 40.17
+  VQ: 69.50
+  IF: 42.00
+  AVG: 50.56
+Factual Knowledge:
+  AVG: 60.26
+Conceptual Knowledge:
+  AVG: 55.86
+Procedural Knowledge:
+  AVG: 51.69
+Overall:
+  AVG: 56.21
+</pre>
+</details>
+<details>
+<summary><b>Results w/ CoT</b></summary>
+<pre>
+Category, meta-category, and overall average scores (100-point scale):
+Attribute Perception:
+  VC: 75.09
+  VQ: 74.00
+  IF: 53.18
+  AVG: 67.42
+Spatial Perception:
+  VC: 78.75
+  VQ: 87.25
+  IF: 39.00
+  AVG: 68.33
+Temporal Prediction:
+  VC: 48.31
+  VQ: 81.08
+  IF: 46.62
+  AVG: 58.67
+Social Science:
+  VC: 80.40
+  VQ: 79.40
+  IF: 51.60
+  KP: 42.80
+  AVG: 63.55
+Natural Science:
+  VC: 67.68
+  VQ: 82.95
+  IF: 52.10
+  KP: 42.88
+  AVG: 61.40
+Logical Reasoning:
+  VC: 62.83
+  VQ: 79.67
+  IF: 28.33
+  KP: 21.67
+  AVG: 48.12
+Instruction Decomposition:
+  VC: 47.83
+  VQ: 66.83
+  IF: 36.00
+  AVG: 50.22
+Factual Knowledge:
+  AVG: 66.18
+Conceptual Knowledge:
+  AVG: 61.92
+Procedural Knowledge:
+  AVG: 49.02
+Overall:
+  AVG: 60.18
+</pre>
+</details>
+# RISE
+We modify the code in [RISEBench](https://github.com/PhoenixZ810/RISEBench) for faster evaluation.
+## Data prepration
+Please download the benchmark data from [RISEBench](https://huggingface.co/datasets/PhoenixZ/RISEBench) and and place it in the `data` directory.
+The final directory structure is:
+```shell
+data
+├── datav2_total_w_subtask.json
+├── causal_reasoning_images
+├── logical_reasoning_images
+├── spatial_reasoning_images
+└── temporal_reasoning_images
+```
+## Evaluation
+Directly run `scripts/eval/run_rise.sh` to evaluate RISEBench. The output will be saved in `$output_path`.
+- Set `$model_path` and `$output_path` for the path for checkpoint and log.
+- Set `$openai_api_key` in `scripts/eval/run_rise.sh` and `your_api_url` in `eval/gen/rise/gpt_eval.py`. The default GPT version is `gpt-4.1-2025-04-14`.
+- Use `think` for thinking mode.
+- We set `cfg_text_scale=4` and `cfg_img_scale=2.0` by default. Additionally, `cfg_renorm_min=0` is specified for CFG Renorm.
+<details>
+<summary><b>Results (cfg_img_scale=1.5)</b></summary>
+<pre>
+                                                -  Score-Origin  Score-Percentage  Accuracy
+0                                         Overall      2.537778         38.444444  0.061111
+1                                        Temporal      2.654118         41.352941  0.023529
+2                                          Causal      2.788889         44.722222  0.055556
+3                                         Spatial      3.452000         61.300000  0.140000
+4                                         Logical      1.080000          2.000000  0.011765
+5                               Overall_Reasoning      2.458333         36.458333       NaN
+6                         Overall_ApprConsistency      3.141643         53.541076       NaN
+7                Overall_VisualPlausibility_total      3.920000         73.000000       NaN
+8                              Temporal_Reasoning      2.588235         39.705882       NaN
+9                            Temporal_Consistency      3.250000         56.250000       NaN
+10                               Temporal_Quality      3.505882         62.647059       NaN
+11                               Causal_Reasoning      2.733333         43.333333       NaN
+12                             Causal_Consistency      3.579545         64.488636       NaN
+13                                 Causal_Quality      3.688889         67.222222       NaN
+14                              Spatial_Reasoning      3.300000         57.500000       NaN
+15                            Spatial_Consistency      3.330000         58.250000       NaN
+16                                Spatial_Quality      4.480000         87.000000       NaN
+17                              Logical_Reasoning      1.047059          1.176471       NaN
+18                            Logical_Consistency      2.364706         34.117647       NaN
+19                          Temp-Life Progression      2.757895         43.947368  0.000000
+20                      Temp-Material Progression      2.500000         37.500000  0.021739
+21                      Temp-Environmental Cycles      3.061538         51.538462  0.076923
+22                   Temp-Societal Transformation      2.628571         40.714286  0.000000
+23                  Causal-Structural Deformation      2.766667         44.166667  0.055556
+24                        Causal-State Transition      3.112000         52.800000  0.080000
+25  Causal-Chemical and Biological Transformation      2.325000         33.125000  0.062500
+26                   Causal-Physics Manifestation      2.800000         45.000000  0.000000
+27                         Spa-Component Assembly      3.434783         60.869565  0.043478
+28                         Spa-Object Arrangement      2.733333         43.333333  0.000000
+29                       Spa-Viewpoint Generation      3.629630         65.740741  0.222222
+30                       Spa-Structural Inference      4.066667         76.666667  0.133333
+31                           Spa-Layout Reasoning      3.234783         55.869565  0.217391
+32                       Logic-Pattern Prediction      1.035484          0.887097  0.000000
+33                  Logic-Mathematical Derivation      1.350000          8.750000  0.071429
+34                           Logic-Puzzle Solving      1.020000          0.500000  0.000000
+</pre>
+</details>
+<details>
+<summary><b>Results w/ CoT</b></summary>
+<pre>
+                                                -  Score-Origin  Score-Percentage  Accuracy
+0                                         Overall      2.933333         48.333333  0.119444
+1                                        Temporal      3.336471         58.411765  0.058824
+2                                          Causal      3.608889         65.222222  0.177778
+3                                         Spatial      3.492000         62.300000  0.210000
+4                                         Logical      1.157647          3.941176  0.011765
+5                               Overall_Reasoning      2.836111         45.902778       NaN
+6                         Overall_ApprConsistency      3.951841         73.796034       NaN
+7                Overall_VisualPlausibility_total      4.203636         80.090909       NaN
+8                              Temporal_Reasoning      3.188235         54.705882       NaN
+9                            Temporal_Consistency      4.225000         80.625000       NaN
+10                               Temporal_Quality      4.200000         80.000000       NaN
+11                               Causal_Reasoning      3.533333         63.333333       NaN
+12                             Causal_Consistency      4.386364         84.659091       NaN
+13                                 Causal_Quality      4.100000         77.500000       NaN
+14                              Spatial_Reasoning      3.350000         58.750000       NaN
+15                            Spatial_Consistency      4.300000         82.500000       NaN
+16                                Spatial_Quality      4.300000         82.500000       NaN
+17                              Logical_Reasoning      1.141176          3.529412       NaN
+18                            Logical_Consistency      2.835294         45.882353       NaN
+19                          Temp-Life Progression      3.526316         63.157895  0.052632
+20                      Temp-Material Progression      3.208696         55.217391  0.086957
+21                      Temp-Environmental Cycles      3.584615         64.615385  0.000000
+22                   Temp-Societal Transformation      3.200000         55.000000  0.000000
+23                  Causal-Structural Deformation      3.750000         68.750000  0.138889
+24                        Causal-State Transition      3.792000         69.800000  0.320000
+25  Causal-Chemical and Biological Transformation      3.512500         62.812500  0.062500
+26                   Causal-Physics Manifestation      2.984615         49.615385  0.153846
+27                         Spa-Component Assembly      3.652174         66.304348  0.304348
+28                         Spa-Object Arrangement      2.700000         42.500000  0.000000
+29                       Spa-Viewpoint Generation      3.800000         70.000000  0.259259
+30                       Spa-Structural Inference      3.680000         67.000000  0.266667
+31                           Spa-Layout Reasoning      3.260870         56.521739  0.130435
+32                       Logic-Pattern Prediction      1.064516          1.612903  0.000000
+33                  Logic-Mathematical Derivation      1.707143         17.678571  0.071429
+34                           Logic-Puzzle Solving      1.037500          0.937500  0.000000
+</pre>
+</details>
+# ImgEdit
+We modify the code in [ImgEdit](https://github.com/PKU-YuanGroup/ImgEdit) for faster evaluation.
+## Data prepration
+Please download the benchmark data from [ImgEdit-Bench](https://huggingface.co/datasets/sysuyy/ImgEdit/blob/main/Benchmark.tar) and and place it in the `Benchmark` directory.
+The final directory structure is:
+```shell
+Benchmark
+├── hard
+├── multiturn
+└── singleturn
+    ├── judge_prompt.json
+    ├── singleturn.json
+    ├── animal
+    ├── architecture
+    ├── clothes
+    ├── compose
+    ├── daily object
+    ├── for_add
+    ├── human
+    ├── style
+    └── transport
+```
+## Evaluation
+Directly run `scripts/eval/run_imgedit.sh` to evaluate ImgEdit-Bench. The output will be saved in `$output_path`.
+- Set `$model_path` and `$output_path` for the path for checkpoint and log.
+- Set `$openai_api_key` in `scripts/eval/run_imgedit.sh` and `your_api_url` in `eval/gen/imgedit/basic_bench.py`. The default GPT version is `gpt-4o-2024-11-20`.
+- We set `cfg_text_scale=4` and `cfg_img_scale=1.5` by default. Additionally, `cfg_renorm_min=0` is specified for CFG Renorm.
+<details>
+<summary><b>Results</b></summary>
+<pre>
+background: 3.28
+adjust: 3.23
+style: 4.26
+extract: 1.48
+remove: 2.99
+add: 3.45
+replace: 3.76
+compose: 3.18
+action: 4.38
+overall: 3.28
+</pre>
+</details>

LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md ADDED Viewed

	@@ -0,0 +1,139 @@

+# Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning
+![Image](assets/zebra_cot_datacard.png)
+### BAGEL Training Zebra-CoT
+This repository is adapted from the [Bagel](https://github.com/ByteDance-Seed/Bagel) repository.
+### Setup
+```bash
+git clone https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT.git
+cd Bagel-Zebra-CoT
+conda create -n bagel python=3.10 -y
+conda activate bagel
+pip install -r requirements.txt
+pip install flash_attn --no-build-isolation
+```
+### Download checkpoint
+Set the `HF_HOME` in `download_model.py` to the path of the checkpoint you want to download.
+```bash
+python download_model.py
+```
+You can also do this straight from python if your `HF_HOME` has already been set.
+```python
+from huggingface_hub import snapshot_download
+snapshot_download(
+  repo_id="multimodal-reasoning-lab/Bagel-Zebra-CoT",
+  local_dir_use_symlinks=False,
+  resume_download=True,
+  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
+)
+```
+### Inference
+![Image](assets/bagel-cot-example.png)
+The inference script (`infz_bf16.py`) supports inherent interleaved text and visual reasoning. To customize it for your
+specific use case:
+##### 1. Model Checkpoint Path
+Update the checkpoint path to point to your model:
+```python
+checkpoint_dir = "/path/to/your/HF_HOME/models/Bagel-Zebra-CoT"
+```
+For example, under the `HF_HOME`, the path to the checkpoint folder is:
+```bash
+checkpoint_dir = f"{HF_HOME}/models--multimodal-reasoning-lab--Bagel-Zebra-CoT/snapshots/c1ff3c56dd5909841523e3a6b554c77d919c2b28
+```
+You can also use the local dir:
+```
+checkpoint_dir = f"{HF_HOME}/models/Bagel-Zebra-CoT
+```
+##### 2. Setting up prompt and images
+Edit the prompt and image variables in `infz_bf16.py` (around lines 203-211):
+**For single image problems:**
+```python
+prompt = "Your question here"
+image = Image.open('path/to/your/image.png')
+```
+**For multiple image problems:**
+```python
+prompt = "Your question about multiple images"
+image_1 = Image.open('path/to/image1.jpg')
+image_2 = Image.open('path/to/image2.jpg')
+image_3 = Image.open('path/to/image3.jpg')
+image = [image_1, image_2, image_3]  # List of images
+```
+**For text-only problems:**
+```python
+prompt = "Your text-only question"
+image = None
+```
+##### 3. Inference Parameters
+You can adjust the generation parameters in the `inference_hyper` dictionary:
+```python
+inference_hyper = dict(
+    do_sample=True,
+    text_temperature=0.3,
+    cfg_text_scale=4.0,
+    cfg_img_scale=2.0,
+    cfg_interval=[0.0, 1.0],
+    timestep_shift=3.0,
+    num_timesteps=50,
+    cfg_renorm_min=0.0,
+    cfg_renorm_type="text_channel",
+)
+```
+For details, refer to the original jupyter notebook [here](inference.ipynb).
+#### Example Use Cases
+```python
+prompt = "Subtract all cylinders. Add 1 red sphere. How many objects are left?"
+image = Image.open('test_images/image.png')
+```
+### Training
+For training, run
+```bash
+bash scripts/train.sh
+```
+For details, please refer to the original repo [README](https://github.com/bytedance-seed/BAGEL).
+The interleaved reasoning data customized for Zebra-CoT can be found in [think_trace_dataset.py](data/interleave_datasets/think_trace_dataset.py).
+### Cite
+```bibtex
+@misc{li2025zebracot,
+      title={Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning},
+      author={Ang Li and Charles Wang and Kaiyu Yue and Zikui Cai and Ollie Liu and Deqing Fu and Peng Guo and Wang Bill Zhu and Vatsal Sharan and Robin Jia and Willie Neiswanger and Furong Huang and Tom Goldstein and Micah Goldblum},
+      year={2025},
+      eprint={2507.16746},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2507.16746},
+}
+```

TRAIN.md ADDED Viewed

	@@ -0,0 +1,168 @@

+# Data prepration
+We provide data examples for **T2I**, **Editing**, and **VLM** tasks. The T2I dataset is generated using [FLUX.1‑dev](https://huggingface.co/black-forest-labs/FLUX.1-dev); the editing examples are randomly sampled from [SEED‑Data‑Edit‑Part3](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit-Part2-3); and the VLM set is sourced from [LLaVA‑OneVision‑Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data).
+We offer examples in both raw-image folder and parquet shard formats. For other data formats, you can use our dataset code as a template and extend it as needed.
+1. **Download the sample dataset**
+   ```bash
+   wget -O bagel_example.zip \
+     https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/bagel_example.zip
+   unzip bagel_example.zip -d /data
+   ```
+2. **Expected hierarchy**
+   ```text
+   bagel_example
+   ├── t2i/                           # text-to-image (parquet)
+   ├── editing/                       # image editing (parquet)
+   │   ├── seedxedit_multi/
+   │   └── parquet_info/
+   └── vlm/
+       ├── images/                    # JPEG / PNG frames
+       └── llava_ov_si.jsonl          # vision‑language SFT conversations
+   ```
+3. Edit every `your_data_path` placeholder in **`data/dataset_info.py`**.
+4. *(Optional)*  Extend `DATASET_INFO` with your own parquet shards or JSONL files to mix extra data.
+---
+# Training
+The baseline training recipe looks like this (replace environment variables with real paths or values):
+```shell
+# Pre-training
+torchrun \
+  --nnodes=$num_nodes \
+  --node_rank=$node_rank \
+  --nproc_per_node=8 \
+  --master_addr=$master_addr \
+  --master_port=$master_port \
+  train/pretrain_unified_navit.py \
+  --dataset_config_file ./data/configs/example.yaml \
+  --llm_path $llm_path \
+  --vae_path $vae_path \
+  --vit_path $vit_path \
+  --layer_module Qwen2MoTDecoderLayer \
+  --use_flex True \
+  --resume_from $resume_from \
+  --results_dir $output_path \
+  --checkpoint_dir $ckpt_path \
+  --max_latent_size 64  # 32 for low-resolution pre-training
+# Fine-tuning
+torchrun \
+  --nnodes=$num_nodes \
+  --node_rank=$node_rank \
+  --nproc_per_node=8 \
+  --master_addr=$master_addr \
+  --master_port=$master_port \
+  train/pretrain_unified_navit.py \
+  --dataset_config_file ./data/configs/example.yaml \
+  --model_path $model_path \
+  --layer_module Qwen2MoTDecoderLayer \
+  --max_latent_size 64 \
+  --resume-from $model_path \
+  --finetune_from_hf True \
+  --auto_resume True \
+  --resume-model-only True \
+  --finetune-from-ema True \
+  --log_every 1 \
+  --lr 2e-5 \
+  --num_worker 1 \
+  --expected_num_tokens 10240 \
+  --max_num_tokens 11520 \
+  --max_num_tokens_per_sample 10240
+```
+- **When fine-tuning BAGEL, set `max_latent_size=64` to ensure the correct pretrained weights are loaded.** If this is not set, an out-of-bounds error may occur.
+- The total value of `num_used_data` should be greater than `NUM_GPUS × NUM_WORKERS`. (For toy data, use `num_worker=1`.)
+- For T2I-only fine-tuning, set `visual_und=False`. For VLM-only fine-tuning, set `visual_gen=False`.
+- For debugging purposes, use smaller values for `expected_num_tokens`, `max_num_tokens`, and `max_num_tokens_per_sample`.
+- When fine-tuning on toy data, the loss behaves as follows:
+    ```shell
+    [2025-05-25 17:01:37] (step=0000000) Train Loss mse: 0.4063, Train Loss ce: 0.5504, Train Steps/Sec: 0.01,
+    [2025-05-25 17:01:40] (step=0000001) Train Loss mse: 0.4121, Train Loss ce: 0.8152, Train Steps/Sec: 0.44,
+    [2025-05-25 17:01:42] (step=0000002) Train Loss mse: 0.3876, Train Loss ce: 1.3411, Train Steps/Sec: 0.40,
+    [2025-05-25 17:01:45] (step=0000003) Train Loss mse: 0.3825, Train Loss ce: 0.7360, Train Steps/Sec: 0.44,
+    ```
+You are encouraged to adjust any of these hyperparameters to fit your GPU budget and the scale of your dataset. If you encounter any issues, please open an issue for assistance. 🎉
+## Model config
+| Argument                     | Default                                     | Description                                                     |
+| ---------------------------- | ------------------------------------------- | --------------------------------------------------------------- |
+| `llm_path`                   | `hf/Qwen2.5-0.5B-Instruct`                  | Language‑model backbone (HuggingFace repo or local folder).     |
+| `vae_path`                   | `flux/vae/ae.safetensors`                   | Pre‑trained VAE checkpoint for latent diffusion.                |
+| `vit_path`                   | `hf/siglip-so400m-14-980-flash-attn2-navit` | SigLIP ViT used for image understanding.                        |
+| `max_latent_size`            | `32`                                        | Maximum latent grid side; defines highest generable resolution. |
+| `latent_patch_size`          | `2`                                         | VAE pixels represented by one latent patch.                     |
+| `vit_max_num_patch_per_side` | `70`                                        | Max ViT patches per image side after resizing.                  |
+| `text_cond_dropout_prob`     | `0.1`                                       | Probability to drop text conditioning while training.           |
+| `vae_cond_dropout_prob`      | `0.3`                                       | Dropout on VAE latent inputs.                                   |
+| `vit_cond_dropout_prob`      | `0.3`                                       | Dropout on visual features.                                     |
+*(See `ModelArguments` for many more options.)*
+## Data config
+| Argument                    | Default                     | Description                                               |
+| --------------------------- | --------------------------- | --------------------------------------------------------- |
+| `dataset_config_file`       | `data/configs/example.yaml` | YAML that groups datasets and assigns sampling weights.   |
+| `num_workers`               | `4`                         | Background workers per rank for the PyTorch `DataLoader`. |
+| `prefetch_factor`           | `2`                         | Batches pre‑fetched by each worker.                       |
+| `max_num_tokens_per_sample` | `16384`                     | Skip raw samples longer than this.                        |
+| `max_num_tokens`            | `36864`                     | Hard cap for a packed batch (prevents OOM).               |
+| `max_buffer_size`           | `50`                        | Overflow buffer length for oversized samples.             |
+| `data_seed`                 | `42`                        | Seed for reproducible shuffling and sampling.             |
+## Training config
+| Argument                               | Default                | Description                                            |
+| -------------------------------------- | ---------------------- | ------------------------------------------------------ |
+| `total_steps`                          | `500_000`              | Optimiser steps to run.                                |
+| `lr`                                   | `1e-4`                 | Peak learning rate after warm‑up.                      |
+| `lr_scheduler`                         | `constant`             | Learning‑rate schedule (`constant` or `cosine`).       |
+| `warmup_steps`                         | `2000`                 | Linear warm‑up duration.                               |
+| `ema`                                  | `0.9999`               | Exponential moving‑average decay for model weights.    |
+| `max_grad_norm`                        | `1.0`                  | Gradient‑clipping threshold.                           |
+| `save_every`                           | `2000`                 | Checkpoint frequency (steps).                          |
+| `visual_gen / visual_und`              | `True`                 | Enable image generation / understanding branches.      |
+| `freeze_llm / freeze_vit / freeze_vae` | `False / False / True` | Freeze selected modules to save VRAM or for ablations. |
+| `use_flex`                             | `True` (in example)    | Enable FLEX packing for higher GPU utilisation.        |
+| `sharding_strategy`                    | `HYBRID_SHARD`         | FSDP sharding mode.                                    |
+| `num_shard`                            | `8`                    | Parameter shards per rank in HYBRID mode.              |
+**Distributed‑launch environment variables**
+| Var                           | Meaning                           |
+| ----------------------------- | --------------------------------- |
+| `num_nodes` / `node_rank`     | Multi‑node orchestration indices. |
+| `nproc_per_node`              | Number of GPUs per node.          |
+| `master_addr` / `master_port` | NCCL rendezvous endpoint.         |
+## Logging config
+| Argument         | Default               | Description                                          |
+| ---------------- | --------------------- | ---------------------------------------------------- |
+| `results_dir`    | `results`             | Root directory for logs and metrics.                 |
+| `checkpoint_dir` | `results/checkpoints` | Checkpoints are saved here.                          |
+| `log_every`      | `10`                  | Steps between console / W\&B logs.                   |
+| `wandb_project`  | `bagel`               | Weights & Biases project name.                       |
+| `wandb_name`     | `run`                 | Run name inside the project.                         |
+| `wandb_offline`  | `False`               | Switch to offline mode (logs locally, sync later).   |
+| `wandb_resume`   | `allow`               | Resumption policy if an existing run ID is detected. |
+> **Tip**  Export `WANDB_API_KEY` before launching if you want online dashboards.

app.py ADDED Viewed

	@@ -0,0 +1,613 @@

+import gradio as gr
+import numpy as np
+import os
+import torch
+import random
+from accelerate import infer_auto_device_map, load_checkpoint_and_dispatch, init_empty_weights
+from PIL import Image
+from data.data_utils import add_special_tokens, pil_img2rgb
+from data.transforms import ImageTransform
+from inferencer import InterleaveInferencer
+from modeling.autoencoder import load_ae
+from modeling.bagel.qwen2_navit import NaiveCache
+from modeling.bagel import (
+    BagelConfig, Bagel, Qwen2Config, Qwen2ForCausalLM,
+    SiglipVisionConfig, SiglipVisionModel
+)
+from modeling.qwen2 import Qwen2Tokenizer
+import argparse
+from accelerate.utils import BnbQuantizationConfig, load_and_quantize_model
+parser = argparse.ArgumentParser()
+parser.add_argument("--server_name", type=str, default="127.0.0.1")
+parser.add_argument("--server_port", type=int, default=7860)
+parser.add_argument("--share", action="store_true")
+parser.add_argument("--model_path", type=str, default="models/BAGEL-7B-MoT")
+parser.add_argument("--mode", type=int, default=1)
+parser.add_argument("--zh", action="store_true")
+args = parser.parse_args()
+# Model Initialization
+model_path = args.model_path #Download from https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT to models/BAGEL-7B-MoT
+model_path = args.model_path
+llm_config = Qwen2Config.from_json_file(os.path.join(model_path, "llm_config.json"))
+llm_config.qk_norm = True
+llm_config.tie_word_embeddings = False
+llm_config.layer_module = "Qwen2MoTDecoderLayer"
+vit_config = SiglipVisionConfig.from_json_file(os.path.join(model_path, "vit_config.json"))
+vit_config.rope = False
+vit_config.num_hidden_layers -= 1
+vae_model, vae_config = load_ae(local_path=os.path.join(model_path, "ae.safetensors"))
+config = BagelConfig(
+    visual_gen=True,
+    visual_und=True,
+    llm_config=llm_config,
+    vit_config=vit_config,
+    vae_config=vae_config,
+    vit_max_num_patch_per_side=70,
+    connector_act='gelu_pytorch_tanh',
+    latent_patch_size=2,
+    max_latent_size=64,
+)
+with init_empty_weights():
+    language_model = Qwen2ForCausalLM(llm_config)
+    vit_model      = SiglipVisionModel(vit_config)
+    model          = Bagel(language_model, vit_model, config)
+    model.vit_model.vision_model.embeddings.convert_conv2d_to_linear(vit_config, meta=True)
+tokenizer = Qwen2Tokenizer.from_pretrained(model_path)
+tokenizer, new_token_ids, _ = add_special_tokens(tokenizer)
+vae_transform = ImageTransform(1024, 512, 16)
+vit_transform = ImageTransform(980, 224, 14)
+# Model Loading and Multi GPU Infernece Preparing
+device_map = infer_auto_device_map(
+    model,
+    max_memory={i: "80GiB" for i in range(torch.cuda.device_count())},
+    no_split_module_classes=["Bagel", "Qwen2MoTDecoderLayer"],
+)
+same_device_modules = [
+    'language_model.model.embed_tokens',
+    'time_embedder',
+    'latent_pos_embed',
+    'vae2llm',
+    'llm2vae',
+    'connector',
+    'vit_pos_embed'
+]
+if torch.cuda.device_count() == 1:
+    first_device = device_map.get(same_device_modules[0], "cuda:0")
+    for k in same_device_modules:
+        if k in device_map:
+            device_map[k] = first_device
+        else:
+            device_map[k] = "cuda:0"
+else:
+    first_device = device_map.get(same_device_modules[0])
+    for k in same_device_modules:
+        if k in device_map:
+            device_map[k] = first_device
+if args.mode == 1:
+    model = load_checkpoint_and_dispatch(
+        model,
+        checkpoint=os.path.join(model_path, "ema.safetensors"),
+        device_map=device_map,
+        offload_buffers=True,
+        offload_folder="offload",
+        dtype=torch.bfloat16,
+        force_hooks=True,
+    ).eval()
+elif args.mode == 2: # NF4
+    bnb_quantization_config = BnbQuantizationConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type="nf4")
+    model = load_and_quantize_model(
+        model,
+        weights_location=os.path.join(model_path, "ema.safetensors"),
+        bnb_quantization_config=bnb_quantization_config,
+        device_map=device_map,
+        offload_folder="offload",
+    ).eval()
+elif args.mode == 3: # INT8
+    bnb_quantization_config = BnbQuantizationConfig(load_in_8bit=True, torch_dtype=torch.float32)
+    model = load_and_quantize_model(
+        model,
+        weights_location=os.path.join(model_path, "ema.safetensors"),
+        bnb_quantization_config=bnb_quantization_config,
+        device_map=device_map,
+        offload_folder="offload",
+    ).eval()
+else:
+    raise NotImplementedError
+# Inferencer Preparing
+inferencer = InterleaveInferencer(
+    model=model,
+    vae_model=vae_model,
+    tokenizer=tokenizer,
+    vae_transform=vae_transform,
+    vit_transform=vit_transform,
+    new_token_ids=new_token_ids,
+)
+def set_seed(seed):
+    """Set random seeds for reproducibility"""
+    if seed > 0:
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.manual_seed(seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed(seed)
+            torch.cuda.manual_seed_all(seed)
+        torch.backends.cudnn.deterministic = True
+        torch.backends.cudnn.benchmark = False
+    return seed
+# Text to Image function with thinking option and hyperparameters
+def text_to_image(prompt, show_thinking=False, cfg_text_scale=4.0, cfg_interval=0.4,
+                 timestep_shift=3.0, num_timesteps=50,
+                 cfg_renorm_min=0.0, cfg_renorm_type="global",
+                 max_think_token_n=1024, do_sample=False, text_temperature=0.3,
+                 seed=0, image_ratio="1:1"):
+    # Set seed for reproducibility
+    set_seed(seed)
+    if image_ratio == "1:1":
+        image_shapes = (1024, 1024)
+    elif image_ratio == "4:3":
+        image_shapes = (768, 1024)
+    elif image_ratio == "3:4":
+        image_shapes = (1024, 768)
+    elif image_ratio == "16:9":
+        image_shapes = (576, 1024)
+    elif image_ratio == "9:16":
+        image_shapes = (1024, 576)
+    # Set hyperparameters
+    inference_hyper = dict(
+        max_think_token_n=max_think_token_n if show_thinking else 1024,
+        do_sample=do_sample if show_thinking else False,
+        text_temperature=text_temperature if show_thinking else 0.3,
+        cfg_text_scale=cfg_text_scale,
+        cfg_interval=[cfg_interval, 1.0],  # End fixed at 1.0
+        timestep_shift=timestep_shift,
+        num_timesteps=num_timesteps,
+        cfg_renorm_min=cfg_renorm_min,
+        cfg_renorm_type=cfg_renorm_type,
+        image_shapes=image_shapes,
+    )
+    # Call inferencer with or without think parameter based on user choice
+    result = inferencer(text=prompt, think=show_thinking, **inference_hyper)
+    return result["image"], result.get("text", None)
+# Image Understanding function with thinking option and hyperparameters
+def image_understanding(image: Image.Image, prompt: str, show_thinking=False,
+                        do_sample=False, text_temperature=0.3, max_new_tokens=512):
+    if image is None:
+        return "Please upload an image."
+    if isinstance(image, np.ndarray):
+        image = Image.fromarray(image)
+    image = pil_img2rgb(image)
+    # Set hyperparameters
+    inference_hyper = dict(
+        do_sample=do_sample,
+        text_temperature=text_temperature,
+        max_think_token_n=max_new_tokens, # Set max_length
+    )
+    # Use show_thinking parameter to control thinking process
+    result = inferencer(image=image, text=prompt, think=show_thinking,
+                        understanding_output=True, **inference_hyper)
+    return result["text"]
+# Image Editing function with thinking option and hyperparameters
+def edit_image(image: Image.Image, prompt: str, show_thinking=False, cfg_text_scale=4.0,
+              cfg_img_scale=2.0, cfg_interval=0.0,
+              timestep_shift=3.0, num_timesteps=50, cfg_renorm_min=0.0,
+              cfg_renorm_type="text_channel", max_think_token_n=1024,
+              do_sample=False, text_temperature=0.3, seed=0):
+    # Set seed for reproducibility
+    set_seed(seed)
+    if image is None:
+        return "Please upload an image.", ""
+    if isinstance(image, np.ndarray):
+        image = Image.fromarray(image)
+    image = pil_img2rgb(image)
+    # Set hyperparameters
+    inference_hyper = dict(
+        max_think_token_n=max_think_token_n if show_thinking else 1024,
+        do_sample=do_sample if show_thinking else False,
+        text_temperature=text_temperature if show_thinking else 0.3,
+        cfg_text_scale=cfg_text_scale,
+        cfg_img_scale=cfg_img_scale,
+        cfg_interval=[cfg_interval, 1.0],  # End fixed at 1.0
+        timestep_shift=timestep_shift,
+        num_timesteps=num_timesteps,
+        cfg_renorm_min=cfg_renorm_min,
+        cfg_renorm_type=cfg_renorm_type,
+    )
+    # Include thinking parameter based on user choice
+    result = inferencer(image=image, text=prompt, think=show_thinking, **inference_hyper)
+    return result["image"], result.get("text", "")
+# Helper function to load example images
+def load_example_image(image_path):
+    try:
+        return Image.open(image_path)
+    except Exception as e:
+        print(f"Error loading example image: {e}")
+        return None
+# Gradio UI
+with gr.Blocks() as demo:
+    gr.Markdown("""
+<div>
+  <img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="380"/>
+</div>
+""")
+    with gr.Tab("📝 Text to Image"):
+        txt_input = gr.Textbox(
+            label="Prompt",
+            value="A female cosplayer portraying an ethereal fairy or elf, wearing a flowing dress made of delicate fabrics in soft, mystical colors like emerald green and silver. She has pointed ears, a gentle, enchanting expression, and her outfit is adorned with sparkling jewels and intricate patterns. The background is a magical forest with glowing plants, mystical creatures, and a serene atmosphere."
+        )
+        with gr.Row():
+            show_thinking = gr.Checkbox(label="Thinking", value=False)
+        # Add hyperparameter controls in an accordion
+        with gr.Accordion("Inference Hyperparameters", open=False):
+            with gr.Group():
+                with gr.Row():
+                    seed = gr.Slider(minimum=0, maximum=1000000, value=0, step=1,
+                                   label="Seed", info="0 for random seed, positive for reproducible results")
+                    image_ratio = gr.Dropdown(choices=["1:1", "4:3", "3:4", "16:9", "9:16"],
+                                                value="1:1", label="Image Ratio",
+                                                info="The longer size is fixed to 1024")
+                with gr.Row():
+                    cfg_text_scale = gr.Slider(minimum=1.0, maximum=8.0, value=4.0, step=0.1, interactive=True,
+                                             label="CFG Text Scale", info="Controls how strongly the model follows the text prompt (4.0-8.0)")
+                    cfg_interval = gr.Slider(minimum=0.0, maximum=1.0, value=0.4, step=0.1,
+                                           label="CFG Interval", info="Start of CFG application interval (end is fixed at 1.0)")
+                with gr.Row():
+                    cfg_renorm_type = gr.Dropdown(choices=["global", "local", "text_channel"],
+                                                value="global", label="CFG Renorm Type",
+                                                info="If the genrated image is blurry, use 'global'")
+                    cfg_renorm_min = gr.Slider(minimum=0.0, maximum=1.0, value=0.0, step=0.1, interactive=True,
+                                             label="CFG Renorm Min", info="1.0 disables CFG-Renorm")
+                with gr.Row():
+                    num_timesteps = gr.Slider(minimum=10, maximum=100, value=50, step=5, interactive=True,
+                                            label="Timesteps", info="Total denoising steps")
+                    timestep_shift = gr.Slider(minimum=1.0, maximum=5.0, value=3.0, step=0.5, interactive=True,
+                                             label="Timestep Shift", info="Higher values for layout, lower for details")
+                # Thinking parameters in a single row
+                thinking_params = gr.Group(visible=False)
+                with thinking_params:
+                    with gr.Row():
+                        do_sample = gr.Checkbox(label="Sampling", value=False, info="Enable sampling for text generation")
+                        max_think_token_n = gr.Slider(minimum=64, maximum=4006, value=1024, step=64, interactive=True,
+                                                    label="Max Think Tokens", info="Maximum number of tokens for thinking")
+                        text_temperature = gr.Slider(minimum=0.1, maximum=1.0, value=0.3, step=0.1, interactive=True,
+                                                  label="Temperature", info="Controls randomness in text generation")
+        thinking_output = gr.Textbox(label="Thinking Process", visible=False)
+        img_output = gr.Image(label="Generated Image")
+        gen_btn = gr.Button("Generate", variant="primary")
+        # Dynamically show/hide thinking process box and parameters
+        def update_thinking_visibility(show):
+            return gr.update(visible=show), gr.update(visible=show)
+        show_thinking.change(
+            fn=update_thinking_visibility,
+            inputs=[show_thinking],
+            outputs=[thinking_output, thinking_params]
+        )
+        # Process function based on thinking option and hyperparameters
+        def process_text_to_image(prompt, show_thinking, cfg_text_scale,
+                                 cfg_interval, timestep_shift,
+                                 num_timesteps, cfg_renorm_min, cfg_renorm_type,
+                                 max_think_token_n, do_sample, text_temperature, seed, image_ratio):
+            image, thinking = text_to_image(
+                prompt, show_thinking, cfg_text_scale, cfg_interval,
+                timestep_shift, num_timesteps,
+                cfg_renorm_min, cfg_renorm_type,
+                max_think_token_n, do_sample, text_temperature, seed, image_ratio
+            )
+            return image, thinking if thinking else ""
+        gr.on(
+            triggers=[gen_btn.click, txt_input.submit],
+            fn=process_text_to_image,
+            inputs=[
+                txt_input, show_thinking, cfg_text_scale,
+                cfg_interval, timestep_shift,
+                num_timesteps, cfg_renorm_min, cfg_renorm_type,
+                max_think_token_n, do_sample, text_temperature, seed, image_ratio
+            ],
+            outputs=[img_output, thinking_output]
+        )
+    with gr.Tab("🖌️ Image Edit"):
+        with gr.Row():
+            with gr.Column(scale=1):
+                edit_image_input = gr.Image(label="Input Image", value=load_example_image('test_images/women.jpg'))
+                edit_prompt = gr.Textbox(
+                    label="Prompt",
+                    value="She boards a modern subway, quietly reading a folded newspaper, wearing the same clothes."
+                )
+            with gr.Column(scale=1):
+                edit_image_output = gr.Image(label="Result")
+                edit_thinking_output = gr.Textbox(label="Thinking Process", visible=False)
+        with gr.Row():
+            edit_show_thinking = gr.Checkbox(label="Thinking", value=False)
+        # Add hyperparameter controls in an accordion
+        with gr.Accordion("Inference Hyperparameters", open=False):
+            with gr.Group():
+                with gr.Row():
+                    edit_seed = gr.Slider(minimum=0, maximum=1000000, value=0, step=1, interactive=True,
+                                        label="Seed", info="0 for random seed, positive for reproducible results")
+                    edit_cfg_text_scale = gr.Slider(minimum=1.0, maximum=8.0, value=4.0, step=0.1, interactive=True,
+                                                  label="CFG Text Scale", info="Controls how strongly the model follows the text prompt")
+                with gr.Row():
+                    edit_cfg_img_scale = gr.Slider(minimum=1.0, maximum=4.0, value=2.0, step=0.1, interactive=True,
+                                                 label="CFG Image Scale", info="Controls how much the model preserves input image details")
+                    edit_cfg_interval = gr.Slider(minimum=0.0, maximum=1.0, value=0.0, step=0.1, interactive=True,
+                                                label="CFG Interval", info="Start of CFG application interval (end is fixed at 1.0)")
+                with gr.Row():
+                    edit_cfg_renorm_type = gr.Dropdown(choices=["global", "local", "text_channel"],
+                                                     value="text_channel", label="CFG Renorm Type",
+                                                     info="If the genrated image is blurry, use 'global'")
+                    edit_cfg_renorm_min = gr.Slider(minimum=0.0, maximum=1.0, value=0.0, step=0.1, interactive=True,
+                                                  label="CFG Renorm Min", info="1.0 disables CFG-Renorm")
+                with gr.Row():
+                    edit_num_timesteps = gr.Slider(minimum=10, maximum=100, value=50, step=5, interactive=True,
+                                                 label="Timesteps", info="Total denoising steps")
+                    edit_timestep_shift = gr.Slider(minimum=1.0, maximum=10.0, value=3.0, step=0.5, interactive=True,
+                                                  label="Timestep Shift", info="Higher values for layout, lower for details")
+                # Thinking parameters in a single row
+                edit_thinking_params = gr.Group(visible=False)
+                with edit_thinking_params:
+                    with gr.Row():
+                        edit_do_sample = gr.Checkbox(label="Sampling", value=False, info="Enable sampling for text generation")
+                        edit_max_think_token_n = gr.Slider(minimum=64, maximum=4006, value=1024, step=64, interactive=True,
+                                                         label="Max Think Tokens", info="Maximum number of tokens for thinking")
+                        edit_text_temperature = gr.Slider(minimum=0.1, maximum=1.0, value=0.3, step=0.1, interactive=True,
+                                                        label="Temperature", info="Controls randomness in text generation")
+        edit_btn = gr.Button("Submit", variant="primary")
+        # Dynamically show/hide thinking process box for editing
+        def update_edit_thinking_visibility(show):
+            return gr.update(visible=show), gr.update(visible=show)
+        edit_show_thinking.change(
+            fn=update_edit_thinking_visibility,
+            inputs=[edit_show_thinking],
+            outputs=[edit_thinking_output, edit_thinking_params]
+        )
+        # Process editing with thinking option and hyperparameters
+        def process_edit_image(image, prompt, show_thinking, cfg_text_scale,
+                              cfg_img_scale, cfg_interval,
+                              timestep_shift, num_timesteps, cfg_renorm_min,
+                              cfg_renorm_type, max_think_token_n, do_sample,
+                              text_temperature, seed):
+            edited_image, thinking = edit_image(
+                image, prompt, show_thinking, cfg_text_scale, cfg_img_scale,
+                cfg_interval, timestep_shift,
+                num_timesteps, cfg_renorm_min, cfg_renorm_type,
+                max_think_token_n, do_sample, text_temperature, seed
+            )
+            return edited_image, thinking if thinking else ""
+        gr.on(
+            triggers=[edit_btn.click, edit_prompt.submit],
+            fn=process_edit_image,
+            inputs=[
+                edit_image_input, edit_prompt, edit_show_thinking,
+                edit_cfg_text_scale, edit_cfg_img_scale, edit_cfg_interval,
+                edit_timestep_shift, edit_num_timesteps,
+                edit_cfg_renorm_min, edit_cfg_renorm_type,
+                edit_max_think_token_n, edit_do_sample, edit_text_temperature, edit_seed
+            ],
+            outputs=[edit_image_output, edit_thinking_output]
+        )
+    with gr.Tab("🖼️ Image Understanding"):
+        with gr.Row():
+            with gr.Column(scale=1):
+                img_input = gr.Image(label="Input Image", value=load_example_image('test_images/meme.jpg'))
+                understand_prompt = gr.Textbox(
+                    label="Prompt",
+                    value="Can someone explain what's funny about this meme??"
+                )
+            with gr.Column(scale=1):
+                txt_output = gr.Textbox(label="Result", lines=20)
+        with gr.Row():
+            understand_show_thinking = gr.Checkbox(label="Thinking", value=False)
+        # Add hyperparameter controls in an accordion
+        with gr.Accordion("Inference Hyperparameters", open=False):
+            with gr.Row():
+                understand_do_sample = gr.Checkbox(label="Sampling", value=False, info="Enable sampling for text generation")
+                understand_text_temperature = gr.Slider(minimum=0.0, maximum=1.0, value=0.3, step=0.05, interactive=True,
+                                                     label="Temperature", info="Controls randomness in text generation (0=deterministic, 1=creative)")
+                understand_max_new_tokens = gr.Slider(minimum=64, maximum=4096, value=512, step=64, interactive=True,
+                                                   label="Max New Tokens", info="Maximum length of generated text, including potential thinking")
+        img_understand_btn = gr.Button("Submit", variant="primary")
+        # Process understanding with thinking option and hyperparameters
+        def process_understanding(image, prompt, show_thinking, do_sample,
+                                 text_temperature, max_new_tokens):
+            result = image_understanding(
+                image, prompt, show_thinking, do_sample,
+                text_temperature, max_new_tokens
+            )
+            return result
+        gr.on(
+            triggers=[img_understand_btn.click, understand_prompt.submit],
+            fn=process_understanding,
+            inputs=[
+                img_input, understand_prompt, understand_show_thinking,
+                understand_do_sample, understand_text_temperature, understand_max_new_tokens
+            ],
+            outputs=txt_output
+        )
+    gr.Markdown("""
+<div style="display: flex; justify-content: flex-start; flex-wrap: wrap; gap: 10px;">
+  <a href="https://bagel-ai.org/">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white"
+      alt="BAGEL Website"
+    />
+  </a>
+  <a href="https://arxiv.org/abs/2505.14683">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red"
+      alt="BAGEL Paper on arXiv"
+    />
+  </a>
+  <a href="https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT">
+    <img
+        src="https://img.shields.io/badge/BAGEL-Hugging%20Face-orange?logo=huggingface&logoColor=yellow"
+        alt="BAGEL on Hugging Face"
+    />
+  </a>
+  <a href="https://demo.bagel-ai.org/">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=blue"
+      alt="BAGEL Demo"
+    />
+  </a>
+  <a href="https://discord.gg/Z836xxzy">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Discord-5865F2?logo=discord&logoColor=purple"
+      alt="BAGEL Discord"
+    />
+  </a>
+  <a href="mailto:[email protected]">
+    <img
+      src="https://img.shields.io/badge/BAGEL-Email-D14836?logo=gmail&logoColor=red"
+      alt="BAGEL Email"
+    />
+  </a>
+</div>
+""")
+UI_TRANSLATIONS = {
+    "📝 Text to Image":"📝 文生图",
+    "Prompt":"提示词",
+    "Thinking":"思考模式",
+    "Inference Hyperparameters":"推理参数",
+    "Seed":"随机种子",
+    "0 for random seed, positive for reproducible results":"0为随机种子，正数表示可重复结果",
+    "Image Ratio":"图片比例",
+    "The longer size is fixed to 1024":"长边固定为1024",
+    "CFG Text Scale":"文本CFG强度",
+    "Controls how strongly the model follows the text prompt (4.0-8.0)":"控制模型是否遵循文本提示（4.0-8.0）",
+    "CFG Interval":"CFG应用间隔",
+    "Start of CFG application interval (end is fixed at 1.0)":"CFG应用间隔的开始（结束固定为1.0）",
+    "CFG Renorm Type":"CFG 重归一化类型",
+    "If the genrated image is blurry, use 'global'":"如果生成的图像模糊，请使用'global'",
+    "CFG Renorm Min":"CFG 重归一化最小值",
+    "1.0 disables CFG-Renorm":"1.0 禁用 CFG 重归一化",
+    "Timesteps":"时间步数",
+    "Total denoising steps":"总去噪步数",
+    "Timestep Shift":"时间步偏移",
+    "Higher values for layout, lower for details":"值更大更倾向于调整布局，值更小更倾向于调整细节",
+    "Sampling":"采样",
+    "Enable sampling for text generation":"为文本生成启用采样",
+    "Max Think Tokens":"最大思考token数",
+    "Maximum number of tokens for thinking":"思考的最大token数",
+    "Temperature":"温度系数",
+    "Controls randomness in text generation":"控制文本生成的随机性",
+    "Thinking Process":"思考过程",
+    "Generated Image":"生成图像",
+    "Generate":"开始生成",
+    "🖌️ Image Edit":"🖌️ 图像编辑",
+    "Input Image":"图像输入",
+    "Result":"结果",
+    "Controls how strongly the model follows the text prompt":"控制模型是否遵循文本提示的强度",
+    "CFG Image Scale":"图像CFG强度",
+    "Controls how much the model preserves input image details":"控制模型保留输入图像细节的强度",
+    "Submit":"开始生成",
+    "🖼️ Image Understanding":"🖼️ 图像理解",
+    "Controls randomness in text generation (0=deterministic, 1=creative)":"控制文本生成的随机性（0=确定，1=creative）",
+    "Max New Tokens":"最大新token数",
+    "Maximum length of generated text, including potential thinking":"生成文本的最大长度，包括可能的思考",
+}
+def apply_localization(block):
+    def process_component(component):
+        if not component:
+            return
+        for attr in ['label', 'info', 'placeholder']:
+            if hasattr(component, attr):
+                text = getattr(component, attr)
+                if text in UI_TRANSLATIONS:
+                    setattr(component, attr, UI_TRANSLATIONS[text])
+        if hasattr(component, 'children'):
+            for child in component.children:
+                process_component(child)
+    process_component(block)
+    return block
+if __name__ == "__main__":
+    if args.zh:
+        demo = apply_localization(demo)
+    demo.launch(
+        server_name=args.server_name,
+        server_port=args.server_port,
+        share=args.share,
+        inbrowser=True,
+    )

bug.log ADDED Viewed

	@@ -0,0 +1,13 @@

+现在我明白问题了！模型目录中有 model_bf16.safetensors 但是代码在寻找 model.safetensors。让我检查一下 fsdp_utils.py 中的加载逻辑：
+Read fsdp_utils.py, lines 155 to 165
+找到问题了！代码在寻找 model.safetensors，但实际文件名是 model_bf16.safetensors。我需要修改加载逻辑来适配实际的文件名。
+cd /scratch/by2593/Bagel-Zebra-CoT-origin && PYTHONPATH=/scratch/by2593/Bagel-Zebra-CoT-origin

data/configs/example_smm_semantic.yaml ADDED Viewed

	@@ -0,0 +1,50 @@

+block_dataset:
+  dataset_names:
+  - block_dataset
+  jsonl_path_list: ["/scratch/by2593/project/SMM/SMM_data/semantic_block_train_part1.jsonl"]
+  num_used_data: None
+  image_prefix_dir: "/scratch/by2593/project/SMM/semantic_blocks_part1"
+  image_transform_args:
+    image_stride: 16
+    max_image_size: 512  # VAE使用stride=16, 512/16=32 patches
+    min_image_size: 512
+  vit_image_transform_args:
+    image_stride: 14
+    max_image_size: 512  # ViT使用stride=14, 512/14=36 patches (匹配模型能力)
+    min_image_size: 512
+  weight: 1.0
+  is_mandatory: true
+# unified_edit:
+#   dataset_names:
+#   - seedxedit_multi
+#   image_transform_args:
+#     image_stride: 16
+#     max_image_size: 1024
+#     min_image_size: 512
+#   vit_image_transform_args:
+#     image_stride: 14
+#     max_image_size: 518
+#     min_image_size: 224
+#   is_mandatory: true
+#   num_used_data:
+#   - 10
+#   weight: 1
+# vlm_sft:
+#   dataset_names:
+#   - llava_ov
+#   image_transform_args:
+#     image_stride: 14
+#     max_image_size: 980
+#     min_image_size: 378
+#     max_pixels: 2_007_040
+#   frame_sampler_args:
+#     max_num_frames: 12
+#     min_num_frames: 8
+#   is_mandatory: true
+#   shuffle_lines: True
+#   shuffle_seed: 0
+#   num_used_data:
+#   - 1000
+#   weight: 1

data/data_utils.py ADDED Viewed

	@@ -0,0 +1,177 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import math
+import random
+from PIL import Image
+import torch
+from torch.nn.attention.flex_attention import or_masks, and_masks
+def create_sparse_mask(document_lens, split_lens, attn_modes, device):
+    def causal_mask(b, h, q_idx, kv_idx):
+        return q_idx >= kv_idx
+    def full_and_noise_mask(b, h, q_idx, kv_idx):
+        return (full_and_noise_seq_id[q_idx] == full_and_noise_seq_id[kv_idx]) & (full_and_noise_seq_id[q_idx] >= 0)
+    def remove_noise_mask(b, h, q_idx, kv_idx):
+        return (~((noise_seq_id[kv_idx] >= 0) & (noise_seq_id[q_idx] != noise_seq_id[kv_idx])))
+    def sample_mask(b, h, q_idx, kv_idx):
+        return document_id[q_idx] == document_id[kv_idx]
+    full_and_noise_tmp = []
+    noise_tmp = []
+    for i, (length, model) in enumerate(zip(split_lens, attn_modes)):
+        value = i if model in ['full', 'noise'] else -1
+        full_and_noise_tmp.extend([value] * length)
+        value_noise = i if model == 'noise' else -1
+        noise_tmp.extend([value_noise] * length)
+    full_and_noise_seq_id = torch.Tensor(full_and_noise_tmp).to(device)
+    noise_seq_id = torch.Tensor(noise_tmp).to(device)
+    document_id = torch.cat([torch.full((l,), i) for i, l in enumerate(document_lens, start=1)]).to(device)
+    return and_masks(or_masks(causal_mask, full_and_noise_mask), remove_noise_mask, sample_mask)
+def patchify(image, patch_size):
+    p = patch_size
+    c, h, w = image.shape
+    assert h % p == 0 and w % p == 0
+    image = image.reshape(c, h // p, p, w // p, p)
+    image = torch.einsum("chpwq->hwpqc", image)
+    image = image.reshape(-1, p**2 * c)
+    return image
+def get_flattened_position_ids_extrapolate(img_h, img_w, patch_size, max_num_patches_per_side):
+    num_patches_h, num_patches_w = img_h // patch_size, img_w // patch_size
+    coords_h = torch.arange(0, num_patches_h)
+    coords_w = torch.arange(0, num_patches_w)
+    pos_ids = (coords_h[:, None] * max_num_patches_per_side + coords_w).flatten()
+    return pos_ids
+def get_flattened_position_ids_interpolate(img_h, img_w, patch_size, max_num_patches_per_side):
+    num_patches_h, num_patches_w = img_h // patch_size, img_w // patch_size
+    boundaries = torch.arange(1 / max_num_patches_per_side, 1.0, 1 / max_num_patches_per_side)
+    fractional_coords_h = torch.arange(0, 1 - 1e-6, 1 / num_patches_h)
+    fractional_coords_w = torch.arange(0, 1 - 1e-6, 1 / num_patches_w)
+    bucket_coords_h = torch.bucketize(fractional_coords_h, boundaries, right=True)
+    bucket_coords_w = torch.bucketize(fractional_coords_w, boundaries, right=True)
+    pos_ids = (bucket_coords_h[:, None] * max_num_patches_per_side + bucket_coords_w).flatten()
+    return pos_ids
+def prepare_attention_mask_per_sample(split_lens, attn_modes, device="cpu"):
+    """
+    nested_split_lens: A list of N lists of ints. Each int indicates the length of a split within
+        a sample, where each sample contains multiple splits with different attn modes.
+    nested_attn_modes: whether to use full attn in each split.
+    """
+    sample_len = sum(split_lens)
+    attention_mask = torch.zeros((sample_len, sample_len), dtype=torch.bool, device=device)
+    csum = 0
+    for s, attn_mode in zip(split_lens, attn_modes):
+        assert attn_mode in ['causal', 'full', 'noise']
+        if attn_mode == "causal":
+            attention_mask[csum:csum + s, csum:csum + s] = torch.ones((s, s), device=device).tril()
+            attention_mask[csum:csum + s, :csum] = 1
+        else:
+            attention_mask[csum:csum + s, csum:csum + s] = torch.ones((s, s))
+            attention_mask[csum:csum + s, :csum] = 1
+        csum += s
+    csum = 0
+    for s, attn_mode in zip(split_lens, attn_modes):
+        if attn_mode == "noise":
+            attention_mask[:, csum : csum + s] = torch.zeros((sample_len, s))
+            attention_mask[csum : csum + s, csum : csum + s] = torch.ones((s, s))
+        csum += s
+    attention_mask = torch.zeros_like(attention_mask, dtype=torch.float).masked_fill_(
+        ~attention_mask, float("-inf")
+    )
+    return attention_mask
+def split_integer_exp_decay(S, ng_sample_decay=1.0):
+    if ng_sample_decay == 1.0:
+        N = random.randint(1, S)
+    else:
+        base = (1 - ng_sample_decay) / (1 - math.pow(ng_sample_decay, S))
+        p = [base * math.pow(ng_sample_decay, i) for i in range(S)]
+        N = random.choices(list(range(1, S + 1)), p, k=1)[0]
+    cumsum = [0] + sorted(random.sample(range(1, S), N - 1)) + [S]
+    result = [cumsum[i+1] - cumsum[i] for i in range(len(cumsum) - 1)]
+    return result, cumsum
+def pil_img2rgb(image):
+    if image.mode == "RGBA" or image.info.get("transparency", None) is not None:
+        image = image.convert("RGBA")
+        white = Image.new(mode="RGB", size=image.size, color=(255, 255, 255))
+        white.paste(image, mask=image.split()[3])
+        image = white
+    else:
+        image = image.convert("RGB")
+    return image
+def add_special_tokens(tokenizer):
+    all_special_tokens = []
+    for k, v in tokenizer.special_tokens_map.items():
+        if isinstance(v, str):
+            all_special_tokens.append(v)
+        elif isinstance(v, list):
+            all_special_tokens += v
+    new_tokens = []
+    if '<|im_start|>' not in all_special_tokens:
+        new_tokens.append('<|im_start|>')
+    if '<|im_end|>' not in all_special_tokens:
+        new_tokens.append('<|im_end|>')
+    if '<|vision_start|>' not in all_special_tokens:
+        new_tokens.append('<|vision_start|>')
+    if '<|vision_end|>' not in all_special_tokens:
+        new_tokens.append('<|vision_end|>')
+    num_new_tokens = tokenizer.add_tokens(new_tokens)
+    bos_token_id = tokenizer.convert_tokens_to_ids('<|im_start|>')
+    eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>')
+    start_of_image = tokenizer.convert_tokens_to_ids('<|vision_start|>')
+    end_of_image = tokenizer.convert_tokens_to_ids('<|vision_end|>')
+    new_token_ids = dict(
+        bos_token_id=bos_token_id,
+        eos_token_id=eos_token_id,
+        start_of_image=start_of_image,
+        end_of_image=end_of_image,
+    )
+    return tokenizer, new_token_ids, num_new_tokens
+def len2weight(x, loss_reduction='square'):
+    if x == 0:
+        return x
+    if loss_reduction == 'token':
+        return 1
+    if loss_reduction == 'sample':
+        return 1 / x
+    if loss_reduction == 'square':
+        return 1 / (x ** 0.5)
+    raise NotImplementedError(loss_reduction)

data/interleave_datasets/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+from .edit_dataset import UnifiedEditIterableDataset
+from .think_trace_dataset import ThinkTraceJSONLIterableDataset

data/parquet_utils.py ADDED Viewed

	@@ -0,0 +1,89 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import os
+import subprocess
+import logging
+import pyarrow.fs as pf
+import torch.distributed as dist
+logger = logging.getLogger(__name__)
+def get_parquet_data_paths(data_dir_list, num_sampled_data_paths, rank=0, world_size=1):
+    num_data_dirs = len(data_dir_list)
+    if world_size > 1:
+        chunk_size = (num_data_dirs + world_size - 1) // world_size
+        start_idx = rank * chunk_size
+        end_idx = min(start_idx + chunk_size, num_data_dirs)
+        local_data_dir_list = data_dir_list[start_idx:end_idx]
+        local_num_sampled_data_paths = num_sampled_data_paths[start_idx:end_idx]
+    else:
+        local_data_dir_list = data_dir_list
+        local_num_sampled_data_paths = num_sampled_data_paths
+    local_data_paths = []
+    for data_dir, num_data_path in zip(local_data_dir_list, local_num_sampled_data_paths):
+        if data_dir.startswith("hdfs://"):
+            files = hdfs_ls_cmd(data_dir)
+            data_paths_per_dir = [
+                file for file in files if file.endswith(".parquet")
+            ]
+        else:
+            files = os.listdir(data_dir)
+            data_paths_per_dir = [
+                os.path.join(data_dir, name)
+                for name in files
+                if name.endswith(".parquet")
+            ]
+        repeat = num_data_path // len(data_paths_per_dir)
+        data_paths_per_dir = data_paths_per_dir * (repeat + 1)
+        local_data_paths.extend(data_paths_per_dir[:num_data_path])
+    if world_size > 1:
+        gather_list = [None] * world_size
+        dist.all_gather_object(gather_list, local_data_paths)
+        combined_chunks = []
+        for chunk_list in gather_list:
+            if chunk_list is not None:
+                combined_chunks.extend(chunk_list)
+    else:
+        combined_chunks = local_data_paths
+    return combined_chunks
+# NOTE: cumtomize this function for your cluster
+def get_hdfs_host():
+    return "hdfs://xxx"
+# NOTE: cumtomize this function for your cluster
+def get_hdfs_block_size():
+    return 134217728
+# NOTE: cumtomize this function for your cluster
+def get_hdfs_extra_conf():
+    return None
+def init_arrow_pf_fs(parquet_file_path):
+    if parquet_file_path.startswith("hdfs://"):
+        fs = pf.HadoopFileSystem(
+            host=get_hdfs_host(),
+            port=0,
+            buffer_size=get_hdfs_block_size(),
+            extra_conf=get_hdfs_extra_conf(),
+        )
+    else:
+        fs = pf.LocalFileSystem()
+    return fs
+def hdfs_ls_cmd(dir):
+    result = subprocess.run(["hdfs", "dfs", "ls", dir], capture_output=True, text=True).stdout
+    return ['hdfs://' + i.split('hdfs://')[-1].strip() for i in result.split('\n') if 'hdfs://' in i]

data/t2i_dataset.py ADDED Viewed

	@@ -0,0 +1,128 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import io
+import json
+import pyarrow.parquet as pq
+import random
+from PIL import Image
+from .data_utils import pil_img2rgb
+from .distributed_iterable_dataset import DistributedIterableDataset
+from .parquet_utils import get_parquet_data_paths, init_arrow_pf_fs
+Image.MAX_IMAGE_PIXELS = 20_000_000
+class T2IIterableDataset(DistributedIterableDataset):
+    def __init__(
+        self, dataset_name, transform, tokenizer, data_dir_list, num_used_data,
+        local_rank=0, world_size=1, num_workers=8, data_status=None,
+    ):
+        """
+        data_dir_list: list of data directories contains parquet files
+        num_used_data: list of number of sampled data paths for each data directory
+        """
+        super().__init__(dataset_name, local_rank, world_size, num_workers)
+        self.transform = transform
+        self.tokenizer = tokenizer
+        self.data_status = data_status
+        self.data_paths = self.get_data_paths(data_dir_list, num_used_data)
+        self.set_epoch()
+    def get_data_paths(self, data_dir_list, num_used_data):
+        return get_parquet_data_paths(data_dir_list, num_used_data)
+    def __iter__(self):
+        data_paths_per_worker, worker_id = self.get_data_paths_per_worker()
+        if self.data_status is not None:
+            parquet_start_id = self.data_status[worker_id][0]
+            row_group_start_id = self.data_status[worker_id][1]
+            row_start_id = self.data_status[worker_id][2] + 1
+        else:
+            parquet_start_id = 0
+            row_group_start_id = 0
+            row_start_id = 0
+        transform_stride = self.transform.stride
+        print(
+            f"rank-{self.local_rank} worker-{worker_id} dataset-{self.dataset_name}: "
+            f"resuming data at parquet#{parquet_start_id}, rg#{row_group_start_id}, row#{row_start_id}"
+        )
+        while True:
+            data_paths_per_worker_ = data_paths_per_worker[parquet_start_id:]
+            for parquet_idx, parquet_file_path in enumerate(data_paths_per_worker_, start=parquet_start_id):
+                fs = init_arrow_pf_fs(parquet_file_path)
+                with fs.open_input_file(parquet_file_path) as f:
+                    fr = pq.ParquetFile(f)
+                    row_group_ids = list(range(fr.num_row_groups))
+                    row_group_ids_ = row_group_ids[row_group_start_id:]
+                    for row_group_id in row_group_ids_:
+                        df = fr.read_row_group(row_group_id).to_pandas()
+                        df = df.iloc[row_start_id:]
+                        for row_idx, row in df.iterrows():
+                            num_tokens = 0
+                            try:
+                                image_byte = row['image']
+                                image = pil_img2rgb(Image.open(io.BytesIO(image_byte)))
+                            except Exception as e:
+                                print(f'Error: {e} in rg#{row_group_id}, {parquet_file_path}')
+                                continue
+                            image_tensor = self.transform(image)
+                            height, width = image_tensor.shape[1:]
+                            num_tokens += width * height // transform_stride ** 2
+                            try:
+                                caption_dict = row['captions']
+                                caption_dict = json.loads(caption_dict)
+                            except Exception as e:
+                                print(f'Error: {e} in rg#{row_group_id}, {parquet_file_path}')
+                                continue
+                            caps_token = [self.tokenizer.encode(v) for _, v in caption_dict.items()]
+                            if len(caps_token) == 0:
+                                print(f'no caption in rg#{row_group_id}, {parquet_file_path}')
+                                caption_token = self.tokenizer.encode(' ')
+                            else:
+                                caption_token = random.choice(caps_token)
+                            sequence_plan, text_ids_list = [], []
+                            text_ids = caption_token
+                            num_tokens += len(caption_token)
+                            text_ids_list.append(text_ids)
+                            sequence_plan.append({
+                                'type': 'text',
+                                'enable_cfg': 1,
+                                'loss': 0,
+                                'special_token_loss': 0,
+                                'special_token_label': None,
+                            })
+                            sequence_plan.append({
+                                'type': 'vae_image',
+                                'enable_cfg': 0,
+                                'loss': 1,
+                                'special_token_loss': 0,
+                                'special_token_label': None,
+                            })
+                            sample = dict(
+                                image_tensor_list=[image_tensor],
+                                text_ids_list=text_ids_list,
+                                num_tokens=num_tokens,
+                                sequence_plan=sequence_plan,
+                                data_indexes={
+                                    "data_indexes": [parquet_idx, row_group_id, row_idx],
+                                    "worker_id": worker_id,
+                                    "dataset_name": self.dataset_name,
+                                }
+                            )
+                            yield sample
+                        row_start_id = 0
+                    row_group_start_id = 0
+            parquet_start_id = 0
+            print(f"{self.dataset_name} repeat in rank-{self.local_rank} worker-{worker_id}")

data/transforms.py ADDED Viewed

	@@ -0,0 +1,287 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import random
+from PIL import Image
+import cv2
+import numpy as np
+import torch
+from torchvision import transforms
+from torchvision.transforms import functional as F
+from torchvision.transforms import InterpolationMode
+class MaxLongEdgeMinShortEdgeResize(torch.nn.Module):
+    """Resize the input image so that its longest side and shortest side are within a specified range,
+    ensuring that both sides are divisible by a specified stride.
+    Args:
+        max_size (int): Maximum size for the longest edge of the image.
+        min_size (int): Minimum size for the shortest edge of the image.
+        stride (int): Value by which the height and width of the image must be divisible.
+        max_pixels (int): Maximum pixels for the full image.
+        interpolation (InterpolationMode): Desired interpolation enum defined by
+            :class:`torchvision.transforms.InterpolationMode`. Default is ``InterpolationMode.BILINEAR``.
+            If input is Tensor, only ``InterpolationMode.NEAREST``, ``InterpolationMode.NEAREST_EXACT``,
+            ``InterpolationMode.BILINEAR``, and ``InterpolationMode.BICUBIC`` are supported.
+            The corresponding Pillow integer constants, e.g., ``PIL.Image.BILINEAR`` are also accepted.
+        antialias (bool, optional): Whether to apply antialiasing (default is True).
+    """
+    def __init__(
+        self,
+        max_size: int,
+        min_size: int,
+        stride: int,
+        max_pixels: int,
+        interpolation=InterpolationMode.BICUBIC,
+        antialias=True
+    ):
+        super().__init__()
+        self.max_size = max_size
+        self.min_size = min_size
+        self.stride = stride
+        self.max_pixels = max_pixels
+        self.interpolation = interpolation
+        self.antialias = antialias
+    def _make_divisible(self, value, stride):
+        """Ensure the value is divisible by the stride."""
+        return max(stride, int(round(value / stride) * stride))
+    def _apply_scale(self, width, height, scale):
+        new_width = round(width * scale)
+        new_height = round(height * scale)
+        new_width = self._make_divisible(new_width, self.stride)
+        new_height = self._make_divisible(new_height, self.stride)
+        return new_width, new_height
+    def forward(self, img, img_num=1):
+        """
+        Args:
+            img (PIL Image): Image to be resized.
+            img_num (int): Number of images, used to change max_tokens.
+        Returns:
+            PIL Image or Tensor: Rescaled image with divisible dimensions.
+        """
+        if isinstance(img, torch.Tensor):
+            height, width = img.shape[-2:]
+        else:
+            width, height = img.size
+        scale = min(self.max_size / max(width, height), 1.0)
+        scale = max(scale, self.min_size / min(width, height))
+        new_width, new_height = self._apply_scale(width, height, scale)
+        # Ensure the number of pixels does not exceed max_pixels
+        if new_width * new_height > self.max_pixels / img_num:
+            scale = self.max_pixels / img_num / (new_width * new_height)
+            new_width, new_height = self._apply_scale(new_width, new_height, scale)
+        # Ensure longest edge does not exceed max_size
+        if max(new_width, new_height) > self.max_size:
+            scale = self.max_size / max(new_width, new_height)
+            new_width, new_height = self._apply_scale(new_width, new_height, scale)
+        return F.resize(img, (new_height, new_width), self.interpolation, antialias=self.antialias)
+class ImageTransform:
+    def __init__(
+        self,
+        max_image_size,
+        min_image_size,
+        image_stride,
+        max_pixels=14*14*9*1024,
+        image_mean=[0.5, 0.5, 0.5],
+        image_std=[0.5, 0.5, 0.5]
+    ):
+        self.stride = image_stride
+        self.resize_transform = MaxLongEdgeMinShortEdgeResize(
+            max_size=max_image_size,
+            min_size=min_image_size,
+            stride=image_stride,
+            max_pixels=max_pixels,
+        )
+        self.to_tensor_transform = transforms.ToTensor()
+        self.normalize_transform = transforms.Normalize(mean=image_mean, std=image_std, inplace=True)
+    def __call__(self, img, img_num=1):
+        img = self.resize_transform(img, img_num=img_num)
+        img = self.to_tensor_transform(img)
+        img = self.normalize_transform(img)
+        return img
+def decolorization(image):
+    gray_image = image.convert('L')
+    return Image.merge(image.mode, [gray_image] * 3) if image.mode in ('RGB', 'L') else gray_image
+def downscale(image, scale_factor):
+    new_width = int(round(image.width * scale_factor))
+    new_height = int(round(image.height * scale_factor))
+    new_width = max(1, new_width)
+    new_height = max(1, new_height)
+    return image.resize((new_width, new_height), resample=Image.BICUBIC)
+def crop(image, crop_factors):
+    target_h, target_w = crop_factors
+    img_w, img_h = image.size
+    if target_h > img_h or target_w > img_w:
+        raise ValueError("Crop size exceeds image dimensions")
+    x = random.randint(0, img_w - target_w)
+    y = random.randint(0, img_h - target_h)
+    return image.crop((x, y, x + target_w, y + target_h)), [[x, y], [x + target_w, y + target_h]]
+def motion_blur_opencv(image, kernel_size=15, angle=0):
+    # 线性核
+    kernel = np.zeros((kernel_size, kernel_size), dtype=np.float32)
+    kernel[kernel_size // 2, :] = np.ones(kernel_size, dtype=np.float32)
+    # 旋转核
+    center = (kernel_size / 2 - 0.5, kernel_size / 2 - 0.5)
+    M = cv2.getRotationMatrix2D(center, angle, 1)
+    rotated_kernel = cv2.warpAffine(kernel, M, (kernel_size, kernel_size))
+    # 归一化核
+    rotated_kernel /= rotated_kernel.sum() if rotated_kernel.sum() != 0 else 1
+    img = np.array(image)
+    if img.ndim == 2:
+        blurred = cv2.filter2D(img, -1, rotated_kernel, borderType=cv2.BORDER_REFLECT)
+    else:
+        # 对于彩色图像，各通道独立卷积
+        blurred = np.zeros_like(img)
+        for c in range(img.shape[2]):
+            blurred[..., c] = cv2.filter2D(img[..., c], -1, rotated_kernel, borderType=cv2.BORDER_REFLECT)
+    return Image.fromarray(blurred.astype(np.uint8))
+def shuffle_patch(image, num_splits, gap_size=2):
+    """将图像分割为块（允许尺寸不整除），随机打乱后拼接，块间保留间隙"""
+    h_splits, w_splits = num_splits
+    img_w, img_h = image.size
+    base_patch_h = img_h // h_splits
+    patch_heights = [base_patch_h] * (h_splits - 1)
+    patch_heights.append(img_h - sum(patch_heights))
+    base_patch_w = img_w // w_splits
+    patch_widths = [base_patch_w] * (w_splits - 1)
+    patch_widths.append(img_w - sum(patch_widths))
+    patches = []
+    current_y = 0
+    for i in range(h_splits):
+        current_x = 0
+        patch_h = patch_heights[i]
+        for j in range(w_splits):
+            patch_w = patch_widths[j]
+            patch = image.crop((current_x, current_y, current_x + patch_w, current_y + patch_h))
+            patches.append(patch)
+            current_x += patch_w
+        current_y += patch_h
+    random.shuffle(patches)
+    total_width = sum(patch_widths) + (w_splits - 1) * gap_size
+    total_height = sum(patch_heights) + (h_splits - 1) * gap_size
+    new_image = Image.new(image.mode, (total_width, total_height), color=(255, 255, 255))
+    current_y = 0  # 当前行的起始 Y 坐标
+    patch_idx = 0  # 当前处理的块索引
+    for i in range(h_splits):
+        current_x = 0  # 当前列的起始 X 坐标
+        patch_h = patch_heights[i]  # 当前行块的高度
+        for j in range(w_splits):
+            # 取出打乱后的块
+            patch = patches[patch_idx]
+            patch_w = patch_widths[j]  # 当前列块的宽度
+            # 粘贴块（左上角坐标为 (current_x, current_y)）
+            new_image.paste(patch, (current_x, current_y))
+            # 更新 X 坐标（下一个块的起始位置 = 当前块宽度 + 间隙）
+            current_x += patch_w + gap_size
+            patch_idx += 1
+        # 更新 Y 坐标（下一行的起始位置 = 当前行高度 + 间隙）
+        current_y += patch_h + gap_size
+    return new_image
+def inpainting(image, num_splits, blank_ratio=0.3, blank_color=(255, 255, 255)):
+    """
+    图像分割后随机空白部分patch，用于inpainting任务
+    参数：
+        image: PIL.Image 输入图像（RGB模式）
+        h_splits: int 行分割数（垂直方向分割块数）
+        w_splits: int 列分割数（水平方向分割块数）
+        blank_ratio: float 空白patch的比例（0~1）
+        blank_color: tuple 空白区域的颜色（RGB，如白色(255,255,255)）
+    返回：
+        PIL.Image 处理后拼接的图像
+    """
+    h_splits, w_splits = num_splits
+    img_w, img_h = image.size
+    base_patch_h = img_h // h_splits
+    patch_heights = [base_patch_h] * (h_splits - 1)
+    patch_heights.append(img_h - sum(patch_heights))
+    base_patch_w = img_w // w_splits
+    patch_widths = [base_patch_w] * (w_splits - 1)
+    patch_widths.append(img_w - sum(patch_widths))
+    patches = []
+    current_y = 0
+    for i in range(h_splits):
+        current_x = 0
+        patch_h = patch_heights[i]
+        for j in range(w_splits):
+            patch_w = patch_widths[j]
+            patch = image.crop((current_x, current_y, current_x + patch_w, current_y + patch_h))
+            patches.append(patch)
+            current_x += patch_w
+        current_y += patch_h
+    total_patches = h_splits * w_splits
+    num_blank = int(total_patches * blank_ratio)
+    num_blank = max(0, min(num_blank, total_patches))
+    blank_indices = random.sample(range(total_patches), num_blank)
+    processed_patches = []
+    for idx, patch in enumerate(patches):
+        if idx in blank_indices:
+            blank_patch = Image.new("RGB", patch.size, color=blank_color)
+            processed_patches.append(blank_patch)
+        else:
+            processed_patches.append(patch)
+    # 创建结果图像（尺寸与原图一致）
+    result_image = Image.new("RGB", (img_w, img_h))
+    current_y = 0
+    patch_idx = 0
+    for i in range(h_splits):
+        current_x = 0
+        patch_h = patch_heights[i]
+        for j in range(w_splits):
+            # 取出处理后的patch
+            patch = processed_patches[patch_idx]
+            patch_w = patch_widths[j]
+            # 粘贴到原位置
+            result_image.paste(patch, (current_x, current_y))
+            current_x += patch_w
+            patch_idx += 1
+        current_y += patch_h
+    return result_image

data/video_utils.py ADDED Viewed

	@@ -0,0 +1,165 @@

+# Copyright (c) 2023 OpenGVLab
+# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: MIT
+#
+# This file has been modified by ByteDance Ltd. and/or its affiliates. on 2025-05-20.
+#
+# Original file was released under MIT, with the full license text
+# available at https://github.com/OpenGVLab/InternVL/blob/main/LICENSE.
+#
+# This modified file is released under the same license.
+import io
+import os
+import random
+import re
+import numpy as np
+import decord
+from PIL import Image
+def get_frame_indices(num_frames, vlen, sample='rand', fix_start=None, input_fps=1, max_num_frames=-1):
+    if sample in ['rand', 'middle']: # uniform sampling
+        acc_samples = min(num_frames, vlen)
+        # split the video into `acc_samples` intervals, and sample from each interval.
+        intervals = np.linspace(start=0, stop=vlen, num=acc_samples + 1).astype(int)
+        ranges = []
+        for idx, interv in enumerate(intervals[:-1]):
+            ranges.append((interv, intervals[idx + 1] - 1))
+        if sample == 'rand':
+            try:
+                frame_indices = [random.choice(range(x[0], x[1])) for x in ranges]
+            except:
+                frame_indices = np.random.permutation(vlen)[:acc_samples]
+                frame_indices.sort()
+                frame_indices = list(frame_indices)
+        elif fix_start is not None:
+            frame_indices = [x[0] + fix_start for x in ranges]
+        elif sample == 'middle':
+            frame_indices = [(x[0] + x[1]) // 2 for x in ranges]
+        else:
+            raise NotImplementedError
+        if len(frame_indices) < num_frames:  # padded with last frame
+            padded_frame_indices = [frame_indices[-1]] * num_frames
+            padded_frame_indices[:len(frame_indices)] = frame_indices
+            frame_indices = padded_frame_indices
+    elif 'fps' in sample:  # fps0.5, sequentially sample frames at 0.5 fps
+        output_fps = float(sample[3:])
+        duration = float(vlen) / input_fps
+        delta = 1 / output_fps  # gap between frames, this is also the clip length each frame represents
+        frame_seconds = np.arange(0 + delta / 2, duration + delta / 2, delta)
+        frame_indices = np.around(frame_seconds * input_fps).astype(int)
+        frame_indices = [e for e in frame_indices if e < vlen]
+        if max_num_frames > 0 and len(frame_indices) > max_num_frames:
+            frame_indices = frame_indices[:max_num_frames]
+    else:
+        raise ValueError
+    return frame_indices
+def read_frames_decord(video_path, num_frames, sample='rand', fix_start=None, clip=None, min_num_frames=4):
+    video_reader = decord.VideoReader(video_path, num_threads=1)
+    vlen = len(video_reader)
+    fps = video_reader.get_avg_fps()
+    duration = vlen / float(fps)
+    if clip:
+        start, end = clip
+        duration = end - start
+        vlen = int(duration * fps)
+        start_index = int(start * fps)
+    t_num_frames = np.random.randint(min_num_frames, num_frames + 1)
+    frame_indices = get_frame_indices(
+        t_num_frames, vlen, sample=sample, fix_start=fix_start,
+        input_fps=fps
+    )
+    if clip:
+        frame_indices = [f + start_index for f in frame_indices]
+    frames = video_reader.get_batch(frame_indices).asnumpy()  # (T, H, W, C), np.uint8
+    frames = [Image.fromarray(frames[i]) for i in range(frames.shape[0])]
+    return frames
+def extract_frame_number(filename):
+    # Extract the numeric part from the filename using regular expressions
+    match = re.search(r'_(\d+).jpg$', filename)
+    return int(match.group(1)) if match else -1
+def sort_frames(frame_paths):
+    # Extract filenames from each path and sort by their numeric part
+    return sorted(frame_paths, key=lambda x: extract_frame_number(os.path.basename(x)))
+def read_frames_folder(video_path, num_frames, sample='rand', fix_start=None, min_num_frames=4):
+    image_list = sort_frames(list(os.listdir(video_path)))
+    frames = []
+    for image in image_list:
+        fp = os.path.join(video_path, image)
+        frame = Image.open(fp).convert('RGB')
+        frames.append(frame)
+    vlen = len(frames)
+    t_num_frames = np.random.randint(min_num_frames, num_frames + 1)
+    if vlen > t_num_frames:
+        frame_indices = get_frame_indices(
+            t_num_frames, vlen, sample=sample, fix_start=fix_start
+        )
+        frames = [frames[i] for i in frame_indices]
+    return frames
+class FrameSampler:
+    def __init__(self, max_num_frames=-1, min_num_frames=8, sample='rand'):
+        self.max_num_frames = max_num_frames
+        self.min_num_frames = min_num_frames
+        self.sample = sample
+    def __call__(self, file_name):
+        fn = read_frames_folder if file_name.endswith('/') else read_frames_decord
+        frames = fn(file_name, num_frames=self.max_num_frames, min_num_frames=self.min_num_frames, sample=self.sample)
+        return frames
+def decode_video_byte(video_bytes):
+    video_stream = io.BytesIO(video_bytes)
+    vr = decord.VideoReader(video_stream)
+    return vr
+def sample_mp4_frames(mp4_p, n_frames=None, fps=None, return_frame_indices=False, random_sample=False):
+    if isinstance(mp4_p, str):
+        vr = decord.VideoReader(mp4_p, num_threads=1)
+    elif isinstance(mp4_p, decord.video_reader.VideoReader):
+        vr = mp4_p
+    video_fps = vr.get_avg_fps()  # 获取视频的帧率
+    video_duration = len(vr) / video_fps
+    if n_frames is not None:
+        if random_sample:
+            frame_indices = sorted(random.sample(range(len(vr)), n_frames))
+        else:
+            frame_indices = np.linspace(0, len(vr)-1, n_frames, dtype=int).tolist()
+    else:
+        frame_indices = [int(i) for i in np.arange(0, len(vr)-1, video_fps/fps)]
+    frames = vr.get_batch(frame_indices).asnumpy()  # 转换为 numpy 数组
+    frames = [Image.fromarray(frame).convert("RGB") for frame in frames]
+    if not return_frame_indices:
+        return frames, video_duration
+    else:
+        return frames, video_duration, frame_indices
+def sample_mp4_frames_by_indices(mp4_p, frame_indices: list):
+    if isinstance(mp4_p, str):
+        vr = decord.VideoReader(mp4_p, num_threads=1)
+    elif isinstance(mp4_p, decord.video_reader.VideoReader):
+        vr = mp4_p
+    # sample the frames in frame_indices
+    frames = vr.get_batch(frame_indices).asnumpy()  # 转换为 numpy 数组
+    frames = [Image.fromarray(frame).convert("RGB") for frame in frames]
+    return frames

data/vlm_dataset.py ADDED Viewed

	@@ -0,0 +1,195 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+import json
+import os
+import traceback
+from PIL import Image, ImageFile, PngImagePlugin
+from .data_utils import pil_img2rgb
+from .distributed_iterable_dataset import DistributedIterableDataset
+Image.MAX_IMAGE_PIXELS = 200000000
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+MaximumDecompressedSize = 1024
+MegaByte = 2 ** 20
+PngImagePlugin.MAX_TEXT_CHUNK = MaximumDecompressedSize * MegaByte
+class SftJSONLIterableDataset(DistributedIterableDataset):
+    def __init__(
+        self, dataset_name, transform, tokenizer, frame_sampler,
+        jsonl_path_list, data_dir_list, num_used_data,
+        local_rank=0, world_size=1, num_workers=8, data_status=None,
+        shuffle_lines=False, shuffle_seed=0,
+    ):
+        """
+        jsonl_path_list: list of jsonl file paths
+        data_dir_list: list of image directories containing the images of each jsonl file
+        num_used_data: list of number of sampled data points for each jsonl
+        """
+        super().__init__(dataset_name, local_rank, world_size, num_workers)
+        self.transform = transform
+        self.tokenizer = tokenizer
+        self.frame_sampler = frame_sampler
+        self.data_status = data_status
+        self.data_paths = self.get_data_paths(
+            jsonl_path_list,
+            data_dir_list,
+            num_used_data,
+            shuffle_lines,
+            shuffle_seed,
+        )
+        self.set_epoch()
+    def get_data_paths(
+        self,
+        jsonl_path_list,
+        data_dir_list,
+        num_used_data,
+        shuffle_lines,
+        shuffle_seed,
+    ):
+        data_paths = []
+        for jsonl_path, image_dir, num_data_point in zip(
+            jsonl_path_list, data_dir_list, num_used_data
+        ):
+            with open(jsonl_path, 'r') as f:
+                raw_data = f.readlines()
+            if shuffle_lines:
+                self.rng.seed(shuffle_seed)
+                self.rng.shuffle(raw_data)
+            raw_data = raw_data[:num_data_point]
+            data_paths.extend([(json_data, image_dir) for json_data in raw_data])
+        return data_paths
+    def change_format(self, data, num_images):
+        elements = []
+        for conversation in data['conversations']:
+            if conversation['from'] == 'human':
+                if '<image>' not in conversation['value']:
+                    elements.append({
+                        'type': 'text',
+                        'has_loss': 0,
+                        'text': conversation['value'],
+                    })
+                else:
+                    text_list = conversation['value'].split('<image>')
+                    for idx, text in enumerate(text_list):
+                        if text.strip() != '':
+                            elements.append({
+                                'type': 'text',
+                                'has_loss': 0,
+                                'text': text.strip(),
+                            })
+                        if (idx != len(text_list) - 1) and (idx < num_images):
+                            elements.append({'type': 'image',})
+            elif conversation['from'] == 'gpt':
+                elements.append({
+                    'type': 'text',
+                    'has_loss': 1,
+                    'text': conversation['value'],
+                })
+        return elements
+    def __iter__(self):
+        data_paths_per_worker, worker_id = self.get_data_paths_per_worker()
+        if self.data_status is not None:
+            row_start_id = self.data_status[worker_id] + 1
+        else:
+            row_start_id = 0
+        transform_stride = self.transform.stride
+        print(
+            f"rank-{self.local_rank} worker-{worker_id} dataset-{self.dataset_name}: "
+            f"resuming data at row#{row_start_id}"
+        )
+        while True:
+            data_paths_per_worker_ = data_paths_per_worker[row_start_id:]
+            for row_idx, (data, image_dir) in enumerate(data_paths_per_worker_, start=row_start_id):
+                num_tokens = 0
+                image_tensor_list = []
+                text_ids_list = []
+                sequence_plan = []
+                try:
+                    data_item = json.loads(data)
+                    raw_images = None
+                    if 'image' in data_item:
+                        if type(data_item['image']) == list:
+                            raw_images = [
+                                pil_img2rgb(Image.open(os.path.join(image_dir, image)))
+                                for image in data_item['image']
+                            ]
+                        else:
+                            raw_images = [
+                                pil_img2rgb(Image.open(os.path.join(image_dir, data_item['image'])))
+                            ]
+                    elif 'video' in data_item:
+                        raw_images = self.frame_sampler(os.path.join(image_dir, data_item['video']))
+                        special_tokens = '<image>' * len(raw_images)
+                        for item in data_item['conversations']:
+                            if '<video>' in item['value']:
+                                item['value'] = item['value'].replace('<video>', special_tokens)
+                                break
+                            else:
+                                raise ValueError("Cannot find <video> in the conversation!")
+                except:
+                    traceback.print_exc()
+                    continue
+                if raw_images:
+                    for raw_image in raw_images:
+                        image_tensor = self.transform(raw_image, img_num=len(raw_images))
+                        image_tensor_list.append(image_tensor)
+                        height, width = image_tensor.shape[1:]
+                        num_tokens += width * height // transform_stride ** 2
+                elements = self.change_format(data_item, len(image_tensor_list))
+                for item in elements:
+                    if item['type'] == 'text':
+                        text_data = item['text']
+                        text_ids = self.tokenizer.encode(text_data)
+                        if len(text_ids) > 0:
+                            text_ids_list.append(text_ids)
+                            num_tokens += len(text_ids)
+                            current_plan = {
+                                'type': 'text',
+                                'enable_cfg': 0,
+                                'loss': item['has_loss'],
+                                'special_token_loss': 0,
+                                'special_token_label': None,
+                            }
+                            sequence_plan.append(current_plan)
+                    elif item['type'] == 'image':
+                        current_plan = {
+                            'type': 'vit_image',
+                            'enable_cfg': 0,
+                            'loss': 0,
+                            'special_token_loss': 0,
+                            'special_token_label': None,
+                        }
+                        sequence_plan.append(current_plan)
+                has_loss = [item['loss'] for item in sequence_plan]
+                if sum(has_loss) == 0:
+                    print(f'No loss defined, skipped.')
+                    continue
+                yield dict(
+                    image_tensor_list=image_tensor_list,
+                    text_ids_list=text_ids_list,
+                    sequence_plan=sequence_plan,
+                    num_tokens=num_tokens,
+                    data_indexes={
+                        "data_indexes": row_idx,
+                        "worker_id": worker_id,
+                        "dataset_name": self.dataset_name,
+                    }
+                )
+            row_start_id = 0
+            print(f"{self.dataset_name} repeat in rank-{self.local_rank} worker-{worker_id}")

download_model.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from huggingface_hub import snapshot_download
+HF_HOME = "/mnt/wsfuse/kaiyuyue/cache/huggingface"
+repo_id = "multimodal-reasoning-lab/Bagel-Zebra-CoT"
+snapshot_download(
+    cache_dir=HF_HOME,
+    repo_id=repo_id,
+    local_dir_use_symlinks=False,
+    resume_download=True,
+    allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
+)

inference.ipynb ADDED Viewed

	@@ -0,0 +1,535 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# Copyright 2025 Bytedance Ltd. and/or its affiliates.\n",
+    "# SPDX-License-Identifier: Apache-2.0"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from copy import deepcopy\n",
+    "from typing import (\n",
+    "    Any,\n",
+    "    AsyncIterable,\n",
+    "    Callable,\n",
+    "    Dict,\n",
+    "    Generator,\n",
+    "    List,\n",
+    "    NamedTuple,\n",
+    "    Optional,\n",
+    "    Tuple,\n",
+    "    Union,\n",
+    ")\n",
+    "import requests\n",
+    "from io import BytesIO\n",
+    "\n",
+    "from PIL import Image\n",
+    "import torch\n",
+    "from accelerate import infer_auto_device_map, load_checkpoint_and_dispatch, init_empty_weights\n",
+    "\n",
+    "from data.transforms import ImageTransform\n",
+    "from data.data_utils import pil_img2rgb, add_special_tokens\n",
+    "from modeling.bagel import (\n",
+    "    BagelConfig, Bagel, Qwen2Config, Qwen2ForCausalLM, SiglipVisionConfig, SiglipVisionModel\n",
+    ")\n",
+    "from modeling.qwen2 import Qwen2Tokenizer\n",
+    "from modeling.bagel.qwen2_navit import NaiveCache\n",
+    "from modeling.autoencoder import load_ae\n",
+    "from safetensors.torch import load_file"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model Initialization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "model_path = \"/path/to/BAGEL-7B-MoT/weights\"  # Download from https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT\n",
+    "\n",
+    "# LLM config preparing\n",
+    "llm_config = Qwen2Config.from_json_file(os.path.join(model_path, \"llm_config.json\"))\n",
+    "llm_config.qk_norm = True\n",
+    "llm_config.tie_word_embeddings = False\n",
+    "llm_config.layer_module = \"Qwen2MoTDecoderLayer\"\n",
+    "\n",
+    "# ViT config preparing\n",
+    "vit_config = SiglipVisionConfig.from_json_file(os.path.join(model_path, \"vit_config.json\"))\n",
+    "vit_config.rope = False\n",
+    "vit_config.num_hidden_layers = vit_config.num_hidden_layers - 1\n",
+    "\n",
+    "# VAE loading\n",
+    "vae_model, vae_config = load_ae(local_path=os.path.join(model_path, \"ae.safetensors\"))\n",
+    "\n",
+    "# Bagel config preparing\n",
+    "config = BagelConfig(\n",
+    "    visual_gen=True,\n",
+    "    visual_und=True,\n",
+    "    llm_config=llm_config, \n",
+    "    vit_config=vit_config,\n",
+    "    vae_config=vae_config,\n",
+    "    vit_max_num_patch_per_side=70,\n",
+    "    connector_act='gelu_pytorch_tanh',\n",
+    "    latent_patch_size=2,\n",
+    "    max_latent_size=64,\n",
+    ")\n",
+    "\n",
+    "with init_empty_weights():\n",
+    "    language_model = Qwen2ForCausalLM(llm_config)\n",
+    "    vit_model      = SiglipVisionModel(vit_config)\n",
+    "    model          = Bagel(language_model, vit_model, config)\n",
+    "    model.vit_model.vision_model.embeddings.convert_conv2d_to_linear(vit_config, meta=True)\n",
+    "\n",
+    "# Tokenizer Preparing\n",
+    "tokenizer = Qwen2Tokenizer.from_pretrained(model_path)\n",
+    "tokenizer, new_token_ids, _ = add_special_tokens(tokenizer)\n",
+    "\n",
+    "# Image Transform Preparing\n",
+    "vae_transform = ImageTransform(1024, 512, 16)\n",
+    "vit_transform = ImageTransform(980, 224, 14)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model Loading and Multi GPU Infernece Preparing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "max_mem_per_gpu = \"80GiB\"  # Modify it according to your GPU setting. On an A100, 80 GiB is sufficient to load on a single GPU.\n",
+    "\n",
+    "device_map = infer_auto_device_map(\n",
+    "    model,\n",
+    "    max_memory={i: max_mem_per_gpu for i in range(torch.cuda.device_count())},\n",
+    "    no_split_module_classes=[\"Bagel\", \"Qwen2MoTDecoderLayer\"],\n",
+    ")\n",
+    "print(device_map)\n",
+    "\n",
+    "same_device_modules = [\n",
+    "    'language_model.model.embed_tokens',\n",
+    "    'time_embedder',\n",
+    "    'latent_pos_embed',\n",
+    "    'vae2llm',\n",
+    "    'llm2vae',\n",
+    "    'connector',\n",
+    "    'vit_pos_embed'\n",
+    "]\n",
+    "\n",
+    "if torch.cuda.device_count() == 1:\n",
+    "    first_device = device_map.get(same_device_modules[0], \"cuda:0\")\n",
+    "    for k in same_device_modules:\n",
+    "        if k in device_map:\n",
+    "            device_map[k] = first_device\n",
+    "        else:\n",
+    "            device_map[k] = \"cuda:0\"\n",
+    "else:\n",
+    "    first_device = device_map.get(same_device_modules[0])\n",
+    "    for k in same_device_modules:\n",
+    "        if k in device_map:\n",
+    "            device_map[k] = first_device\n",
+    "\n",
+    "# Thanks @onion-liu: https://github.com/ByteDance-Seed/Bagel/pull/8\n",
+    "model = load_checkpoint_and_dispatch(\n",
+    "    model,\n",
+    "    checkpoint=os.path.join(model_path, \"ema.safetensors\"),\n",
+    "    device_map=device_map,\n",
+    "    offload_buffers=True,\n",
+    "    dtype=torch.bfloat16,\n",
+    "    force_hooks=True,\n",
+    "    offload_folder=\"/tmp/offload\"\n",
+    ")\n",
+    "\n",
+    "model = model.eval()\n",
+    "print('Model loaded')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inferencer Preparing "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "from inferencer import InterleaveInferencer\n",
+    "\n",
+    "inferencer = InterleaveInferencer(\n",
+    "    model=model, \n",
+    "    vae_model=vae_model, \n",
+    "    tokenizer=tokenizer, \n",
+    "    vae_transform=vae_transform, \n",
+    "    vit_transform=vit_transform, \n",
+    "    new_token_ids=new_token_ids\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import random\n",
+    "import numpy as np\n",
+    "\n",
+    "seed = 42\n",
+    "random.seed(seed)\n",
+    "np.random.seed(seed)\n",
+    "torch.manual_seed(seed)\n",
+    "if torch.cuda.is_available():\n",
+    "    torch.cuda.manual_seed(seed)\n",
+    "    torch.cuda.manual_seed_all(seed)\n",
+    "torch.backends.cudnn.deterministic = True\n",
+    "torch.backends.cudnn.benchmark = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**About Inference Hyperparameters:**\n",
+    "- **`cfg_text_scale`:** Controls how strongly the model follows the text prompt. `1.0` disables text guidance. Typical range: `4.0–8.0`.\n",
+    "- **`cfg_image_scale`:** Controls how much the model preserves input image details. `1.0` disables image guidance. Typical range: `1.0–2.0`.\n",
+    "- **`cfg_interval`:** Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: `[0.4, 1.0]`.\n",
+    "- **`timestep_shift`:** Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).\n",
+    "- **`num_timesteps`:** Total denoising steps. Typical: `50`.\n",
+    "- **`cfg_renorm_min`:** Minimum value for CFG-Renorm. `1.0` disables renorm. Typical: `0`.\n",
+    "- **`cfg_renorm_type`:** CFG-Renorm method:  \n",
+    "  - `global`: Normalize over all tokens and channels (default for T2I).\n",
+    "  - `channel`: Normalize across channels for each token.\n",
+    "  - `text_channel`: Like `channel`, but only applies to text condition (good for editing, may cause blur).\n",
+    "- **If edited images appear blurry, try `global` CFG-Renorm, decrease `cfg_renorm_min` or decrease `cfg_scale`.**\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Image Generation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "inference_hyper=dict(\n",
+    "    cfg_text_scale=4.0,\n",
+    "    cfg_img_scale=1.0,\n",
+    "    cfg_interval=[0.4, 1.0],\n",
+    "    timestep_shift=3.0,\n",
+    "    num_timesteps=50,\n",
+    "    cfg_renorm_min=0.0,\n",
+    "    cfg_renorm_type=\"global\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "prompt = \"A female cosplayer portraying an ethereal fairy or elf, wearing a flowing dress made of delicate fabrics in soft, mystical colors like emerald green and silver. She has pointed ears, a gentle, enchanting expression, and her outfit is adorned with sparkling jewels and intricate patterns. The background is a magical forest with glowing plants, mystical creatures, and a serene atmosphere.\"\n",
+    "\n",
+    "print(prompt)\n",
+    "print('-' * 10)\n",
+    "output_dict = inferencer(text=prompt, **inference_hyper)\n",
+    "display(output_dict['image'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Image Generation with Think"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "inference_hyper=dict(\n",
+    "    max_think_token_n=1000,\n",
+    "    do_sample=False,\n",
+    "    # text_temperature=0.3,\n",
+    "    cfg_text_scale=4.0,\n",
+    "    cfg_img_scale=1.0,\n",
+    "    cfg_interval=[0.4, 1.0],\n",
+    "    timestep_shift=3.0,\n",
+    "    num_timesteps=50,\n",
+    "    cfg_renorm_min=0.0,\n",
+    "    cfg_renorm_type=\"global\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "prompt = 'a car made of small cars'\n",
+    "\n",
+    "print(prompt)\n",
+    "print('-' * 10)\n",
+    "output_dict = inferencer(text=prompt, think=True, **inference_hyper)\n",
+    "print(output_dict['text'])\n",
+    "display(output_dict['image'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Editing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "inference_hyper=dict(\n",
+    "    cfg_text_scale=4.0,\n",
+    "    cfg_img_scale=2.0,\n",
+    "    cfg_interval=[0.0, 1.0],\n",
+    "    timestep_shift=3.0,\n",
+    "    num_timesteps=50,\n",
+    "    cfg_renorm_min=0.0,\n",
+    "    cfg_renorm_type=\"text_channel\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "image = Image.open('test_images/women.jpg')\n",
+    "prompt = 'She boards a modern subway, quietly reading a folded newspaper, wearing the same clothes.'\n",
+    "\n",
+    "display(image)\n",
+    "print(prompt)\n",
+    "print('-'*10)\n",
+    "output_dict = inferencer(image=image, text=prompt, **inference_hyper)\n",
+    "display(output_dict['image'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Edit with Think"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "inference_hyper=dict(\n",
+    "    max_think_token_n=1000,\n",
+    "    do_sample=False,\n",
+    "    # text_temperature=0.3,\n",
+    "    cfg_text_scale=4.0,\n",
+    "    cfg_img_scale=2.0,\n",
+    "    cfg_interval=[0.0, 1.0],\n",
+    "    timestep_shift=3.0,\n",
+    "    num_timesteps=50,\n",
+    "    cfg_renorm_min=0.0,\n",
+    "    cfg_renorm_type=\"text_channel\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "image = Image.open('test_images/octupusy.jpg')\n",
+    "prompt = 'Could you display the sculpture that takes after this design?'\n",
+    "\n",
+    "display(image)\n",
+    "print('-'*10)\n",
+    "output_dict = inferencer(image=image, text=prompt, think=True, **inference_hyper)\n",
+    "print(output_dict['text'])\n",
+    "display(output_dict['image'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Understanding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "inference_hyper=dict(\n",
+    "    max_think_token_n=1000,\n",
+    "    do_sample=False,\n",
+    "    # text_temperature=0.3,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "image = Image.open('test_images/meme.jpg')\n",
+    "prompt = \"Can someone explain what’s funny about this meme??\"\n",
+    "\n",
+    "display(image)\n",
+    "print(prompt)\n",
+    "print('-'*10)\n",
+    "output_dict = inferencer(image=image, text=prompt, understanding_output=True, **inference_hyper)\n",
+    "print(output_dict['text'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "fileId": "1bfaa82d-51b0-4c13-9e4c-295ba28bcd8a",
+  "filePath": "/mnt/bn/seed-aws-va/chaorui/code/cdt-hf/notebooks/chat.ipynb",
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

inferencer.py ADDED Viewed

	@@ -0,0 +1,300 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+from copy import deepcopy
+from typing import List, Dict, Optional, Union, Any
+from PIL import Image
+import torch
+from data.data_utils import pil_img2rgb
+from modeling.bagel.qwen2_navit import NaiveCache
+VLM_THINK_SYSTEM_PROMPT = '''Generation Instructions: You should first think about the reasoning process in the mind and then provide the user with the answer.
+The reasoning process is enclosed within <think> </think> tags, i.e. <think> reasoning process here </think> answer here'''
+GEN_THINK_SYSTEM_PROMPT = '''Generation Instructions: You should first think about the planning process in the mind and then generate the image.
+The planning process is enclosed within <think> </think> tags, i.e. <think> planning process here </think> image here'''
+class InterleaveInferencer:
+    def __init__(self, model, vae_model, tokenizer, vae_transform, vit_transform, new_token_ids):
+        self.model = model
+        self.vae_model = vae_model
+        self.tokenizer = tokenizer
+        self.vae_transform = vae_transform
+        self.vit_transform = vit_transform
+        self.new_token_ids = new_token_ids
+    def init_gen_context(self):
+        gen_context = {
+            'kv_lens': [0],
+            'ropes': [0],
+            'past_key_values': NaiveCache(self.model.config.llm_config.num_hidden_layers),
+        }
+        return gen_context
+    @torch.no_grad()
+    def update_context_text(self, text, gen_context):
+        # used for interleave data, currently only support 1 data inference,
+        past_key_values = gen_context['past_key_values']
+        kv_lens = gen_context['kv_lens']
+        ropes = gen_context['ropes']
+        generation_input, kv_lens, ropes = self.model.prepare_prompts(
+            curr_kvlens=kv_lens,
+            curr_rope=ropes,
+            prompts=[text],
+            tokenizer=self.tokenizer,
+            new_token_ids=self.new_token_ids,
+        )
+        past_key_values = self.model.forward_cache_update_text(past_key_values, **generation_input)
+        gen_context['kv_lens'] = kv_lens
+        gen_context['ropes'] = ropes
+        gen_context['past_key_values'] = past_key_values
+        return gen_context
+    @torch.no_grad()
+    def update_context_image(self, image, gen_context, vae=True, vit=True):
+        # used for interleave data, currently only support 1 data inference,
+        assert vae or vit
+        past_key_values = gen_context['past_key_values']
+        kv_lens = gen_context['kv_lens']
+        ropes =  gen_context['ropes']
+        if vae:
+            ## update vae
+            generation_input, kv_lens, ropes = self.model.prepare_vae_images(
+                curr_kvlens=kv_lens,
+                curr_rope=ropes,
+                images=[image],
+                transforms=self.vae_transform,
+                new_token_ids=self.new_token_ids,
+            )
+            past_key_values = self.model.forward_cache_update_vae(self.vae_model, past_key_values, **generation_input)
+        if vit:
+            ## update vit
+            generation_input, kv_lens, ropes = self.model.prepare_vit_images(
+                curr_kvlens=kv_lens,
+                curr_rope=ropes,
+                images=[image],
+                transforms=self.vit_transform,
+                new_token_ids=self.new_token_ids,
+            )
+            past_key_values = self.model.forward_cache_update_vit(past_key_values, **generation_input)
+        gen_context['kv_lens'] = kv_lens
+        gen_context['ropes'] = ropes
+        gen_context['past_key_values'] = past_key_values
+        return gen_context
+    @torch.no_grad()
+    def gen_image(
+        self,
+        image_shape,
+        gen_context,
+        cfg_text_scale=4.0,
+        cfg_img_scale=1.5,
+        cfg_text_precontext=None,
+        cfg_img_precontext=None,
+        cfg_interval=(0.4, 1.0),
+        cfg_renorm_min=0.0,
+        cfg_renorm_type="global",
+        num_timesteps=50,
+        timestep_shift=3.0
+    ):
+        # print(cfg_renorm_type)
+        past_key_values = gen_context['past_key_values']
+        kv_lens = gen_context['kv_lens']
+        ropes = gen_context['ropes']
+        generation_input = self.model.prepare_vae_latent(
+            curr_kvlens=kv_lens,
+            curr_rope=ropes,
+            image_sizes=[image_shape],
+            new_token_ids=self.new_token_ids,
+        )
+        # text cfg
+        cfg_text_past_key_values = cfg_text_precontext['past_key_values']
+        kv_lens_cfg = cfg_text_precontext['kv_lens']
+        ropes_cfg = cfg_text_precontext['ropes']
+        generation_input_cfg_text = self.model.prepare_vae_latent_cfg(
+            curr_kvlens=kv_lens_cfg,
+            curr_rope=ropes_cfg,
+            image_sizes=[image_shape],
+        )
+        # img cfg
+        cfg_img_past_key_values = cfg_img_precontext['past_key_values']
+        kv_lens_cfg = cfg_img_precontext['kv_lens']
+        ropes_cfg = cfg_img_precontext['ropes']
+        generation_input_cfg_img = self.model.prepare_vae_latent_cfg(
+            curr_kvlens=kv_lens_cfg,
+            curr_rope=ropes_cfg,
+            image_sizes=[image_shape],
+        )
+        unpacked_latent = self.model.generate_image(
+            past_key_values=past_key_values,
+            cfg_text_past_key_values=cfg_text_past_key_values,
+            cfg_img_past_key_values=cfg_img_past_key_values,
+            num_timesteps=num_timesteps,
+            cfg_text_scale=cfg_text_scale,
+            cfg_img_scale=cfg_img_scale,
+            cfg_interval=cfg_interval,
+            cfg_renorm_min=cfg_renorm_min,
+            cfg_renorm_type=cfg_renorm_type,
+            timestep_shift=timestep_shift,
+            **generation_input,
+            cfg_text_packed_position_ids=generation_input_cfg_text['cfg_packed_position_ids'],
+            cfg_text_packed_query_indexes=generation_input_cfg_text['cfg_packed_query_indexes'],
+            cfg_text_key_values_lens=generation_input_cfg_text['cfg_key_values_lens'],
+            cfg_text_packed_key_value_indexes=generation_input_cfg_text['cfg_packed_key_value_indexes'],
+            cfg_img_packed_position_ids=generation_input_cfg_img['cfg_packed_position_ids'],
+            cfg_img_packed_query_indexes=generation_input_cfg_img['cfg_packed_query_indexes'],
+            cfg_img_key_values_lens=generation_input_cfg_img['cfg_key_values_lens'],
+            cfg_img_packed_key_value_indexes=generation_input_cfg_img['cfg_packed_key_value_indexes'],
+        )
+        image = self.decode_image(unpacked_latent[0], image_shape)
+        return image
+    def decode_image(self, latent, image_shape):
+        H, W = image_shape
+        h, w = H // self.model.latent_downsample, W // self.model.latent_downsample
+        latent = latent.reshape(1, h, w, self.model.latent_patch_size, self.model.latent_patch_size, self.model.latent_channel)
+        latent = torch.einsum("nhwpqc->nchpwq", latent)
+        latent = latent.reshape(1, self.model.latent_channel, h * self.model.latent_patch_size, w * self.model.latent_patch_size)
+        image = self.vae_model.decode(latent)
+        image = (image * 0.5 + 0.5).clamp(0, 1)[0].permute(1, 2, 0) * 255
+        image = Image.fromarray((image).to(torch.uint8).cpu().numpy())
+        return image
+    @torch.no_grad()
+    def gen_text(self, gen_context, max_length: int = 500, do_sample: bool = True, temperature: float = 1.0):
+        gen_context = deepcopy(gen_context)
+        past_key_values = gen_context['past_key_values']
+        kv_lens = gen_context['kv_lens']
+        ropes = gen_context['ropes']
+        generation_input = self.model.prepare_start_tokens(kv_lens, ropes, self.new_token_ids)
+        unpacked_latent = self.model.generate_text(
+            past_key_values=past_key_values,
+            max_length=max_length,
+            do_sample=do_sample,
+            temperature=temperature,
+            end_token_id=self.new_token_ids['eos_token_id'],
+            # end_token_id=151652,
+            **generation_input,
+        )
+        output = self.tokenizer.decode(unpacked_latent[:,0])
+        return output
+    @torch.no_grad()
+    def interleave_inference(
+        self,
+        input_lists: List[Union[str, Image.Image]],
+        understanding_output=False,
+        system_prompt=None,
+        max_think_token_n=1000,
+        do_sample=False,
+        text_temperature=0.3,
+        cfg_text_scale=3.0,
+        cfg_img_scale=1.5,
+        cfg_interval=[0.4, 1.0],
+        timestep_shift=3.0,
+        num_timesteps=50,
+        cfg_renorm_min=0.0,
+        cfg_renorm_type="global",
+        image_shapes=(1024, 1024),
+    ) -> List[Union[str, Image.Image]]:
+        output_list = []
+        gen_context = self.init_gen_context()
+        cfg_text_context = deepcopy(gen_context)
+        cfg_img_context = deepcopy(gen_context)
+        with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
+            if system_prompt:
+                gen_context = self.update_context_text(system_prompt, gen_context)
+                cfg_img_context = self.update_context_text(system_prompt, cfg_img_context)
+            for input_term in input_lists:
+                if isinstance(input_term, str):
+                    cfg_text_context = deepcopy(gen_context)
+                    gen_context = self.update_context_text(input_term, gen_context)
+                    cfg_img_context = self.update_context_text(input_term, cfg_img_context)
+                elif isinstance(input_term, Image.Image):
+                    input_term = self.vae_transform.resize_transform(pil_img2rgb(input_term))
+                    gen_context = self.update_context_image(input_term, gen_context, vae=not understanding_output)
+                    image_shapes = input_term.size[::-1]
+                    cfg_text_context = deepcopy(gen_context)
+                else:
+                    raise ValueError(f"Unsupported input type: {type(input_term)}")
+            if understanding_output:
+                gen_text = self.gen_text(gen_context, do_sample=do_sample, temperature=text_temperature, max_length=max_think_token_n)
+                output_list.append(gen_text)
+            else:
+                img = self.gen_image(
+                    image_shapes,
+                    gen_context,
+                    cfg_text_precontext=cfg_text_context,
+                    cfg_img_precontext=cfg_img_context,
+                    cfg_text_scale=cfg_text_scale,
+                    cfg_img_scale=cfg_img_scale,
+                    cfg_interval=cfg_interval,
+                    timestep_shift=timestep_shift,
+                    num_timesteps=num_timesteps,
+                    cfg_renorm_min=cfg_renorm_min,
+                    cfg_renorm_type=cfg_renorm_type,
+                )
+                output_list.append(img)
+        return output_list
+    def __call__(
+        self,
+        image: Optional[Image.Image] = None,
+        text: Optional[str] = None,
+        **kargs
+    ) -> Dict[str, Any]:
+        output_dict = {'image': None, 'text': None}
+        if image is None and text is None:
+            print('Please provide at least one input: either an image or text.')
+            return output_dict
+        input_list = []
+        if image is not None:
+            input_list.append(image)
+        if text is not None:
+            input_list.append(text)
+        output_list = self.interleave_inference(input_list, **kargs)
+        for i in output_list:
+            if isinstance(i, Image.Image):
+                output_dict['image'] = i
+            elif isinstance(i, str):
+                output_dict['text'] = i
+        return output_dict

infz_bf16.py ADDED Viewed

	@@ -0,0 +1,704 @@

+import os
+import json
+import numpy as np
+from datetime import datetime
+from copy import deepcopy
+from typing import (
+    Any,
+    AsyncIterable,
+    Callable,
+    Dict,
+    Generator,
+    List,
+    NamedTuple,
+    Optional,
+    Tuple,
+    Union,
+)
+import requests
+from io import BytesIO
+from PIL import Image
+import torch
+from accelerate import infer_auto_device_map, load_checkpoint_and_dispatch, init_empty_weights
+from data.transforms import ImageTransform
+from data.data_utils import pil_img2rgb, add_special_tokens
+from modeling.bagel import (
+    BagelConfig, Bagel, Qwen2Config, Qwen2ForCausalLM, SiglipVisionConfig, SiglipVisionModel
+)
+from modeling.qwen2 import Qwen2Tokenizer
+from modeling.bagel.qwen2_navit import NaiveCache
+from modeling.autoencoder import load_ae
+# Set paths for your trained checkpoint
+# checkpoint_dir = "/scratch/by2593/merged_checkpoint_final"
+origin_checkpoint_dir = "/scratch/by2593/hf_cache/hub/models--multimodal-reasoning-lab--Bagel-Zebra-CoT/snapshots/ebce32410ee2062d073feae484ea2c6c1515fba8"
+checkpoint_dir = "/scratch/by2593/project/Bagel-Zebra-CoT/weights/checkpoints_smm_semantic_part1_reorder_questionimage/0000150"
+checkpoint_dir = '/scratch/by2593/project/Bagel-Zebra-CoT/weights/checkpoints_smm_semantic_part1_reorder_v2_test/000010'
+checkpoint_dir = '/scratch/by2593/project/Bagel-Zebra-CoT/weights/checkpoints_smm_semantic_part1_reorder_v2/000150'
+checkpoint_dir = '/scratch/by2593/project/Bagel-Zebra-CoT/weights/checkpoints_smm_semantic_part1_v1_final/0000500'
+checkpoint_dir = "/scratch/by2593/hf_cache/hub/models--multimodal-reasoning-lab--Bagel-Zebra-CoT/snapshots/ebce32410ee2062d073feae484ea2c6c1515fba8"
+checkpoint_file = "model.safetensors"
+# checkpoint_file = "model_bf16.safetensors"
+checkpoint_path = os.path.join(checkpoint_dir, checkpoint_file)
+checkpoint_path = "/scratch/by2593/Bagel-Zebra-CoT-origin/results/checkpoints_smm_semantic_part1_v1_origin/0000050/model.safetensors"
+print(f"Available GPUs: {torch.cuda.device_count()}")
+print(f"GPU memory per device:")
+for i in range(torch.cuda.device_count()):
+    props = torch.cuda.get_device_properties(i)
+    print(f"  GPU {i}: {props.name}, {props.total_memory / 1e9:.1f} GB")
+# LLM config preparing (use base model configs)
+llm_config = Qwen2Config.from_json_file(os.path.join(checkpoint_dir, "llm_config.json"))
+llm_config.qk_norm = True
+llm_config.tie_word_embeddings = False
+llm_config.layer_module = "Qwen2MoTDecoderLayer"
+# ViT config preparing (use base model configs)
+vit_config = SiglipVisionConfig.from_json_file(os.path.join(checkpoint_dir, "vit_config.json"))
+vit_config.rope = False
+vit_config.num_hidden_layers = vit_config.num_hidden_layers - 1
+# VAE loading (use base model VAE)
+vae_model, vae_config = load_ae(local_path=os.path.join(origin_checkpoint_dir, "ae.safetensors"))
+# Bagel config preparing
+config = BagelConfig(
+    visual_gen=True,
+    visual_und=True,
+    llm_config=llm_config,
+    vit_config=vit_config,
+    vae_config=vae_config,
+    vit_max_num_patch_per_side=70,
+    connector_act='gelu_pytorch_tanh',
+    latent_patch_size=2,
+    max_latent_size=64,## 默认64，改为实际的latent尺寸
+)
+# Import the position embedding function first
+from modeling.bagel.modeling_utils import get_2d_sincos_pos_embed
+# Create model with empty weights
+with init_empty_weights():
+    language_model = Qwen2ForCausalLM(llm_config)
+    vit_model      = SiglipVisionModel(vit_config)
+    model          = Bagel(language_model, vit_model, config)
+    model.vit_model.vision_model.embeddings.convert_conv2d_to_linear(vit_config, meta=True)
+# Initialize position embeddings with proper values BEFORE loading checkpoint
+print("Initializing position embeddings before loading...")
+# Initialize latent_pos_embed if it exists
+if hasattr(model, 'latent_pos_embed'):
+    print("Initializing latent_pos_embed...")
+    pos_embed = get_2d_sincos_pos_embed(model.latent_pos_embed.hidden_size, model.latent_pos_embed.max_num_patch_per_side)
+    # Create parameter with actual values, not meta
+    model.latent_pos_embed.pos_embed = torch.nn.Parameter(
+        torch.from_numpy(pos_embed).float(), requires_grad=False
+    )
+    print(f"latent_pos_embed initialized with shape {model.latent_pos_embed.pos_embed.shape}")
+# Initialize vit_pos_embed if it exists
+if hasattr(model, 'vit_pos_embed'):
+    print("Initializing vit_pos_embed...")
+    pos_embed = get_2d_sincos_pos_embed(model.vit_pos_embed.hidden_size, model.vit_pos_embed.max_num_patch_per_side)
+    # Create parameter with actual values, not meta
+    model.vit_pos_embed.pos_embed = torch.nn.Parameter(
+        torch.from_numpy(pos_embed).float(), requires_grad=False
+    )
+    print(f"vit_pos_embed initialized with shape {model.vit_pos_embed.pos_embed.shape}")
+print("Position embeddings initialized successfully")
+# Tokenizer Preparing (use base model tokenizer)
+tokenizer = Qwen2Tokenizer.from_pretrained(checkpoint_dir)
+tokenizer, new_token_ids, _ = add_special_tokens(tokenizer)
+# Image Transform Preparing
+vae_transform = ImageTransform(1024, 512, 16)
+vit_transform = ImageTransform(980, 512, 14)
+# Device mapping for 8x80GB GPUs - use bf16 directly
+max_mem_per_gpu = "80GiB"
+print("Setting up device mapping...")
+device_map = infer_auto_device_map(
+    model,
+    max_memory={i: max_mem_per_gpu for i in range(torch.cuda.device_count())},
+    no_split_module_classes=["Bagel", "Qwen2MoTDecoderLayer"],
+    dtype=torch.bfloat16,  # Use bf16 for device mapping
+)
+print("Device map:", device_map)
+# Handle same-device modules
+same_device_modules = [
+    'language_model.model.embed_tokens',
+    'time_embedder',
+    'latent_pos_embed',
+    'vae2llm',
+    'llm2vae',
+    'connector',
+    'vit_pos_embed'
+]
+if torch.cuda.device_count() == 1:
+    first_device = device_map.get(same_device_modules[0], "cuda:0")
+    for k in same_device_modules:
+        if k in device_map:
+            device_map[k] = first_device
+        else:
+            device_map[k] = "cuda:0"
+else:
+    first_device = device_map.get(same_device_modules[0])
+    if first_device is not None:
+        for k in same_device_modules:
+            if k in device_map:
+                device_map[k] = first_device
+print("Final device map:", device_map)
+# Load checkpoint directly in bf16
+print(f"Loading checkpoint directly in bfloat16: {checkpoint_path}")
+print("Loading model from safetensors file...")
+# Load model directly in bf16
+model = load_checkpoint_and_dispatch(
+    model,
+    checkpoint=checkpoint_path,
+    device_map=device_map,
+    offload_buffers=False,
+    dtype=torch.bfloat16,   # Load directly as bf16
+    force_hooks=True,
+)
+model = model.eval()
+print('Model loaded directly in bfloat16!')
+print(f"Model dtype: {next(model.parameters()).dtype}")
+# Position embeddings were already initialized before model loading
+print("Position embeddings were pre-initialized before loading checkpoint")
+print("Model loading completed successfully!")
+# Check memory usage
+print("GPU memory usage after loading:")
+for i in range(torch.cuda.device_count()):
+    if torch.cuda.memory_allocated(i) > 0:
+        allocated = torch.cuda.memory_allocated(i) / 1e9
+        cached = torch.cuda.memory_reserved(i) / 1e9
+        print(f"  GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached")
+# Rest of inference code
+from inferencer import InterleaveInferencer
+inferencer = InterleaveInferencer(
+    model=model,
+    vae_model=vae_model,
+    tokenizer=tokenizer,
+    vae_transform=vae_transform,
+    vit_transform=vit_transform,
+    new_token_ids=new_token_ids
+)
+import random
+import numpy as np
+seed = 42
+random.seed(seed)
+np.random.seed(seed)
+torch.manual_seed(seed)
+if torch.cuda.is_available():
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+inference_hyper=dict(
+    do_sample=False,
+    text_temperature=0.0,
+    cfg_text_scale=4.0,
+    cfg_img_scale=2.0,
+    cfg_interval=[0.0, 1.0],
+    timestep_shift=3.0,
+    num_timesteps=50,
+    cfg_renorm_min=0.0,
+    cfg_renorm_type="text_channel",
+)
+INTERLEAVED_SYSTEM_PROMPT = '''You are an AI reasoning assistant capable of step-by-step interleaved text and visual chain of thought. Think step by step and use visual aids to enhance your problem-solving.'''
+INTERLEAVED_SYSTEM_PROMPT = ''
+# Original example (004 case) - commented out
+# prompt = '''My goal is to generate a visual guide for constructing a specific shape using a set of blocks. This involves multiple steps, each requiring the addition of a new block to progressively build the final shape. The initial input includes 2 images of multiple blocks that will be used <image_start>[problem_image_1]<image_end><image_start>[problem_image_2]<image_end> and an image of the final desired shape<image_start>[problem_image_3]<image_end>. I need to imagine and generate images of intermediate steps, leading up to the final construction. Step 0 has been completed: a red arch block has been placed on top of the ground. The image after step 0 is provided<image_start>[problem_image_4]<image_end>. Now I need to generate the image for step 1, considering spatial relationships and stability.'''
+# Use the new example data (145 case)
+prompt = '''Based on the construction task shown below, follow the instructions to complete the build. Given the final desired shape of blocks shown in the first image<image_start>[problem_image_1]<image_end> which is viewed from a Front45 angle, perform a series of specified manipulations. This involves multiple steps, each requiring the addition of a new block to progressively build the final shape. The initial input also includes 3 images of multiple blocks that will be used.<image_start>[problem_image_2]<image_end><image_start>[problem_image_3]<image_end><image_start>[problem_image_4]<image_end> Step 0 has been completed: a orange cylinder block has been placed on top of the ground. The image after step 0 is provided.<image_start>[problem_image_5]<image_end>'''
+# Load images from the new example paths (145 case)
+image = []
+base_path = '/scratch/by2593/project/SMM'
+image_paths = [
+    f'{base_path}/semantic_blocks_part1/145/final_state/145_final_1.png',    # problem_image_1 - final desired shape
+    f'{base_path}/SMM_data/each_block_views_diffposes/cylinder_orange.png',  # problem_image_2 - orange cylinder
+    f'{base_path}/SMM_data/each_block_views_diffposes/cuboid3_yellow.png',   # problem_image_3 - yellow cuboid3
+    f'{base_path}/SMM_data/each_block_views_diffposes/triangle_orange.png',  # problem_image_4 - orange triangle
+    f'{base_path}/semantic_blocks_part1/145/steps/view_1/145_step0_1.png',   # problem_image_5 - image after step 0
+]
+print("Loading input images:")
+for i, img_path in enumerate(image_paths):
+    try:
+        img = Image.open(img_path).convert('RGB')
+        image.append(img)
+        print(f"  ✓ Loaded problem_image_{i+1}: {img_path}")
+        print(f"     Image size: {img.size}")
+    except Exception as e:
+        print(f"  ✗ Failed to load {img_path}: {e}")
+        # Create a placeholder image if file not found
+        img = Image.new('RGB', (512, 512), color='gray')
+        image.append(img)
+        print(f"  ⚠ Using placeholder for problem_image_{i+1}")
+print(prompt)
+print('-'*50)
+# Create output folder with timestamp
+timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+output_folder = f"reasoning_output_example_145_{timestamp}"
+images_folder = os.path.join(output_folder, "images")
+os.makedirs(images_folder, exist_ok=True)
+print(f"Output will be saved to: {output_folder}")
+# Save the original problem images if they exist
+problem_image_paths = []
+if image is not None:
+    if isinstance(image, list):
+        # Handle multiple images
+        for i, img in enumerate(image):
+            problem_image_path = os.path.join(images_folder, f"problem_image_{i+1}.png")
+            relative_path = os.path.join("images", f"problem_image_{i+1}.png")
+            img.save(problem_image_path)
+            problem_image_paths.append(relative_path)
+            print(f"Problem image {i+1} saved at '{problem_image_path}'")
+    else:
+        # Handle single image
+        problem_image_path = os.path.join(images_folder, "problem_image.png")
+        relative_path = os.path.join("images", "problem_image.png")
+        image.save(problem_image_path)
+        problem_image_paths.append(relative_path)
+        print(f"Problem image saved at '{problem_image_path}'")
+reasoning_text = []
+reasoning_images = []
+generated_image_paths = []  # Store relative paths to generated reasoning images
+# Create input with multiple images properly flattened
+if image is not None:
+    if isinstance(image, list):
+        current_input = [prompt] + image  # Flatten the list of images
+    else:
+        current_input = [prompt, image]
+else:
+    current_input = [prompt]
+# Loop until no more vision_start tokens
+iteration = 0
+while True:
+    # Get understanding output
+    print(f"iteration: {iteration}")
+    output = inferencer.interleave_inference(current_input, understanding_output=True, system_prompt=INTERLEAVED_SYSTEM_PROMPT, **inference_hyper)
+    # Check for stopping conditions
+    has_final_answer = 'Final Answer:' in output[0] or '<answer>' in output[0]
+    # Stop if we have a final answer OR if there's no vision token (no more images to generate)
+    # should_stop = has_final_answer or not has_vision_token
+    should_stop = has_final_answer
+    if should_stop:
+        if output[0].strip():
+            extracted_text = output[0].split('<|im_end|>')[0].split('<|im_start|>')[1]
+            reasoning_text.append(extracted_text)
+            print(f"{extracted_text}")
+            current_input = current_input + [extracted_text]
+        break
+    extracted_text = output[0].split('<|im_end|>')[0].split('<|im_start|>')[1]
+    reasoning_text.append(extracted_text)
+    print(f"{extracted_text}")
+    # Generate image based on current reasoning
+    current_input_with_reasoning = current_input + [extracted_text]
+    output = inferencer.interleave_inference(current_input_with_reasoning, system_prompt=INTERLEAVED_SYSTEM_PROMPT, **inference_hyper)
+    image_output = output[0]
+    # Save and collect the generated image
+    reasoning_images.append(image_output)
+    image_filename = f'reasoning_image_{iteration + 1}.png'
+    image_path = os.path.join(images_folder, image_filename)
+    relative_image_path = os.path.join("images", image_filename)  # Relative path for JSON
+    image_output.save(image_path)
+    generated_image_paths.append(relative_image_path)
+    print(f"Image saved at '{image_path}'")
+    # Update input for next iteration
+    current_input = current_input_with_reasoning + [image_output]
+    iteration += 1
+    print('-'*50)
+# Save reasoning data to JSON
+reasoning_data = {
+    "timestamp": timestamp,
+    "prompt": prompt,
+    "system_prompt": INTERLEAVED_SYSTEM_PROMPT,
+    "problem_image_paths": problem_image_paths if problem_image_paths else None,
+    "response": [
+        {
+            "step": i + 1,
+            "text": text,
+            "image_path": generated_image_paths[i] if i < len(generated_image_paths) else None
+        }
+        for i, text in enumerate(reasoning_text)
+    ],
+    "total_steps": len(reasoning_text),
+    "total_images": len(generated_image_paths)
+}
+# Save JSON file
+json_path = os.path.join(output_folder, "reasoning_data.json")
+with open(json_path, 'w', encoding='utf-8') as f:
+    json.dump(reasoning_data, f, indent=2, ensure_ascii=False)
+print(f"\nReasoning complete!")
+print(f"Output folder: {output_folder}")
+print(f"JSON metadata: {json_path}")
+print(f"Generated {len(generated_image_paths)} images and {len(reasoning_text)} text steps")
+# python infz_bf16.py
+# import os
+# import json
+# from datetime import datetime
+# from copy import deepcopy
+# from typing import (
+#     Any,
+#     AsyncIterable,
+#     Callable,
+#     Dict,
+#     Generator,
+#     List,
+#     NamedTuple,
+#     Optional,
+#     Tuple,
+#     Union,
+# )
+# import requests
+# from io import BytesIO
+# from PIL import Image
+# import torch
+# from accelerate import infer_auto_device_map, load_checkpoint_and_dispatch, init_empty_weights
+# from data.transforms import ImageTransform
+# from data.data_utils import pil_img2rgb, add_special_tokens
+# from modeling.bagel import (
+#     BagelConfig, Bagel, Qwen2Config, Qwen2ForCausalLM, SiglipVisionConfig, SiglipVisionModel
+# )
+# from modeling.qwen2 import Qwen2Tokenizer
+# from modeling.bagel.qwen2_navit import NaiveCache
+# from modeling.autoencoder import load_ae
+# # Set paths for your trained checkpoint
+# checkpoint_dir = "path/to/your/HF_HOME/models/Bagel-Zebra-CoT"
+# checkpoint_file = "model_bf16.safetensors"
+# checkpoint_path = os.path.join(checkpoint_dir, checkpoint_file)
+# print(f"Available GPUs: {torch.cuda.device_count()}")
+# print(f"GPU memory per device:")
+# for i in range(torch.cuda.device_count()):
+#     props = torch.cuda.get_device_properties(i)
+#     print(f"  GPU {i}: {props.name}, {props.total_memory / 1e9:.1f} GB")
+# # LLM config preparing (use base model configs)
+# llm_config = Qwen2Config.from_json_file(os.path.join(checkpoint_dir, "llm_config.json"))
+# llm_config.qk_norm = True
+# llm_config.tie_word_embeddings = False
+# llm_config.layer_module = "Qwen2MoTDecoderLayer"
+# # ViT config preparing (use base model configs)
+# vit_config = SiglipVisionConfig.from_json_file(os.path.join(checkpoint_dir, "vit_config.json"))
+# vit_config.rope = False
+# vit_config.num_hidden_layers = vit_config.num_hidden_layers - 1
+# # VAE loading (use base model VAE)
+# vae_model, vae_config = load_ae(local_path=os.path.join(checkpoint_dir, "ae.safetensors"))
+# # Bagel config preparing
+# config = BagelConfig(
+#     visual_gen=True,
+#     visual_und=True,
+#     llm_config=llm_config,
+#     vit_config=vit_config,
+#     vae_config=vae_config,
+#     vit_max_num_patch_per_side=70,
+#     connector_act='gelu_pytorch_tanh',
+#     latent_patch_size=2,
+#     max_latent_size=64,
+# )
+# # Create model with empty weights - IMPORTANT: Use float32 initially to match checkpoint
+# with init_empty_weights():
+#     language_model = Qwen2ForCausalLM(llm_config)
+#     vit_model      = SiglipVisionModel(vit_config)
+#     model          = Bagel(language_model, vit_model, config)
+#     model.vit_model.vision_model.embeddings.convert_conv2d_to_linear(vit_config, meta=True)
+# # Tokenizer Preparing (use base model tokenizer)
+# tokenizer = Qwen2Tokenizer.from_pretrained(checkpoint_dir)
+# tokenizer, new_token_ids, _ = add_special_tokens(tokenizer)
+# # Image Transform Preparing
+# vae_transform = ImageTransform(1024, 512, 16)
+# vit_transform = ImageTransform(980, 512, 14)
+# # Device mapping for 8x80GB GPUs - use bf16 directly
+# max_mem_per_gpu = "80GiB"
+# print("Setting up device mapping...")
+# device_map = infer_auto_device_map(
+#     model,
+#     max_memory={i: max_mem_per_gpu for i in range(torch.cuda.device_count())},
+#     no_split_module_classes=["Bagel", "Qwen2MoTDecoderLayer"],
+#     dtype=torch.bfloat16,  # Use bf16 for device mapping
+# )
+# print("Device map:", device_map)
+# # Handle same-device modules
+# same_device_modules = [
+#     'language_model.model.embed_tokens',
+#     'time_embedder',
+#     'latent_pos_embed',
+#     'vae2llm',
+#     'llm2vae',
+#     'connector',
+#     'vit_pos_embed'
+# ]
+# if torch.cuda.device_count() == 1:
+#     first_device = device_map.get(same_device_modules[0], "cuda:0")
+#     for k in same_device_modules:
+#         if k in device_map:
+#             device_map[k] = first_device
+#         else:
+#             device_map[k] = "cuda:0"
+# else:
+#     first_device = device_map.get(same_device_modules[0])
+#     if first_device is not None:
+#         for k in same_device_modules:
+#             if k in device_map:
+#                 device_map[k] = first_device
+# print("Final device map:", device_map)
+# # Load checkpoint directly in bf16
+# print(f"Loading checkpoint directly in bfloat16: {checkpoint_path}")
+# print("Loading model from safetensors file...")
+# # Load model directly in bf16
+# model = load_checkpoint_and_dispatch(
+#     model,
+#     checkpoint=checkpoint_path,
+#     device_map=device_map,
+#     offload_buffers=False,
+#     dtype=torch.bfloat16,   # Load directly as bf16
+#     force_hooks=True,
+# )
+# model = model.eval()
+# print('Model loaded directly in bfloat16!')
+# print(f"Model dtype: {next(model.parameters()).dtype}")
+# print("Model loading completed successfully!")
+# # Check memory usage
+# print("GPU memory usage after loading:")
+# for i in range(torch.cuda.device_count()):
+#     if torch.cuda.memory_allocated(i) > 0:
+#         allocated = torch.cuda.memory_allocated(i) / 1e9
+#         cached = torch.cuda.memory_reserved(i) / 1e9
+#         print(f"  GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached")
+# # Rest of inference code
+# from inferencer import InterleaveInferencer
+# inferencer = InterleaveInferencer(
+#     model=model,
+#     vae_model=vae_model,
+#     tokenizer=tokenizer,
+#     vae_transform=vae_transform,
+#     vit_transform=vit_transform,
+#     new_token_ids=new_token_ids
+# )
+# import random
+# import numpy as np
+# seed = 42
+# random.seed(seed)
+# np.random.seed(seed)
+# torch.manual_seed(seed)
+# if torch.cuda.is_available():
+#     torch.cuda.manual_seed(seed)
+#     torch.cuda.manual_seed_all(seed)
+# torch.backends.cudnn.deterministic = True
+# torch.backends.cudnn.benchmark = False
+# inference_hyper=dict(
+#     do_sample=True,
+#     text_temperature=0.3,
+#     cfg_text_scale=4.0,
+#     cfg_img_scale=2.0,
+#     cfg_interval=[0.0, 1.0],
+#     timestep_shift=3.0,
+#     num_timesteps=50,
+#     cfg_renorm_min=0.0,
+#     cfg_renorm_type="text_channel",
+# )
+# INTERLEAVED_SYSTEM_PROMPT = '''You are an AI reasoning assistant capable of step-by-step interleaved text and visual chain of thought. Think step by step and use visual aids to enhance your problem-solving. Provide your final conclusion clearly in the format of "Final Answer: <answer here>"'''
+# prompt = '''Subtract all cylinders. Add 1 red sphere. How many objects are left?'''
+# image = Image.open('test_images/image.png')
+# print(prompt)
+# print('-'*50)
+# # Create output folder with timestamp
+# timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+# output_folder = f"reasoning_output_{timestamp}"
+# images_folder = os.path.join(output_folder, "images")
+# os.makedirs(images_folder, exist_ok=True)
+# # Save the original problem images if they exist
+# problem_image_paths = []
+# if image is not None:
+#     if isinstance(image, list):
+#         # Handle multiple images
+#         for i, img in enumerate(image):
+#             problem_image_path = os.path.join(images_folder, f"problem_image_{i+1}.png")
+#             relative_path = os.path.join("images", f"problem_image_{i+1}.png")
+#             img.save(problem_image_path)
+#             problem_image_paths.append(relative_path)
+#             print(f"Problem image {i+1} saved at '{problem_image_path}'")
+#     else:
+#         # Handle single image
+#         problem_image_path = os.path.join(images_folder, "problem_image.png")
+#         relative_path = os.path.join("images", "problem_image.png")
+#         image.save(problem_image_path)
+#         problem_image_paths.append(relative_path)
+#         print(f"Problem image saved at '{problem_image_path}'")
+# reasoning_text = []
+# reasoning_images = []
+# image_paths = []  # Store relative paths to images
+# # Create input with multiple images properly flattened
+# if image is not None:
+#     if isinstance(image, list):
+#         current_input = [prompt] + image  # Flatten the list of images
+#     else:
+#         current_input = [prompt, image]
+# else:
+#     current_input = [prompt]
+# # Loop until no more vision_start tokens
+# iteration = 0
+# while True:
+#     # Get understanding output
+#     print(f"iteration: {iteration}")
+#     output = inferencer.interleave_inference(current_input, understanding_output=True, system_prompt=INTERLEAVED_SYSTEM_PROMPT, **inference_hyper)
+#     # Check for stopping conditions
+#     has_final_answer = 'Final Answer:' in output[0] or '<answer>' in output[0]
+#     # Stop if we have a final answer OR if there's no vision token (no more images to generate)
+#     # should_stop = has_final_answer or not has_vision_token
+#     should_stop = has_final_answer
+#     if should_stop:
+#         if output[0].strip():
+#             extracted_text = output[0].split('<|im_end|>')[0].split('<|im_start|>')[1]
+#             reasoning_text.append(extracted_text)
+#             print(f"{extracted_text}")
+#             current_input = current_input + [extracted_text]
+#         break
+#     extracted_text = output[0].split('<|im_end|>')[0].split('<|im_start|>')[1]
+#     reasoning_text.append(extracted_text)
+#     print(f"{extracted_text}")
+#     # Generate image based on current reasoning
+#     current_input_with_reasoning = current_input + [extracted_text]
+#     output = inferencer.interleave_inference(current_input_with_reasoning, system_prompt=INTERLEAVED_SYSTEM_PROMPT, **inference_hyper)
+#     image_output = output[0]
+#     # Save and collect the generated image
+#     reasoning_images.append(image_output)
+#     image_filename = f'reasoning_image_{iteration + 1}.png'
+#     image_path = os.path.join(images_folder, image_filename)
+#     relative_image_path = os.path.join("images", image_filename)  # Relative path for JSON
+#     image_output.save(image_path)
+#     image_paths.append(relative_image_path)
+#     print(f"Image saved at '{image_path}'")
+#     # Update input for next iteration
+#     current_input = current_input_with_reasoning + [image_output]
+#     iteration += 1
+#     print('-'*50)
+# # Save reasoning data to JSON
+# reasoning_data = {
+#     "timestamp": timestamp,
+#     "prompt": prompt,
+#     "system_prompt": INTERLEAVED_SYSTEM_PROMPT,
+#     "problem_image_paths": problem_image_paths if problem_image_paths else None,
+#     "response": [
+#         {
+#             "step": i + 1,
+#             "text": text,
+#             "image_path": image_paths[i] if i < len(image_paths) else None
+#         }
+#         for i, text in enumerate(reasoning_text)
+#     ],
+#     "total_steps": len(reasoning_text),
+#     "total_images": len(image_paths)
+# }
+# # Save JSON file
+# json_path = os.path.join(output_folder, "reasoning_data.json")
+# with open(json_path, 'w', encoding='utf-8') as f:
+#     json.dump(reasoning_data, f, indent=2, ensure_ascii=False)
+# print(f"\nReasoning complete!")
+# print(f"Output folder: {output_folder}")
+# print(f"JSON metadata: {json_path}")
+# print(f"Generated {len(image_paths)} images and {len(reasoning_text)} text steps")

modeling/bagel/__init__.py ADDED Viewed

	@@ -0,0 +1,18 @@

+# Copyright 2025 Bytedance Ltd. and/or its affiliates.
+# SPDX-License-Identifier: Apache-2.0
+from .bagel import BagelConfig, Bagel
+from .qwen2_navit import Qwen2Config, Qwen2Model, Qwen2ForCausalLM
+from .siglip_navit import SiglipVisionConfig, SiglipVisionModel
+__all__ = [
+    'BagelConfig',
+    'Bagel',
+    'Qwen2Config',
+    'Qwen2Model',
+    'Qwen2ForCausalLM',
+    'SiglipVisionConfig',
+    'SiglipVisionModel',
+]

requirements.txt ADDED Viewed

	@@ -0,0 +1,25 @@

+decord==0.6.0
+einops==0.8.1
+huggingface_hub==0.29.1
+matplotlib==3.7.0
+numpy==1.24.4
+opencv-python-headless
+pyarrow==11.0.0
+PyYAML==6.0.2
+Requests==2.32.3
+safetensors==0.4.5
+scipy==1.10.1
+sentencepiece==0.1.99
+torch==2.5.1
+torchvision==0.20.1
+transformers==4.49.0
+accelerate>=0.34.0
+wandb
+gradio
+setuptools
+wheel
+ninja
+bitsandbytes
+xlsxwriter
+triton ; sys_platform != 'win32'
+triton-windows ; sys_platform == 'win32'