Teatime666 commited on Dec 15, 2024

Commit

823e49a

verified ·

1 Parent(s): 7cf1a79

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

LICENSE +202 -0
README.md +119 -0
dataset2json.py +63 -0
extract_frame.py +38 -0
model_structures.log +1497 -0
myoutput.log +2 -0
nohup.out +0 -0
output.log +2 -0
output/20241207/1929--seed_42-384x512/upper1_00057_00_512x384_3_1929.mp4 +0 -0
output/20241207/2241--seed_42-384x512/3_s_1110342_in_xl_512x384_3_2241.mp4 +0 -0
output/20241207/2241--seed_42-384x512/7_s_1110342_in_xl_512x384_3_2241.mp4 +0 -0
output/20241207/2241--seed_42-384x512/8_s_1009794_in_xl_512x384_3_2241.mp4 +0 -0
output/20241207/2241--seed_42-384x512/8_s_1110342_in_xl_512x384_3_2241.mp4 +0 -0
read.py +39 -0
requirements.txt +29 -0
scripts.sh +7 -0
stage1_nohup.out +0 -0
train_stage_1.py +781 -0
train_stage_2.py +842 -0
vivid.py +229 -0
vividfuxian_motion/20241211/1715/803128_detail_1060638_in_xl.mp4 +0 -0
vividfuxian_motion/20241212/1437/000004-803128_detail_1060638_in_xl.mp4 +0 -0
vividfuxian_motion/20241212/1506/000200-803128_detail_1060638_in_xl.mp4 +0 -0
vividfuxian_motion/20241212/1629/000600-803128_detail_1060638_in_xl.mp4 +0 -0
vividfuxian_valid/stage1/000010-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/000200-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/000400-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/000600-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/000800-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/001000-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/001200-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/001600-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/001800-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/002000-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/002200-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/002400-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/002600-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/002800-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/003000-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/003400-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/003600-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/003800-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/004200-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/004400-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/004600-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/004800-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/005200-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/005400-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/005600-803137_in_xl_812294_in_xl.jpg +0 -0
vividfuxian_valid/stage1/005800-803137_in_xl_812294_in_xl.jpg +0 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,202 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md ADDED Viewed

	@@ -0,0 +1,119 @@

+# ViViD
+ViViD: Video Virtual Try-on using Diffusion Models
+[![arXiv](https://img.shields.io/badge/arXiv-2405.11794-b31b1b.svg)](https://arxiv.org/abs/2405.11794)
+[![Project Page](https://img.shields.io/badge/Project-Website-green)](https://alibaba-yuanjing-aigclab.github.io/ViViD)
+[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/alibaba-yuanjing-aigclab/ViViD)
+## Dataset
+Dataset released: [ViViD](https://huggingface.co/datasets/alibaba-yuanjing-aigclab/ViViD)
+## Installation
+```
+git clone https://github.com/alibaba-yuanjing-aigclab/ViViD
+cd ViViD
+```
+### Environment
+```
+conda create -n vivid python=3.10
+conda activate vivid
+conda activate /mnt/pfs-mc0p4k/ssai/cvg/team/envs/vivid
+pip install -r requirements.txt
+```
+### Weights
+You can place the weights anywhere you like, for example, ```./ckpts```. If you put them somewhere else, you just need to update the path in ```./configs/prompts/*.yaml```.
+#### Stable Diffusion Image Variations
+```
+cd ckpts
+git lfs install
+git clone https://huggingface.co/lambdalabs/sd-image-variations-diffusers
+```
+#### SD-VAE-ft-mse
+```
+git lfs install
+git clone https://huggingface.co/stabilityai/sd-vae-ft-mse
+```
+#### Motion Module
+Download [mm_sd_v15_v2](https://huggingface.co/guoyww/animatediff/blob/main/mm_sd_v15_v2.ckpt)
+#### ViViD
+```
+git lfs install
+git clone https://huggingface.co/alibaba-yuanjing-aigclab/ViViD
+```
+## Inference
+We provide two demos in ```./configs/prompts/```, run the following commands to have a try😼.
+```
+python vivid.py --config ./configs/prompts/upper1.yaml
+python vivid.py --config ./configs/prompts/lower1.yaml
+```
+## Data
+As illustrated in ```./data```, the following data should be provided.
+```text
+./data/
+|-- agnostic
+|   |-- video1.mp4
+|   |-- video2.mp4
+|   ...
+|-- agnostic_mask
+|   |-- video1.mp4
+|   |-- video2.mp4
+|   ...
+|-- cloth
+|   |-- cloth1.jpg
+|   |-- cloth2.jpg
+|   ...
+|-- cloth_mask
+|   |-- cloth1.jpg
+|   |-- cloth2.jpg
+|   ...
+|-- densepose
+|   |-- video1.mp4
+|   |-- video2.mp4
+|   ...
+|-- videos
+|   |-- video1.mp4
+|   |-- video2.mp4
+|   ...
+```
+### Agnostic and agnostic_mask video
+This part is a bit complex, you can obtain them through any of the following three ways:
+1. Follow [OOTDiffusion](https://github.com/levihsu/OOTDiffusion) to extract them frame-by-frame.(recommended)
+2. Use [SAM](https://github.com/facebookresearch/segment-anything) + Gaussian Blur.(see ```./tools/sam_agnostic.py``` for an example)
+3. Mask editor tools.
+Note that the shape and size of the agnostic area may affect the try-on results.
+### Densepose video
+See [vid2densepose](https://github.com/Flode-Labs/vid2densepose).(Thanks)
+### Cloth mask
+Any detection tool is ok for obtaining the mask, like [SAM](https://github.com/facebookresearch/segment-anything).
+## BibTeX
+```text
+@misc{fang2024vivid,
+        title={ViViD: Video Virtual Try-on using Diffusion Models},
+        author={Zixun Fang and Wei Zhai and Aimin Su and Hongliang Song and Kai Zhu and Mao Wang and Yu Chen and Zhiheng Liu and Yang Cao and Zheng-Jun Zha},
+        year={2024},
+        eprint={2405.11794},
+        archivePrefix={arXiv},
+        primaryClass={cs.CV}
+  }
+```
+## Contact Us
+**Zixun Fang**: [[email protected]](mailto:[email protected])
+**Yu Chen**: [[email protected]](mailto:[email protected])

dataset2json.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import os
+import json
+def collect_files(data_dir):
+    """
+    遍历 data 文件夹下的各子文件夹，以文件名的前 7 个字符为键，将对应的文件路径整理为字典
+    """
+    file_dict = {}
+    # 子文件夹列表
+    subfolders = ['densepose', 'videos', 'cloth', 'cloth_mask', 'agnostic_mask', 'agnostic']
+    for subfolder in subfolders:
+        subfolder_path = os.path.join(data_dir, subfolder)
+        if not os.path.exists(subfolder_path):
+            print(f"Warning: {subfolder_path} 路径不存在")
+            continue
+        # 遍历子文件夹中的文件
+        for file_name in os.listdir(subfolder_path):
+            # 只取文件名前 7 个字符用于匹配
+            key = file_name[:7]
+            if key not in file_dict:
+                # 初始化字典键为前 7 个字符的键名
+                file_dict[key] = {}
+            # 将当前文件路径保存在子文件夹名称对应的 key 下
+            file_dict[key][subfolder] = os.path.join(subfolder_path, file_name)
+    return file_dict
+def generate_json(data_dir, output_file):
+    """
+    生成 JSON 文件，将文件匹配结果输出
+    """
+    files = collect_files(data_dir)
+    result = []
+    # 构造符合格式的 JSON 列表
+    for key, paths in files.items():
+        result.append({
+            "densepose": paths.get("densepose", ""),         # 如果某个字段不存在，则填补为空值
+            "videos": paths.get("videos", ""),
+            "cloth": paths.get("cloth", ""),
+            "cloth_mask": paths.get("cloth_mask", ""),
+            "agnostic_mask": paths.get("agnostic_mask", ""),
+            "agnostic": paths.get("agnostic", "")
+        })
+    # 写入到指定路径的 JSON 文件
+    with open(output_file, "w", encoding="utf-8") as f:
+        json.dump(result, f, indent=4, ensure_ascii=False)
+    print(f"JSON 文件已生成: {output_file}")
+if __name__ == "__main__":
+    # 要匹配的 data 文件夹路径
+    data_dir = "/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/data"
+    # 输出的 JSON 文件路径
+    output_file = "/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/data/vividfuxian_stage1.json"
+    generate_json(data_dir, output_file)

extract_frame.py ADDED Viewed

	@@ -0,0 +1,38 @@

+import cv2
+import os
+def extract_frame(video_path, frame_number, output_path):
+    # 打开视频文件
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        print(f"无法打开视频文件: {video_path}")
+        return
+    # 设置视频捕捉的位置到指定帧
+    cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number)
+    # 读取指定的帧
+    success, frame = cap.read()
+    if success:
+        # 保存帧为指定路径的文件
+        cv2.imwrite(output_path, frame)
+        print(f"已成功提取帧 {frame_number} 并保存为 {output_path}")
+    else:
+        print(f"未能读取帧 {frame_number}。请检查帧编号是否超出范围。")
+    # 释放资源
+    cap.release()
+if __name__ == "__main__":
+    video_file = "/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/dataset/ViViD/dresses/densepose/803137_detail.mp4"  # 替换为你的 MP4 文件路径
+    frame_to_extract = 24  # 需要提取的帧编号
+    output_file = "/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/configs/valid/densepose_images/803137_in_xl.jpg"  # 替换为你想保存的路径
+    # 创建包含输出文件的目录（如果不存在）
+    output_dir = os.path.dirname(output_file)
+    if not os.path.exists(output_dir) and output_dir:
+        os.makedirs(output_dir)
+    extract_frame(video_file, frame_to_extract, output_file)

model_structures.log ADDED Viewed

	@@ -0,0 +1,1497 @@

+Denoising UNet structure:
+UNet3DConditionModel(
+  (conv_in): InflatedConv3d(9, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+  (time_proj): Timesteps()
+  (time_embedding): TimestepEmbedding(
+    (linear_1): LoRACompatibleLinear(in_features=320, out_features=1280, bias=True)
+    (act): SiLU()
+    (linear_2): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+  )
+  (down_blocks): ModuleList(
+    (0): CrossAttnDownBlock3D(
+      (attentions): ModuleList(
+        (0-1): 2 x Transformer3DModel(
+          (norm): GroupNorm(32, 320, eps=1e-06, affine=True)
+          (proj_in): Conv2d(320, 320, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): TemporalBasicTransformerBlock(
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm1): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=320, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=320, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=320, out_features=2560, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=1280, out_features=320, bias=True)
+                )
+              )
+              (norm3): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+            )
+          )
+          (proj_out): Conv2d(320, 320, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0-1): 2 x ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 320, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=320, bias=True)
+          (norm2): InflatedGroupNorm(32, 320, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+        )
+      )
+      (motion_modules): ModuleList(
+        (0-1): 2 x VanillaTemporalModule(
+          (temporal_transformer): TemporalTransformer3DModel(
+            (norm): GroupNorm(32, 320, eps=1e-06, affine=True)
+            (proj_in): Linear(in_features=320, out_features=320, bias=True)
+            (transformer_blocks): ModuleList(
+              (0): TemporalTransformerBlock(
+                (attention_blocks): ModuleList(
+                  (0-1): 2 x VersatileAttention(
+                    (Module Info) Attention_Mode: Temporal, Is_Cross_Attention: False
+                    (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                    (to_k): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                    (to_v): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                    (to_out): ModuleList(
+                      (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                      (1): Dropout(p=0.0, inplace=False)
+                    )
+                    (pos_encoder): PositionalEncoding(
+                      (dropout): Dropout(p=0.0, inplace=False)
+                    )
+                  )
+                )
+                (norms): ModuleList(
+                  (0-1): 2 x LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+                )
+                (ff): FeedForward(
+                  (net): ModuleList(
+                    (0): GEGLU(
+                      (proj): LoRACompatibleLinear(in_features=320, out_features=2560, bias=True)
+                    )
+                    (1): Dropout(p=0.0, inplace=False)
+                    (2): LoRACompatibleLinear(in_features=1280, out_features=320, bias=True)
+                  )
+                )
+                (ff_norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              )
+            )
+            (proj_out): Linear(in_features=320, out_features=320, bias=True)
+          )
+        )
+      )
+      (downsamplers): ModuleList(
+        (0): Downsample3D(
+          (conv): InflatedConv3d(320, 320, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+        )
+      )
+    )
+    (1): CrossAttnDownBlock3D(
+      (attentions): ModuleList(
+        (0-1): 2 x Transformer3DModel(
+          (norm): GroupNorm(32, 640, eps=1e-06, affine=True)
+          (proj_in): Conv2d(640, 640, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): TemporalBasicTransformerBlock(
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm1): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=640, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=640, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=640, out_features=5120, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=2560, out_features=640, bias=True)
+                )
+              )
+              (norm3): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+            )
+          )
+          (proj_out): Conv2d(640, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0): ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 320, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(320, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
+          (norm2): InflatedGroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(320, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (1): ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 640, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
+          (norm2): InflatedGroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+        )
+      )
+      (motion_modules): ModuleList(
+        (0-1): 2 x VanillaTemporalModule(
+          (temporal_transformer): TemporalTransformer3DModel(
+            (norm): GroupNorm(32, 640, eps=1e-06, affine=True)
+            (proj_in): Linear(in_features=640, out_features=640, bias=True)
+            (transformer_blocks): ModuleList(
+              (0): TemporalTransformerBlock(
+                (attention_blocks): ModuleList(
+                  (0-1): 2 x VersatileAttention(
+                    (Module Info) Attention_Mode: Temporal, Is_Cross_Attention: False
+                    (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                    (to_k): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                    (to_v): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                    (to_out): ModuleList(
+                      (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                      (1): Dropout(p=0.0, inplace=False)
+                    )
+                    (pos_encoder): PositionalEncoding(
+                      (dropout): Dropout(p=0.0, inplace=False)
+                    )
+                  )
+                )
+                (norms): ModuleList(
+                  (0-1): 2 x LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+                )
+                (ff): FeedForward(
+                  (net): ModuleList(
+                    (0): GEGLU(
+                      (proj): LoRACompatibleLinear(in_features=640, out_features=5120, bias=True)
+                    )
+                    (1): Dropout(p=0.0, inplace=False)
+                    (2): LoRACompatibleLinear(in_features=2560, out_features=640, bias=True)
+                  )
+                )
+                (ff_norm): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              )
+            )
+            (proj_out): Linear(in_features=640, out_features=640, bias=True)
+          )
+        )
+      )
+      (downsamplers): ModuleList(
+        (0): Downsample3D(
+          (conv): InflatedConv3d(640, 640, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+        )
+      )
+    )
+    (2): CrossAttnDownBlock3D(
+      (attentions): ModuleList(
+        (0-1): 2 x Transformer3DModel(
+          (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+          (proj_in): Conv2d(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): TemporalBasicTransformerBlock(
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+                )
+              )
+              (norm3): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+            )
+          )
+          (proj_out): Conv2d(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0): ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 640, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(640, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
+          (norm2): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(640, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (1): ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
+          (norm2): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+        )
+      )
+      (motion_modules): ModuleList(
+        (0-1): 2 x VanillaTemporalModule(
+          (temporal_transformer): TemporalTransformer3DModel(
+            (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+            (proj_in): Linear(in_features=1280, out_features=1280, bias=True)
+            (transformer_blocks): ModuleList(
+              (0): TemporalTransformerBlock(
+                (attention_blocks): ModuleList(
+                  (0-1): 2 x VersatileAttention(
+                    (Module Info) Attention_Mode: Temporal, Is_Cross_Attention: False
+                    (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_out): ModuleList(
+                      (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                      (1): Dropout(p=0.0, inplace=False)
+                    )
+                    (pos_encoder): PositionalEncoding(
+                      (dropout): Dropout(p=0.0, inplace=False)
+                    )
+                  )
+                )
+                (norms): ModuleList(
+                  (0-1): 2 x LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+                )
+                (ff): FeedForward(
+                  (net): ModuleList(
+                    (0): GEGLU(
+                      (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                    )
+                    (1): Dropout(p=0.0, inplace=False)
+                    (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+                  )
+                )
+                (ff_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              )
+            )
+            (proj_out): Linear(in_features=1280, out_features=1280, bias=True)
+          )
+        )
+      )
+      (downsamplers): ModuleList(
+        (0): Downsample3D(
+          (conv): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+        )
+      )
+    )
+    (3): DownBlock3D(
+      (resnets): ModuleList(
+        (0-1): 2 x ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
+          (norm2): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+        )
+      )
+      (motion_modules): ModuleList(
+        (0-1): 2 x VanillaTemporalModule(
+          (temporal_transformer): TemporalTransformer3DModel(
+            (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+            (proj_in): Linear(in_features=1280, out_features=1280, bias=True)
+            (transformer_blocks): ModuleList(
+              (0): TemporalTransformerBlock(
+                (attention_blocks): ModuleList(
+                  (0-1): 2 x VersatileAttention(
+                    (Module Info) Attention_Mode: Temporal, Is_Cross_Attention: False
+                    (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_out): ModuleList(
+                      (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                      (1): Dropout(p=0.0, inplace=False)
+                    )
+                    (pos_encoder): PositionalEncoding(
+                      (dropout): Dropout(p=0.0, inplace=False)
+                    )
+                  )
+                )
+                (norms): ModuleList(
+                  (0-1): 2 x LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+                )
+                (ff): FeedForward(
+                  (net): ModuleList(
+                    (0): GEGLU(
+                      (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                    )
+                    (1): Dropout(p=0.0, inplace=False)
+                    (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+                  )
+                )
+                (ff_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              )
+            )
+            (proj_out): Linear(in_features=1280, out_features=1280, bias=True)
+          )
+        )
+      )
+    )
+  )
+  (up_blocks): ModuleList(
+    (0): UpBlock3D(
+      (resnets): ModuleList(
+        (0-2): 3 x ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 2560, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(2560, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
+          (norm2): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(2560, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (motion_modules): ModuleList(
+        (0-2): 3 x VanillaTemporalModule(
+          (temporal_transformer): TemporalTransformer3DModel(
+            (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+            (proj_in): Linear(in_features=1280, out_features=1280, bias=True)
+            (transformer_blocks): ModuleList(
+              (0): TemporalTransformerBlock(
+                (attention_blocks): ModuleList(
+                  (0-1): 2 x VersatileAttention(
+                    (Module Info) Attention_Mode: Temporal, Is_Cross_Attention: False
+                    (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_out): ModuleList(
+                      (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                      (1): Dropout(p=0.0, inplace=False)
+                    )
+                    (pos_encoder): PositionalEncoding(
+                      (dropout): Dropout(p=0.0, inplace=False)
+                    )
+                  )
+                )
+                (norms): ModuleList(
+                  (0-1): 2 x LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+                )
+                (ff): FeedForward(
+                  (net): ModuleList(
+                    (0): GEGLU(
+                      (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                    )
+                    (1): Dropout(p=0.0, inplace=False)
+                    (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+                  )
+                )
+                (ff_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              )
+            )
+            (proj_out): Linear(in_features=1280, out_features=1280, bias=True)
+          )
+        )
+      )
+      (upsamplers): ModuleList(
+        (0): Upsample3D(
+          (conv): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        )
+      )
+    )
+    (1): CrossAttnUpBlock3D(
+      (attentions): ModuleList(
+        (0-2): 3 x Transformer3DModel(
+          (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+          (proj_in): Conv2d(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): TemporalBasicTransformerBlock(
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+                )
+              )
+              (norm3): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+            )
+          )
+          (proj_out): Conv2d(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0-1): 2 x ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 2560, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(2560, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
+          (norm2): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(2560, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (2): ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 1920, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(1920, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
+          (norm2): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(1920, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (motion_modules): ModuleList(
+        (0-2): 3 x VanillaTemporalModule(
+          (temporal_transformer): TemporalTransformer3DModel(
+            (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+            (proj_in): Linear(in_features=1280, out_features=1280, bias=True)
+            (transformer_blocks): ModuleList(
+              (0): TemporalTransformerBlock(
+                (attention_blocks): ModuleList(
+                  (0-1): 2 x VersatileAttention(
+                    (Module Info) Attention_Mode: Temporal, Is_Cross_Attention: False
+                    (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                    (to_out): ModuleList(
+                      (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                      (1): Dropout(p=0.0, inplace=False)
+                    )
+                    (pos_encoder): PositionalEncoding(
+                      (dropout): Dropout(p=0.0, inplace=False)
+                    )
+                  )
+                )
+                (norms): ModuleList(
+                  (0-1): 2 x LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+                )
+                (ff): FeedForward(
+                  (net): ModuleList(
+                    (0): GEGLU(
+                      (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                    )
+                    (1): Dropout(p=0.0, inplace=False)
+                    (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+                  )
+                )
+                (ff_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              )
+            )
+            (proj_out): Linear(in_features=1280, out_features=1280, bias=True)
+          )
+        )
+      )
+      (upsamplers): ModuleList(
+        (0): Upsample3D(
+          (conv): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        )
+      )
+    )
+    (2): CrossAttnUpBlock3D(
+      (attentions): ModuleList(
+        (0-2): 3 x Transformer3DModel(
+          (norm): GroupNorm(32, 640, eps=1e-06, affine=True)
+          (proj_in): Conv2d(640, 640, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): TemporalBasicTransformerBlock(
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm1): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=640, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=640, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=640, out_features=5120, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=2560, out_features=640, bias=True)
+                )
+              )
+              (norm3): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+            )
+          )
+          (proj_out): Conv2d(640, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0): ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 1920, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(1920, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
+          (norm2): InflatedGroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(1920, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (1): ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(1280, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
+          (norm2): InflatedGroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(1280, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (2): ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 960, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(960, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
+          (norm2): InflatedGroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(960, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (motion_modules): ModuleList(
+        (0-2): 3 x VanillaTemporalModule(
+          (temporal_transformer): TemporalTransformer3DModel(
+            (norm): GroupNorm(32, 640, eps=1e-06, affine=True)
+            (proj_in): Linear(in_features=640, out_features=640, bias=True)
+            (transformer_blocks): ModuleList(
+              (0): TemporalTransformerBlock(
+                (attention_blocks): ModuleList(
+                  (0-1): 2 x VersatileAttention(
+                    (Module Info) Attention_Mode: Temporal, Is_Cross_Attention: False
+                    (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                    (to_k): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                    (to_v): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                    (to_out): ModuleList(
+                      (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                      (1): Dropout(p=0.0, inplace=False)
+                    )
+                    (pos_encoder): PositionalEncoding(
+                      (dropout): Dropout(p=0.0, inplace=False)
+                    )
+                  )
+                )
+                (norms): ModuleList(
+                  (0-1): 2 x LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+                )
+                (ff): FeedForward(
+                  (net): ModuleList(
+                    (0): GEGLU(
+                      (proj): LoRACompatibleLinear(in_features=640, out_features=5120, bias=True)
+                    )
+                    (1): Dropout(p=0.0, inplace=False)
+                    (2): LoRACompatibleLinear(in_features=2560, out_features=640, bias=True)
+                  )
+                )
+                (ff_norm): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              )
+            )
+            (proj_out): Linear(in_features=640, out_features=640, bias=True)
+          )
+        )
+      )
+      (upsamplers): ModuleList(
+        (0): Upsample3D(
+          (conv): InflatedConv3d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        )
+      )
+    )
+    (3): CrossAttnUpBlock3D(
+      (attentions): ModuleList(
+        (0-2): 3 x Transformer3DModel(
+          (norm): GroupNorm(32, 320, eps=1e-06, affine=True)
+          (proj_in): Conv2d(320, 320, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): TemporalBasicTransformerBlock(
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm1): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=320, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=320, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=320, out_features=2560, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=1280, out_features=320, bias=True)
+                )
+              )
+              (norm3): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+            )
+          )
+          (proj_out): Conv2d(320, 320, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0): ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 960, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(960, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=320, bias=True)
+          (norm2): InflatedGroupNorm(32, 320, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(960, 320, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (1-2): 2 x ResnetBlock3D(
+          (norm1): InflatedGroupNorm(32, 640, eps=1e-05, affine=True)
+          (conv1): InflatedConv3d(640, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): Linear(in_features=1280, out_features=320, bias=True)
+          (norm2): InflatedGroupNorm(32, 320, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): InflatedConv3d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): InflatedConv3d(640, 320, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (motion_modules): ModuleList(
+        (0-2): 3 x VanillaTemporalModule(
+          (temporal_transformer): TemporalTransformer3DModel(
+            (norm): GroupNorm(32, 320, eps=1e-06, affine=True)
+            (proj_in): Linear(in_features=320, out_features=320, bias=True)
+            (transformer_blocks): ModuleList(
+              (0): TemporalTransformerBlock(
+                (attention_blocks): ModuleList(
+                  (0-1): 2 x VersatileAttention(
+                    (Module Info) Attention_Mode: Temporal, Is_Cross_Attention: False
+                    (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                    (to_k): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                    (to_v): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                    (to_out): ModuleList(
+                      (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                      (1): Dropout(p=0.0, inplace=False)
+                    )
+                    (pos_encoder): PositionalEncoding(
+                      (dropout): Dropout(p=0.0, inplace=False)
+                    )
+                  )
+                )
+                (norms): ModuleList(
+                  (0-1): 2 x LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+                )
+                (ff): FeedForward(
+                  (net): ModuleList(
+                    (0): GEGLU(
+                      (proj): LoRACompatibleLinear(in_features=320, out_features=2560, bias=True)
+                    )
+                    (1): Dropout(p=0.0, inplace=False)
+                    (2): LoRACompatibleLinear(in_features=1280, out_features=320, bias=True)
+                  )
+                )
+                (ff_norm): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              )
+            )
+            (proj_out): Linear(in_features=320, out_features=320, bias=True)
+          )
+        )
+      )
+    )
+  )
+  (mid_block): UNetMidBlock3DCrossAttn(
+    (attentions): ModuleList(
+      (0): Transformer3DModel(
+        (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+        (proj_in): Conv2d(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+        (transformer_blocks): ModuleList(
+          (0): TemporalBasicTransformerBlock(
+            (attn1): Attention(
+              (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+              (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+              (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+              (to_out): ModuleList(
+                (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                (1): Dropout(p=0.0, inplace=False)
+              )
+            )
+            (norm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+            (attn2): Attention(
+              (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+              (to_k): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+              (to_v): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+              (to_out): ModuleList(
+                (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                (1): Dropout(p=0.0, inplace=False)
+              )
+            )
+            (norm2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+            (ff): FeedForward(
+              (net): ModuleList(
+                (0): GEGLU(
+                  (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                )
+                (1): Dropout(p=0.0, inplace=False)
+                (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+              )
+            )
+            (norm3): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+          )
+        )
+        (proj_out): Conv2d(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+      )
+    )
+    (resnets): ModuleList(
+      (0-1): 2 x ResnetBlock3D(
+        (norm1): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+        (conv1): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
+        (norm2): InflatedGroupNorm(32, 1280, eps=1e-05, affine=True)
+        (dropout): Dropout(p=0.0, inplace=False)
+        (conv2): InflatedConv3d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (nonlinearity): SiLU()
+      )
+    )
+    (motion_modules): ModuleList(
+      (0): VanillaTemporalModule(
+        (temporal_transformer): TemporalTransformer3DModel(
+          (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+          (proj_in): Linear(in_features=1280, out_features=1280, bias=True)
+          (transformer_blocks): ModuleList(
+            (0): TemporalTransformerBlock(
+              (attention_blocks): ModuleList(
+                (0-1): 2 x VersatileAttention(
+                  (Module Info) Attention_Mode: Temporal, Is_Cross_Attention: False
+                  (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                  (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                  (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                  (to_out): ModuleList(
+                    (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                    (1): Dropout(p=0.0, inplace=False)
+                  )
+                  (pos_encoder): PositionalEncoding(
+                    (dropout): Dropout(p=0.0, inplace=False)
+                  )
+                )
+              )
+              (norms): ModuleList(
+                (0-1): 2 x LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              )
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+                )
+              )
+              (ff_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+            )
+          )
+          (proj_out): Linear(in_features=1280, out_features=1280, bias=True)
+        )
+      )
+    )
+  )
+  (conv_norm_out): InflatedGroupNorm(32, 320, eps=1e-05, affine=True)
+  (conv_act): SiLU()
+  (conv_out): InflatedConv3d(320, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+)
+Reference UNet structure:
+UNet2DConditionModel(
+  (conv_in): Conv2d(5, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+  (time_proj): Timesteps()
+  (time_embedding): TimestepEmbedding(
+    (linear_1): LoRACompatibleLinear(in_features=320, out_features=1280, bias=True)
+    (act): SiLU()
+    (linear_2): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+  )
+  (down_blocks): ModuleList(
+    (0): CrossAttnDownBlock2D(
+      (attentions): ModuleList(
+        (0-1): 2 x Transformer2DModel(
+          (norm): GroupNorm(32, 320, eps=1e-06, affine=True)
+          (proj_in): LoRACompatibleConv(320, 320, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): BasicTransformerBlock(
+              (norm1): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=320, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=320, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm3): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=320, out_features=2560, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=1280, out_features=320, bias=True)
+                )
+              )
+            )
+          )
+          (proj_out): LoRACompatibleConv(320, 320, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0-1): 2 x ResnetBlock2D(
+          (norm1): GroupNorm(32, 320, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=320, bias=True)
+          (norm2): GroupNorm(32, 320, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+        )
+      )
+      (downsamplers): ModuleList(
+        (0): Downsample2D(
+          (conv): LoRACompatibleConv(320, 320, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+        )
+      )
+    )
+    (1): CrossAttnDownBlock2D(
+      (attentions): ModuleList(
+        (0-1): 2 x Transformer2DModel(
+          (norm): GroupNorm(32, 640, eps=1e-06, affine=True)
+          (proj_in): LoRACompatibleConv(640, 640, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): BasicTransformerBlock(
+              (norm1): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=640, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=640, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm3): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=640, out_features=5120, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=2560, out_features=640, bias=True)
+                )
+              )
+            )
+          )
+          (proj_out): LoRACompatibleConv(640, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0): ResnetBlock2D(
+          (norm1): GroupNorm(32, 320, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(320, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=640, bias=True)
+          (norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(320, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (1): ResnetBlock2D(
+          (norm1): GroupNorm(32, 640, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=640, bias=True)
+          (norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+        )
+      )
+      (downsamplers): ModuleList(
+        (0): Downsample2D(
+          (conv): LoRACompatibleConv(640, 640, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+        )
+      )
+    )
+    (2): CrossAttnDownBlock2D(
+      (attentions): ModuleList(
+        (0-1): 2 x Transformer2DModel(
+          (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+          (proj_in): LoRACompatibleConv(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): BasicTransformerBlock(
+              (norm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm3): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+                )
+              )
+            )
+          )
+          (proj_out): LoRACompatibleConv(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0): ResnetBlock2D(
+          (norm1): GroupNorm(32, 640, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(640, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+          (norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(640, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (1): ResnetBlock2D(
+          (norm1): GroupNorm(32, 1280, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+          (norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+        )
+      )
+      (downsamplers): ModuleList(
+        (0): Downsample2D(
+          (conv): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+        )
+      )
+    )
+    (3): DownBlock2D(
+      (resnets): ModuleList(
+        (0-1): 2 x ResnetBlock2D(
+          (norm1): GroupNorm(32, 1280, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+          (norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+        )
+      )
+    )
+  )
+  (up_blocks): ModuleList(
+    (0): UpBlock2D(
+      (resnets): ModuleList(
+        (0-2): 3 x ResnetBlock2D(
+          (norm1): GroupNorm(32, 2560, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(2560, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+          (norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(2560, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (upsamplers): ModuleList(
+        (0): Upsample2D(
+          (conv): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        )
+      )
+    )
+    (1): CrossAttnUpBlock2D(
+      (attentions): ModuleList(
+        (0-2): 3 x Transformer2DModel(
+          (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+          (proj_in): LoRACompatibleConv(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): BasicTransformerBlock(
+              (norm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm3): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+                )
+              )
+            )
+          )
+          (proj_out): LoRACompatibleConv(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0-1): 2 x ResnetBlock2D(
+          (norm1): GroupNorm(32, 2560, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(2560, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+          (norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(2560, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (2): ResnetBlock2D(
+          (norm1): GroupNorm(32, 1920, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(1920, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+          (norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(1920, 1280, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (upsamplers): ModuleList(
+        (0): Upsample2D(
+          (conv): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        )
+      )
+    )
+    (2): CrossAttnUpBlock2D(
+      (attentions): ModuleList(
+        (0-2): 3 x Transformer2DModel(
+          (norm): GroupNorm(32, 640, eps=1e-06, affine=True)
+          (proj_in): LoRACompatibleConv(640, 640, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): BasicTransformerBlock(
+              (norm1): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=640, out_features=640, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=640, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=640, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=640, out_features=640, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm3): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=640, out_features=5120, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=2560, out_features=640, bias=True)
+                )
+              )
+            )
+          )
+          (proj_out): LoRACompatibleConv(640, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0): ResnetBlock2D(
+          (norm1): GroupNorm(32, 1920, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(1920, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=640, bias=True)
+          (norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(1920, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (1): ResnetBlock2D(
+          (norm1): GroupNorm(32, 1280, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(1280, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=640, bias=True)
+          (norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(1280, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (2): ResnetBlock2D(
+          (norm1): GroupNorm(32, 960, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(960, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=640, bias=True)
+          (norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(960, 640, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (upsamplers): ModuleList(
+        (0): Upsample2D(
+          (conv): LoRACompatibleConv(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        )
+      )
+    )
+    (3): CrossAttnUpBlock2D(
+      (attentions): ModuleList(
+        (0-2): 3 x Transformer2DModel(
+          (norm): GroupNorm(32, 320, eps=1e-06, affine=True)
+          (proj_in): LoRACompatibleConv(320, 320, kernel_size=(1, 1), stride=(1, 1))
+          (transformer_blocks): ModuleList(
+            (0): BasicTransformerBlock(
+              (norm1): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (attn1): Attention(
+                (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm2): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (attn2): Attention(
+                (to_q): LoRACompatibleLinear(in_features=320, out_features=320, bias=False)
+                (to_k): LoRACompatibleLinear(in_features=768, out_features=320, bias=False)
+                (to_v): LoRACompatibleLinear(in_features=768, out_features=320, bias=False)
+                (to_out): ModuleList(
+                  (0): LoRACompatibleLinear(in_features=320, out_features=320, bias=True)
+                  (1): Dropout(p=0.0, inplace=False)
+                )
+              )
+              (norm3): LayerNorm((320,), eps=1e-05, elementwise_affine=True)
+              (ff): FeedForward(
+                (net): ModuleList(
+                  (0): GEGLU(
+                    (proj): LoRACompatibleLinear(in_features=320, out_features=2560, bias=True)
+                  )
+                  (1): Dropout(p=0.0, inplace=False)
+                  (2): LoRACompatibleLinear(in_features=1280, out_features=320, bias=True)
+                )
+              )
+            )
+          )
+          (proj_out): LoRACompatibleConv(320, 320, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+      (resnets): ModuleList(
+        (0): ResnetBlock2D(
+          (norm1): GroupNorm(32, 960, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(960, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=320, bias=True)
+          (norm2): GroupNorm(32, 320, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(960, 320, kernel_size=(1, 1), stride=(1, 1))
+        )
+        (1-2): 2 x ResnetBlock2D(
+          (norm1): GroupNorm(32, 640, eps=1e-05, affine=True)
+          (conv1): LoRACompatibleConv(640, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=320, bias=True)
+          (norm2): GroupNorm(32, 320, eps=1e-05, affine=True)
+          (dropout): Dropout(p=0.0, inplace=False)
+          (conv2): LoRACompatibleConv(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+          (nonlinearity): SiLU()
+          (conv_shortcut): LoRACompatibleConv(640, 320, kernel_size=(1, 1), stride=(1, 1))
+        )
+      )
+    )
+  )
+  (mid_block): UNetMidBlock2DCrossAttn(
+    (attentions): ModuleList(
+      (0): Transformer2DModel(
+        (norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
+        (proj_in): LoRACompatibleConv(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+        (transformer_blocks): ModuleList(
+          (0): BasicTransformerBlock(
+            (norm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+            (attn1): Attention(
+              (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+              (to_k): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+              (to_v): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+              (to_out): ModuleList(
+                (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                (1): Dropout(p=0.0, inplace=False)
+              )
+            )
+            (norm2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+            (attn2): Attention(
+              (to_q): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=False)
+              (to_k): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+              (to_v): LoRACompatibleLinear(in_features=768, out_features=1280, bias=False)
+              (to_out): ModuleList(
+                (0): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+                (1): Dropout(p=0.0, inplace=False)
+              )
+            )
+            (norm3): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
+            (ff): FeedForward(
+              (net): ModuleList(
+                (0): GEGLU(
+                  (proj): LoRACompatibleLinear(in_features=1280, out_features=10240, bias=True)
+                )
+                (1): Dropout(p=0.0, inplace=False)
+                (2): LoRACompatibleLinear(in_features=5120, out_features=1280, bias=True)
+              )
+            )
+          )
+        )
+        (proj_out): LoRACompatibleConv(1280, 1280, kernel_size=(1, 1), stride=(1, 1))
+      )
+    )
+    (resnets): ModuleList(
+      (0-1): 2 x ResnetBlock2D(
+        (norm1): GroupNorm(32, 1280, eps=1e-05, affine=True)
+        (conv1): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (time_emb_proj): LoRACompatibleLinear(in_features=1280, out_features=1280, bias=True)
+        (norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
+        (dropout): Dropout(p=0.0, inplace=False)
+        (conv2): LoRACompatibleConv(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+        (nonlinearity): SiLU()
+      )
+    )
+  )
+  (conv_norm_out): None
+  (conv_act): SiLU()
+)
+Pose Guider structure:
+PoseGuider(
+  (conv_in): InflatedConv3d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+  (blocks): ModuleList(
+    (0): InflatedConv3d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+    (1): InflatedConv3d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+    (2): InflatedConv3d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+    (3): InflatedConv3d(32, 96, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+    (4): InflatedConv3d(96, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+    (5): InflatedConv3d(96, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+  )
+  (conv_out): InflatedConv3d(256, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+)
+image_enc:
+CLIPVisionModelWithProjection(
+  (vision_model): CLIPVisionTransformer(
+    (embeddings): CLIPVisionEmbeddings(
+      (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
+      (position_embedding): Embedding(257, 1024)
+    )
+    (pre_layrnorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
+    (encoder): CLIPEncoder(
+      (layers): ModuleList(
+        (0-23): 24 x CLIPEncoderLayer(
+          (self_attn): CLIPAttention(
+            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
+            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
+            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
+            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
+          )
+          (layer_norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
+          (mlp): CLIPMLP(
+            (activation_fn): QuickGELUActivation()
+            (fc1): Linear(in_features=1024, out_features=4096, bias=True)
+            (fc2): Linear(in_features=4096, out_features=1024, bias=True)
+          )
+          (layer_norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
+        )
+      )
+    )
+    (post_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
+  )
+  (visual_projection): Linear(in_features=1024, out_features=768, bias=False)
+)
+Pose Guider structure:
+PoseGuider(
+  (conv_in): InflatedConv3d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+  (blocks): ModuleList(
+    (0): InflatedConv3d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+    (1): InflatedConv3d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+    (2): InflatedConv3d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+    (3): InflatedConv3d(32, 96, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+    (4): InflatedConv3d(96, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+    (5): InflatedConv3d(96, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+  )
+  (conv_out): InflatedConv3d(256, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+)
+pipe:
+Pose2VideoPipeline {
+  "_class_name": "Pose2VideoPipeline",
+  "_diffusers_version": "0.24.0",
+  "denoising_unet": [
+    "src.models.unet_3d",
+    "UNet3DConditionModel"
+  ],
+  "image_encoder": [
+    "transformers",
+    "CLIPVisionModelWithProjection"
+  ],
+  "image_proj_model": [
+    null,
+    null
+  ],
+  "pose_guider": [
+    "src.models.pose_guider",
+    "PoseGuider"
+  ],
+  "reference_unet": [
+    "src.models.unet_2d_condition",
+    "UNet2DConditionModel"
+  ],
+  "scheduler": [
+    "diffusers",
+    "DDIMScheduler"
+  ],
+  "text_encoder": [
+    null,
+    null
+  ],
+  "tokenizer": [
+    null,
+    null
+  ],
+  "vae": [
+    "diffusers",
+    "AutoencoderKL"
+  ]
+}

myoutput.log ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ nohup: ignoring input
2	+ nohup: failed to run command 'CUDA_VISIBLE_DEVICES=2': No such file or directory

nohup.out ADDED Viewed

The diff for this file is too large to render. See raw diff

output.log ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ nohup: ignoring input
2	+ nohup: failed to run command 'CUDA_VISIBLE_DEVICES=2': No such file or directory

output/20241207/1929--seed_42-384x512/upper1_00057_00_512x384_3_1929.mp4 ADDED Viewed

Binary file (233 kB). View file

output/20241207/2241--seed_42-384x512/3_s_1110342_in_xl_512x384_3_2241.mp4 ADDED Viewed

Binary file (194 kB). View file

output/20241207/2241--seed_42-384x512/7_s_1110342_in_xl_512x384_3_2241.mp4 ADDED Viewed

Binary file (196 kB). View file

output/20241207/2241--seed_42-384x512/8_s_1009794_in_xl_512x384_3_2241.mp4 ADDED Viewed

Binary file (201 kB). View file

output/20241207/2241--seed_42-384x512/8_s_1110342_in_xl_512x384_3_2241.mp4 ADDED Viewed

Binary file (201 kB). View file

read.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import yaml
+import os
+# 假设对应关系存储在名为 "file_pairs.txt" 的文本文件中
+file_pairs_file = "./dataset/ViViD/upper_body/test_pairs.txt"
+output_yaml_path = "./configs/prompts/upper_body2.yaml"  # 输出的 YAML 文件路径
+videos_dir = "./dataset/ViViD/upper_body/videos"
+images_dir = "./dataset/ViViD/upper_body/images"
+# 准备要写入 YAML 的数据结构
+yaml_data = {
+    "pretrained_base_model_path": "ckpts/sd-image-variations-diffusers",
+    "pretrained_vae_path": "ckpts/sd-vae-ft-mse",
+    "image_encoder_path": "ckpts/sd-image-variations-diffusers/image_encoder",
+    "denoising_unet_path": "ckpts/ViViD/denoising_unet.pth",
+    "reference_unet_path": "ckpts/ViViD/reference_unet.pth",
+    "pose_guider_path": "ckpts/ViViD/pose_guider.pth",
+    "motion_module_path": "ckpts/MotionModule/mm_sd_v15_v2.ckpt",
+    "inference_config": "./configs/inference/inference.yaml",
+    "weight_dtype": "fp16",
+    "model_video_paths": [],
+    "cloth_image_paths": []
+}
+# 读取文本文件并填充 YAML 数据结构
+with open(file_pairs_file, 'r') as file:
+    for line in file:
+        # 每行可能是 "视频文件路径 对应图像文件路径"
+        video_file_name, image_file_name = line.strip().split()  # 假设用空格分隔
+        # 构建完整的路径
+        video_path = os.path.join(videos_dir, video_file_name)  # 完整视频文件路径
+        image_path = os.path.join(images_dir, image_file_name)  # 完整图像文件路径
+        yaml_data["model_video_paths"].append(video_path)  # 添加视频文件路径
+        yaml_data["cloth_image_paths"].append(image_path)  # 添加图像文件路径
+# 将数据写入 YAML 文件
+with open(output_yaml_path, 'w') as yaml_file:
+    yaml.dump(yaml_data, yaml_file, default_flow_style=False)
+print(f"YAML 文件已生成: {output_yaml_path}")

requirements.txt ADDED Viewed

	@@ -0,0 +1,29 @@

+accelerate==0.21.0
+av==11.0.0
+clip @ https://github.com/openai/CLIP/archive/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1.zip#sha256=b5842c25da441d6c581b53a5c60e0c2127ebafe0f746f8e15561a006c6c3be6a
+decord==0.6.0
+diffusers==0.24.0
+einops==0.4.1
+gradio==3.41.2
+gradio_client==0.5.0
+imageio==2.33.0
+imageio-ffmpeg==0.4.9
+numpy==1.23.5
+omegaconf==2.2.3
+onnxruntime-gpu==1.16.3
+open-clip-torch==2.20.0
+opencv-contrib-python==4.8.1.78
+opencv-python==4.8.1.78
+Pillow==9.5.0
+scikit-image==0.21.0
+scikit-learn==1.3.2
+scipy==1.11.4
+torch==2.0.1
+torchdiffeq==0.2.3
+torchmetrics==1.2.1
+torchsde==0.2.5
+torchvision==0.15.2
+tqdm==4.66.1
+transformers==4.30.2
+mlflow==2.9.2
+xformers==0.0.22

scripts.sh ADDED Viewed

	@@ -0,0 +1,7 @@

+CUDA_VISIBLE_DEVICES=2 python vivid.py --config /mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/configs/prompts/test_lm_build/cloth_complex_dress.yml
+CUDA_VISIBLE_DEVICES=2 python vivid.py --config /mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/configs/prompts/test_lm_build/cloth_complex_low.yml
+CUDA_VISIBLE_DEVICES=2 python vivid.py --config /mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/configs/prompts/test_lm_build/cloth_complex_up.yml
+CUDA_VISIBLE_DEVICES=2 python vivid.py --config /mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/configs/prompts/test_lm_build/complex_motion.yml

stage1_nohup.out ADDED Viewed

The diff for this file is too large to render. See raw diff

train_stage_1.py ADDED Viewed

	@@ -0,0 +1,781 @@

+import argparse
+import logging
+import math
+import os
+import os.path as osp
+import random
+import warnings
+from datetime import datetime
+from pathlib import Path
+from tempfile import TemporaryDirectory
+import diffusers
+import mlflow
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint
+import transformers
+from accelerate import Accelerator
+from accelerate.logging import get_logger
+from accelerate.utils import DistributedDataParallelKwargs
+from diffusers import AutoencoderKL, DDIMScheduler
+from diffusers.optimization import get_scheduler
+from diffusers.utils import check_min_version
+from diffusers.utils.import_utils import is_xformers_available
+from omegaconf import OmegaConf
+from PIL import Image
+from tqdm.auto import tqdm
+from transformers import CLIPVisionModelWithProjection
+from src.dataset.dance_image import HumanDanceDataset
+# from src.dwpose import DWposeDetector
+from src.models.mutual_self_attention import ReferenceAttentionControl
+from src.models.pose_guider import PoseGuider
+from src.models.unet_2d_condition import UNet2DConditionModel
+from src.models.unet_3d import UNet3DConditionModel
+from src.pipelines.pipeline_pose2img import Pose2ImagePipeline
+from src.utils.util import delete_additional_ckpt, import_filename, seed_everything
+warnings.filterwarnings("ignore")
+# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
+check_min_version("0.10.0.dev0")
+logger = get_logger(__name__, log_level="INFO")
+class Net(nn.Module):
+    def __init__(
+        self,
+        reference_unet: UNet2DConditionModel,
+        denoising_unet: UNet3DConditionModel,
+        pose_guider: PoseGuider,
+        reference_control_writer,
+        reference_control_reader,
+    ):
+        super().__init__()
+        self.reference_unet = reference_unet
+        self.denoising_unet = denoising_unet
+        self.pose_guider = pose_guider
+        self.reference_control_writer = reference_control_writer
+        self.reference_control_reader = reference_control_reader
+    def forward(
+        self,
+        noisy_latents,
+        timesteps,
+        ref_image_latents,
+        clip_image_embeds,
+        pose_img,
+        uncond_fwd: bool = False,
+    ):
+        pose_cond_tensor = pose_img.to(device="cuda")
+        pose_fea = self.pose_guider(pose_cond_tensor)
+        if not uncond_fwd:
+            ref_timesteps = torch.zeros_like(timesteps)
+            self.reference_unet(
+                ref_image_latents,
+                ref_timesteps,
+                encoder_hidden_states=clip_image_embeds,
+                return_dict=False,
+            )
+            self.reference_control_reader.update(self.reference_control_writer)
+        model_pred = self.denoising_unet(
+            noisy_latents,
+            timesteps,
+            pose_cond_fea=pose_fea,
+            encoder_hidden_states=clip_image_embeds,
+        ).sample
+        return model_pred
+def log_validation(
+    vae,
+    image_enc,
+    net,
+    scheduler,
+    accelerator,
+    width,
+    height,
+    save_dir,
+    global_step,
+):
+    logger.info("Running validation... ")
+    ori_net = accelerator.unwrap_model(net)
+    reference_unet = ori_net.reference_unet
+    denoising_unet = ori_net.denoising_unet
+    pose_guider = ori_net.pose_guider
+    # generator = torch.manual_seed(42)
+    generator = torch.Generator().manual_seed(42)
+    # cast unet dtype
+    vae = vae.to(dtype=torch.float32)
+    image_enc = image_enc.to(dtype=torch.float32)
+    # pose_detector = DWposeDetector()
+    # pose_detector.to(accelerator.device)
+    pipe = Pose2ImagePipeline(
+        vae=vae,
+        image_encoder=image_enc,
+        reference_unet=reference_unet,
+        denoising_unet=denoising_unet,
+        pose_guider=pose_guider,
+        scheduler=scheduler,
+    )
+    pipe = pipe.to(accelerator.device)
+    video_image_paths=["/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/configs/valid/videos/803137_in_xl.jpg"]
+    cloth_paths=["/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/configs/valid/cloth/803128_in_xl.jpg"]
+    pil_images = []
+    for video_image_path in video_image_paths:
+        clip_length=1
+        for cloth_image_path in cloth_paths:
+            agnostic_path=video_image_path.replace("videos","agnostic_images") #data/videos/upper1.mp4——>data/agnostic/upper1.mp4
+            agn_mask_path=video_image_path.replace("videos","agnostic_mask_images")
+            densepose_path=video_image_path.replace("videos","densepose_images")
+            cloth_mask_path=cloth_image_path.replace("cloth","cloth_mask")
+            video_name = video_image_path.split("/")[-1].replace(".jpg", "")
+            cloth_name = cloth_image_path.split("/")[-1].replace(".jpg", "")
+            video_image_pil = Image.open(video_image_path).convert("RGB")
+            cloth_image_pil = Image.open(cloth_image_path).convert("RGB")
+            cloth_mask_pil = Image.open(cloth_mask_path).convert("RGB")
+            agnostic_pil = Image.open(agnostic_path).convert("RGB")
+            agn_mask_pil = Image.open(agn_mask_path).convert("RGB")
+            densepose_pil = Image.open(densepose_path).convert("RGB")
+            image = pipe(
+                agnostic_pil,
+                agn_mask_pil,
+                cloth_image_pil,
+                cloth_mask_pil,
+                densepose_pil,
+                width,
+                height,
+                clip_length,
+                20,
+                3.5,
+                generator=generator,
+            ).images
+            image = image[0, :, 0].permute(1, 2, 0).cpu().numpy()  # (3, 512, 512)
+            res_image_pil = Image.fromarray((image * 255).astype(np.uint8))
+            # Save ref_image, src_image and the generated_image
+            w, h = res_image_pil.size
+            canvas = Image.new("RGB", (w * 4, h), "white")
+            cloth_image_pil = cloth_image_pil.resize((w, h))
+            video_image_pil = video_image_pil.resize((w, h))
+            agnostic_pil = agnostic_pil.resize((w, h))
+            canvas.paste(cloth_image_pil, (0, 0))
+            canvas.paste(video_image_pil, (w, 0))
+            canvas.paste(agnostic_pil, (w * 2, 0))
+            canvas.paste(res_image_pil, (w * 3, 0))
+            out_file = os.path.join(
+            save_dir, f"{global_step:06d}-{video_name}_{cloth_name}.jpg"
+            )
+            canvas.save(out_file)
+    vae = vae.to(dtype=torch.float32)
+    image_enc = image_enc.to(dtype=torch.float32)
+    del pipe
+    torch.cuda.empty_cache()
+    return pil_images
+def compute_snr(noise_scheduler, timesteps):
+    """
+    Computes SNR as per
+    https://github.com/TiankaiHang/Min-SNR-Diffusion-Training/blob/521b624bd70c67cee4bdf49225915f5945a872e3/guided_diffusion/gaussian_diffusion.py#L847-L849
+    """
+    alphas_cumprod = noise_scheduler.alphas_cumprod
+    sqrt_alphas_cumprod = alphas_cumprod**0.5
+    sqrt_one_minus_alphas_cumprod = (1.0 - alphas_cumprod) ** 0.5
+    # Expand the tensors.
+    # Adapted from https://github.com/TiankaiHang/Min-SNR-Diffusion-Training/blob/521b624bd70c67cee4bdf49225915f5945a872e3/guided_diffusion/gaussian_diffusion.py#L1026
+    sqrt_alphas_cumprod = sqrt_alphas_cumprod.to(device=timesteps.device)[
+        timesteps
+    ].float()
+    while len(sqrt_alphas_cumprod.shape) < len(timesteps.shape):
+        sqrt_alphas_cumprod = sqrt_alphas_cumprod[..., None]
+    alpha = sqrt_alphas_cumprod.expand(timesteps.shape)
+    sqrt_one_minus_alphas_cumprod = sqrt_one_minus_alphas_cumprod.to(
+        device=timesteps.device
+    )[timesteps].float()
+    while len(sqrt_one_minus_alphas_cumprod.shape) < len(timesteps.shape):
+        sqrt_one_minus_alphas_cumprod = sqrt_one_minus_alphas_cumprod[..., None]
+    sigma = sqrt_one_minus_alphas_cumprod.expand(timesteps.shape)
+    # Compute SNR.
+    snr = (alpha / sigma) ** 2
+    return snr
+def main(cfg):
+    kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
+    accelerator = Accelerator(
+        gradient_accumulation_steps=cfg.solver.gradient_accumulation_steps,
+        mixed_precision=cfg.solver.mixed_precision,
+        log_with="mlflow",
+        project_dir="./mlruns",
+        kwargs_handlers=[kwargs],
+    )
+    # Make one log on every process with the configuration for debugging.
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(accelerator.state, main_process_only=False)
+    if accelerator.is_local_main_process:
+        transformers.utils.logging.set_verbosity_warning()
+        diffusers.utils.logging.set_verbosity_info()
+    else:
+        transformers.utils.logging.set_verbosity_error()
+        diffusers.utils.logging.set_verbosity_error()
+    # If passed along, set the training seed now.
+    if cfg.seed is not None:
+        seed_everything(cfg.seed)
+    exp_name = cfg.exp_name
+    save_dir = f"{cfg.output_dir}/{exp_name}"
+    if accelerator.is_main_process and not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+    save_valid_dir = f"{cfg.valid_dir}/{exp_name}"
+    if accelerator.is_main_process and not os.path.exists(save_valid_dir):
+        os.makedirs(save_valid_dir)
+    validation_dir = save_valid_dir
+    if cfg.weight_dtype == "fp16":
+        weight_dtype = torch.float16
+    elif cfg.weight_dtype == "bf16":
+        weight_dtype = torch.bfloat16
+    elif cfg.weight_dtype == "fp32":
+        weight_dtype = torch.float32
+    else:
+        raise ValueError(
+            f"Do not support weight dtype: {cfg.weight_dtype} during training"
+        )
+    sched_kwargs = OmegaConf.to_container(cfg.noise_scheduler_kwargs)
+    if cfg.enable_zero_snr:
+        sched_kwargs.update(
+            rescale_betas_zero_snr=True,
+            timestep_spacing="trailing",
+            prediction_type="v_prediction",
+        )
+    val_noise_scheduler = DDIMScheduler(**sched_kwargs)
+    sched_kwargs.update({"beta_schedule": "scaled_linear"})
+    train_noise_scheduler = DDIMScheduler(**sched_kwargs)
+    vae = AutoencoderKL.from_pretrained(cfg.vae_model_path).to(
+        "cuda", dtype=weight_dtype
+    )
+    reference_unet = UNet2DConditionModel.from_pretrained_2d(
+        config.base_model_path,
+        subfolder="unet",
+        unet_additional_kwargs={
+            "in_channels": 5,
+        }
+    ).to(dtype=weight_dtype, device="cuda")
+    denoising_unet = UNet3DConditionModel.from_pretrained_2d(
+        cfg.base_model_path,
+        "",
+        subfolder="unet",
+        unet_additional_kwargs={
+            "in_channels": 9,
+            "use_motion_module": False,
+            "unet_use_temporal_attention": False,
+        },
+    ).to(device="cuda")
+    image_enc = CLIPVisionModelWithProjection.from_pretrained(
+        cfg.image_encoder_path,
+    ).to(dtype=weight_dtype, device="cuda")
+    if cfg.pose_guider_path:
+        pose_guider = PoseGuider(
+            conditioning_embedding_channels=320, block_out_channels=(16, 32, 96, 256)
+        ).to(device="cuda")
+        # load pretrained controlnet-openpose params for pose_guider
+        controlnet_openpose_state_dict = torch.load(cfg.controlnet_openpose_path)
+        state_dict_to_load = {}
+        for k in controlnet_openpose_state_dict.keys():
+            if k.startswith("controlnet_cond_embedding.") and k.find("conv_out") < 0:
+                new_k = k.replace("controlnet_cond_embedding.", "")
+                state_dict_to_load[new_k] = controlnet_openpose_state_dict[k]
+        miss, _ = pose_guider.load_state_dict(state_dict_to_load, strict=False)
+        logger.info(f"Missing key for pose guider: {len(miss)}")
+    else:
+        pose_guider = PoseGuider(
+            conditioning_embedding_channels=320,
+        ).to(device="cuda")
+    # load pretrained weights
+    denoising_unet.load_state_dict(
+        torch.load(config.denoising_unet_path, map_location="cpu"),
+        strict=True,
+    )
+    reference_unet.load_state_dict(
+        torch.load(config.reference_unet_path, map_location="cpu"),
+        strict=True,
+    )
+    pose_guider.load_state_dict(
+        torch.load(config.pose_guider_path, map_location="cpu"),
+        strict=True,
+    )
+    # Freeze
+    vae.requires_grad_(False)
+    image_enc.requires_grad_(False)
+    # Explictly declare training models
+    denoising_unet.requires_grad_(True)
+    #  Some top layer parames of reference_unet don't need grad
+    for name, param in reference_unet.named_parameters():
+        if "up_blocks.3" in name:
+            param.requires_grad_(False)
+        else:
+            param.requires_grad_(True)
+    pose_guider.requires_grad_(True)
+    reference_control_writer = ReferenceAttentionControl(
+        reference_unet,
+        do_classifier_free_guidance=False,
+        mode="write",
+        fusion_blocks="full",
+    )
+    reference_control_reader = ReferenceAttentionControl(
+        denoising_unet,
+        do_classifier_free_guidance=False,
+        mode="read",
+        fusion_blocks="full",
+    )
+    net = Net(
+        reference_unet,
+        denoising_unet,
+        pose_guider,
+        reference_control_writer,
+        reference_control_reader,
+    )
+    if cfg.solver.enable_xformers_memory_efficient_attention:
+        if is_xformers_available():
+            reference_unet.enable_xformers_memory_efficient_attention()
+            denoising_unet.enable_xformers_memory_efficient_attention()
+        else:
+            raise ValueError(
+                "xformers is not available. Make sure it is installed correctly"
+            )
+    if cfg.solver.gradient_checkpointing:
+        reference_unet.enable_gradient_checkpointing()
+        denoising_unet.enable_gradient_checkpointing()
+    if cfg.solver.scale_lr:
+        learning_rate = (
+            cfg.solver.learning_rate
+            * cfg.solver.gradient_accumulation_steps
+            * cfg.data.train_bs
+            * accelerator.num_processes
+        )
+    else:
+        learning_rate = cfg.solver.learning_rate
+    optimizer_cls = torch.optim.AdamW
+    trainable_params = list(filter(lambda p: p.requires_grad, net.parameters()))
+    optimizer = optimizer_cls(
+        trainable_params,
+        lr=learning_rate,
+        betas=(cfg.solver.adam_beta1, cfg.solver.adam_beta2),
+        weight_decay=cfg.solver.adam_weight_decay,
+        eps=cfg.solver.adam_epsilon,
+    )
+    # Scheduler
+    lr_scheduler = get_scheduler(
+        cfg.solver.lr_scheduler,
+        optimizer=optimizer,
+        num_warmup_steps=cfg.solver.lr_warmup_steps
+        * cfg.solver.gradient_accumulation_steps,
+        num_training_steps=cfg.solver.max_train_steps
+        * cfg.solver.gradient_accumulation_steps,
+    )
+    train_dataset = HumanDanceDataset(
+        img_size=(cfg.data.train_width, cfg.data.train_height),
+        img_scale=(0.9, 1.0),
+        data_meta_paths=cfg.data.meta_paths,
+        sample_margin=cfg.data.sample_margin,
+    )
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset, batch_size=cfg.data.train_bs, shuffle=True, num_workers=4
+    )
+    # Prepare everything with our `accelerator`.
+    (
+        net,
+        optimizer,
+        train_dataloader,
+        lr_scheduler,
+    ) = accelerator.prepare(
+        net,
+        optimizer,
+        train_dataloader,
+        lr_scheduler,
+    )
+    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
+    num_update_steps_per_epoch = math.ceil(
+        len(train_dataloader) / cfg.solver.gradient_accumulation_steps
+    )
+    # Afterwards we recalculate our number of training epochs
+    num_train_epochs = math.ceil(
+        cfg.solver.max_train_steps / num_update_steps_per_epoch
+    )
+    # We need to initialize the trackers we use, and also store our configuration.
+    # The trackers initializes automatically on the main process.
+    if accelerator.is_main_process:
+        run_time = datetime.now().strftime("%Y%m%d-%H%M")
+        accelerator.init_trackers(
+            cfg.exp_name,
+            init_kwargs={"mlflow": {"run_name": run_time}},
+        )
+        # dump config file
+        mlflow.log_dict(OmegaConf.to_container(cfg), "config.yaml")
+    # Train!
+    total_batch_size = (
+        cfg.data.train_bs
+        * accelerator.num_processes
+        * cfg.solver.gradient_accumulation_steps
+    )
+    logger.info("***** Running training *****")
+    logger.info(f"  Num examples = {len(train_dataset)}")
+    logger.info(f"  Num Epochs = {num_train_epochs}")
+    logger.info(f"  Instantaneous batch size per device = {cfg.data.train_bs}")
+    logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
+    logger.info(f"  Gradient Accumulation steps = {cfg.solver.gradient_accumulation_steps}")
+    logger.info(f"  Total optimization steps = {cfg.solver.max_train_steps}")
+    global_step = 0
+    first_epoch = 0
+    # Potentially load in the weights and states from a previous save
+    if cfg.resume_from_checkpoint:
+        if cfg.resume_from_checkpoint != "latest":
+            resume_dir = cfg.resume_from_checkpoint
+        else:
+            resume_dir = save_dir
+        # Get the most recent checkpoint
+        dirs = os.listdir(resume_dir)
+        print( dirs)
+        dirs = [d for d in dirs if d.startswith("checkpoint")]
+        dirs = sorted(dirs, key=lambda x: int(x.split("-")[1]))
+        path = dirs[-1]
+        accelerator.load_state(os.path.join(resume_dir, path))
+        accelerator.print(f"Resuming from checkpoint {path}")
+        global_step = int(path.split("-")[1])
+        first_epoch = global_step // num_update_steps_per_epoch
+        resume_step = global_step % num_update_steps_per_epoch
+    # Only show the progress bar once on each machine.
+    progress_bar = tqdm(
+        range(global_step, cfg.solver.max_train_steps),
+        disable=not accelerator.is_local_main_process,
+    )
+    progress_bar.set_description("Steps")
+    for epoch in range(first_epoch, num_train_epochs):
+        train_loss = 0.0
+        for step, batch in enumerate(train_dataloader):
+            # print(batch.keys())
+            with accelerator.accumulate(net):
+                # Convert videos to latent space
+                pixel_values = batch["tgt_img"].to(weight_dtype)
+                masked_pixel_values = batch["agnostic_img"].to(weight_dtype)
+                mask_of_pixel_values = batch["agnostic_mask_img"].to(weight_dtype)[:,0:1,:,:]
+                with torch.no_grad():
+                    # print(pixel_values.dtype)
+                    latents = vae.encode(pixel_values).latent_dist.sample()
+                    latents = latents.unsqueeze(2)  # (b, c, 1, h, w)
+                    latents = latents * 0.18215
+                    masked_latents = vae.encode(masked_pixel_values).latent_dist.sample().unsqueeze(2) * 0.18215
+                    mask_of_latents = torch.nn.functional.interpolate(mask_of_pixel_values.unsqueeze(2), size=(1,mask_of_pixel_values.shape[-2] // 8, mask_of_pixel_values.shape[-1] // 8))
+                noise = torch.randn_like(latents)
+                if cfg.noise_offset > 0.0:
+                    noise += cfg.noise_offset * torch.randn(
+                        (noise.shape[0], noise.shape[1], 1, 1, 1),
+                        device=noise.device,
+                    )
+                bsz = latents.shape[0]
+                # Sample a random timestep for each video
+                timesteps = torch.randint(
+                    0,
+                    train_noise_scheduler.num_train_timesteps,
+                    (bsz,),
+                    device=latents.device,
+                )
+                timesteps = timesteps.long()
+                tgt_pose_img = batch["tgt_pose"]
+                tgt_pose_img = tgt_pose_img.unsqueeze(2)  # (bs, 3, 1, 512, 512)
+                uncond_fwd = random.random() < cfg.uncond_ratio
+                clip_image_list = []
+                ref_image_list = []
+                cloth_mask_list = []
+                for batch_idx, (ref_img, cloth_mask, clip_img) in enumerate(
+                    zip(
+                        batch["cloth_img"],
+                        batch["cloth_mask"],
+                        batch["clip_images"],
+                    )
+                ):
+                    if uncond_fwd:
+                        clip_image_list.append(torch.zeros_like(clip_img))
+                    else:
+                        clip_image_list.append(clip_img)
+                    ref_image_list.append(ref_img)
+                    cloth_mask_list.append(cloth_mask)
+                with torch.no_grad():
+                    ref_img = torch.stack(ref_image_list, dim=0).to(
+                        dtype=vae.dtype, device=vae.device
+                    )
+                    ref_image_latents = vae.encode(
+                        ref_img
+                    ).latent_dist.sample()  # (bs, d, 64, 64)
+                    ref_image_latents = ref_image_latents * 0.18215
+                    cloth_mask = torch.stack(cloth_mask_list, dim=0).to(
+                        dtype=vae.dtype, device=vae.device
+                    )
+                    cloth_mask = cloth_mask[:,0:1,:,:]
+                    cloth_mask = torch.nn.functional.interpolate(cloth_mask, size=(cloth_mask.shape[-2] // 8, cloth_mask.shape[-1] // 8))
+                    clip_img = torch.stack(clip_image_list, dim=0).to(
+                        dtype=image_enc.dtype, device=image_enc.device
+                    )
+                    clip_image_embeds = image_enc(
+                        clip_img.to("cuda", dtype=weight_dtype)
+                    ).image_embeds
+                    image_prompt_embeds = clip_image_embeds.unsqueeze(1)  # (bs, 1, d)
+                # add noise
+                noisy_latents = train_noise_scheduler.add_noise(
+                    latents, noise, timesteps
+                )
+                # Get the target for loss depending on the prediction type
+                if train_noise_scheduler.prediction_type == "epsilon":
+                    target = noise
+                elif train_noise_scheduler.prediction_type == "v_prediction":
+                    target = train_noise_scheduler.get_velocity(
+                        latents, noise, timesteps
+                    )
+                else:
+                    raise ValueError(
+                        f"Unknown prediction type {train_noise_scheduler.prediction_type}"
+                    )
+                model_pred = net(
+                    # noisy_latents,
+                    torch.cat([noisy_latents,masked_latents,mask_of_latents],dim=1),
+                    timesteps,
+                    torch.cat([ref_image_latents, cloth_mask],dim=1),
+                    image_prompt_embeds,
+                    tgt_pose_img,
+                    uncond_fwd,
+                )
+                if cfg.snr_gamma == 0:
+                    loss = F.mse_loss(
+                        model_pred.float(), target.float(), reduction="mean"
+                    )
+                else:
+                    snr = compute_snr(train_noise_scheduler, timesteps)
+                    if train_noise_scheduler.config.prediction_type == "v_prediction":
+                        # Velocity objective requires that we add one to SNR values before we divide by them.
+                        snr = snr + 1
+                    mse_loss_weights = (
+                        torch.stack(
+                            [snr, cfg.snr_gamma * torch.ones_like(timesteps)], dim=1
+                        ).min(dim=1)[0]
+                        / snr
+                    )
+                    loss = F.mse_loss(
+                        model_pred.float(), target.float(), reduction="none"
+                    )
+                    loss = (
+                        loss.mean(dim=list(range(1, len(loss.shape))))
+                        * mse_loss_weights
+                    )
+                    loss = loss.mean()
+                # Gather the losses across all processes for logging (if we use distributed training).
+                avg_loss = accelerator.gather(loss.repeat(cfg.data.train_bs)).mean()
+                train_loss += avg_loss.item() / cfg.solver.gradient_accumulation_steps
+                # Backpropagate
+                accelerator.backward(loss)
+                if accelerator.sync_gradients:
+                    accelerator.clip_grad_norm_(
+                        trainable_params,
+                        cfg.solver.max_grad_norm,
+                    )
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()
+            if accelerator.sync_gradients:
+                reference_control_reader.clear()
+                reference_control_writer.clear()
+                progress_bar.update(1)
+                global_step += 1
+                accelerator.log({"train_loss": train_loss}, step=global_step)
+                train_loss = 0.0
+                if global_step % cfg.checkpointing_steps == 0:
+                    if accelerator.is_main_process:
+                        save_path = os.path.join(save_dir, f"checkpoint-{global_step}")
+                        delete_additional_ckpt(save_dir, 1)
+                        accelerator.save_state(save_path)
+                if global_step % cfg.val.validation_steps == 0:
+                    if accelerator.is_main_process:
+                        generator = torch.Generator(device=accelerator.device)
+                        generator.manual_seed(cfg.seed)
+                        log_validation(
+                            vae=vae,
+                            image_enc=image_enc,
+                            net=net,
+                            scheduler=val_noise_scheduler,
+                            accelerator=accelerator,
+                            width=cfg.data.train_width,
+                            height=cfg.data.train_height,
+                            save_dir=validation_dir,
+                            global_step=global_step,
+                        )
+                        # for sample_id, sample_dict in enumerate(sample_dicts):
+                        #     sample_name = sample_dict["name"]
+                        #     img = sample_dict["img"]
+                        #     with TemporaryDirectory() as temp_dir:
+                        #         out_file = Path(
+                        #             f"{temp_dir}/{global_step:06d}-{sample_name}.gif"
+                        #         )
+                        #         img.save(out_file)
+                        #         mlflow.log_artifact(out_file)
+            logs = {
+                "step_loss": loss.detach().item(),
+                "lr": lr_scheduler.get_last_lr()[0],
+            }
+            progress_bar.set_postfix(**logs)
+            if global_step >= cfg.solver.max_train_steps:
+                break
+        # save model after each epoch
+        if (
+            epoch + 1
+        ) % cfg.save_model_epoch_interval == 0 and accelerator.is_main_process:
+            unwrap_net = accelerator.unwrap_model(net)
+            save_checkpoint(
+                unwrap_net.reference_unet,
+                save_dir,
+                "reference_unet",
+                global_step,
+                total_limit=3,
+            )
+            save_checkpoint(
+                unwrap_net.denoising_unet,
+                save_dir,
+                "denoising_unet",
+                global_step,
+                total_limit=3,
+            )
+            save_checkpoint(
+                unwrap_net.pose_guider,
+                save_dir,
+                "pose_guider",
+                global_step,
+                total_limit=3,
+            )
+    # Create the pipeline using the trained modules and save it.
+    accelerator.wait_for_everyone()
+    accelerator.end_training()
+def save_checkpoint(model, save_dir, prefix, ckpt_num, total_limit=None):
+    save_path = osp.join(save_dir, f"{prefix}-{ckpt_num}.pth")
+    if total_limit is not None:
+        checkpoints = os.listdir(save_dir)
+        checkpoints = [d for d in checkpoints if d.startswith(prefix)]
+        checkpoints = sorted(
+            checkpoints, key=lambda x: int(x.split("-")[1].split(".")[0])
+        )
+        if len(checkpoints) >= total_limit:
+            num_to_remove = len(checkpoints) - total_limit + 1
+            removing_checkpoints = checkpoints[0:num_to_remove]
+            logger.info(
+                f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
+            )
+            logger.info(f"removing checkpoints: {', '.join(removing_checkpoints)}")
+            for removing_checkpoint in removing_checkpoints:
+                removing_checkpoint = os.path.join(save_dir, removing_checkpoint)
+                os.remove(removing_checkpoint)
+    state_dict = model.state_dict()
+    torch.save(state_dict, save_path)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", type=str, default="./configs/training/stage1.yaml")
+    args = parser.parse_args()
+    if args.config[-5:] == ".yaml":
+        config = OmegaConf.load(args.config)
+    elif args.config[-3:] == ".py":
+        config = import_filename(args.config).cfg
+    else:
+        raise ValueError("Do not support this format config file")
+    main(config)
+# accelerate launch train_stage_1.py --config configs/train/stage1.yaml
+# accelerate launch train_stage_2.py --config configs/train/stage2.yaml

train_stage_2.py ADDED Viewed

	@@ -0,0 +1,842 @@

+import argparse
+import copy
+import logging
+import math
+import os
+import os.path as osp
+import random
+import time
+import warnings
+from collections import OrderedDict
+from datetime import datetime
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from src.utils.util import get_fps, read_frames, save_videos_grid
+import diffusers
+import mlflow
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint
+import transformers
+from accelerate import Accelerator
+from accelerate.logging import get_logger
+from accelerate.utils import DistributedDataParallelKwargs
+from diffusers import AutoencoderKL, DDIMScheduler
+from diffusers.optimization import get_scheduler
+from diffusers.utils import check_min_version
+from diffusers.utils.import_utils import is_xformers_available
+from einops import rearrange
+from omegaconf import OmegaConf
+from PIL import Image
+from torchvision import transforms
+from tqdm.auto import tqdm
+from transformers import CLIPVisionModelWithProjection
+from src.dataset.dance_video import HumanDanceVideoDataset
+from src.models.mutual_self_attention import ReferenceAttentionControl
+from src.models.pose_guider import PoseGuider
+from src.models.unet_2d_condition import UNet2DConditionModel
+from src.models.unet_3d import UNet3DConditionModel
+from src.pipelines.pipeline_pose2vid_long import Pose2VideoPipeline
+from src.utils.util import (
+    delete_additional_ckpt,
+    import_filename,
+    read_frames,
+    save_videos_grid,
+    seed_everything,
+)
+warnings.filterwarnings("ignore")
+# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
+check_min_version("0.10.0.dev0")
+logger = get_logger(__name__, log_level="INFO")
+class Net(nn.Module):
+    def __init__(
+        self,
+        reference_unet: UNet2DConditionModel,
+        denoising_unet: UNet3DConditionModel,
+        pose_guider: PoseGuider,
+        reference_control_writer,
+        reference_control_reader,
+    ):
+        super().__init__()
+        self.reference_unet = reference_unet
+        self.denoising_unet = denoising_unet
+        self.pose_guider = pose_guider
+        self.reference_control_writer = reference_control_writer
+        self.reference_control_reader = reference_control_reader
+    def forward(
+        self,
+        noisy_latents,
+        timesteps,
+        ref_image_latents,
+        clip_image_embeds,
+        pose_img,
+        uncond_fwd: bool = False,
+    ):
+        pose_cond_tensor = pose_img.to(device="cuda")
+        pose_fea = self.pose_guider(pose_cond_tensor)
+        if not uncond_fwd:
+            ref_timesteps = torch.zeros_like(timesteps)
+            self.reference_unet(
+                ref_image_latents,
+                ref_timesteps,
+                encoder_hidden_states=clip_image_embeds,
+                return_dict=False,
+            )
+            self.reference_control_reader.update(self.reference_control_writer)
+        model_pred = self.denoising_unet(
+            noisy_latents,
+            timesteps,
+            pose_cond_fea=pose_fea,
+            encoder_hidden_states=clip_image_embeds,
+        ).sample
+        return model_pred
+def compute_snr(noise_scheduler, timesteps):
+    """
+    Computes SNR as per
+    https://github.com/TiankaiHang/Min-SNR-Diffusion-Training/blob/521b624bd70c67cee4bdf49225915f5945a872e3/guided_diffusion/gaussian_diffusion.py#L847-L849
+    """
+    alphas_cumprod = noise_scheduler.alphas_cumprod
+    sqrt_alphas_cumprod = alphas_cumprod**0.5
+    sqrt_one_minus_alphas_cumprod = (1.0 - alphas_cumprod) ** 0.5
+    # Expand the tensors.
+    # Adapted from https://github.com/TiankaiHang/Min-SNR-Diffusion-Training/blob/521b624bd70c67cee4bdf49225915f5945a872e3/guided_diffusion/gaussian_diffusion.py#L1026
+    sqrt_alphas_cumprod = sqrt_alphas_cumprod.to(device=timesteps.device)[
+        timesteps
+    ].float()
+    while len(sqrt_alphas_cumprod.shape) < len(timesteps.shape):
+        sqrt_alphas_cumprod = sqrt_alphas_cumprod[..., None]
+    alpha = sqrt_alphas_cumprod.expand(timesteps.shape)
+    sqrt_one_minus_alphas_cumprod = sqrt_one_minus_alphas_cumprod.to(
+        device=timesteps.device
+    )[timesteps].float()
+    while len(sqrt_one_minus_alphas_cumprod.shape) < len(timesteps.shape):
+        sqrt_one_minus_alphas_cumprod = sqrt_one_minus_alphas_cumprod[..., None]
+    sigma = sqrt_one_minus_alphas_cumprod.expand(timesteps.shape)
+    # Compute SNR.
+    snr = (alpha / sigma) ** 2
+    return snr
+def log_validation(
+    vae,
+    image_enc,
+    net,
+    scheduler,
+    accelerator,
+    width,
+    height,
+    global_step,
+    clip_length=24,
+    generator=None,
+):
+    logger.info("Running validation... ")
+    ori_net = accelerator.unwrap_model(net)
+    reference_unet = ori_net.reference_unet
+    denoising_unet = ori_net.denoising_unet
+    pose_guider = ori_net.pose_guider
+    if generator is None:
+        generator = torch.manual_seed(42)
+    tmp_denoising_unet = copy.deepcopy(denoising_unet)
+    tmp_denoising_unet = tmp_denoising_unet.to(dtype=torch.float16)
+    pipe = Pose2VideoPipeline(
+        vae=vae,
+        image_encoder=image_enc,
+        reference_unet=reference_unet,
+        denoising_unet=tmp_denoising_unet,
+        pose_guider=pose_guider,
+        scheduler=scheduler,
+    )
+    pipe = pipe.to(accelerator.device)
+    date_str = datetime.now().strftime("%Y%m%d")
+    time_str = datetime.now().strftime("%H%M")
+    save_dir_name = f"{time_str}"
+    save_dir = Path(f"vividfuxian_motion/{date_str}/{save_dir_name}")
+    save_dir.mkdir(exist_ok=True, parents=True)
+    model_video_paths = ["/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/dataset/ViViD/dresses/videos/803128_detail.mp4"]
+    cloth_image_paths=["/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/dataset/ViViD/dresses/images/1060638_in_xl.jpg"]
+    transform = transforms.Compose(
+        [transforms.Resize((height, width)), transforms.ToTensor()]
+    )
+    for model_image_path in model_video_paths:
+        src_fps = get_fps(model_image_path)
+        model_name = Path(model_image_path).stem
+        agnostic_path=model_image_path.replace("videos","agnostic")
+        agn_mask_path=model_image_path.replace("videos","agnostic_mask")
+        densepose_path=model_image_path.replace("videos","densepose")
+        video_tensor_list=[]
+        video_images=read_frames(model_image_path)
+        for vid_image_pil in video_images[:clip_length]:
+            video_tensor_list.append(transform(vid_image_pil))
+        video_tensor = torch.stack(video_tensor_list, dim=0)  # (f, c, h, w)
+        video_tensor = video_tensor.transpose(0, 1)
+        agnostic_list=[]
+        agnostic_images=read_frames(agnostic_path)
+        for agnostic_image_pil in agnostic_images[:clip_length]:
+            agnostic_list.append(agnostic_image_pil)
+        agn_mask_list=[]
+        agn_mask_images=read_frames(agn_mask_path)
+        for agn_mask_image_pil in agn_mask_images[:clip_length]:
+            agn_mask_list.append(agn_mask_image_pil)
+        pose_list=[]
+        pose_images=read_frames(densepose_path)
+        for pose_image_pil in pose_images[:clip_length]:
+            pose_list.append(pose_image_pil)
+        video_tensor = video_tensor.unsqueeze(0)
+        for cloth_image_path in cloth_image_paths:
+            cloth_name =  Path(cloth_image_path).stem
+            cloth_image_pil = Image.open(cloth_image_path).convert("RGB")
+            cloth_mask_path=cloth_image_path.replace("cloth","cloth_mask")
+            cloth_mask_pil = Image.open(cloth_mask_path).convert("RGB")
+            pipeline_output = pipe(
+                agnostic_list,
+                agn_mask_list,
+                cloth_image_pil,
+                cloth_mask_pil,
+                pose_list,
+                width,
+                height,
+                clip_length,
+                20,
+                3.5,
+                generator=generator,
+            )
+            video = pipeline_output.videos
+            video = torch.cat([video_tensor,video], dim=0)
+            save_videos_grid(
+                video,
+                f"{save_dir}/{global_step:06d}-{model_name}_{cloth_name}.mp4",
+                n_rows=2,
+                fps=src_fps,
+            )
+    del tmp_denoising_unet
+    del pipe
+    torch.cuda.empty_cache()
+    return video
+def main(cfg):
+    kwargs = DistributedDataParallelKwargs(find_unused_parameters=False)
+    accelerator = Accelerator(
+        gradient_accumulation_steps=cfg.solver.gradient_accumulation_steps,
+        mixed_precision=cfg.solver.mixed_precision,
+        log_with="mlflow",
+        project_dir="./mlruns",
+        kwargs_handlers=[kwargs],
+    )
+    # Make one log on every process with the configuration for debugging.
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+    )
+    logger.info(accelerator.state, main_process_only=False)
+    if accelerator.is_local_main_process:
+        transformers.utils.logging.set_verbosity_warning()
+        diffusers.utils.logging.set_verbosity_info()
+    else:
+        transformers.utils.logging.set_verbosity_error()
+        diffusers.utils.logging.set_verbosity_error()
+    # If passed along, set the training seed now.
+    if cfg.seed is not None:
+        seed_everything(cfg.seed)
+    exp_name = cfg.exp_name
+    save_dir = f"{cfg.output_dir}/{exp_name}"
+    if accelerator.is_main_process:
+        if not os.path.exists(save_dir):
+            os.makedirs(save_dir)
+    # inference_config_path = "./configs/inference/inference_v2.yaml"
+    inference_config_path = "./configs/inference/inference.yaml"
+    infer_config = OmegaConf.load(inference_config_path)
+    if cfg.weight_dtype == "fp16":
+        weight_dtype = torch.float16
+    elif cfg.weight_dtype == "bf16":
+        weight_dtype = torch.bfloat16
+    elif cfg.weight_dtype == "fp32":
+        weight_dtype = torch.float32
+    else:
+        raise ValueError(
+            f"Do not support weight dtype: {cfg.weight_dtype} during training"
+        )
+    sched_kwargs = OmegaConf.to_container(cfg.noise_scheduler_kwargs)
+    if cfg.enable_zero_snr:
+        sched_kwargs.update(
+            rescale_betas_zero_snr=True,
+            timestep_spacing="trailing",
+            prediction_type="v_prediction",
+        )
+    val_noise_scheduler = DDIMScheduler(**sched_kwargs)
+    sched_kwargs.update({"beta_schedule": "scaled_linear"})
+    train_noise_scheduler = DDIMScheduler(**sched_kwargs)
+    image_enc = CLIPVisionModelWithProjection.from_pretrained(
+        cfg.image_encoder_path,
+    ).to(dtype=weight_dtype, device="cuda")
+    vae = AutoencoderKL.from_pretrained(cfg.vae_model_path).to(
+        "cuda", dtype=weight_dtype
+    )
+    reference_unet = UNet2DConditionModel.from_pretrained_2d(
+        cfg.base_model_path,
+        subfolder="unet",
+        unet_additional_kwargs={
+            "in_channels": 5,
+        }
+    ).to(device="cuda", dtype=weight_dtype)
+    denoising_unet = UNet3DConditionModel.from_pretrained_2d(
+        cfg.base_model_path,
+        cfg.mm_path,
+        subfolder="unet",
+        unet_additional_kwargs=OmegaConf.to_container(
+            infer_config.unet_additional_kwargs
+        ),
+    ).to(device="cuda")
+    pose_guider = PoseGuider(
+        conditioning_embedding_channels=320, block_out_channels=(16, 32, 96, 256)
+    ).to(device="cuda", dtype=weight_dtype)
+    stage1_ckpt_dir = cfg.stage1_ckpt_dir
+    stage1_ckpt_step = cfg.stage1_ckpt_step
+    denoising_unet.load_state_dict(
+        torch.load(
+            os.path.join(stage1_ckpt_dir, f"denoising_unet-{stage1_ckpt_step}.pth"),
+            map_location="cpu",
+        ),
+        strict=False,
+    )
+    reference_unet.load_state_dict(
+        torch.load(
+            os.path.join(stage1_ckpt_dir, f"reference_unet-{stage1_ckpt_step}.pth"),
+            map_location="cpu",
+        ),
+        strict=False,
+    )
+    pose_guider.load_state_dict(
+        torch.load(
+            os.path.join(stage1_ckpt_dir, f"pose_guider-{stage1_ckpt_step}.pth"),
+            map_location="cpu",
+        ),
+        strict=False,
+    )
+    # Freeze
+    vae.requires_grad_(False)
+    image_enc.requires_grad_(False)
+    reference_unet.requires_grad_(False)
+    denoising_unet.requires_grad_(False)
+    pose_guider.requires_grad_(False)
+    # Set motion module learnable
+    for name, module in denoising_unet.named_modules():
+        if "motion_modules" in name:
+            for params in module.parameters():
+                params.requires_grad = True
+    reference_control_writer = ReferenceAttentionControl(
+        reference_unet,
+        do_classifier_free_guidance=False,
+        mode="write",
+        fusion_blocks="full",
+    )
+    reference_control_reader = ReferenceAttentionControl(
+        denoising_unet,
+        do_classifier_free_guidance=False,
+        mode="read",
+        fusion_blocks="full",
+    )
+    net = Net(
+        reference_unet,
+        denoising_unet,
+        pose_guider,
+        reference_control_writer,
+        reference_control_reader,
+    )
+    if cfg.solver.enable_xformers_memory_efficient_attention:
+        if is_xformers_available():
+            reference_unet.enable_xformers_memory_efficient_attention()
+            denoising_unet.enable_xformers_memory_efficient_attention()
+        else:
+            raise ValueError(
+                "xformers is not available. Make sure it is installed correctly"
+            )
+    if cfg.solver.gradient_checkpointing:
+        reference_unet.enable_gradient_checkpointing()
+        denoising_unet.enable_gradient_checkpointing()
+    if cfg.solver.scale_lr:
+        learning_rate = (
+            cfg.solver.learning_rate
+            * cfg.solver.gradient_accumulation_steps
+            * cfg.data.train_bs
+            * accelerator.num_processes
+        )
+    else:
+        learning_rate = cfg.solver.learning_rate
+    # Initialize the optimizer
+    if cfg.solver.use_8bit_adam:
+        try:
+            import bitsandbytes as bnb
+        except ImportError:
+            raise ImportError(
+                "Please install bitsandbytes to use 8-bit Adam. You can do so by running `pip install bitsandbytes`"
+            )
+        optimizer_cls = bnb.optim.AdamW8bit
+    else:
+        optimizer_cls = torch.optim.AdamW
+    trainable_params = list(filter(lambda p: p.requires_grad, net.parameters()))
+    logger.info(f"Total trainable params {len(trainable_params)}")
+    optimizer = optimizer_cls(
+        trainable_params,
+        lr=learning_rate,
+        betas=(cfg.solver.adam_beta1, cfg.solver.adam_beta2),
+        weight_decay=cfg.solver.adam_weight_decay,
+        eps=cfg.solver.adam_epsilon,
+    )
+    # Scheduler
+    lr_scheduler = get_scheduler(
+        cfg.solver.lr_scheduler,
+        optimizer=optimizer,
+        num_warmup_steps=cfg.solver.lr_warmup_steps
+        * cfg.solver.gradient_accumulation_steps,
+        num_training_steps=cfg.solver.max_train_steps
+        * cfg.solver.gradient_accumulation_steps,
+    )
+    train_dataset = HumanDanceVideoDataset(
+        width=cfg.data.train_width,
+        height=cfg.data.train_height,
+        n_sample_frames=cfg.data.n_sample_frames,
+        sample_rate=cfg.data.sample_rate,
+        img_scale=(1.0, 1.0),
+        data_meta_paths=cfg.data.meta_paths,
+    )
+    train_dataloader = torch.utils.data.DataLoader(
+        train_dataset, batch_size=cfg.data.train_bs, shuffle=True, num_workers=4
+    )
+    # Prepare everything with our `accelerator`.
+    (
+        net,
+        optimizer,
+        train_dataloader,
+        lr_scheduler,
+    ) = accelerator.prepare(
+        net,
+        optimizer,
+        train_dataloader,
+        lr_scheduler,
+    )
+    # We need to recalculate our total training steps as the size of the training dataloader may have changed.
+    num_update_steps_per_epoch = math.ceil(
+        len(train_dataloader) / cfg.solver.gradient_accumulation_steps
+    )
+    # Afterwards we recalculate our number of training epochs
+    num_train_epochs = math.ceil(
+        cfg.solver.max_train_steps / num_update_steps_per_epoch
+    )
+    # We need to initialize the trackers we use, and also store our configuration.
+    # The trackers initializes automatically on the main process.
+    if accelerator.is_main_process:
+        run_time = datetime.now().strftime("%Y%m%d-%H%M")
+        accelerator.init_trackers(
+            exp_name,
+            init_kwargs={"mlflow": {"run_name": run_time}},
+        )
+        # dump config file
+        mlflow.log_dict(OmegaConf.to_container(cfg), "config.yaml")
+    # Train!
+    total_batch_size = (
+        cfg.data.train_bs
+        * accelerator.num_processes
+        * cfg.solver.gradient_accumulation_steps
+    )
+    logger.info("***** Running training *****")
+    logger.info(f"  Num examples = {len(train_dataset)}")
+    logger.info(f"  Num Epochs = {num_train_epochs}")
+    logger.info(f"  Instantaneous batch size per device = {cfg.data.train_bs}")
+    logger.info(
+        f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}"
+    )
+    logger.info(
+        f"  Gradient Accumulation steps = {cfg.solver.gradient_accumulation_steps}"
+    )
+    logger.info(f"  Total optimization steps = {cfg.solver.max_train_steps}")
+    global_step = 0
+    first_epoch = 0
+    # Potentially load in the weights and states from a previous save
+    if cfg.resume_from_checkpoint:
+        if cfg.resume_from_checkpoint != "latest":
+            resume_dir = cfg.resume_from_checkpoint
+        else:
+            resume_dir = save_dir
+        # Get the most recent checkpoint
+        dirs = os.listdir(resume_dir)
+        dirs = [d for d in dirs if d.startswith("checkpoint")]
+        dirs = sorted(dirs, key=lambda x: int(x.split("-")[1]))
+        path = dirs[-1]
+        accelerator.load_state(os.path.join(resume_dir, path))
+        accelerator.print(f"Resuming from checkpoint {path}")
+        global_step = int(path.split("-")[1])
+        first_epoch = global_step // num_update_steps_per_epoch
+        resume_step = global_step % num_update_steps_per_epoch
+    # Only show the progress bar once on each machine.
+    progress_bar = tqdm(
+        range(global_step, cfg.solver.max_train_steps),
+        disable=not accelerator.is_local_main_process,
+    )
+    progress_bar.set_description("Steps")
+    for epoch in range(first_epoch, num_train_epochs):
+        train_loss = 0.0
+        t_data_start = time.time()
+        for step, batch in enumerate(train_dataloader):
+            t_data = time.time() - t_data_start
+            with accelerator.accumulate(net):
+                # Convert videos to latent space
+                pixel_values_vid = batch["pixel_values_vid"].to(weight_dtype)
+                masked_pixel_values = batch["pixel_values_vid_agnostic"].to(weight_dtype)
+                # mask_of_pixel_values = batch["pixel_values_vid_agnostic_mask"].to(weight_dtype)
+                mask_of_pixel_values = batch["pixel_values_vid_agnostic_mask"].to(weight_dtype)[:,:,0:1,:,:]
+                mask_of_pixel_values=mask_of_pixel_values.transpose(1, 2)#b f c h w->b c f h w
+                with torch.no_grad():
+                    video_length = pixel_values_vid.shape[1]
+                    pixel_values_vid = rearrange(
+                        pixel_values_vid, "b f c h w -> (b f) c h w"
+                    )
+                    latents = vae.encode(pixel_values_vid).latent_dist.sample()
+                    latents = rearrange(
+                        latents, "(b f) c h w -> b c f h w", f=video_length
+                    )
+                    latents = latents * 0.18215
+                    masked_pixel_values = rearrange(
+                        masked_pixel_values, "b f c h w -> (b f) c h w"
+                    )
+                    masked_latents = vae.encode(masked_pixel_values).latent_dist.sample()
+                    masked_latents = rearrange(
+                        masked_latents, "(b f) c h w -> b c f h w", f=video_length
+                    )
+                    masked_latents = masked_latents * 0.18215
+                    mask_of_latents = torch.nn.functional.interpolate(mask_of_pixel_values, size=(24,mask_of_pixel_values.shape[-2] // 8, mask_of_pixel_values.shape[-1] // 8))
+                noise = torch.randn_like(latents)
+                if cfg.noise_offset > 0:
+                    noise += cfg.noise_offset * torch.randn(
+                        (latents.shape[0], latents.shape[1], 1, 1, 1),
+                        device=latents.device,
+                    )
+                bsz = latents.shape[0]
+                # Sample a random timestep for each video
+                timesteps = torch.randint(
+                    0,
+                    train_noise_scheduler.num_train_timesteps,
+                    (bsz,),
+                    device=latents.device,
+                )
+                timesteps = timesteps.long()
+                pixel_values_pose = batch["pixel_values_pose"]  # (bs, f, c, H, W)
+                pixel_values_pose = pixel_values_pose.transpose(
+                    1, 2
+                )  # (bs, c, f, H, W)
+                uncond_fwd = random.random() < cfg.uncond_ratio
+                clip_image_list = []
+                ref_image_list = []
+                cloth_mask_list = []
+                for batch_idx, (ref_img, cloth_mask, clip_img) in enumerate(
+                    zip(
+                        batch["pixel_cloth"],
+                        batch["pixel_cloth_mask"],
+                        batch["clip_ref_img"],
+                    )
+                ):
+                    if uncond_fwd:
+                        clip_image_list.append(torch.zeros_like(clip_img))
+                    else:
+                        clip_image_list.append(clip_img)
+                    ref_image_list.append(ref_img)
+                    cloth_mask_list.append(cloth_mask)
+                with torch.no_grad():
+                    ref_img = torch.stack(ref_image_list, dim=0).to(
+                        dtype=vae.dtype, device=vae.device
+                    )
+                    ref_image_latents = vae.encode(
+                        ref_img
+                    ).latent_dist.sample()  # (bs, d, 64, 64)
+                    ref_image_latents = ref_image_latents * 0.18215
+                    cloth_mask = torch.stack(cloth_mask_list, dim=0).to(
+                        dtype=vae.dtype, device=vae.device
+                    )
+                    cloth_mask = cloth_mask[:,0:1,:,:]
+                    cloth_mask = torch.nn.functional.interpolate(cloth_mask, size=(cloth_mask.shape[-2] // 8, cloth_mask.shape[-1] // 8))
+                    clip_img = torch.stack(clip_image_list, dim=0).to(
+                        dtype=image_enc.dtype, device=image_enc.device
+                    )
+                    clip_img = clip_img.to(device="cuda", dtype=weight_dtype)
+                    clip_image_embeds = image_enc(
+                        clip_img.to("cuda", dtype=weight_dtype)
+                    ).image_embeds
+                    clip_image_embeds = clip_image_embeds.unsqueeze(1)  # (bs, 1, d)
+                # add noise
+                noisy_latents = train_noise_scheduler.add_noise(
+                    latents, noise, timesteps
+                )
+                # Get the target for loss depending on the prediction type
+                if train_noise_scheduler.prediction_type == "epsilon":
+                    target = noise
+                elif train_noise_scheduler.prediction_type == "v_prediction":
+                    target = train_noise_scheduler.get_velocity(
+                        latents, noise, timesteps
+                    )
+                else:
+                    raise ValueError(
+                        f"Unknown prediction type {train_noise_scheduler.prediction_type}"
+                    )
+                # ---- Forward!!! -----
+                model_pred = net(
+                    # noisy_latents,
+                    torch.cat([noisy_latents,masked_latents,mask_of_latents],dim=1),
+                    timesteps,
+                    # ref_image_latents,
+                    torch.cat([ref_image_latents, cloth_mask],dim=1),
+                    clip_image_embeds,
+                    pixel_values_pose,
+                    uncond_fwd=uncond_fwd,
+                )
+                if cfg.snr_gamma == 0:
+                    loss = F.mse_loss(
+                        model_pred.float(), target.float(), reduction="mean"
+                    )
+                else:
+                    snr = compute_snr(train_noise_scheduler, timesteps)
+                    if train_noise_scheduler.config.prediction_type == "v_prediction":
+                        # Velocity objective requires that we add one to SNR values before we divide by them.
+                        snr = snr + 1
+                    mse_loss_weights = (
+                        torch.stack(
+                            [snr, cfg.snr_gamma * torch.ones_like(timesteps)], dim=1
+                        ).min(dim=1)[0]
+                        / snr
+                    )
+                    loss = F.mse_loss(
+                        model_pred.float(), target.float(), reduction="none"
+                    )
+                    loss = (
+                        loss.mean(dim=list(range(1, len(loss.shape))))
+                        * mse_loss_weights
+                    )
+                    loss = loss.mean()
+                # Gather the losses across all processes for logging (if we use distributed training).
+                avg_loss = accelerator.gather(loss.repeat(cfg.data.train_bs)).mean()
+                train_loss += avg_loss.item() / cfg.solver.gradient_accumulation_steps
+                # Backpropagate
+                accelerator.backward(loss)
+                if accelerator.sync_gradients:
+                    accelerator.clip_grad_norm_(
+                        trainable_params,
+                        cfg.solver.max_grad_norm,
+                    )
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad()
+            if accelerator.sync_gradients:
+                reference_control_reader.clear()
+                reference_control_writer.clear()
+                progress_bar.update(1)
+                global_step += 1
+                accelerator.log({"train_loss": train_loss}, step=global_step)
+                train_loss = 0.0
+                if global_step % cfg.val.validation_steps == 0:
+                    if accelerator.is_main_process:
+                        generator = torch.Generator(device=accelerator.device)
+                        generator.manual_seed(cfg.seed)
+                        log_validation(
+                            vae=vae,
+                            image_enc=image_enc,
+                            net=net,
+                            scheduler=val_noise_scheduler,
+                            accelerator=accelerator,
+                            width=cfg.data.train_width,
+                            height=cfg.data.train_height,
+                            global_step=global_step,
+                            clip_length=cfg.data.n_sample_frames,
+                            generator=generator,
+                        )
+                        # for sample_id, sample_dict in enumerate(sample_dicts):
+                        #     sample_name = sample_dict["name"]
+                        #     vid = sample_dict["vid"]
+                        #     with TemporaryDirectory() as temp_dir:
+                        #         out_file = Path(
+                        #             f"{temp_dir}/{global_step:06d}-{sample_name}.gif"
+                        #         )
+                        #         save_videos_grid(vid, out_file, n_rows=2)
+                        #         mlflow.log_artifact(out_file)
+            logs = {
+                "step_loss": loss.detach().item(),
+                "lr": lr_scheduler.get_last_lr()[0],
+                "td": f"{t_data:.2f}s",
+            }
+            t_data_start = time.time()
+            progress_bar.set_postfix(**logs)
+            if global_step >= cfg.solver.max_train_steps:
+                break
+        # save model after each epoch
+        if accelerator.is_main_process:
+            save_path = os.path.join(save_dir, f"checkpoint-{global_step}")
+            delete_additional_ckpt(save_dir, 1)
+            # accelerator.save_state(save_path)
+            # save motion module only
+            unwrap_net = accelerator.unwrap_model(net)
+            save_checkpoint(
+                unwrap_net.denoising_unet,
+                save_dir,
+                "motion_module",
+                global_step,
+                total_limit=3,
+            )
+    # Create the pipeline using the trained modules and save it.
+    accelerator.wait_for_everyone()
+    accelerator.end_training()
+def save_checkpoint(model, save_dir, prefix, ckpt_num, total_limit=None):
+    save_path = osp.join(save_dir, f"{prefix}-{ckpt_num}.pth")
+    if total_limit is not None:
+        checkpoints = os.listdir(save_dir)
+        checkpoints = [d for d in checkpoints if d.startswith(prefix)]
+        checkpoints = sorted(
+            checkpoints, key=lambda x: int(x.split("-")[1].split(".")[0])
+        )
+        if len(checkpoints) >= total_limit:
+            num_to_remove = len(checkpoints) - total_limit + 1
+            removing_checkpoints = checkpoints[0:num_to_remove]
+            logger.info(
+                f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
+            )
+            logger.info(f"removing checkpoints: {', '.join(removing_checkpoints)}")
+            for removing_checkpoint in removing_checkpoints:
+                removing_checkpoint = os.path.join(save_dir, removing_checkpoint)
+                os.remove(removing_checkpoint)
+    mm_state_dict = OrderedDict()
+    state_dict = model.state_dict()
+    for key in state_dict:
+        if "motion_module" in key:
+            mm_state_dict[key] = state_dict[key]
+    torch.save(mm_state_dict, save_path)
+def decode_latents(vae, latents):
+    video_length = latents.shape[2]
+    latents = 1 / 0.18215 * latents
+    latents = rearrange(latents, "b c f h w -> (b f) c h w")
+    # video = self.vae.decode(latents).sample
+    video = []
+    for frame_idx in tqdm(range(latents.shape[0])):
+        video.append(vae.decode(latents[frame_idx : frame_idx + 1]).sample)
+    video = torch.cat(video)
+    video = rearrange(video, "(b f) c h w -> b c f h w", f=video_length)
+    video = (video / 2 + 0.5).clamp(0, 1)
+    # we always cast to float32 as this does not cause significant overhead and is compatible with bfloa16
+    video = video.cpu().float().numpy()
+    return video
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", type=str, default="./configs/training/stage2.yaml")
+    args = parser.parse_args()
+    if args.config[-5:] == ".yaml":
+        config = OmegaConf.load(args.config)
+    elif args.config[-3:] == ".py":
+        config = import_filename(args.config).cfg
+    else:
+        raise ValueError("Do not support this format config file")
+    main(config)

vivid.py ADDED Viewed

	@@ -0,0 +1,229 @@

+import argparse
+from datetime import datetime
+from pathlib import Path
+import sys
+import torch
+import os
+from diffusers import AutoencoderKL, DDIMScheduler
+from omegaconf import OmegaConf
+from PIL import Image
+from torchvision import transforms
+from transformers import CLIPVisionModelWithProjection
+from src.models.pose_guider import PoseGuider
+from src.models.unet_2d_condition import UNet2DConditionModel
+from src.models.unet_3d import UNet3DConditionModel
+from src.pipelines.pipeline_pose2vid_long import Pose2VideoPipeline
+from src.utils.util import get_fps, read_frames, save_videos_grid
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config",type=str,default="/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/configs/prompts/valid.yaml")
+    parser.add_argument("-W", type=int, default=384)
+    parser.add_argument("-H", type=int, default=512)
+    parser.add_argument("-L", type=int, default=24)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--cfg", type=float, default=3.5)
+    parser.add_argument("--steps", type=int, default=20)
+    parser.add_argument("--fps", type=int)
+    args = parser.parse_args()
+    return args
+def main():
+    args = parse_args()
+    config = OmegaConf.load(args.config)
+    if config.weight_dtype == "fp16":
+        weight_dtype = torch.float16
+    else:
+        weight_dtype = torch.float32
+    vae = AutoencoderKL.from_pretrained(
+        config.pretrained_vae_path,
+    ).to("cuda", dtype=weight_dtype)
+    reference_unet = UNet2DConditionModel.from_pretrained_2d(
+        config.pretrained_base_model_path,
+        subfolder="unet",
+        unet_additional_kwargs={
+            "in_channels": 5,
+        }
+    ).to(dtype=weight_dtype, device="cuda")
+    inference_config_path = config.inference_config #'/mnt/lpai-dione/ssai/cvg/team/wjj/ViViD/configs/inference/inference.yaml'
+    infer_config = OmegaConf.load(inference_config_path)
+    denoising_unet = UNet3DConditionModel.from_pretrained_2d(
+        config.pretrained_base_model_path,
+        config.motion_module_path,
+        subfolder="unet",
+        unet_additional_kwargs=infer_config.unet_additional_kwargs,
+    ).to(dtype=weight_dtype, device="cuda")
+    pose_guider = PoseGuider(320, block_out_channels=(16, 32, 96, 256)).to(
+        dtype=weight_dtype, device="cuda"
+    )
+    image_enc = CLIPVisionModelWithProjection.from_pretrained(
+        config.image_encoder_path
+    ).to(dtype=weight_dtype, device="cuda")
+    sched_kwargs = OmegaConf.to_container(infer_config.noise_scheduler_kwargs)
+    scheduler = DDIMScheduler(**sched_kwargs)
+    seed = config.get("seed",args.seed)
+    generator = torch.manual_seed(seed)
+    width, height = args.W, args.H
+    clip_length = config.get("L",args.L)
+    steps = args.steps
+    guidance_scale = args.cfg
+    # load pretrained weights
+    denoising_unet.load_state_dict(
+        torch.load(config.denoising_unet_path, map_location="cpu"),
+        strict=False,
+    )
+    reference_unet.load_state_dict(
+        torch.load(config.reference_unet_path, map_location="cpu"),
+    )
+    pose_guider.load_state_dict(
+        torch.load(config.pose_guider_path, map_location="cpu"),
+    )
+    pipe = Pose2VideoPipeline(
+        vae=vae,
+        image_encoder=image_enc,
+        reference_unet=reference_unet,
+        denoising_unet=denoising_unet,
+        pose_guider=pose_guider,
+        scheduler=scheduler,
+    )
+    #  设置日志文件路径
+    # log_file_path = "model_structures.log"
+    # with open(log_file_path, 'w') as log_file:
+    #     # 重定向标准输出到日志文件
+    #     orig_stdout = sys.stdout  # 保存原始的标准输出
+    #     sys.stdout = log_file     # 将标准输出重定向到日志文件
+    #     # 打印模型结构
+    #     print("Denoising UNet structure:")
+    #     print(denoising_unet)  # 打印 denoising_unet 的结构
+    #     print("Reference UNet structure:")
+    #     print(reference_unet)  # 打印 reference_unet 的结构
+    #     print("Pose Guider structure:")
+    #     print(pose_guider)  # 打印 pose_guider 的结构
+    #     print("image_enc:")
+    #     print(image_enc)
+    #     print("Pose Guider structure:")
+    #     print(pose_guider)
+    #     print("pipe:")
+    #     print(pipe)
+    #     # 恢复标准输出
+    #     sys.stdout = orig_stdout  # 还原标准输出
+    # print(f"The model structures have been saved to {log_file_path}.")
+    pipe = pipe.to("cuda", dtype=weight_dtype)
+    date_str = datetime.now().strftime("%Y%m%d")
+    time_str = datetime.now().strftime("%H%M")
+    save_dir_name = f"{time_str}--seed_{seed}-{args.W}x{args.H}"
+    save_dir = Path(f"output/{date_str}/{save_dir_name}")
+    save_dir.mkdir(exist_ok=True, parents=True)
+    model_video_paths = config.model_video_paths
+    cloth_image_paths = config.cloth_image_paths
+    transform = transforms.Compose(
+        [transforms.Resize((height, width)), transforms.ToTensor()]
+    )
+    for model_image_path in model_video_paths:
+        # print("model_image_path", model_image_path)
+        src_fps = get_fps(model_image_path)
+        model_name = Path(model_image_path).stem
+        agnostic_path=model_image_path.replace("videos","agnostic") #data/videos/upper1.mp4——>data/agnostic/upper1.mp4
+        agn_mask_path=model_image_path.replace("videos","agnostic_mask")
+        densepose_path=model_image_path.replace("videos","densepose")
+        video_tensor_list=[]
+        video_images=read_frames(model_image_path)
+        clip_length = len(video_images)  # 设置 clip_length 为输入视频的总帧数
+        # clip_length=48
+        for vid_image_pil in video_images[:clip_length]: #clip_length=24
+            video_tensor_list.append(transform(vid_image_pil))
+        video_tensor = torch.stack(video_tensor_list, dim=0)  # (f, c, h, w)
+        video_tensor = video_tensor.transpose(0, 1)
+        agnostic_list=[]
+        agnostic_images=read_frames(agnostic_path)
+        for agnostic_image_pil in agnostic_images[:clip_length]:
+            agnostic_list.append(agnostic_image_pil)
+        agn_mask_list=[]
+        agn_mask_images=read_frames(agn_mask_path)
+        # print(" agn_mask_images",  agn_mask_images)
+        for agn_mask_image_pil in agn_mask_images[:clip_length]:
+            agn_mask_list.append(agn_mask_image_pil)
+        pose_list=[]
+        pose_images=read_frames(densepose_path)
+        for pose_image_pil in pose_images[:clip_length]:
+            pose_list.append(pose_image_pil)
+        video_tensor = video_tensor.unsqueeze(0)
+        for cloth_image_path in cloth_image_paths:
+            cloth_name =  Path(cloth_image_path).stem
+            cloth_image_pil = Image.open(cloth_image_path).convert("RGB")
+            cloth_mask_path=cloth_image_path.replace("cloth","cloth_mask")
+            cloth_mask_pil = Image.open(cloth_mask_path).convert("RGB")
+            pipeline_output = pipe(
+                agnostic_list,
+                agn_mask_list,
+                cloth_image_pil,
+                cloth_mask_pil,
+                pose_list,
+                width,
+                height,
+                clip_length,
+                steps,
+                guidance_scale,
+                generator=generator,
+            )
+            # print("pipeline_output", pipeline_output)
+            video = pipeline_output.videos
+            video = torch.cat([video_tensor,video], dim=0)
+            save_videos_grid(
+                video,
+                f"{save_dir}/{model_name}_{cloth_name}_{args.H}x{args.W}_{int(guidance_scale)}_{time_str}.mp4",
+                n_rows=2,
+                fps=src_fps if args.fps is None else args.fps,
+            )
+if __name__ == "__main__":
+    main()

vividfuxian_motion/20241211/1715/803128_detail_1060638_in_xl.mp4 ADDED Viewed

Binary file (95.1 kB). View file

vividfuxian_motion/20241212/1437/000004-803128_detail_1060638_in_xl.mp4 ADDED Viewed

Binary file (94 kB). View file

vividfuxian_motion/20241212/1506/000200-803128_detail_1060638_in_xl.mp4 ADDED Viewed

Binary file (98.1 kB). View file

vividfuxian_motion/20241212/1629/000600-803128_detail_1060638_in_xl.mp4 ADDED Viewed

Binary file (97.8 kB). View file

vividfuxian_valid/stage1/000010-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/000200-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/000400-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/000600-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/000800-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/001000-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/001200-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/001600-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/001800-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/002000-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/002200-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/002400-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/002600-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/002800-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/003000-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/003400-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/003600-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/003800-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/004200-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/004400-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/004600-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/004800-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/005200-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/005400-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/005600-803137_in_xl_812294_in_xl.jpg ADDED Viewed

vividfuxian_valid/stage1/005800-803137_in_xl_812294_in_xl.jpg ADDED Viewed