float32/CPU inference?

by danjacobellis - opened May 8

May 8

•

I am curious how the model performs on CPU.

The .jit files appear to work with bfloat16 only and give an error at runtime if loaded with device='cpu' and dtype='float32'.

If loaded withdevice='cpu' and dtype='bfloat16', the performance is extremely poor (<0.1 megapixels per second encoding throughput). I suspect this is because some sort of emulation is being done to support bf16 on cpu, but I'm not sure.

I was able to create a tokenizer with placeholder values:

from cosmos_predict1.tokenizer.networks.discrete_image import DiscreteImageTokenizer
model_path = snapshot_download("nvidia/Cosmos-Tokenize1-DI16x16-360p")
checkpoint = torch.load(f'{model_path}/model.pt', map_location='cpu',weights_only=False)
tokenizer = DiscreteImageTokenizer(
    z_channels=6,
    embedding_dim=3*16*16,
    in_channels=3,
    channels=3*16*16,
    channels_mult=[1,1,1,1],
    num_res_blocks=4,
    attn_resolutions=[360,360,360,360],
    dropout = 0.0,
    resolution = 360,
    spatial_compression=16,
    out_channels=3,
    levels=[1,1,1,1],
    num_quantizers=1,
)

Presumably, the model could be loaded to CPU by choosing the correct hyperparameters and loading the state dict from model.pt, but I'm not able to find the actual values of these hyperparameters anywhere in the paper or public repo.

I looked in the config.json file, but it only contains { "architectures": ["CosmosTokenizer"],}

Is it possible to use the cosmos tokenizers with float32? If so, what is the recommended way to load the model?

What should the expected CPU throughput be?

Thanks,

Dan

danjacobellis changed discussion title from CPU inference? to float32/CPU inference? May 8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment