Parker commited on
Commit
2daa24e
·
verified ·
1 Parent(s): 83a357a

Upload 5 files

Browse files
4bit/QUANTIZATION_README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # VibeVoice Quantization Guide
2
+
3
+ Successfully quantized VibeVoice 7B model to both 4-bit and 8-bit versions using bitsandbytes!
4
+
5
+ ## Model Sizes
6
+
7
+ | Model Version | Size | Memory Usage | Quality |
8
+ |---------------|------|--------------|---------|
9
+ | Original (fp16/bf16) | 18GB | ~18GB VRAM | Best |
10
+ | 8-bit Quantized | 9.9GB | ~10.6GB VRAM | Excellent |
11
+ | 4-bit Quantized (nf4) | 6.2GB | ~6.6GB VRAM | Very Good |
12
+
13
+ ## How to Use Pre-Quantized Models
14
+
15
+ ### 1. Loading 4-bit Model
16
+
17
+ ```python
18
+ from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
19
+ from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
20
+
21
+ # Load pre-quantized 4-bit model
22
+ model_path = "/path/to/VibeVoice-Large-4bit"
23
+ processor = VibeVoiceProcessor.from_pretrained(model_path)
24
+ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
25
+ model_path,
26
+ device_map='cuda',
27
+ torch_dtype=torch.bfloat16,
28
+ )
29
+ ```
30
+
31
+ ### 2. Loading 8-bit Model
32
+
33
+ ```python
34
+ # Same code, just point to 8-bit model
35
+ model_path = "/path/to/VibeVoice-Large-8bit"
36
+ # ... rest is the same
37
+ ```
38
+
39
+ ## Creating Your Own Quantized Models
40
+
41
+ Use the provided script to quantize models:
42
+
43
+ ```bash
44
+ # 4-bit quantization (nf4)
45
+ python quantize_and_save_vibevoice.py \
46
+ --model_path /path/to/original/model \
47
+ --output_dir /path/to/output/4bit \
48
+ --bits 4 \
49
+ --test
50
+
51
+ # 8-bit quantization
52
+ python quantize_and_save_vibevoice.py \
53
+ --model_path /path/to/original/model \
54
+ --output_dir /path/to/output/8bit \
55
+ --bits 8 \
56
+ --test
57
+ ```
58
+
59
+ ## Benefits
60
+
61
+ 1. **Pre-quantized models load faster** - No on-the-fly quantization needed
62
+ 2. **Lower VRAM requirements** - 4-bit uses only ~6.6GB vs 18GB
63
+ 3. **Shareable** - Upload the quantized folder to share with others
64
+ 4. **Quality preserved** - nf4 quantization maintains excellent output quality
65
+
66
+ ## Distribution
67
+
68
+ To share quantized models:
69
+
70
+ 1. Upload the entire quantized model directory (e.g., `VibeVoice-Large-4bit/`)
71
+ 2. Include the `quantization_config.json` file (automatically created)
72
+ 3. Users can load directly without any quantization setup
73
+
74
+ ## Performance Notes
75
+
76
+ - 4-bit (nf4): Best for memory-constrained systems, minimal quality loss
77
+ - 8-bit: Better quality than 4-bit, still significant memory savings
78
+ - Both versions maintain the same generation speed as the original
79
+ - Flash Attention 2 is supported in all quantized versions
80
+
81
+ ## Troubleshooting
82
+
83
+ If loading fails:
84
+ 1. Ensure you have `bitsandbytes` installed: `pip install bitsandbytes`
85
+ 2. Make sure you're on a CUDA-capable GPU
86
+ 3. Check that all model files are present in the directory
87
+
88
+ ## Files Created
89
+
90
+ Each quantized model directory contains:
91
+ - `model.safetensors.*` - Quantized model weights
92
+ - `config.json` - Model configuration with quantization settings
93
+ - `quantization_config.json` - Specific quantization parameters
94
+ - `processor/` - Audio processor files
95
+ - `load_quantized_Xbit.py` - Example loading script
4bit/quantize_and_save_vibevoice.py ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ Quantize and save VibeVoice model using bitsandbytes
4
+ Creates a pre-quantized model that can be shared and loaded directly
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import shutil
10
+ import torch
11
+ from pathlib import Path
12
+ from transformers import BitsAndBytesConfig
13
+ from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
14
+ from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
15
+ from transformers.utils import logging
16
+ from safetensors.torch import save_file
17
+
18
+ logging.set_verbosity_info()
19
+
20
+ def quantize_and_save_model(
21
+ model_path: str,
22
+ output_dir: str,
23
+ bits: int = 4,
24
+ quant_type: str = "nf4"
25
+ ):
26
+ """Quantize VibeVoice model and save it for distribution"""
27
+
28
+ print(f"\n{'='*70}")
29
+ print(f"VIBEVOICE QUANTIZATION - {bits}-bit ({quant_type})")
30
+ print(f"{'='*70}")
31
+ print(f"Source: {model_path}")
32
+ print(f"Output: {output_dir}")
33
+ print(f"{'='*70}\n")
34
+
35
+ # Create output directory
36
+ output_path = Path(output_dir)
37
+ output_path.mkdir(parents=True, exist_ok=True)
38
+
39
+ # Configure quantization
40
+ if bits == 4:
41
+ bnb_config = BitsAndBytesConfig(
42
+ load_in_4bit=True,
43
+ bnb_4bit_compute_dtype=torch.bfloat16,
44
+ bnb_4bit_use_double_quant=True,
45
+ bnb_4bit_quant_type=quant_type
46
+ )
47
+ elif bits == 8:
48
+ bnb_config = BitsAndBytesConfig(
49
+ load_in_8bit=True,
50
+ bnb_8bit_compute_dtype=torch.bfloat16,
51
+ )
52
+ else:
53
+ raise ValueError(f"Unsupported bit width: {bits}")
54
+
55
+ print("🔧 Loading and quantizing model...")
56
+
57
+ # Load the model with quantization
58
+ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
59
+ model_path,
60
+ quantization_config=bnb_config,
61
+ device_map='cuda',
62
+ torch_dtype=torch.bfloat16,
63
+ )
64
+
65
+ # Get memory usage
66
+ memory_gb = torch.cuda.memory_allocated() / 1e9
67
+ print(f"💾 Quantized model memory usage: {memory_gb:.1f} GB")
68
+
69
+ # Save the quantized model
70
+ print("\n📦 Saving quantized model...")
71
+
72
+ # Method 1: Try using save_pretrained with quantization info
73
+ try:
74
+ # Save model with quantization config
75
+ model.save_pretrained(
76
+ output_path,
77
+ safe_serialization=True,
78
+ max_shard_size="5GB"
79
+ )
80
+
81
+ # Save the quantization config separately
82
+ quant_config_dict = {
83
+ "quantization_config": bnb_config.to_dict(),
84
+ "quantization_method": "bitsandbytes",
85
+ "bits": bits,
86
+ "quant_type": quant_type
87
+ }
88
+
89
+ with open(output_path / "quantization_config.json", 'w') as f:
90
+ json.dump(quant_config_dict, f, indent=2)
91
+
92
+ print("✅ Model saved with integrated quantization")
93
+
94
+ except Exception as e:
95
+ print(f"⚠️ Standard save failed: {e}")
96
+ print("Trying alternative save method...")
97
+
98
+ # Method 2: Save state dict with quantized weights
99
+ save_quantized_state_dict(model, output_path, bnb_config)
100
+
101
+ # Copy processor files
102
+ print("\n📋 Copying processor files...")
103
+ processor = VibeVoiceProcessor.from_pretrained(model_path)
104
+ processor.save_pretrained(output_path)
105
+
106
+ # Copy additional config files
107
+ for file in ["config.json", "generation_config.json"]:
108
+ src = Path(model_path) / file
109
+ if src.exists():
110
+ shutil.copy2(src, output_path / file)
111
+
112
+ # Update config to indicate quantization
113
+ config_path = output_path / "config.json"
114
+ if config_path.exists():
115
+ with open(config_path, 'r') as f:
116
+ config = json.load(f)
117
+
118
+ config["quantization_config"] = bnb_config.to_dict()
119
+ config["_quantization_method"] = "bitsandbytes"
120
+
121
+ with open(config_path, 'w') as f:
122
+ json.dump(config, f, indent=2)
123
+
124
+ print(f"\n✅ Quantized model saved to: {output_path}")
125
+
126
+ # Create loading script
127
+ create_loading_script(output_path, bits, quant_type)
128
+
129
+ return output_path
130
+
131
+ def save_quantized_state_dict(model, output_path, bnb_config):
132
+ """Alternative method to save quantized weights"""
133
+ print("\n🔧 Saving quantized state dict...")
134
+
135
+ # Get the state dict
136
+ state_dict = model.state_dict()
137
+
138
+ # Separate quantized and non-quantized parameters
139
+ quantized_state = {}
140
+ metadata = {
141
+ "quantized_modules": [],
142
+ "quantization_config": bnb_config.to_dict()
143
+ }
144
+
145
+ for name, param in state_dict.items():
146
+ # Check if this is a quantized parameter
147
+ if hasattr(param, 'quant_state'):
148
+ # Store quantization state
149
+ metadata["quantized_modules"].append(name)
150
+ quantized_state[name] = param.data
151
+ else:
152
+ # Regular parameter
153
+ quantized_state[name] = param
154
+
155
+ # Save using safetensors
156
+ save_file(quantized_state, output_path / "model.safetensors", metadata=metadata)
157
+
158
+ # Save metadata
159
+ with open(output_path / "quantization_metadata.json", 'w') as f:
160
+ json.dump(metadata, f, indent=2)
161
+
162
+ def create_loading_script(output_path, bits, quant_type):
163
+ """Create a script to load the quantized model"""
164
+
165
+ script_content = f'''#!/usr/bin/env python
166
+ """
167
+ Load and use the {bits}-bit quantized VibeVoice model
168
+ """
169
+
170
+ import torch
171
+ from transformers import BitsAndBytesConfig
172
+ from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
173
+ from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
174
+
175
+ def load_quantized_model(model_path="{output_path}"):
176
+ """Load the pre-quantized VibeVoice model"""
177
+
178
+ print("Loading {bits}-bit quantized VibeVoice model...")
179
+
180
+ # The model is already quantized, but we need to specify the config
181
+ # to ensure proper loading of quantized weights
182
+ bnb_config = BitsAndBytesConfig(
183
+ load_in_{bits}bit=True,
184
+ bnb_{bits}bit_compute_dtype=torch.bfloat16,
185
+ {"bnb_4bit_use_double_quant=True," if bits == 4 else ""}
186
+ {"bnb_4bit_quant_type='" + quant_type + "'" if bits == 4 else ""}
187
+ )
188
+
189
+ # Load processor
190
+ processor = VibeVoiceProcessor.from_pretrained(model_path)
191
+
192
+ # Load model
193
+ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
194
+ model_path,
195
+ quantization_config=bnb_config,
196
+ device_map='cuda',
197
+ torch_dtype=torch.bfloat16,
198
+ )
199
+
200
+ model.eval()
201
+
202
+ print("✅ Model loaded successfully!")
203
+ print(f"💾 Memory usage: {{torch.cuda.memory_allocated() / 1e9:.1f}} GB")
204
+
205
+ return model, processor
206
+
207
+ # Example usage
208
+ if __name__ == "__main__":
209
+ model, processor = load_quantized_model()
210
+
211
+ # Generate audio
212
+ text = "Speaker 1: Hello! Speaker 2: Hi there!"
213
+ inputs = processor(
214
+ text=[text],
215
+ voice_samples=[["path/to/voice1.wav", "path/to/voice2.wav"]],
216
+ padding=True,
217
+ return_tensors="pt",
218
+ )
219
+
220
+ with torch.no_grad():
221
+ outputs = model.generate(**inputs)
222
+
223
+ # Save audio
224
+ processor.save_audio(outputs.speech_outputs[0], "output.wav")
225
+ '''
226
+
227
+ script_path = output_path / f"load_quantized_{bits}bit.py"
228
+ with open(script_path, 'w') as f:
229
+ f.write(script_content)
230
+
231
+ print(f"📝 Created loading script: {script_path}")
232
+
233
+ def test_quantized_model(model_path):
234
+ """Test loading and generating with the quantized model"""
235
+ print(f"\n🧪 Testing quantized model from: {model_path}")
236
+
237
+ try:
238
+ # Load the quantized model
239
+ processor = VibeVoiceProcessor.from_pretrained(model_path)
240
+
241
+ # Load with auto-detection of quantization
242
+ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
243
+ model_path,
244
+ device_map='cuda',
245
+ torch_dtype=torch.bfloat16,
246
+ )
247
+
248
+ print("✅ Model loaded successfully!")
249
+
250
+ # Quick generation test
251
+ test_text = "Speaker 1: Testing quantized model. Speaker 2: It works!"
252
+ print(f"\n🎤 Testing generation with: '{test_text}'")
253
+
254
+ # Use demo voices
255
+ voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
256
+ speaker_voices = [
257
+ os.path.join(voices_dir, "en-Alice_woman.wav"),
258
+ os.path.join(voices_dir, "en-Carter_man.wav")
259
+ ]
260
+
261
+ inputs = processor(
262
+ text=[test_text],
263
+ voice_samples=[speaker_voices],
264
+ padding=True,
265
+ return_tensors="pt",
266
+ return_attention_mask=True,
267
+ )
268
+
269
+ with torch.no_grad():
270
+ outputs = model.generate(
271
+ **inputs,
272
+ max_new_tokens=None,
273
+ cfg_scale=1.3,
274
+ tokenizer=processor.tokenizer,
275
+ generation_config={'do_sample': False},
276
+ )
277
+
278
+ print("✅ Generation successful!")
279
+
280
+ # Save test output
281
+ output_path = Path(model_path) / "test_output.wav"
282
+ processor.save_audio(outputs.speech_outputs[0], output_path=str(output_path))
283
+ print(f"🔊 Test audio saved to: {output_path}")
284
+
285
+ return True
286
+
287
+ except Exception as e:
288
+ print(f"❌ Test failed: {e}")
289
+ return False
290
+
291
+ def main():
292
+ import argparse
293
+ parser = argparse.ArgumentParser(description="Quantize and save VibeVoice model")
294
+ parser.add_argument("--model_path", default="/home/deveraux/Desktop/vibevoice/VibeVoice-Large-pt",
295
+ help="Path to the original model")
296
+ parser.add_argument("--output_dir", default="/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit",
297
+ help="Output directory for quantized model")
298
+ parser.add_argument("--bits", type=int, default=4, choices=[4, 8],
299
+ help="Quantization bits (4 or 8)")
300
+ parser.add_argument("--quant_type", default="nf4", choices=["nf4", "fp4"],
301
+ help="4-bit quantization type")
302
+ parser.add_argument("--test", action="store_true",
303
+ help="Test the quantized model after saving")
304
+
305
+ args = parser.parse_args()
306
+
307
+ # Update output dir based on bits
308
+ if str(args.bits) not in args.output_dir:
309
+ args.output_dir = args.output_dir.replace("4bit", f"{args.bits}bit")
310
+
311
+ # Quantize and save
312
+ output_path = quantize_and_save_model(
313
+ args.model_path,
314
+ args.output_dir,
315
+ args.bits,
316
+ args.quant_type
317
+ )
318
+
319
+ # Test if requested
320
+ if args.test:
321
+ test_quantized_model(output_path)
322
+
323
+ print(f"\n🎉 Done! Quantized model ready for distribution at: {output_path}")
324
+ print(f"\n📦 To share this model:")
325
+ print(f"1. Upload the entire '{output_path}' directory")
326
+ print(f"2. Users can load it with the provided script or directly with transformers")
327
+ print(f"3. The model will load in {args.bits}-bit without additional quantization")
328
+
329
+ if __name__ == "__main__":
330
+ main()
4bit/test_accurate_vram.py ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ Accurate VRAM measurement for VibeVoice models
4
+ Shows the difference between allocated vs reserved memory
5
+ """
6
+
7
+ import os
8
+ import gc
9
+ import torch
10
+ import subprocess
11
+ import time
12
+ from pathlib import Path
13
+ from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
14
+ from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
15
+
16
+ def get_gpu_memory_info():
17
+ """Get detailed GPU memory information"""
18
+ if not torch.cuda.is_available():
19
+ return {}
20
+
21
+ # PyTorch memory stats
22
+ allocated = torch.cuda.memory_allocated() / 1e9
23
+ reserved = torch.cuda.memory_reserved() / 1e9
24
+
25
+ # Get nvidia-smi info
26
+ try:
27
+ result = subprocess.run([
28
+ 'nvidia-smi',
29
+ '--query-gpu=memory.used,memory.total',
30
+ '--format=csv,nounits,noheader'
31
+ ], capture_output=True, text=True)
32
+
33
+ if result.returncode == 0:
34
+ used, total = map(int, result.stdout.strip().split(','))
35
+ nvidia_used_gb = used / 1024 # Convert MB to GB
36
+ nvidia_total_gb = total / 1024
37
+ else:
38
+ nvidia_used_gb = 0
39
+ nvidia_total_gb = 0
40
+ except:
41
+ nvidia_used_gb = 0
42
+ nvidia_total_gb = 0
43
+
44
+ return {
45
+ 'allocated': allocated,
46
+ 'reserved': reserved,
47
+ 'nvidia_smi': nvidia_used_gb,
48
+ 'nvidia_total': nvidia_total_gb
49
+ }
50
+
51
+ def print_memory_report(label, before, after):
52
+ """Print detailed memory usage report"""
53
+ print(f"\n{label}:")
54
+ print(f" PyTorch Allocated: {before['allocated']:.2f} GB → {after['allocated']:.2f} GB "
55
+ f"(+{after['allocated'] - before['allocated']:.2f} GB)")
56
+ print(f" PyTorch Reserved: {before['reserved']:.2f} GB → {after['reserved']:.2f} GB "
57
+ f"(+{after['reserved'] - before['reserved']:.2f} GB)")
58
+ print(f" nvidia-smi Total: {before['nvidia_smi']:.2f} GB → {after['nvidia_smi']:.2f} GB "
59
+ f"(+{after['nvidia_smi'] - before['nvidia_smi']:.2f} GB)")
60
+ print(f" Memory Overhead: {after['reserved'] - after['allocated']:.2f} GB (PyTorch cache)")
61
+
62
+ def clear_gpu_memory():
63
+ """Aggressively clear GPU memory"""
64
+ gc.collect()
65
+ if torch.cuda.is_available():
66
+ torch.cuda.empty_cache()
67
+ torch.cuda.synchronize()
68
+ # Force memory pool cleanup
69
+ torch.cuda.reset_peak_memory_stats()
70
+
71
+ def test_model_memory(model_path, model_name):
72
+ """Test model with detailed memory tracking"""
73
+ print(f"\n{'='*70}")
74
+ print(f"Testing {model_name}")
75
+ print(f"{'='*70}")
76
+
77
+ # Clear memory and get baseline
78
+ clear_gpu_memory()
79
+ time.sleep(2) # Let memory settle
80
+
81
+ baseline = get_gpu_memory_info()
82
+ print(f"\nBaseline GPU Memory:")
83
+ print(f" PyTorch Allocated: {baseline['allocated']:.2f} GB")
84
+ print(f" PyTorch Reserved: {baseline['reserved']:.2f} GB")
85
+ print(f" nvidia-smi Shows: {baseline['nvidia_smi']:.2f} GB / {baseline['nvidia_total']:.2f} GB")
86
+
87
+ # Load model
88
+ print(f"\nLoading {model_name}...")
89
+ load_start = time.time()
90
+
91
+ processor = VibeVoiceProcessor.from_pretrained(model_path)
92
+ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
93
+ model_path,
94
+ device_map='cuda',
95
+ torch_dtype=torch.bfloat16,
96
+ )
97
+ model.eval()
98
+
99
+ load_time = time.time() - load_start
100
+
101
+ # Get memory after loading
102
+ loaded = get_gpu_memory_info()
103
+ print_memory_report("After Model Loading", baseline, loaded)
104
+
105
+ # Test generation to see peak usage
106
+ print(f"\nTesting generation...")
107
+ test_text = "Speaker 1: Testing memory usage. Speaker 2: Let's see the results!"
108
+ voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
109
+ speaker_voices = [
110
+ os.path.join(voices_dir, "en-Alice_woman.wav"),
111
+ os.path.join(voices_dir, "en-Carter_man.wav")
112
+ ]
113
+
114
+ inputs = processor(
115
+ text=[test_text],
116
+ voice_samples=[speaker_voices],
117
+ padding=True,
118
+ return_tensors="pt",
119
+ return_attention_mask=True,
120
+ )
121
+
122
+ # Monitor during generation
123
+ pre_gen = get_gpu_memory_info()
124
+
125
+ with torch.no_grad():
126
+ outputs = model.generate(
127
+ **inputs,
128
+ max_new_tokens=None,
129
+ cfg_scale=1.3,
130
+ tokenizer=processor.tokenizer,
131
+ generation_config={'do_sample': False},
132
+ )
133
+
134
+ post_gen = get_gpu_memory_info()
135
+ print_memory_report("During Generation", pre_gen, post_gen)
136
+
137
+ # Peak memory stats
138
+ if torch.cuda.is_available():
139
+ peak_memory = torch.cuda.max_memory_allocated() / 1e9
140
+ peak_reserved = torch.cuda.max_memory_reserved() / 1e9
141
+ print(f"\nPeak Memory Usage:")
142
+ print(f" Peak Allocated: {peak_memory:.2f} GB")
143
+ print(f" Peak Reserved: {peak_reserved:.2f} GB")
144
+
145
+ # Clean up
146
+ del model
147
+ del processor
148
+ clear_gpu_memory()
149
+
150
+ return {
151
+ 'name': model_name,
152
+ 'allocated': loaded['allocated'] - baseline['allocated'],
153
+ 'reserved': loaded['reserved'] - baseline['reserved'],
154
+ 'nvidia_smi': loaded['nvidia_smi'] - baseline['nvidia_smi'],
155
+ 'peak_allocated': peak_memory,
156
+ 'peak_reserved': peak_reserved
157
+ }
158
+
159
+ def main():
160
+ print("="*70)
161
+ print("ACCURATE VRAM MEASUREMENT FOR VIBEVOICE")
162
+ print("="*70)
163
+ print("\nNote: PyTorch reserves extra memory for efficiency.")
164
+ print("nvidia-smi shows total reserved memory, not just allocated.")
165
+
166
+ models = [
167
+ {
168
+ "path": "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-pt",
169
+ "name": "16-bit Original"
170
+ },
171
+ {
172
+ "path": "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit",
173
+ "name": "4-bit Quantized"
174
+ }
175
+ ]
176
+
177
+ results = []
178
+ for model_info in models:
179
+ try:
180
+ result = test_model_memory(model_info["path"], model_info["name"])
181
+ results.append(result)
182
+ time.sleep(5)
183
+ except Exception as e:
184
+ print(f"Error testing {model_info['name']}: {e}")
185
+
186
+ # Summary
187
+ print("\n" + "="*70)
188
+ print("MEMORY USAGE SUMMARY")
189
+ print("="*70)
190
+ print(f"\n{'Model':<20} {'Allocated':<12} {'Reserved':<12} {'nvidia-smi':<12} {'Peak':<12}")
191
+ print("-"*70)
192
+
193
+ for r in results:
194
+ print(f"{r['name']:<20} "
195
+ f"{r['allocated']:<12.2f} "
196
+ f"{r['reserved']:<12.2f} "
197
+ f"{r['nvidia_smi']:<12.2f} "
198
+ f"{r['peak_allocated']:<12.2f}")
199
+
200
+ print("\n💡 Key Insights:")
201
+ print("- 'Allocated' = Actual model weights in memory")
202
+ print("- 'Reserved' = Total GPU memory reserved by PyTorch (includes cache)")
203
+ print("- 'nvidia-smi' = What nvidia-smi reports (includes all overhead)")
204
+ print("- The difference is PyTorch's memory pool for efficiency")
205
+
206
+ if __name__ == "__main__":
207
+ main()
4bit/use_quantized_model.py ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ Simple example of using the pre-quantized VibeVoice model
4
+ No need for on-the-fly quantization - loads much faster!
5
+ """
6
+
7
+ import os
8
+ import torch
9
+ from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
10
+ from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
11
+
12
+ def main():
13
+ # Path to the pre-quantized model
14
+ model_path = "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit"
15
+
16
+ print("Loading pre-quantized VibeVoice 4-bit model...")
17
+
18
+ # Load processor
19
+ processor = VibeVoiceProcessor.from_pretrained(model_path)
20
+
21
+ # Load the pre-quantized model
22
+ # The quantization config is already saved in the model
23
+ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
24
+ model_path,
25
+ device_map='cuda',
26
+ torch_dtype=torch.bfloat16,
27
+ )
28
+ model.eval()
29
+
30
+ # Check memory usage
31
+ memory_gb = torch.cuda.memory_allocated() / 1e9
32
+ print(f"✅ Model loaded! Memory usage: {memory_gb:.1f} GB")
33
+
34
+ # Example generation
35
+ text = "Speaker 1: Welcome to our podcast! Speaker 2: Thanks for having me!"
36
+
37
+ # Voice samples (using demo voices)
38
+ voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
39
+ speaker_voices = [
40
+ os.path.join(voices_dir, "en-Alice_woman.wav"),
41
+ os.path.join(voices_dir, "en-Carter_man.wav")
42
+ ]
43
+
44
+ # Process inputs
45
+ inputs = processor(
46
+ text=[text],
47
+ voice_samples=[speaker_voices],
48
+ padding=True,
49
+ return_tensors="pt",
50
+ return_attention_mask=True,
51
+ )
52
+
53
+ # Generate
54
+ print(f"\nGenerating: '{text}'")
55
+ with torch.no_grad():
56
+ outputs = model.generate(
57
+ **inputs,
58
+ max_new_tokens=None,
59
+ cfg_scale=1.3,
60
+ tokenizer=processor.tokenizer,
61
+ generation_config={'do_sample': False},
62
+ )
63
+
64
+ # Save output
65
+ output_path = "quantized_output.wav"
66
+ processor.save_audio(outputs.speech_outputs[0], output_path=output_path)
67
+ print(f"✅ Audio saved to: {output_path}")
68
+
69
+ if __name__ == "__main__":
70
+ main()
4bit/vibevoice_7gb_target.py ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ Load VibeVoice 4-bit in ~7GB VRAM
4
+ Minimize PyTorch's memory pool overhead
5
+ """
6
+
7
+ import os
8
+ import gc
9
+ import torch
10
+ from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
11
+ from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
12
+
13
+ # CRITICAL: Set these BEFORE any CUDA operations
14
+ os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128,expandable_segments:True'
15
+
16
+ # Reduce memory fraction to force PyTorch to be more conservative
17
+ torch.cuda.set_per_process_memory_fraction(0.75) # This limits reserved memory
18
+
19
+ def get_memory_stats():
20
+ """Get detailed memory statistics"""
21
+ if torch.cuda.is_available():
22
+ allocated = torch.cuda.memory_allocated() / 1e9
23
+ reserved = torch.cuda.memory_reserved() / 1e9
24
+ free = torch.cuda.mem_get_info()[0] / 1e9
25
+ total = torch.cuda.mem_get_info()[1] / 1e9
26
+ return {
27
+ 'allocated': allocated,
28
+ 'reserved': reserved,
29
+ 'free': free,
30
+ 'total': total,
31
+ 'used': total - free
32
+ }
33
+ return {}
34
+
35
+ def load_model_minimal(model_path):
36
+ """Load model with absolute minimal memory overhead"""
37
+ print("Loading 4-bit model with minimal overhead...")
38
+
39
+ # Start clean
40
+ gc.collect()
41
+ torch.cuda.empty_cache()
42
+ torch.cuda.reset_peak_memory_stats()
43
+
44
+ # Report initial state
45
+ stats = get_memory_stats()
46
+ print(f"\nInitial state:")
47
+ print(f" GPU total: {stats['total']:.2f} GB")
48
+ print(f" GPU used: {stats['used']:.2f} GB")
49
+ print(f" GPU free: {stats['free']:.2f} GB")
50
+
51
+ # Load processor
52
+ processor = VibeVoiceProcessor.from_pretrained(model_path)
53
+
54
+ # Load model - let it use default device map
55
+ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
56
+ model_path,
57
+ device_map='cuda',
58
+ torch_dtype=torch.bfloat16,
59
+ low_cpu_mem_usage=True,
60
+ )
61
+
62
+ # Immediately set to eval and disable gradients
63
+ model.eval()
64
+ model.requires_grad_(False)
65
+
66
+ # Force cleanup
67
+ gc.collect()
68
+ torch.cuda.empty_cache()
69
+
70
+ # Report after loading
71
+ stats = get_memory_stats()
72
+ print(f"\nAfter loading:")
73
+ print(f" Allocated: {stats['allocated']:.2f} GB (actual model)")
74
+ print(f" Reserved: {stats['reserved']:.2f} GB (PyTorch total)")
75
+ print(f" Overhead: {stats['reserved'] - stats['allocated']:.2f} GB")
76
+ print(f" System reports: {stats['used']:.2f} GB used")
77
+
78
+ return model, processor
79
+
80
+ def generate_minimal(model, processor, text, speaker_voices):
81
+ """Generate with minimal memory overhead"""
82
+ # Process inputs
83
+ inputs = processor(
84
+ text=[text],
85
+ voice_samples=[speaker_voices],
86
+ padding=True,
87
+ return_tensors="pt",
88
+ return_attention_mask=True,
89
+ )
90
+
91
+ # Disable caching to save memory during generation
92
+ with torch.no_grad():
93
+ # Temporarily reduce memory fragmentation
94
+ torch.cuda.empty_cache()
95
+
96
+ outputs = model.generate(
97
+ **inputs,
98
+ max_new_tokens=None,
99
+ cfg_scale=1.3,
100
+ tokenizer=processor.tokenizer,
101
+ generation_config={
102
+ 'do_sample': False,
103
+ 'use_cache': True, # Actually, keeping cache can be more efficient
104
+ },
105
+ )
106
+
107
+ # Cleanup
108
+ del inputs
109
+ gc.collect()
110
+
111
+ return outputs
112
+
113
+ def try_memory_reduction_tricks():
114
+ """Additional tricks to reduce memory"""
115
+ print("\n🔧 Applying memory reduction tricks...")
116
+
117
+ # 1. Reduce CUDA kernel reservation
118
+ if hasattr(torch.cuda, 'set_allocator_settings'):
119
+ torch.cuda.set_allocator_settings(backend='native')
120
+
121
+ # 2. Force synchronization and cleanup
122
+ torch.cuda.synchronize()
123
+ torch.cuda.empty_cache()
124
+
125
+ # 3. Try to release unused cached blocks
126
+ allocated_before = torch.cuda.memory_allocated()
127
+ reserved_before = torch.cuda.memory_reserved()
128
+
129
+ # This might help
130
+ torch.cuda.reset_peak_memory_stats()
131
+ torch.cuda.empty_cache()
132
+
133
+ allocated_after = torch.cuda.memory_allocated()
134
+ reserved_after = torch.cuda.memory_reserved()
135
+
136
+ if reserved_before > reserved_after:
137
+ print(f" ✓ Freed {(reserved_before - reserved_after) / 1e9:.2f} GB")
138
+
139
+ def main():
140
+ # Paths
141
+ model_path = "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit"
142
+ voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
143
+
144
+ print("="*60)
145
+ print("VIBEVOICE 4-BIT - 7GB TARGET MODE")
146
+ print("="*60)
147
+
148
+ # Apply tricks before loading
149
+ try_memory_reduction_tricks()
150
+
151
+ # Load model
152
+ model, processor = load_model_minimal(model_path)
153
+
154
+ # Try to compact memory after loading
155
+ try_memory_reduction_tricks()
156
+
157
+ # Test generation
158
+ test_text = "Speaker 1: Testing minimal memory. Speaker 2: Hope it works!"
159
+ speaker_voices = [
160
+ os.path.join(voices_dir, "en-Alice_woman.wav"),
161
+ os.path.join(voices_dir, "en-Carter_man.wav")
162
+ ]
163
+
164
+ print("\n🎤 Generating audio...")
165
+ outputs = generate_minimal(model, processor, test_text, speaker_voices)
166
+
167
+ # Final stats
168
+ stats = get_memory_stats()
169
+ print(f"\nFinal memory usage:")
170
+ print(f" Allocated: {stats['allocated']:.2f} GB")
171
+ print(f" Reserved: {stats['reserved']:.2f} GB")
172
+ print(f" Total used: {stats['used']:.2f} GB")
173
+
174
+ # Save output
175
+ output_path = "7gb_target_output.wav"
176
+ processor.save_audio(outputs.speech_outputs[0], output_path=output_path)
177
+ print(f"\n✅ Audio saved to: {output_path}")
178
+
179
+ # Analysis
180
+ print("\n📊 Analysis:")
181
+ overhead = stats['reserved'] - stats['allocated']
182
+ print(f"The {overhead:.2f} GB overhead comes from:")
183
+ print("- PyTorch memory pool fragmentation")
184
+ print("- CUDA kernel workspace")
185
+ print("- Temporary buffers for operations")
186
+ print("\n💡 The model IS 6.6GB, but PyTorch needs workspace!")
187
+
188
+ # Extreme option
189
+ print("\n🚀 To truly get to 7GB total, you could:")
190
+ print("1. Use bnb 3-bit quantization (experimental)")
191
+ print("2. Prune some model layers")
192
+ print("3. Use a custom CUDA allocator")
193
+ print("4. Compile with torch.compile() for memory efficiency")
194
+
195
+ if __name__ == "__main__":
196
+ main()