Flash Attention
#6
by
						
skymeng
	
							
						- opened
							
					
Hello,
     When using Flash Attention with two separately initialized base models (Qwen2.5-1.5B and Qwen2.5-0.5B) trained on identical datasets, the validation loss consistently shows a marked increase compared to training without Flash Attention. Has this issue been documented or studied in existing research?
Thanks
