Perhaps we can provide a couple of thousand human annotations
Jason Corkill
jasoncorkill
		AI & ML interests
Human data annotation
		Recent Activity
						liked
								a dataset
							
						1 day ago
						
					
						
						
						
						Rapidata/text-2-video-human-preferences-sora-2
						
						liked
								a dataset
							
						1 day ago
						
					
						
						
						
						Rapidata/text-2-video-human-preferences-sora-2-pro
						
						liked
								a dataset
							
						about 1 month ago
						
					
						
						
						
						Rapidata/HunyuanImage-2.1_t2i_human_preference
						Organizations
 
					
	
		
					
					replied to 
						their 
						post
						
				
				4 months ago
 
					
	
		
					
					replied to 
						their 
						post
						
				
				4 months ago
Interesting, what kind of data are you collecting?
 
					
	
		
					
					replied to 
						their 
						post
						
				
				4 months ago
Funny, we also noticed that these models will almost always revert to the Question - Answer Style Joke if not prompted otherwise.
Post
				
				
							3265
					"Why did the bee get married?"
"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini
	
		
	"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini
 
					
	
		
					posted 
						an
							update
							
				
				4 months ago
Post
				
				
							3265
					"Why did the bee get married?"
"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini
	
		
	"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini
Post
				
				
							2436
					Imagine you could have an Image Arena score equivalent at each checkpoint during training. We released the first version of just that: 
Crowd-Eval
Add one line of code to your training loop and you will have a new real human loss curve in your W&B dashboard.
Thousands of real humans from around the world rating your model in real time at the cost of a few dollars per checkpoint is a game changer.
Check it out here: https://github.com/RapidataAI/crowd-eval
First 5 people to put it in their loop get 100'000 human responses for free! (ping me)
		
	
		
	Crowd-Eval
Add one line of code to your training loop and you will have a new real human loss curve in your W&B dashboard.
Thousands of real humans from around the world rating your model in real time at the cost of a few dollars per checkpoint is a game changer.
Check it out here: https://github.com/RapidataAI/crowd-eval
First 5 people to put it in their loop get 100'000 human responses for free! (ping me)
 
					
	
		
					posted 
						an
							update
							
				
				5 months ago
Post
				
				
							2436
					Imagine you could have an Image Arena score equivalent at each checkpoint during training. We released the first version of just that: 
Crowd-Eval
Add one line of code to your training loop and you will have a new real human loss curve in your W&B dashboard.
Thousands of real humans from around the world rating your model in real time at the cost of a few dollars per checkpoint is a game changer.
Check it out here: https://github.com/RapidataAI/crowd-eval
First 5 people to put it in their loop get 100'000 human responses for free! (ping me)
		
	
		
	Crowd-Eval
Add one line of code to your training loop and you will have a new real human loss curve in your W&B dashboard.
Thousands of real humans from around the world rating your model in real time at the cost of a few dollars per checkpoint is a game changer.
Check it out here: https://github.com/RapidataAI/crowd-eval
First 5 people to put it in their loop get 100'000 human responses for free! (ping me)
 
					
	
		
					
					replied to 
						their 
						post
						
				
				5 months ago
Good catch :) yes, we uploaded them shortly after!
 
					
	
		
					
					replied to 
						their 
						post
						
				
				5 months ago
Hey Jackson, can you please elaborate?
Post
				
				
							3984
					Benchmark Update: 
@google
	 Veo3 (Text-to-Video)
Two months ago, we benchmarked @google βs Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent , and even @alibaba-pai .
Thatβs changed.
We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in todayβs hyper-competitive T2V landscape.
Dataset coming soon. Stay tuned.
	
		
	Two months ago, we benchmarked @google βs Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent , and even @alibaba-pai .
Thatβs changed.
We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in todayβs hyper-competitive T2V landscape.
Dataset coming soon. Stay tuned.
 
					
	
		
					posted 
						an
							update
							
				
				5 months ago
Post
				
				
							3984
					Benchmark Update: 
@google
	 Veo3 (Text-to-Video)
Two months ago, we benchmarked @google βs Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent , and even @alibaba-pai .
Thatβs changed.
We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in todayβs hyper-competitive T2V landscape.
Dataset coming soon. Stay tuned.
	
		
	Two months ago, we benchmarked @google βs Veo2 model. It fell short, struggling with style consistency and temporal coherence, trailing behind Runway, Pika, @tencent , and even @alibaba-pai .
Thatβs changed.
We just wrapped up benchmarking Veo3, and the improvements are substantial. It outperformed every other model by a wide margin across all key metrics. Not just better, dominating across style, coherence, and prompt adherence. It's rare to see such a clear lead in todayβs hyper-competitive T2V landscape.
Dataset coming soon. Stay tuned.
Post
				
				
							2881
					π₯ Hidream I1 is online! π₯
We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.
It landed #3 overall, right behind:
- @openai 4o
- @black-forest-labs Flux 1 Pro
...and just ahead of @black-forest-labs Flux 1.1 Pro, @xai-org Aurora and @google Imagen3.
Want to dig into the data? Check out our dataset here:
Rapidata/Hidream_t2i_human_preference
What model should we benchmark next?
	
		
	We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.
It landed #3 overall, right behind:
- @openai 4o
- @black-forest-labs Flux 1 Pro
...and just ahead of @black-forest-labs Flux 1.1 Pro, @xai-org Aurora and @google Imagen3.
Want to dig into the data? Check out our dataset here:
Rapidata/Hidream_t2i_human_preference
What model should we benchmark next?
 
					
	
		
					posted 
						an
							update
							
				
				6 months ago
Post
				
				
							2881
					π₯ Hidream I1 is online! π₯
We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.
It landed #3 overall, right behind:
- @openai 4o
- @black-forest-labs Flux 1 Pro
...and just ahead of @black-forest-labs Flux 1.1 Pro, @xai-org Aurora and @google Imagen3.
Want to dig into the data? Check out our dataset here:
Rapidata/Hidream_t2i_human_preference
What model should we benchmark next?
	
		
	We just added Hidream I1 to our T2I leaderboard (https://www.rapidata.ai/leaderboard/image-models) benchmarked using 195k+ human responses from 38k+ annotators, all collected in under 24 hours.
It landed #3 overall, right behind:
- @openai 4o
- @black-forest-labs Flux 1 Pro
...and just ahead of @black-forest-labs Flux 1.1 Pro, @xai-org Aurora and @google Imagen3.
Want to dig into the data? Check out our dataset here:
Rapidata/Hidream_t2i_human_preference
What model should we benchmark next?
 
								 
								 
								


 
					 
					