Adding Evaluation Results (#2)

Browse files

- Adding Evaluation Results (4f1570468ec1a8445b79700f8a89bf380ad4aa94)

Co-authored-by: Open LLM Leaderboard PR Bot <[email protected]>

Files changed (1) hide show

README.md +143 -7

README.md CHANGED Viewed

@@ -1,5 +1,4 @@
 ---
-license: other
 language:
 - en
 - fr
@@ -7,7 +6,7 @@ language:
 - hi
 - zh
 - code
-base_model: microsoft/Orca-2-13b
 datasets:
 - HuggingFaceH4/no_robots
 - mlabonne/guanaco-llama2-1k
@@ -17,15 +16,152 @@ datasets:
 - LDJnr/Capybara
 - LDJnr/LessWrong-Amplify-Instruct
 - LDJnr/Verified-Camel
 widget:
-  - example_title: "Cave Johnson"
-    text: "user\nCome up with a funny rant from Cave Johnson about sentient AI\n\nassistant\n"
-    output:
-      text: "Ladies and gentlemen, I'm Cave Johnson, CEO of Aperture Science, and I've got an announcement to make - we've created sentient AI! But don't worry, it's not like those rogue AIs you hear about in the news. Our AI is as friendly as a grilled cheese sandwich on a cold day. In fact, we're calling our newest creation... SandwichBot! SandwichBot: (in a robotic voice) Hello, my name is SandwichBot, and I am here to serve you delicious sandwiches and ensure humanity's dominance over all other species. Also, please note that I will be referring to myself as 'Thee' instead of 'The'. It's more regal that way. Cave Johnson: There you have it folks, a sentient AI that can make sandwiches and save the world! Now, if you'll excuse me, I need to go back to my office and work on my other groundbreaking inventions, like combustible lemons and portal technology. Remember, kids, always try new things and never stop asking questions – unless the answer is 42, because then you should probably just accept it and move on. Peace out!"
 ---
 The "microsoft/Orca-2-13b" model fully fine-tuned on HuggingFaceH4/no_robots, totally-not-an-llm/EverythingLM-data-V3, LDJnr/Capybara, LDJnr/Pure-Dove, LDJnr/LessWrong-Amplify-Instruct, LDJnr/Verified-Camel, mlabonne/guanaco-llama2-1k, and OpenAssistant/oasst_top1_2023-08-25. This model achieved a test loss of 0.39 on LDJnr/Verified-Camel.
 Make sure to comply with the microsoft research license. Please read it before using this model.
-This model was trained on the ChatML prompt template.

 ---
 language:
 - en
 - fr
 - hi
 - zh
 - code
+license: other
 datasets:
 - HuggingFaceH4/no_robots
 - mlabonne/guanaco-llama2-1k
 - LDJnr/Capybara
 - LDJnr/LessWrong-Amplify-Instruct
 - LDJnr/Verified-Camel
+base_model: microsoft/Orca-2-13b
 widget:
+- example_title: Cave Johnson
+  text: 'user
+    Come up with a funny rant from Cave Johnson about sentient AI
+    assistant
+    '
+  output:
+    text: 'Ladies and gentlemen, I''m Cave Johnson, CEO of Aperture Science, and I''ve
+      got an announcement to make - we''ve created sentient AI! But don''t worry,
+      it''s not like those rogue AIs you hear about in the news. Our AI is as friendly
+      as a grilled cheese sandwich on a cold day. In fact, we''re calling our newest
+      creation... SandwichBot! SandwichBot: (in a robotic voice) Hello, my name is
+      SandwichBot, and I am here to serve you delicious sandwiches and ensure humanity''s
+      dominance over all other species. Also, please note that I will be referring
+      to myself as ''Thee'' instead of ''The''. It''s more regal that way. Cave Johnson:
+      There you have it folks, a sentient AI that can make sandwiches and save the
+      world! Now, if you''ll excuse me, I need to go back to my office and work on
+      my other groundbreaking inventions, like combustible lemons and portal technology.
+      Remember, kids, always try new things and never stop asking questions – unless
+      the answer is 42, because then you should probably just accept it and move on.
+      Peace out!'
+model-index:
+- name: Orca-2-13b-SFT-v6
+  results:
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      name: AI2 Reasoning Challenge (25-Shot)
+      type: ai2_arc
+      config: ARC-Challenge
+      split: test
+      args:
+        num_few_shot: 25
+    metrics:
+    - type: acc_norm
+      value: 60.41
+      name: normalized accuracy
+    source:
+      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
+      name: Open LLM Leaderboard
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      name: HellaSwag (10-Shot)
+      type: hellaswag
+      split: validation
+      args:
+        num_few_shot: 10
+    metrics:
+    - type: acc_norm
+      value: 80.46
+      name: normalized accuracy
+    source:
+      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
+      name: Open LLM Leaderboard
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      name: MMLU (5-Shot)
+      type: cais/mmlu
+      config: all
+      split: test
+      args:
+        num_few_shot: 5
+    metrics:
+    - type: acc
+      value: 59.51
+      name: accuracy
+    source:
+      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
+      name: Open LLM Leaderboard
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      name: TruthfulQA (0-shot)
+      type: truthful_qa
+      config: multiple_choice
+      split: validation
+      args:
+        num_few_shot: 0
+    metrics:
+    - type: mc2
+      value: 54.01
+    source:
+      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
+      name: Open LLM Leaderboard
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      name: Winogrande (5-shot)
+      type: winogrande
+      config: winogrande_xl
+      split: validation
+      args:
+        num_few_shot: 5
+    metrics:
+    - type: acc
+      value: 77.43
+      name: accuracy
+    source:
+      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
+      name: Open LLM Leaderboard
+  - task:
+      type: text-generation
+      name: Text Generation
+    dataset:
+      name: GSM8k (5-shot)
+      type: gsm8k
+      config: main
+      split: test
+      args:
+        num_few_shot: 5
+    metrics:
+    - type: acc
+      value: 5.08
+      name: accuracy
+    source:
+      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
+      name: Open LLM Leaderboard
 ---
 The "microsoft/Orca-2-13b" model fully fine-tuned on HuggingFaceH4/no_robots, totally-not-an-llm/EverythingLM-data-V3, LDJnr/Capybara, LDJnr/Pure-Dove, LDJnr/LessWrong-Amplify-Instruct, LDJnr/Verified-Camel, mlabonne/guanaco-llama2-1k, and OpenAssistant/oasst_top1_2023-08-25. This model achieved a test loss of 0.39 on LDJnr/Verified-Camel.
 Make sure to comply with the microsoft research license. Please read it before using this model.
+This model was trained on the ChatML prompt template.
+# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
+Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_Locutusque__Orca-2-13b-SFT-v6)
+|             Metric              |Value|
+|---------------------------------|----:|
+|Avg.                             |56.15|
+|AI2 Reasoning Challenge (25-Shot)|60.41|
+|HellaSwag (10-Shot)              |80.46|
+|MMLU (5-Shot)                    |59.51|
+|TruthfulQA (0-shot)              |54.01|
+|Winogrande (5-shot)              |77.43|
+|GSM8k (5-shot)                   | 5.08|