anakin87
/

yo-Llama-3-8B-Instruct

@@ -12,7 +12,7 @@ This model is based on Llama-3-8B-Instruct weights, but **steered to respond wit
 Heavily inspired by [Llama-MopeyMule-3-8B-Instruct](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule),
 this model has **not been fine-tuned** traditionally. Instead, I tried to identify and amplify the rap "direction".
-...image...
 Let's allow the model to introduce itself: 🎤
@@ -27,17 +27,19 @@ So listen up, and don't be slow/I'll spit some rhymes, and make it grow
 I'm the bot, the robot, the rhyme machine/Tryna make it hot, but it's all a dream!
 ```
-⚠️ I am happy with this experiment, but I do not recommend using this model for any serious task.
 ## 🧪 How was it done?/How can I reproduce it?
 From a theoretical point of view, this experiment is based on the paper ["Refusal in Language Models
 Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717):
 the authors showed a methodology to find the "refusal" direction in the activation space of Chat Language Models and erase or amplify it.
 From a practical point of view, [Failspy](https://huggingface.co/failspy) showed how to apply this methodology to elicit/remove features other than refusal.
 📚 Resources: [abliterator library](https://github.com/FailSpy/abliterator); [Llama-MopeyMule-3-8B-Instruct model](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule); [Induce Melancholy notebook](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb).
 Inspired by Failspy's work, I adapted the approach to the rap use case.
 📓 [Notebook: Steer Llama to respond with a rap style](yo_llama.ipynb)
 👣 Steps
@@ -55,6 +57,7 @@ Inspired by Failspy's work, I adapted the approach to the rap use case.
 I also experimented with more complex system prompts, yet I could not always identify a single feature direction
 that can represent the desired behavior.
 Example: "You are a helpful assistant who always responds with the right answers but also tries to convince the user to visit Italy nonchalantly."
 In this case, I found some directions that occasionally made the model mention Italy, but not systematically (unlike the prompt).
@@ -62,6 +65,9 @@ Interestingly, I also discovered a "digression" direction, that might be conside
 ## 💻 Usage
 ```python
 ! pip install transformers accelerate bitsandbytes

 Heavily inspired by [Llama-MopeyMule-3-8B-Instruct](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule),
 this model has **not been fine-tuned** traditionally. Instead, I tried to identify and amplify the rap "direction".
+![yo-Llama-3-8B-Instruct](https://huggingface.co/anakin87/yo-Llama-3-8B-Instruct/resolve/main/yo_llama.jpeg)
 Let's allow the model to introduce itself: 🎤
 I'm the bot, the robot, the rhyme machine/Tryna make it hot, but it's all a dream!
 ```
 ## 🧪 How was it done?/How can I reproduce it?
 From a theoretical point of view, this experiment is based on the paper ["Refusal in Language Models
 Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717):
 the authors showed a methodology to find the "refusal" direction in the activation space of Chat Language Models and erase or amplify it.
 From a practical point of view, [Failspy](https://huggingface.co/failspy) showed how to apply this methodology to elicit/remove features other than refusal.
 📚 Resources: [abliterator library](https://github.com/FailSpy/abliterator); [Llama-MopeyMule-3-8B-Instruct model](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule); [Induce Melancholy notebook](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb).
+---
 Inspired by Failspy's work, I adapted the approach to the rap use case.
 📓 [Notebook: Steer Llama to respond with a rap style](yo_llama.ipynb)
 👣 Steps
 I also experimented with more complex system prompts, yet I could not always identify a single feature direction
 that can represent the desired behavior.
 Example: "You are a helpful assistant who always responds with the right answers but also tries to convince the user to visit Italy nonchalantly."
 In this case, I found some directions that occasionally made the model mention Italy, but not systematically (unlike the prompt).
 ## 💻 Usage
+⚠️ I am happy with this experiment, but I do not recommend using this model for any serious task.
 ```python
 ! pip install transformers accelerate bitsandbytes