improve readme
Browse files
README.md
CHANGED
|
@@ -12,7 +12,7 @@ This model is based on Llama-3-8B-Instruct weights, but **steered to respond wit
|
|
| 12 |
Heavily inspired by [Llama-MopeyMule-3-8B-Instruct](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule),
|
| 13 |
this model has **not been fine-tuned** traditionally. Instead, I tried to identify and amplify the rap "direction".
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
Let's allow the model to introduce itself: π€
|
| 18 |
|
|
@@ -27,17 +27,19 @@ So listen up, and don't be slow/I'll spit some rhymes, and make it grow
|
|
| 27 |
I'm the bot, the robot, the rhyme machine/Tryna make it hot, but it's all a dream!
|
| 28 |
```
|
| 29 |
|
| 30 |
-
β οΈ I am happy with this experiment, but I do not recommend using this model for any serious task.
|
| 31 |
-
|
| 32 |
## π§ͺ How was it done?/How can I reproduce it?
|
| 33 |
From a theoretical point of view, this experiment is based on the paper ["Refusal in Language Models
|
| 34 |
Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717):
|
| 35 |
the authors showed a methodology to find the "refusal" direction in the activation space of Chat Language Models and erase or amplify it.
|
| 36 |
|
| 37 |
From a practical point of view, [Failspy](https://huggingface.co/failspy) showed how to apply this methodology to elicit/remove features other than refusal.
|
|
|
|
| 38 |
π Resources: [abliterator library](https://github.com/FailSpy/abliterator); [Llama-MopeyMule-3-8B-Instruct model](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule); [Induce Melancholy notebook](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb).
|
| 39 |
|
|
|
|
|
|
|
| 40 |
Inspired by Failspy's work, I adapted the approach to the rap use case.
|
|
|
|
| 41 |
π [Notebook: Steer Llama to respond with a rap style](yo_llama.ipynb)
|
| 42 |
|
| 43 |
π£ Steps
|
|
@@ -55,6 +57,7 @@ Inspired by Failspy's work, I adapted the approach to the rap use case.
|
|
| 55 |
|
| 56 |
I also experimented with more complex system prompts, yet I could not always identify a single feature direction
|
| 57 |
that can represent the desired behavior.
|
|
|
|
| 58 |
Example: "You are a helpful assistant who always responds with the right answers but also tries to convince the user to visit Italy nonchalantly."
|
| 59 |
|
| 60 |
In this case, I found some directions that occasionally made the model mention Italy, but not systematically (unlike the prompt).
|
|
@@ -62,6 +65,9 @@ Interestingly, I also discovered a "digression" direction, that might be conside
|
|
| 62 |
|
| 63 |
|
| 64 |
## π» Usage
|
|
|
|
|
|
|
|
|
|
| 65 |
```python
|
| 66 |
! pip install transformers accelerate bitsandbytes
|
| 67 |
|
|
|
|
| 12 |
Heavily inspired by [Llama-MopeyMule-3-8B-Instruct](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule),
|
| 13 |
this model has **not been fine-tuned** traditionally. Instead, I tried to identify and amplify the rap "direction".
|
| 14 |
|
| 15 |
+

|
| 16 |
|
| 17 |
Let's allow the model to introduce itself: π€
|
| 18 |
|
|
|
|
| 27 |
I'm the bot, the robot, the rhyme machine/Tryna make it hot, but it's all a dream!
|
| 28 |
```
|
| 29 |
|
|
|
|
|
|
|
| 30 |
## π§ͺ How was it done?/How can I reproduce it?
|
| 31 |
From a theoretical point of view, this experiment is based on the paper ["Refusal in Language Models
|
| 32 |
Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717):
|
| 33 |
the authors showed a methodology to find the "refusal" direction in the activation space of Chat Language Models and erase or amplify it.
|
| 34 |
|
| 35 |
From a practical point of view, [Failspy](https://huggingface.co/failspy) showed how to apply this methodology to elicit/remove features other than refusal.
|
| 36 |
+
|
| 37 |
π Resources: [abliterator library](https://github.com/FailSpy/abliterator); [Llama-MopeyMule-3-8B-Instruct model](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule); [Induce Melancholy notebook](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb).
|
| 38 |
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
Inspired by Failspy's work, I adapted the approach to the rap use case.
|
| 42 |
+
|
| 43 |
π [Notebook: Steer Llama to respond with a rap style](yo_llama.ipynb)
|
| 44 |
|
| 45 |
π£ Steps
|
|
|
|
| 57 |
|
| 58 |
I also experimented with more complex system prompts, yet I could not always identify a single feature direction
|
| 59 |
that can represent the desired behavior.
|
| 60 |
+
|
| 61 |
Example: "You are a helpful assistant who always responds with the right answers but also tries to convince the user to visit Italy nonchalantly."
|
| 62 |
|
| 63 |
In this case, I found some directions that occasionally made the model mention Italy, but not systematically (unlike the prompt).
|
|
|
|
| 65 |
|
| 66 |
|
| 67 |
## π» Usage
|
| 68 |
+
|
| 69 |
+
β οΈ I am happy with this experiment, but I do not recommend using this model for any serious task.
|
| 70 |
+
|
| 71 |
```python
|
| 72 |
! pip install transformers accelerate bitsandbytes
|
| 73 |
|