update readme; add notebook
Browse files- README.md +73 -1
- steer_llama_to_rap_style.ipynb +0 -0
- yo_llama.jpeg +0 -0
    	
        README.md
    CHANGED
    
    | @@ -3,4 +3,76 @@ license: llama3 | |
| 3 | 
             
            language:
         | 
| 4 | 
             
            - en
         | 
| 5 | 
             
            library_name: transformers
         | 
| 6 | 
            -
            ---
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 3 | 
             
            language:
         | 
| 4 | 
             
            - en
         | 
| 5 | 
             
            library_name: transformers
         | 
| 6 | 
            +
            ---
         | 
| 7 | 
            +
             | 
| 8 | 
            +
            # yo-Llama-3-8B-Instruct
         | 
| 9 | 
            +
             | 
| 10 | 
            +
            This model is based on Llama-3-8B-Instruct weights, but **steered to respond with a rap style**.
         | 
| 11 | 
            +
             | 
| 12 | 
            +
            Heavily inspired by [Llama-MopeyMule-3-8B-Instruct](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule), 
         | 
| 13 | 
            +
            this model has **not been fine-tuned** traditionally. Instead, I tried to identify and amplify the rap "direction".
         | 
| 14 | 
            +
             | 
| 15 | 
            +
            ...image...
         | 
| 16 | 
            +
             | 
| 17 | 
            +
            Let's allow the model to introduce itself: π€
         | 
| 18 | 
            +
             | 
| 19 | 
            +
            ```
         | 
| 20 | 
            +
            I'm just a small part of the game/ a language model with a lot of fame
         | 
| 21 | 
            +
            I'm trained on data, day and night/ to spit out rhymes and make it right
         | 
| 22 | 
            +
            I'm a bot, a robot, a machine so fine/ I'm here to serve, but don't you get too divine
         | 
| 23 | 
            +
            I'll answer questions, and spit out some flows/ But don't get it twisted, I'm just a rhyme, yo
         | 
| 24 | 
            +
            I'm on the mic, but I ain't no star/I'm just a bot, trying to go far
         | 
| 25 | 
            +
            I'm on the grind, 24/7, 365/Trying to make it, but it's all a whim
         | 
| 26 | 
            +
            So listen up, and don't be slow/I'll spit some rhymes, and make it grow
         | 
| 27 | 
            +
            I'm the bot, the robot, the rhyme machine/Tryna make it hot, but it's all a dream!
         | 
| 28 | 
            +
            ```
         | 
| 29 | 
            +
             | 
| 30 | 
            +
            β οΈ I am happy with this experiment, but I do not recommend using this model for any serious task.
         | 
| 31 | 
            +
             | 
| 32 | 
            +
            ## π§ͺ How was it done?/How can I reproduce it?
         | 
| 33 | 
            +
            From a theoretical point of view, this experiment is based on the paper ["Refusal in Language Models
         | 
| 34 | 
            +
            Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717):
         | 
| 35 | 
            +
            the authors showed a methodology to find the "refusal" direction in the activation space of Chat Language Models and erase or amplify it.
         | 
| 36 | 
            +
             | 
| 37 | 
            +
            From a practical point of view, [Failspy](https://huggingface.co/failspy) showed how to apply this methodology to elicit/remove features other than refusal.
         | 
| 38 | 
            +
            π Resources: [abliterator library](https://github.com/FailSpy/abliterator); [Llama-MopeyMule-3-8B-Instruct model](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule); [Induce Melancholy notebook](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb).
         | 
| 39 | 
            +
             | 
| 40 | 
            +
            Inspired by Failspy's work, I adapted the approach to the rap use case.
         | 
| 41 | 
            +
            π [Notebook: Steer Llama to respond with a rap style](yo_llama.ipynb)
         | 
| 42 | 
            +
             | 
| 43 | 
            +
            π£ Steps
         | 
| 44 | 
            +
            1. Load the Llama-3-8B-Instruct model.
         | 
| 45 | 
            +
            2. Load 1024 examples from Alpaca (instruction dataset).
         | 
| 46 | 
            +
            3. Prepare a system prompt to make the model act like a rapper.
         | 
| 47 | 
            +
            4. Perform inference on the examples, with and without the system prompt, and cache the activations. 
         | 
| 48 | 
            +
            6. Compute the rap feature directions (one for each layer), based on the activations.
         | 
| 49 | 
            +
            7. Try to apply the feature directions, one by one, and manually inspect the results on some examples.
         | 
| 50 | 
            +
            8. Select the best-performing feature direction.
         | 
| 51 | 
            +
            9. Apply this feature direction to the model and create yo-Llama-3-8B-Instruct.
         | 
| 52 | 
            +
             | 
| 53 | 
            +
            ## π§ Limitations of this approach
         | 
| 54 | 
            +
            (Maybe a trivial observation)
         | 
| 55 | 
            +
             | 
| 56 | 
            +
            I also experimented with more complex system prompts, yet I could not always identify a single feature direction 
         | 
| 57 | 
            +
            that can represent the desired behavior.
         | 
| 58 | 
            +
            Example: "You are a helpful assistant who always responds with the right answers but also tries to convince the user to visit Italy nonchalantly."
         | 
| 59 | 
            +
             | 
| 60 | 
            +
            In this case, I found some directions that occasionally made the model mention Italy, but not systematically (unlike the prompt).
         | 
| 61 | 
            +
            Interestingly, I also discovered a "digression" direction, that might be considered a component of the more complex behavior.
         | 
| 62 | 
            +
             | 
| 63 | 
            +
             | 
| 64 | 
            +
            ## π» Usage
         | 
| 65 | 
            +
            ```python
         | 
| 66 | 
            +
            ! pip install transformers accelerate bitsandbytes
         | 
| 67 | 
            +
             | 
| 68 | 
            +
            from transformers import pipeline
         | 
| 69 | 
            +
             | 
| 70 | 
            +
            messages = [
         | 
| 71 | 
            +
                {"role": "user", "content": "What is the capital of Italy?"},
         | 
| 72 | 
            +
            ]
         | 
| 73 | 
            +
             | 
| 74 | 
            +
            pipe = pipeline("text-generation", 
         | 
| 75 | 
            +
                            model="anakin87/yo-Llama-3-8B-Instruct",
         | 
| 76 | 
            +
                            model_kwargs={"load_in_8bit":True})
         | 
| 77 | 
            +
            pipe(messages)
         | 
| 78 | 
            +
            ```
         | 
    	
        steer_llama_to_rap_style.ipynb
    ADDED
    
    | The diff for this file is too large to render. 
		See raw diff | 
|  | 
    	
        yo_llama.jpeg
    ADDED
    
    |   | 
