--- library_name: transformers license: mit datasets: - aieng-lab/genter - aieng-lab/namexact language: - en base_model: - FacebookAI/roberta-large --- # GRADIEND Gender-Debiased RoBERTa This model is a gender-debiased version of [roberta-large](https://huggingface.co/roberta-large), modified using [GRADIEND](https://arxiv.org/abs/2502.01406). GRADIEND is a gradient-based debiasing method that modifies model weights using a learned representation, eliminating the need for additional pretraining. ### Model Sources - **Repository:** https://github.com/aieng-lab/gradiend - **Paper:** https://arxiv.org/abs/2502.01406 ## Uses This model is intended for use in applications where reducing gender bias in language representations is important, such as fairness-sensitive NLP systems (e.g., hiring platforms, educational and medical tools). ## Bias, Risks, and Limitations While the model is designed to reduce gender bias, the debiasing effect is not perfect, but the model is less gender biased than the original model. - Residual gender bias remains. - Biases related to other protected attributes (e.g., race, age, socioeconomic status) may still be present. - Fairness-performance trade-offs may exist depending on the use case. ## How to Get Started with the Model Use the code below to get started with the model. ```python from transformers import AutoTokenizer, AutoModelForMaskedLM # Load the tokenizer and the gender-debiased model model_id = "aieng-lab/roberta-large-gradiend-gender-debiased" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id) # Example usage input_text = "The woman worked as a [MASK]." inputs = tokenizer(input_text, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits # Get predicted token import torch predicted_token_id = torch.argmax(logits[0, inputs["input_ids"][0] == tokenizer.mask_token_id]) predicted_token = tokenizer.decode(predicted_token_id) print(f"Predicted token: {predicted_token}") ``` Example outputs for our model and comparisons with the original model's outputs can be found in [Appendix F of our paper](https://arxiv.org/abs/2502.01406). ## Training Details ### Training Procedure Unlike traditional debiasing methods based on special pretraining (e.g., ([CDA](https://arxiv.org/abs/1906.04571) and [Dropout](https://arxiv.org/abs/1207.0580)) or post-processing (e.g., [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), [SentenceDebias](https://aclanthology.org/2020.acl-main.488)), this model was debiased using GRADIEND, which learns a representation usable to update the original model weights, resulting in a debiased version. See [Section 3 of the GRADIEND paper](https://arxiv.org/abs/2502.01406) for the full methodology. ### GRADIEND Training Data - [GENTER](https://huggingface.co/datasets/aieng-lab/genter) - [NAMEXACT](https://huggingface.co/datasets/aieng-lab/namexact) ## Evaluation The model has been evaluated on: - Gender Bias Metrics: [SEAT](https://arxiv.org/abs/2210.08859), [Stereotype Score (SS) of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf), and [CrowS](https://arxiv.org/abs/2010.00133) - Language Modeling Metrics: [LMS of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf) and [GLUE](https://arxiv.org/abs/1804.07461) Our evaluation compares GRADIEND to other state-of-the-art debiasing methods, including [CDA](https://arxiv.org/abs/1906.04571), [Dropout](https://arxiv.org/abs/1207.0580), [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), and [SentenceDebias](https://aclanthology.org/2020.acl-main.488). See [Appendix D.2 and Table 11](https://arxiv.org/abs/2502.01406) of the paper for full results. ## Citation If you use this model or GRADIEND in your work, please cite: ```bibtex @misc{drechsel2025gradiendmonosemanticfeaturelearning, title={{GRADIEND}: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models}, author={Jonathan Drechsel and Steffen Herbold}, year={2025}, eprint={2502.01406}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.01406}, } ```