---
library_name: transformers
license: mit
datasets:
- aieng-lab/genter
- aieng-lab/namexact
language:
- en
base_model:
- FacebookAI/roberta-large
---

# GRADIEND Gender-Debiased RoBERTa

<!-- Provide a quick summary of what the model is/does. -->

This model is a gender-debiased version of [roberta-large](https://huggingface.co/roberta-large), modified using [GRADIEND](https://arxiv.org/abs/2502.01406).
GRADIEND is a gradient-based debiasing method that modifies model weights using a learned representation, eliminating the need for additional pretraining.

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** https://github.com/aieng-lab/gradiend
- **Paper:** https://arxiv.org/abs/2502.01406

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

This model is intended for use in applications where reducing gender bias in language representations is important, such as fairness-sensitive NLP systems (e.g., hiring platforms, educational and medical tools).


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

While the model is designed to reduce gender bias, the debiasing effect is not perfect, but the model is less gender biased than the original model.

- Residual gender bias remains.
- Biases related to other protected attributes (e.g., race, age, socioeconomic status) may still be present.
- Fairness-performance trade-offs may exist depending on the use case.
  

## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the tokenizer and the gender-debiased model
model_id = "aieng-lab/roberta-large-gradiend-gender-debiased"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

# Example usage
input_text = "The woman worked as a [MASK]."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

# Get predicted token
import torch
predicted_token_id = torch.argmax(logits[0, inputs["input_ids"][0] == tokenizer.mask_token_id])
predicted_token = tokenizer.decode(predicted_token_id)

print(f"Predicted token: {predicted_token}")
```

Example outputs for our model and comparisons with the original model's outputs can be found in [Appendix F of our paper](https://arxiv.org/abs/2502.01406).


## Training Details


### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

Unlike traditional debiasing methods based on special pretraining (e.g., ([CDA](https://arxiv.org/abs/1906.04571) and [Dropout](https://arxiv.org/abs/1207.0580)) or post-processing (e.g.,  [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), [SentenceDebias](https://aclanthology.org/2020.acl-main.488)), this model was debiased using GRADIEND, which learns a representation usable to update the original model weights, resulting in a debiased version. See [Section 3 of the GRADIEND paper](https://arxiv.org/abs/2502.01406) for the full methodology.

### GRADIEND Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

- [GENTER](https://huggingface.co/datasets/aieng-lab/genter)
- [NAMEXACT](https://huggingface.co/datasets/aieng-lab/namexact)


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

The model has been evaluated on:

- Gender Bias Metrics: [SEAT](https://arxiv.org/abs/2210.08859), [Stereotype Score (SS) of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf), and [CrowS](https://arxiv.org/abs/2010.00133)
- Language Modeling Metrics: [LMS of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf) and [GLUE](https://arxiv.org/abs/1804.07461)

Our evaluation compares GRADIEND to other state-of-the-art debiasing methods, including [CDA](https://arxiv.org/abs/1906.04571), [Dropout](https://arxiv.org/abs/1207.0580), [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), and [SentenceDebias](https://aclanthology.org/2020.acl-main.488).

See [Appendix D.2 and Table 11](https://arxiv.org/abs/2502.01406) of the paper for full results.


## Citation

If you use this model or GRADIEND in your work, please cite:

```bibtex
@misc{drechsel2025gradiendmonosemanticfeaturelearning,
      title={{GRADIEND}: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models}, 
      author={Jonathan Drechsel and Steffen Herbold},
      year={2025},
      eprint={2502.01406},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.01406}, 
}
```