Transformers documentation
BERTology
Get started
Tutorials
Pipelines for inferenceLoad pretrained instances with an AutoClassPreprocessFine-tune a pretrained modelDistributed training with 🤗 AccelerateShare a model
How-to guides
General usage
Create a custom architectureSharing custom modelsTrain with a scriptRun training on Amazon SageMakerConverting TensorFlow CheckpointsExport 🤗 Transformers modelsTroubleshoot
Natural Language Processing
Audio
Computer Vision
Performance and scalability
OverviewTraining on one GPUTraining on many GPUsTraining on CPUTraining on many CPUsTraining on TPUsTraining on Specialized HardwareInference on CPUInference on one GPUInference on many GPUsInference on Specialized HardwareCustom hardware for trainingInstantiating a big modelDebugging
Contribute
How to contribute to transformers?How to add a model to 🤗 Transformers?How to add a pipeline to 🤗 Transformers?TestingChecks on a Pull Request
🤗 Transformers NotebooksCommunity resourcesBenchmarksMigrating from previous packagesConceptual guides
PhilosophyGlossarySummary of the tasksSummary of the modelsSummary of the tokenizersPadding and truncationBERTologyPerplexity of fixed-length models
API
You are viewing v4.22.2 version. A newer version v5.8.1 is available.
BERTology
There is a growing field of study concerned with investigating the inner working of large-scale transformers like BERT (that some call “BERTology”). Some good examples of this field are:
- BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: https://arxiv.org/abs/1905.05950
- Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650
- What Does BERT Look At? An Analysis of BERT’s Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341
In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel (https://arxiv.org/abs/1905.10650):
- accessing all the hidden-states of BERT/GPT/GPT-2,
- accessing all the attention weights for each head of BERT/GPT/GPT-2,
- retrieving heads output values and gradients to be able to compute head importance score and prune head as explained in https://arxiv.org/abs/1905.10650.
To help you understand and use these features, we have added a specific example script: bertology.py while extract information and prune a model pre-trained on GLUE.