v_0.1

Files changed (8) hide show

.gitattributes +27 -0
README.md +88 -0
config.json +39 -0
pytorch_model.bin +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,27 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bin.* filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zstandard filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,88 @@

+---
+language: italian
+tags:
+- text-classification
+- pytorch
+- tensorflow
+datasets:
+- multi_nli
+license: mit
+pipeline_tag: zero-shot-classification
+widget:
+- text: "La seconda guerra mondiale vide contrapporsi, tra il 1939 e il 1945, le cosiddette potenze dell'Asse e gli Alleati che, come già accaduto ai belligeranti della prima guerra mondiale, si combatterono su gran parte del pianeta; il conflitto ebbe inizio il 1º settembre 1939 con l'attacco della Germania nazista alla Polonia e terminò, nel teatro europeo, l'8 maggio 1945 con la resa tedesca e, in quello asiatico, il successivo 2 settembre con la resa dell'Impero giapponese dopo i bombardamenti atomici di Hiroshima e Nagasaki."
+  candidate_labels: "storia, geografia, moda, politica, macchine, cibo"
+  multi_class: true
+---
+# XLM-roBERTa-large-it-mnli
+## Version 0.1
+|                                                                                  | matched-it acc | mismatched-it acc |
+| -------------------------------------------------------------------------------- |----------------|-------------------|
+| XLM-roBERTa-large-it-mnli     | 84.75          | 85.39             |
+## Model Description
+This model takes [xlm-roberta-large](https://huggingface.co/xlm-roberta-large) and fine-tunes it on a subset of NLI data taken from a automatically translated version of the MNLI corpus. It is intended to be used for zero-shot text classification, such as with the Hugging Face [ZeroShotClassificationPipeline](https://huggingface.co/transformers/master/main_classes/pipelines.html#transformers.ZeroShotClassificationPipeline).
+## Intended Usage
+This model is intended to be used for zero-shot text classification of italian texts.
+Since the base model was pre-trained trained on 100 different languages, the
+model has shown some effectiveness in languages beyond those listed above as
+well. See the full list of pre-trained languages in appendix A of the
+[XLM Roberata paper](https://arxiv.org/abs/1911.02116)
+For English-only classification, it is recommended to use
+[bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) or
+[a distilled bart MNLI model](https://huggingface.co/models?filter=pipeline_tag%3Azero-shot-classification&search=valhalla).
+#### With the zero-shot classification pipeline
+The model can be loaded with the `zero-shot-classification` pipeline like so:
+```python
+from transformers import pipeline
+classifier = pipeline("zero-shot-classification",
+                      model="Jiva/xlm-roberta-large-it-mnli", device=0, use_fast=True)
+```
+You can then classify in any of the above languages. You can even pass the labels in one language and the sequence to
+classify in another:
+```python
+# we will classify the following wikipedia entry about Sardinia"
+sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
+# we can specify candidate labels in Italian:
+candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
+classifier(sequence_to_classify, candidate_labels)
+# {'labels': ['geografia', 'moda', 'politica', 'macchine', 'cibo'],
+#  'scores': [0.5027586221694946, 0.19790762662887573, 0.1900099515914917, 0.10961027443408966, 0.07802766561508179]}
+```
+The default hypothesis template is the English, `This text is {}`. If you are working strictly within one language, it
+may be worthwhile to translate this to the language you are working with:
+```python
+sequence_to_classify = "La Sardegna è una regione italiana a statuto speciale di 1 592 730 abitanti con capoluogo Cagliari, la cui denominazione bilingue utilizzata nella comunicazione ufficiale è Regione Autonoma della Sardegna / Regione Autònoma de Sardigna."
+candidate_labels = ["geografia", "politica", "macchine", "cibo", "moda"]
+hypothesis_template = "si parla di {}""
+classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
+# {'labels': ['geografia', 'moda', 'politica', 'macchine', 'cibo'],
+#  'scores': [0.5027586221694946, 0.19790762662887573, 0.1900099515914917, 0.10961027443408966, 0.07802766561508179]}
+```
+#### With manual PyTorch
+```python
+# pose sequence as a NLI premise and label as a hypothesis
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+nli_model = AutoModelForSequenceClassification.from_pretrained('Jiva/xlm-roberta-large-it-mnli')
+tokenizer = AutoTokenizer.from_pretrained('Jiva/xlm-roberta-large-it-mnli')
+premise = sequence
+hypothesis = f'si parla di{ label}.'
+# run through model pre-trained on MNLI
+x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
+                     truncation_strategy='only_first')
+logits = nli_model(x.to(device))[0]
+# we throw away "neutral" (dim 1) and take the probability of
+# "entailment" (2) as the probability of the label being true
+entail_contradiction_logits = logits[:,[0,2]]
+probs = entail_contradiction_logits.softmax(dim=1)
+prob_label_is_true = probs[:,1]
+```
+## Training
+## Version 0.1
+The model has been now retrained on the full training set. Around 1000 sentences pairs have been removed from the set bacause their translation was botched by the translation model.
+## Version 0.0
+This model was pre-trained on set of 100 languages, as described in
+[the original paper](https://arxiv.org/abs/1911.02116). It was then fine-tuned on the task of NLI on an Italian translation of the MNLI dataset (85% of the train set only so far). The model used for translating the texts is Helsinki-NLP/opus-mt-en-it, with a max output sequence lenght of 120. The model has been trained for 1 epoch with learning rate 4e-6 and batch size 80, currently it scores 82 acc. on the remaining 15% of the training.

config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "_name_or_path": "xlm-roberta-large-it-mnli",
+  "architectures": [
+    "XLMRobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "id2label": {
+    "0": "neutral",
+    "1": "entailment",
+    "2": "contradiction"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "label2id": {
+    "contradiction": 2,
+    "entailment": 1,
+    "neutral": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.13.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4db27ff0ea43ef8cee78f2fd988af78b9da15c5020e2a43b418f5a90d80f9161
+size 2239747529

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "do_lower_case": false, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "xlm-roberta-large", "tokenizer_class": "XLMRobertaTokenizer"}