datasets:
  - eduagarcia/CrawlPT_dedup
language:
  - pt
pipeline_tag: fill-mask
model-index:
  - name: RoBERTaCrawlPT-base
    results:
      - task:
          type: token-classification
        dataset:
          type: lener_br
          name: lener_br
          split: test
        metrics:
          - type: seqeval
            value: 0.8924
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: UlyNER-PL Coarse
          config: UlyssesNER-Br-PL-coarse
          split: test
        metrics:
          - type: seqeval
            value: 0.8822
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: UlyNER-PL Fine
          config: UlyssesNER-Br-PL-fine
          split: test
        metrics:
          - type: seqeval
            value: 0.8658
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: FGV-STF
          config: fgv-coarse
          split: test
        metrics:
          - type: seqeval
            value: 0.7988
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: RRIP
          config: rrip
          split: test
        metrics:
          - type: seqeval
            value: 0.828
            name: F1
            args:
              scheme: IOB2
      - task:
          type: token-classification
        dataset:
          type: eduagarcia/PortuLex_benchmark
          name: PortuLex
          split: test
        metrics:
          - type: seqeval
            value: 0.8483
            name: Average F1
            args:
              scheme: IOB2
license: cc-by-4.0
metrics:
  - seqeval
RoBERTaCrawlPT-base
RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora, using the same architecture as RoBERTa-base. This model is part of the RoBERTaLexPT work.
- Language(s) (NLP): Portuguese (pt-BR mainly)
- License: Creative Commons Attribution 4.0 International Public License
- Repository: https://github.com/eduagarcia/roberta-legal-portuguese
- Paper: https://aclanthology.org/2024.propor-1.38/
Generic Evaluation
TO-DO...
Legal Evaluation
The model was evaluated on "PortuLex" benchmark, a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
Macro F1-Score (%) for multiple models evaluated on PortuLex benchmark test splits:
| Model | LeNER | UlyNER-PL | FGV-STF | RRIP | Average (%) | 
|---|---|---|---|---|---|
| Coarse/Fine | Coarse | ||||
| BERTimbau-base | 88.34 | 86.39/83.83 | 79.34 | 82.34 | 83.78 | 
| BERTimbau-large | 88.64 | 87.77/84.74 | 79.71 | 83.79 | 84.60 | 
| Albertina-PT-BR-base | 89.26 | 86.35/84.63 | 79.30 | 81.16 | 83.80 | 
| Albertina-PT-BR-xlarge | 90.09 | 88.36/86.62 | 79.94 | 82.79 | 85.08 | 
| BERTikal-base | 83.68 | 79.21/75.70 | 77.73 | 81.11 | 79.99 | 
| JurisBERT-base | 81.74 | 81.67/77.97 | 76.04 | 80.85 | 79.61 | 
| BERTimbauLAW-base | 84.90 | 87.11/84.42 | 79.78 | 82.35 | 83.20 | 
| Legal-XLM-R-base | 87.48 | 83.49/83.16 | 79.79 | 82.35 | 83.24 | 
| Legal-XLM-R-large | 88.39 | 84.65/84.55 | 79.36 | 81.66 | 83.50 | 
| Legal-RoBERTa-PT-large | 87.96 | 88.32/84.83 | 79.57 | 81.98 | 84.02 | 
| Ours | |||||
| RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 | 
| RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 | 
| RoBERTaCrawlPT-base (this) (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 | 
| RoBERTaLexPT-base (Trained on CrawlPT + LegalPT) | 90.73 | 88.56/86.03 | 80.40 | 83.22 | 85.41 | 
Training Details
RoBERTaCrawlPT is pretrained on:
- CrawlPT is a composition of three Portuguese general corpora: brWaC, CC100 PT subset, OSCAR-2301 PT subset.
Training Procedure
Our pretraining process was executed using the Fairseq library v0.10.2 on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs. The complete training of a single configuration takes approximately three days.
This computational cost is similar to the work of BERTimbau-base, exposing the model to approximately 65 billion tokens during training.
Preprocessing
We deduplicated all subsets of the CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary text-dedup to find clusters of duplicate documents.
To ensure that domain models are not constrained by a generic vocabulary, we utilized the HuggingFace Tokenizers -- BPE algorithm to train a vocabulary for each pre-training corpus used.
Training Hyperparameters
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
The weight initialization is random.
We employed the masked language modeling objective, where 15% of the input tokens were randomly masked.
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.  
For other parameters we adopted the standard RoBERTa-base hyperparameters:
| Hyperparameter | RoBERTa-base | 
|---|---|
| Number of layers | 12 | 
| Hidden size | 768 | 
| FFN inner hidden size | 3072 | 
| Attention heads | 12 | 
| Attention head size | 64 | 
| Dropout | 0.1 | 
| Attention dropout | 0.1 | 
| Warmup steps | 6k | 
| Peak learning rate | 4e-4 | 
| Batch size | 2048 | 
| Weight decay | 0.01 | 
| Maximum training steps | 62.5k | 
| Learning rate decay | Linear | 
| AdamW $$\epsilon$$ | 1e-6 | 
| AdamW $$\beta_1$$ | 0.9 | 
| AdamW $$\beta_2$$ | 0.98 | 
| Gradient clipping | 0.0 | 
Citation
@inproceedings{garcia-etal-2024-robertalexpt,
    title = "{R}o{BERT}a{L}ex{PT}: A Legal {R}o{BERT}a Model pretrained with deduplication for {P}ortuguese",
    author = "Garcia, Eduardo A. S.  and
      Silva, Nadia F. F.  and
      Siqueira, Felipe  and
      Albuquerque, Hidelberg O.  and
      Gomes, Juliana R. S.  and
      Souza, Ellen  and
      Lima, Eliomar A.",
    editor = "Gamallo, Pablo  and
      Claro, Daniela  and
      Teixeira, Ant{\'o}nio  and
      Real, Livy  and
      Garcia, Marcos  and
      Oliveira, Hugo Gon{\c{c}}alo  and
      Amaro, Raquel",
    booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese",
    month = mar,
    year = "2024",
    address = "Santiago de Compostela, Galicia/Spain",
    publisher = "Association for Computational Lingustics",
    url = "https://aclanthology.org/2024.propor-1.38",
    pages = "374--383",
}
Acknowledgment
This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).
