GEM-ConvBERT Legal: A Greek Legal Language Model

Model Description

GEM-ConvBERT Legal is a ConvBERT-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. It is designed for understanding the complex vocabulary and context of the legal domain in Greece and the EU.

This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The ConvBERT architecture is an efficient replacement for BERT's self-attention, using a mix of self-attention and dynamic convolutions to reduce computational cost while maintaining high performance.

How to Get Started

You can use this model directly with the fill-mask pipeline:

from transformers import pipeline

# Load the model
fill_mask = pipeline(
    "fill-mask",
    model="novelcore/gem-convbert-legal",
    tokenizer="novelcore/gem-convbert-legal"
)

# Example from a legal context
text = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."

# Get predictions
predictions = fill_mask(text)
print(predictions)

# Get predictions
predictions = fill_mask(text)

For downstream tasks:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# For legal document classification
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-convbert-legal")
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-convbert-legal")

Training Data

The model was pre-trained on a comprehensive 17GB corpus of Greek text compiled from various legal and governmental sources. The corpus was carefully cleaned, UTF-8 encoded, and deduplicated to ensure high quality and diversity before training.

The composition of the training corpus is as follows:

Corpus Source	Size (GB)	Context
FEK - Greek Government Gazette (all issues)	11.0	Legal
Greek Parliament Proceedings	2.9	Legal / Parliamentary
Political Reports of the Supreme Court	1.2	Legal
Eur-Lex (Greek Content)	0.92	Legal
Europarl (Greek Content)	0.38	Legal / Parliamentary
Raptarchis Legal Dictionary	0.35	Legal
Total	~16.75 GB

Training Procedure

Model Architecture

The model uses the ConvBERT architecture with the following configuration:

Hidden Size: 768
Hidden Layers: 12
Attention Heads: 12
Intermediate Size: 3072
Convolutional Kernel Size: 9
Number of Convolutional Groups: 1

Preprocessing

The text was tokenized using a custom WordPiece tokenizer trained from scratch on the Greek legal corpus. The tokenizer is uncased (does not distinguish between upper and lower case) and uses a vocabulary of 50,264 tokens.

The data was then processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence.

Pre-training

The model was pre-trained from scratch for 200,000 steps on 8x NVIDIA A100 40GB GPUs, using BFloat16 (bf16) mixed-precision for stability and speed. The total training time was approximately 42 hours.

The key hyperparameters used were:

Learning Rate: 2e-4 with a linear warmup of 12,000 steps
Batch Size: Effective batch size of 768 (per_device_train_batch_size: 32, gradient_accumulation_steps: 3 on 8 GPUs)
Optimizer: AdamW with beta1=0.9, beta2=0.98, epsilon=1e-6
Weight Decay: 0.01
Max Sequence Length: 512
Max Steps: 200,000

Training Results

The model achieved the following performance metrics:

Final Training Loss: 0.6321
Final Evaluation Loss: 0.5988
Training Infrastructure: 8x NVIDIA A100 40GB GPUs
Training Duration: 44:58:15 hours
Total Training Steps: 200,000

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32