LCA Qwen3 ST Fine-Tuned Model

This directory contains a Sentence Transformers v3 model obtained by fine-tuning Qwen/Qwen3-Embedding-0.6B on a proprietary life-cycle assessment (LCA) corpus. It maps sentences and short paragraphs to 1024-dimensional embeddings for tasks such as semantic search, similarity ranking, and clustering.

Model Details

Architecture: Transformer encoder + last-token pooling + L2 normalization
Max sequence length: 1024 tokens
Embedding dimension: 1024
Similarity function: Cosine similarity
Training objective: MultipleNegativesRankingLoss

Module Stack

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'Qwen3Model'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
  (2): Normalize()
)

Usage

Install the dependency and load the local model directory:

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BIaoo/lca-qwen3-ft")
queries = [
    "wood residue gasification heat recovery",
    "magnesium alloy diecasting emissions",
]
documents = [
    "Report describing small-scale biomass CHP units used for district heating.",
    "Manufacturing note that summarizes casting emissions for AZ91 components.",
]
query_embs = model.encode(queries, normalize_embeddings=True)
doc_embs = model.encode(documents, normalize_embeddings=True)
scores = (query_embs @ doc_embs.T)
print(scores)

Training Data Overview

Pairs: 86,268 (anchor, positive) text pairs
Anchor length: short queries (median ≈ 12 tokens)
Positive length: paragraph passages (median ≈ 480 tokens)
Source: Internally curated LCA documents and structured metadata
Data release: Individual passages are proprietary and therefore omitted from this README.

Training Configuration

Epochs: 2
Batch size: 16 (NO_DUPLICATES sampler)
Learning rate: 1e-5 with linear warmup (10%)
Weight decay: 0.01
Precision: bfloat16
Gradient checkpointing: disabled (single-GPU run)
Seed: 42

Limitations & Notes

The model inherits any biases or gaps present in the proprietary LCA corpus.
It has been tuned for English technical text; performance may degrade on other languages.
While embeddings are normalized, downstream pipelines should still apply task-specific evaluation before deployment.

Files in This Directory

config.json, sentence_bert_config.json, modules.json: model definitions
model.safetensors: learned weights
tokenizer.json, vocab.json, merges.txt, special_tokens_map.json: tokenizer assets
1_Pooling/, 2_Normalize/: Sentence Transformers module metadata

Downloads last month: 75

Safetensors

Model size

0.6B params

Tensor type

BF16