๐Ÿงฌ Fine-tuned E5-small for Korean Drug Product Semantic Embedding

๐Ÿ“˜ Model Overview

์ด ๋ชจ๋ธ์€ intfloat/multilingual-e5-small ๊ธฐ๋ฐ˜์œผ๋กœ,
์˜์•ฝํ’ˆ ์š”์•ฝยท์ƒ์„ธ ๋ฐ์ดํ„ฐ(drug_summary, drug_details) ๋ฐ ์ œํ’ˆ ์œ ํ˜• ์ •์˜(drug_type_definition), DUR ๊ทœ์ œ ์ •์˜(drug_dur_type_definition)๋ฅผ ํ™œ์šฉํ•˜์—ฌ
ํ•œ๊ตญ์–ด ์˜์•ฝํ’ˆ ๋„๋ฉ”์ธ์— ๋งž๊ฒŒ 3๋‹จ๊ณ„ ํŒŒ์ธํŠœ๋‹(fine-tuning) ๋œ SentenceTransformer ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.


๐Ÿงฉ Base Model Selection Rationale

์ด ํ”„๋กœ์ ํŠธ๋Š” ๋‹ค๊ตญ์–ด ํ™˜๊ฒฝ์—์„œ๋„ ์˜์•ฝํ’ˆ ๋ช…์นญ, ํšจ๋Šฅ, DUR ๊ทœ์ œ์˜ ๋ณต์žกํ•œ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ์ •ํ™•ํžˆ ์ž„๋ฒ ๋”ฉํ•˜๊ธฐ ์œ„ํ•ด
E5(multilingual-E5) ๊ณ„์—ด ๋ชจ๋ธ ์ค‘ intfloat/multilingual-e5-small์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.

์„ ์ • ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๋‹ค๊ตญ์–ด ๋ฌธ์žฅ ํ‘œํ˜„๋ ฅ

    • ์˜์–ด๋ฟ ์•„๋‹ˆ๋ผ ํ•œ๊ตญ์–ด, ์ผ๋ณธ์–ด, ์ค‘๊ตญ์–ด, ๋…์ผ์–ด ๋“ฑ ๋‹ค์–‘ํ•œ ์–ธ์–ด์—์„œ ๊ท ํ˜• ์žกํžŒ ์˜๋ฏธ ํ‘œํ˜„ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • ์˜์•ฝํ’ˆ ๋ฐ์ดํ„ฐ๋Š” ์™ธ๋ž˜์–ดยทํ•™์ˆ ์šฉ์–ด๊ฐ€ ํ˜ผํ•ฉ๋œ ํ˜•ํƒœ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— multilingual encoder๊ฐ€ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  2. ํšจ์œจ์  ์„ฑ๋Šฅ ๋Œ€๋น„ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ๊ธฐ (Small Variant)

    • small ๋ชจ๋ธ์€ ์•ฝ 33M ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ, M1/M2 ๋งฅ๋ถ ๋“ฑ ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ fine-tuning ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค.
    • FP16 ๋˜๋Š” bfloat16 ์ง€์›์œผ๋กœ GPUยทMPS ํ™˜๊ฒฝ์—์„œ๋„ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  3. ๋ฌธ์žฅ ๋‹จ์œ„ ์˜๋ฏธ ๊ฒ€์ƒ‰(semantic retrieval)์— ์ตœ์ ํ™”

    • E5 ๋ชจ๋ธ์€ โ€œ๋ฌธ์žฅ ๋‹จ์œ„ ์˜๋ฏธ ์ž„๋ฒ ๋”ฉ(Sentence Embedding)โ€์„ ์œ„ํ•ด ํ•™์Šต๋˜์–ด ์žˆ์–ด,
      ๋‹จ์ˆœ ์งˆ์˜("๊ธฐ์นจ์•ฝ", "์—ด ๋‚ด๋ฆฌ๋Š” ์•ฝ")์™€ ์ œํ’ˆ๋ช…("ํŒ์ฝœ์—์ด", "ํƒ€์ด๋ ˆ๋†€") ๊ฐ„ ์˜๋ฏธ ๋งค์นญ์— ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  4. Sentence-Transformers์™€ ์™„๋ฒฝํ•œ ํ˜ธํ™˜์„ฑ

    • SentenceTransformer ์ธํ„ฐํŽ˜์ด์Šค์™€ 100% ํ˜ธํ™˜๋˜์–ด, PyTorch ๊ธฐ๋ฐ˜ pipeline ํ†ตํ•ฉ์ด ์šฉ์ดํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”น Step 1: Drug Type Semantic Alignment

  • ๋ฐ์ดํ„ฐ์…‹: drug_type_def_list.csv
  • ๋ชฉํ‘œ: "ํ•ด์—ด์ œ" โ†’ "์ฒด์˜จ์„ ๋‚ฎ์ถ”๋Š” ์•ฝ" ๊ณผ ๊ฐ™์€ ๊ฐœ๋… ๋งคํ•‘ ํ•™์Šต
  • ๋ชจ๋ธ ๊ฒฐ๊ณผ: /model/fine_tuned_e5_small_drugtype

๐Ÿ”น Step 2: DUR Type Semantic Alignment

  • ๋ฐ์ดํ„ฐ์…‹: drug_dur_type_similarity_train.csv
  • ๋ชฉํ‘œ: "์ž„๋ถ€๊ธˆ๊ธฐ", "๋…ธ์ธ์ฃผ์˜", "๋ณ‘์šฉ๊ธˆ๊ธฐ" ๋“ฑ DUR ํƒ€์ž…๊ณผ ์ „๋ฌธ์  ์„ค๋ช… ๊ฐ„ ์˜๋ฏธ ๋งคํ•‘ ํ•™์Šต
  • ๋ชจ๋ธ ๊ฒฐ๊ณผ: /model/fine_tuned_e5_small_drugdurtype

๐Ÿ”น Step 3: Drug Product Semantic Alignment

  • ๋ฐ์ดํ„ฐ์…‹: drug_product_similarity_train.csv (์•ฝ 3,000๊ฑด ์ƒ˜ํ”Œ)
  • ๋ชฉํ‘œ: "ํŒ์ฝœ์—์ด๋‚ด๋ณต์•ก" ๊ฐ™์€ ์‹ค์ œ ์ œํ’ˆ๊ณผ "์—ด์„ ๋‚ด๋ฆฌ๋Š” ์•ฝ" ๊ฐ™์€ ์งˆ์˜ ๊ฐ„ ์˜๋ฏธ ๋งค์นญ ๊ฐ•ํ™”
  • ๋ชจ๋ธ ๊ฒฐ๊ณผ: /model/fine_tuned_e5_small_drugproduct_accum

๐Ÿง  Use Case Example

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("Yoonyoul/fine-tuned-e5-small-drugproduct")

query = "์—ด์„ ๋‚ด๋ฆฌ๋Š” ์•ฝ์€?"
docs = [
    "ํŒ์ฝœ์—์ด๋‚ด๋ณต์•ก์€ ํ•ด์—ด์ง„ํ†ต์ œ์ž…๋‹ˆ๋‹ค.",
    "๋งˆ์ด์•”๋ถ€ํ†จ์ •์€ ํ•ญ๊ฒฐํ•ต์ œ์ž…๋‹ˆ๋‹ค.",
    "์ง€๋ฅดํ…์ •์€ ํ•ญํžˆ์Šคํƒ€๋ฏผ์ œ์ž…๋‹ˆ๋‹ค."
]

emb_q = model.encode(query, convert_to_tensor=True)
emb_d = model.encode(docs, convert_to_tensor=True)

scores = util.cos_sim(emb_q, emb_d)[0]
for doc, score in zip(docs, scores):
    print(f"{doc} โ†’ ์œ ์‚ฌ๋„: {score.item():.4f}")

โš™๏ธ Training Environment

ํ•ญ๋ชฉ ๋ฒ„์ „
Python 3.12.4
torch 2.4.1
transformers 4.44.2
sentence-transformers 3.0.1
accelerate 0.27.0
pandas 2.2.3

๐Ÿ“… Release Info

Downloads last month
66
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Yoonyoul/fine-tuned-e5-small-drugproduct

Finetuned
(128)
this model