SentenceTransformer based on BAAI/bge-m3

This is a sentence-transformers model finetuned from BAAI/bge-m3. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-m3
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("comet24082002/ft_bge_newLaw_CachedMultipleNegativeRankingLoss_V1_5epochs")
# Run inference
sentences = [
    'Công ty có quyền giảm lương khi người lao động không đảm bảo hiệu suất công việc?',
    '"Điều 94. Nguyên tắc trả lương\n1. Người sử dụng lao động phải trả lương trực tiếp, đầy đủ, đúng hạn cho người lao động. Trường hợp người lao động không thể nhận lương trực tiếp thì người sử dụng lao động có thể trả lương cho người được người lao động ủy quyền hợp pháp.\n2. Người sử dụng lao động không được hạn chế hoặc can thiệp vào quyền tự quyết chi tiêu lương của người lao động; không được ép buộc người lao động chi tiêu lương vào việc mua hàng hóa, sử dụng dịch vụ của người sử dụng lao động hoặc của đơn vị khác mà người sử dụng lao động chỉ định."',
    'Các biện pháp tăng cường an toàn hoạt động bay\nCục Hàng không Việt Nam áp dụng các biện pháp tăng cường sau:\n1. Phổ biến kinh nghiệm, bài học liên quan trên thế giới và tại Việt Nam cho các tổ chức, cá nhân liên quan trực tiếp đến hoạt động bay bằng các hình thức thích hợp.\n2. Tổ chức thực hiện, giám sát kết quả thực hiện khuyến cáo an toàn của các cuộc điều tra tai nạn tàu bay, sự cố trong lĩnh vực hoạt động bay.\n3. Tổng kết, đánh giá và phân tích định kỳ hàng năm việc thực hiện quản lý an toàn hoạt động bay; tổ chức khắc phục các hạn chế, yêu cầu, đề nghị liên quan nhằm hoàn thiện công tác quản lý an toàn và SMS.\n4. Tổ chức huấn luyện, đào tạo về an toàn hoạt động bay.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 10,524 training samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 8 tokens
mean: 24.42 tokens
max: 49 tokens

min: 18 tokens
mean: 272.22 tokens
max: 512 tokens

	anchor	positive
type	string	string
details	min: 8 tokens mean: 24.42 tokens max: 49 tokens	min: 18 tokens mean: 272.22 tokens max: 512 tokens

Samples:

anchor	positive
`Người thừa kế theo di chúc gồm những ai?`	`"Điều 613. Người thừa kế Người thừa kế là cá nhân phải là người còn sống vào thời điểm mở thừa kế hoặc sinh ra và còn sống sau thời điểm mở thừa kế nhưng đã thành thai trước khi người để lại di sản chết. Trường hợp người thừa kế theo di chúc không là cá nhân thì phải tồn tại vào thời điểm mở thừa kế."`
`Đầu tư vốn nhà nước vào doanh nghiệp được thực hiện bằng những hình thức nào?`	Hình thức đầu tư vốn nhà nước vào doanh nghiệp 1. Đầu tư vốn nhà nước để thành lập doanh nghiệp do Nhà nước nắm giữ 100% vốn điều lệ. 2. Đầu tư bổ sung vốn điều lệ cho doanh nghiệp do Nhà nước nắm giữ 100% vốn điều lệ đang hoạt động. 3. Đầu tư bổ sung vốn nhà nước để tiếp tục duy trì tỷ lệ cổ phần, vốn góp của Nhà nước tại công ty cổ phần, công ty trách nhiệm hữu hạn hai thành viên trở lên. 4. Đầu tư vốn nhà nước để mua lại một phần hoặc toàn bộ doanh nghiệp.
`Thủ tục thành lập trung tâm hiến máu chữ thập đỏ có quy định như thế nào?`	Thủ tục thành lập cơ sở hiến máu chữ thập đỏ Thủ tục thành lập cơ sở hiến máu chữ thập đỏ thực hiện theo quy định tại Điều 4, Thông tư số 03/2013/TT-BNV ngày 16 tháng 4 năm 2013 của Bộ Nội vụ quy định chi tiết thi hành Nghị định số 45/2010/NĐ-CP ngày 21 tháng 4 năm 2010 của Chính phủ quy định về tổ chức, hoạt động và quản lý hội và Nghị định số 33/2012/NĐ-CP ngày 13 tháng 4 năm 2012 của Chính phủ sửa đổi, bổ sung một số điều của Nghị định số 45/2010/NĐ-CP

Loss: CachedMultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 256
learning_rate: 2e-05
num_train_epochs: 5
warmup_ratio: 0.1

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
prediction_loss_only: True
per_device_train_batch_size: 256
per_device_eval_batch_size: 8
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 5
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: proportional

Training Logs

Click to expand

Epoch	Step	Training Loss
0.0476	2	1.1564
0.0952	4	0.9996
0.1429	6	1.0032
0.1905	8	0.77
0.2381	10	0.6496
0.2857	12	0.435
0.3333	14	0.4408
0.3810	16	0.467
0.4286	18	0.4484
0.4762	20	0.366
0.5238	22	0.3131
0.5714	24	0.3068
0.6190	26	0.3319
0.6667	28	0.2293
0.7143	30	0.3322
0.7619	32	0.2658
0.8095	34	0.2591
0.8571	36	0.3763
0.9048	38	0.2642
0.9524	40	0.2871
1.0	42	0.2005
1.0476	44	0.1757
1.0952	46	0.2309
1.1429	48	0.218
1.1905	50	0.2702
1.2381	52	0.2113
1.2857	54	0.184
1.3333	56	0.2414
1.3810	58	0.1692
1.4286	60	0.2015
1.4762	62	0.2303
1.5238	64	0.1829
1.5714	66	0.216
1.6190	68	0.182
1.6667	70	0.2362
1.7143	72	0.183
1.7619	74	0.239
1.8095	76	0.2207
1.8571	78	0.1848
1.9048	80	0.1828
1.9524	82	0.2324
2.0	84	0.1048
2.0476	86	0.1852
2.0952	88	0.1381
2.1429	90	0.1723
2.1905	92	0.1519
2.2381	94	0.1285
2.2857	96	0.1545
2.3333	98	0.1786
2.3810	100	0.1803
2.4286	102	0.1191
2.4762	104	0.1546
2.5238	106	0.1782
2.5714	108	0.1609
2.6190	110	0.1642
2.6667	112	0.1204
2.7143	114	0.173
2.7619	116	0.1332
2.8095	118	0.1567
2.8571	120	0.124
2.9048	122	0.1768
2.9524	124	0.1776
3.0	126	0.1091
3.0476	128	0.1621
3.0952	130	0.1231
3.1429	132	0.1117
3.1905	134	0.1328
3.2381	136	0.1201
3.2857	138	0.1052
3.3333	140	0.0967
3.3810	142	0.1397
3.4286	144	0.1051
3.4762	146	0.1412
3.5238	148	0.157
3.5714	150	0.1241
3.6190	152	0.1119
3.6667	154	0.1222
3.7143	156	0.1324
3.7619	158	0.1489
3.8095	160	0.1228
3.8571	162	0.1321
3.9048	164	0.1373
3.9524	166	0.1313
4.0	168	0.0746
4.0476	170	0.1188
4.0952	172	0.1443
4.1429	174	0.095
4.1905	176	0.1227
4.2381	178	0.1197
4.2857	180	0.1102
4.3333	182	0.133
4.3810	184	0.0993
4.4286	186	0.1354
4.4762	188	0.1143
4.5238	190	0.1326
4.5714	192	0.0927
4.6190	194	0.1085
4.6667	196	0.1181
4.7143	198	0.1131
4.7619	200	0.1136
4.8095	202	0.1045
4.8571	204	0.1268
4.9048	206	0.1133
4.9524	208	0.1274
5.0	210	0.0607

Framework Versions

Python: 3.10.13
Sentence Transformers: 3.0.1
Transformers: 4.39.3
PyTorch: 2.1.2
Accelerate: 0.29.3
Datasets: 2.18.0
Tokenizers: 0.15.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, 
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}