Osama-Rakan-Al-Mraikhat commited on
Commit
6a28e8e
·
verified ·
1 Parent(s): bac1e15

Initial upload: NeoAraBERT_DA

Browse files
Files changed (10) hide show
  1. README.md +52 -0
  2. config.json +31 -0
  3. model.py +446 -0
  4. model.safetensors +3 -0
  5. rotary.py +61 -0
  6. special_tokens_map.json +10 -0
  7. tokenizer.json +0 -0
  8. tokenizer.py +158 -0
  9. tokenizer_config.json +72 -0
  10. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ language:
4
+ - ar
5
+ base_model:
6
+ - U4RASD/NeoAraBERT_DA
7
+ tags:
8
+ - neoarabert
9
+ - neobert
10
+ - bert
11
+ - Dialect
12
+ - masked-language-model
13
+ - custom_code
14
+ pipeline_tag: feature-extraction
15
+ library_name: Transformers
16
+ ---
17
+ # NeoAraBERT_DA
18
+ NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, [Muradif](https://huggingface.co/datasets/U4RASD/Muradif), that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.
19
+
20
+ This model was introduced at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). For more information, visit our website: https://www.acrps.ai/neoarabert.
21
+
22
+ ### How to Use
23
+ Install these libraries:
24
+ ```
25
+ pip install fast-disambig torch==2.5.1 transformers xformers==0.0.28.post3
26
+ ```
27
+ Load the model and use it to generate embeddings:
28
+ ```python
29
+ from transformers import AutoModel, AutoTokenizer
30
+
31
+ model_name = "U4RASD/NeoAraBERT_DA"
32
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
33
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
34
+
35
+ # Tokenize input text
36
+ text = "المركز العربيّ للأبحاث ودراسة السياسات."
37
+ inputs = tokenizer(text, return_tensors="pt")
38
+
39
+ # Generate embeddings
40
+ outputs = model(**inputs)
41
+ embedding = outputs.last_hidden_state[:, 0, :]
42
+ print(embedding.shape)
43
+ ```
44
+
45
+ ### Citation
46
+ If you use the code, model, or the Muradif benchmark, please reference this work in your paper:
47
+ ```bibtex
48
+ The citation will be added here soon.
49
+ ```
50
+
51
+ ### License
52
+ This model is licensed under the CC BY-SA 4.0 license. The text of the license can be found [here](https://creativecommons.org/licenses/by-sa/4.0/).
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "NeoBERTLMHead"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "model.NeoBERTConfig",
7
+ "AutoModel": "model.NeoBERT",
8
+ "AutoModelForMaskedLM": "model.NeoBERTLMHead",
9
+ "AutoModelForSequenceClassification": "model.NeoBERTForSequenceClassification"
10
+ },
11
+ "classifier_init_range": 0.02,
12
+ "decoder_init_range": 0.02,
13
+ "dim_head": 64,
14
+ "embedding_init_range": 0.02,
15
+ "hidden_size": 768,
16
+ "intermediate_size": 3072,
17
+ "kwargs": {
18
+ "classifier_init_range": 0.02,
19
+ "trust_remote_code": true
20
+ },
21
+ "max_length": 1024,
22
+ "model_type": "neobert",
23
+ "norm_eps": 1e-05,
24
+ "num_attention_heads": 12,
25
+ "num_hidden_layers": 28,
26
+ "pad_token_id": 0,
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.48.2",
29
+ "trust_remote_code": true,
30
+ "vocab_size": 65000
31
+ }
model.py ADDED
@@ -0,0 +1,446 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # From https://github.com/facebookresearch/llama/blob/main/llama/model.py
2
+
3
+ import torch
4
+ from torch import nn
5
+
6
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
7
+ from torch.nn.functional import scaled_dot_product_attention
8
+
9
+ from typing import Optional, Tuple
10
+ import numpy as np
11
+
12
+ from xformers.ops import SwiGLU
13
+
14
+ try:
15
+ from flash_attn.flash_attn_interface import flash_attn_varlen_func
16
+
17
+ FLASH_ATTN_AVAILABLE = True
18
+ except ImportError:
19
+ FLASH_ATTN_AVAILABLE = False
20
+
21
+ from transformers import (
22
+ PreTrainedModel,
23
+ PretrainedConfig,
24
+ DataCollatorForLanguageModeling,
25
+ )
26
+ from transformers.modeling_outputs import (
27
+ BaseModelOutput,
28
+ MaskedLMOutput,
29
+ SequenceClassifierOutput,
30
+ )
31
+
32
+ try:
33
+ import logging as _std_logging
34
+ from transformers.utils import logging as _hf_logging
35
+
36
+ class _DropNeoBERTLoadReport(_std_logging.Filter):
37
+ def filter(self, record):
38
+ return "LOAD REPORT" not in record.getMessage()
39
+
40
+ _hf_logging.get_logger("transformers.modeling_utils").addFilter(_DropNeoBERTLoadReport())
41
+ except Exception:
42
+ pass
43
+
44
+ from .rotary import precompute_freqs_cis, apply_rotary_emb
45
+
46
+
47
+ class DataCollatorWithPacking(DataCollatorForLanguageModeling):
48
+ def __init__(self, pack_sequences=False, **kwargs):
49
+ super().__init__(**kwargs)
50
+ self.pack_sequences = pack_sequences
51
+
52
+ def __call__(self, batch):
53
+ if self.pack_sequences:
54
+ # Add position_ids if not present
55
+ if "position_ids" not in batch[0]:
56
+ for item in batch:
57
+ item["position_ids"] = list(range(len(item["input_ids"])))
58
+
59
+ # Pack the sequences into a single list
60
+ input_ids_list = [item["input_ids"] for item in batch]
61
+ position_ids_list = [item["position_ids"] for item in batch]
62
+ seqlens = np.array([0] + [len(ids) for ids in input_ids_list])
63
+
64
+ packed_batch = {
65
+ "position_ids": np.concatenate(position_ids_list, axis=0),
66
+ "input_ids": np.concatenate(input_ids_list, axis=0),
67
+ "cu_seqlens": np.cumsum(seqlens),
68
+ "max_seqlen": max(seqlens),
69
+ }
70
+
71
+ batch = super().__call__([packed_batch])
72
+ batch["cu_seqlens"] = batch["cu_seqlens"].to(torch.int32).squeeze()
73
+ else:
74
+ batch = super().__call__(batch)
75
+ batch["attention_mask"] = batch["attention_mask"].to(torch.bool)
76
+
77
+ return batch
78
+
79
+
80
+ class NeoBERTConfig(PretrainedConfig):
81
+ model_type = "neobert"
82
+
83
+ # All config parameters must have a default value.
84
+ def __init__(
85
+ self,
86
+ hidden_size: int = 768,
87
+ num_hidden_layers: int = 28,
88
+ num_attention_heads: int = 12,
89
+ intermediate_size: int = 3072,
90
+ embedding_init_range: float = 0.02,
91
+ decoder_init_range: float = 0.02,
92
+ norm_eps: float = 1e-05,
93
+ vocab_size: int = 65000,
94
+ pad_token_id: int = 0,
95
+ max_length: int = 1024,
96
+ **kwargs,
97
+ ):
98
+ super().__init__(**kwargs)
99
+
100
+ self.hidden_size = hidden_size
101
+ self.num_hidden_layers = num_hidden_layers
102
+ self.num_attention_heads = num_attention_heads
103
+ if hidden_size % num_attention_heads != 0:
104
+ raise ValueError("Hidden size must be divisible by the number of heads.")
105
+ self.dim_head = hidden_size // num_attention_heads
106
+ self.intermediate_size = intermediate_size
107
+ self.embedding_init_range = embedding_init_range
108
+ self.decoder_init_range = decoder_init_range
109
+ self.norm_eps = norm_eps
110
+ self.vocab_size = vocab_size
111
+ self.pad_token_id = pad_token_id
112
+ self.max_length = max_length
113
+ self.kwargs = kwargs
114
+
115
+
116
+ class EncoderBlock(nn.Module):
117
+ """Transformer encoder block."""
118
+
119
+ def __init__(self, config: NeoBERTConfig):
120
+ super().__init__()
121
+
122
+ self.config = config
123
+
124
+ # Attention
125
+ self.qkv = nn.Linear(in_features=config.hidden_size, out_features=config.hidden_size * 3, bias=False)
126
+ self.wo = nn.Linear(in_features=config.hidden_size, out_features=config.hidden_size, bias=False)
127
+
128
+ # Feedforward network
129
+ multiple_of = 8
130
+ intermediate_size = int(2 * config.intermediate_size / 3)
131
+ intermediate_size = multiple_of * ((intermediate_size + multiple_of - 1) // multiple_of)
132
+ self.ffn = SwiGLU(config.hidden_size, intermediate_size, config.hidden_size, bias=False)
133
+
134
+ # Layer norms
135
+ self.attention_norm = nn.RMSNorm(config.hidden_size, config.norm_eps)
136
+ self.ffn_norm = nn.RMSNorm(config.hidden_size, config.norm_eps)
137
+
138
+ def forward(
139
+ self,
140
+ x: torch.Tensor,
141
+ attention_mask: torch.Tensor,
142
+ freqs_cis: torch.Tensor,
143
+ output_attentions: bool,
144
+ max_seqlen: int = None,
145
+ cu_seqlens: torch.Tensor = None,
146
+ ):
147
+ # Attention
148
+ attn_output, attn_weights = self._att_block(
149
+ self.attention_norm(x), attention_mask, freqs_cis, output_attentions, max_seqlen, cu_seqlens
150
+ )
151
+
152
+ # Residual
153
+ x = x + attn_output
154
+
155
+ # Feed-forward
156
+ x = x + self.ffn(self.ffn_norm(x))
157
+
158
+ return x, attn_weights
159
+
160
+ def _att_block(
161
+ self,
162
+ x: torch.Tensor,
163
+ attention_mask: torch.Tensor,
164
+ freqs_cis: torch.Tensor,
165
+ output_attentions: bool,
166
+ max_seqlen: int = None,
167
+ cu_seqlens: torch.Tensor = None,
168
+ ):
169
+ batch_size, seq_len, _ = x.shape
170
+
171
+ xq, xk, xv = self.qkv(x).view(batch_size, seq_len, self.config.num_attention_heads, self.config.dim_head * 3).chunk(3, axis=-1)
172
+
173
+ xq, xk = apply_rotary_emb(xq, xk, freqs_cis)
174
+
175
+ # Attn block
176
+ attn_weights = None
177
+
178
+ # Flash attention if the tensors are packed
179
+ if cu_seqlens is not None:
180
+ attn = flash_attn_varlen_func(
181
+ q=xq.squeeze(0),
182
+ k=xk.squeeze(0),
183
+ v=xv.squeeze(0),
184
+ cu_seqlens_q=cu_seqlens,
185
+ cu_seqlens_k=cu_seqlens,
186
+ max_seqlen_q=max_seqlen,
187
+ max_seqlen_k=max_seqlen,
188
+ dropout_p=0.0,
189
+ causal=False,
190
+ )
191
+ # Eager attention if attention weights are needed in the output
192
+ elif output_attentions:
193
+ attn_weights = xq.permute(0, 2, 1, 3) @ xk.permute(0, 2, 3, 1) / (xq.size(-1) ** 0.5)
194
+ if attention_mask is not None:
195
+ attn_weights = attn_weights * attention_mask
196
+ attn_weights = attn_weights.softmax(-1)
197
+ attn = attn_weights @ xv.permute(0, 2, 1, 3)
198
+ attn = attn.transpose(1, 2)
199
+ # Fall back to SDPA otherwise
200
+ else:
201
+ attn = scaled_dot_product_attention(
202
+ query=xq.transpose(1, 2),
203
+ key=xk.transpose(1, 2),
204
+ value=xv.transpose(1, 2),
205
+ attn_mask=attention_mask.bool(),
206
+ dropout_p=0,
207
+ ).transpose(1, 2)
208
+
209
+ return self.wo(attn.reshape(batch_size, seq_len, self.config.num_attention_heads * self.config.dim_head)), attn_weights
210
+
211
+
212
+ class NeoBERTPreTrainedModel(PreTrainedModel):
213
+ config_class = NeoBERTConfig
214
+ base_model_prefix = "model"
215
+ _supports_cache_class = True
216
+
217
+ def _init_weights(self, module):
218
+ if isinstance(module, nn.Linear):
219
+ module.weight.data.uniform_(-self.config.decoder_init_range, self.config.decoder_init_range)
220
+ elif isinstance(module, nn.Embedding):
221
+ module.weight.data.uniform_(-self.config.embedding_init_range, self.config.embedding_init_range)
222
+
223
+
224
+ class NeoBERT(NeoBERTPreTrainedModel):
225
+ config_class = NeoBERTConfig
226
+
227
+ def __init__(self, config: NeoBERTConfig):
228
+ super().__init__(config)
229
+
230
+ self.config = config
231
+
232
+ self.encoder = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
233
+
234
+ # Ensures freqs_cis is moved to the same devices as the model. Non-persistent buffers are not saved in the state_dict.
235
+ freqs_cis = precompute_freqs_cis(config.hidden_size // config.num_attention_heads, config.max_length)
236
+ self.register_buffer("freqs_cis", freqs_cis, persistent=False)
237
+
238
+ self.transformer_encoder = nn.ModuleList()
239
+ for _ in range(config.num_hidden_layers):
240
+ self.transformer_encoder.append(EncoderBlock(config))
241
+
242
+ self.layer_norm = nn.RMSNorm(config.hidden_size, config.norm_eps)
243
+
244
+ # Initialize weights and apply final processing
245
+ self.post_init()
246
+
247
+ def forward(
248
+ self,
249
+ input_ids: Optional[torch.Tensor] = None,
250
+ position_ids: torch.Tensor = None,
251
+ max_seqlen: int = None,
252
+ cu_seqlens: torch.Tensor = None,
253
+ attention_mask: torch.Tensor = None,
254
+ inputs_embeds: Optional[torch.Tensor] = None,
255
+ output_hidden_states: bool = False,
256
+ output_attentions: bool = False,
257
+ **kwargs,
258
+ ):
259
+ # Initialize
260
+ hidden_states, attentions = [], []
261
+
262
+ if (input_ids is None) ^ (inputs_embeds is not None):
263
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
264
+
265
+ # Expand and repeat: (Batch, Length) -> (Batch, Heads, Length, Length)
266
+ if attention_mask is not None:
267
+ attention_mask = attention_mask.unsqueeze(1).unsqueeze(1).repeat(1, self.config.num_attention_heads, attention_mask.size(-1), 1)
268
+
269
+ # Checks to be done if inputs are packed sequences
270
+ if cu_seqlens is not None:
271
+ assert (
272
+ FLASH_ATTN_AVAILABLE
273
+ ), "Flash-attention is not available. Please ''pip install flash_attn'', or provide un-packed sequences."
274
+ assert not output_attentions, "Output attentions is not supported when sequences are packed."
275
+ assert max_seqlen is not None, "Missing max_seqlen. It must be provided when cu_seqlens are not None."
276
+ assert (input_ids if input_ids is not None else inputs_embeds).shape[
277
+ 0
278
+ ] == 1, "Cumulative sequence lengths are provided but inputs are not packed."
279
+ assert (
280
+ input_ids if input_ids is not None else inputs_embeds
281
+ ).is_cuda, "Packing uses an implementation of flash-attention and is only supported on GPU."
282
+
283
+ # RoPE
284
+ freqs_cis = (
285
+ self.freqs_cis[position_ids]
286
+ if position_ids is not None
287
+ else self.freqs_cis[: (input_ids if input_ids is not None else inputs_embeds).shape[1]].unsqueeze(0)
288
+ )
289
+
290
+ # Embedding
291
+ x = self.encoder(input_ids) if input_ids is not None else inputs_embeds
292
+
293
+ # Transformer encoder
294
+ for layer in self.transformer_encoder:
295
+ x, attn = layer(x, attention_mask, freqs_cis, output_attentions, max_seqlen, cu_seqlens)
296
+ if output_hidden_states:
297
+ hidden_states.append(x)
298
+ if output_attentions:
299
+ attentions.append(attn)
300
+
301
+ # Final normalization layer
302
+ x = self.layer_norm(x)
303
+
304
+ # Return the output of the last hidden layer
305
+ return BaseModelOutput(
306
+ last_hidden_state=x,
307
+ hidden_states=hidden_states if output_hidden_states else None,
308
+ attentions=attentions if output_attentions else None,
309
+ )
310
+
311
+
312
+ class NeoBERTLMHead(NeoBERTPreTrainedModel):
313
+ config_class = NeoBERTConfig
314
+
315
+ def __init__(self, config: NeoBERTConfig):
316
+ super().__init__(config)
317
+
318
+ self.config = config
319
+
320
+ self.model = NeoBERT(config)
321
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
322
+
323
+ self.post_init()
324
+
325
+ def forward(
326
+ self,
327
+ input_ids: torch.Tensor,
328
+ position_ids: torch.Tensor = None,
329
+ max_seqlen: int = None,
330
+ cu_seqlens: torch.Tensor = None,
331
+ attention_mask: torch.Tensor = None,
332
+ output_hidden_states: bool = False,
333
+ output_attentions: bool = False,
334
+ **kwargs,
335
+ ):
336
+
337
+ output = self.model.forward(
338
+ input_ids=input_ids,
339
+ position_ids=position_ids,
340
+ max_seqlen=max_seqlen,
341
+ cu_seqlens=cu_seqlens,
342
+ attention_mask=attention_mask,
343
+ output_hidden_states=output_hidden_states,
344
+ output_attentions=output_attentions,
345
+ )
346
+ logits = self.decoder(output.last_hidden_state)
347
+
348
+ return MaskedLMOutput(
349
+ hidden_states=output.hidden_states if output_hidden_states else None,
350
+ attentions=output.attentions if output_attentions else None,
351
+ logits=logits,
352
+ )
353
+
354
+
355
+ class NeoBERTForSequenceClassification(NeoBERTPreTrainedModel):
356
+ config_class = NeoBERTConfig
357
+
358
+ def __init__(self, config: NeoBERTConfig):
359
+ super().__init__(config)
360
+
361
+ self.config = config
362
+
363
+ self.num_labels = getattr(config, "num_labels", 2)
364
+ self.classifier_dropout = getattr(config, "classifier_dropout", 0.1)
365
+ self.classifier_init_range = getattr(config, "classifier_init_range", 0.02)
366
+
367
+ self.model = NeoBERT(config)
368
+
369
+ self.dense = nn.Linear(self.config.hidden_size, self.config.hidden_size)
370
+ self.dropout = nn.Dropout(self.classifier_dropout)
371
+ self.classifier = nn.Linear(self.config.hidden_size, self.num_labels)
372
+
373
+ self.post_init()
374
+
375
+ def _init_weights(self, module):
376
+ if isinstance(module, nn.Linear):
377
+ module.weight.data.normal_(mean=0.0, std=self.classifier_init_range)
378
+ if module.bias is not None:
379
+ module.bias.data.zero_()
380
+
381
+ def forward(
382
+ self,
383
+ input_ids: Optional[torch.Tensor] = None,
384
+ position_ids: torch.Tensor = None,
385
+ max_seqlen: int = None,
386
+ cu_seqlens: torch.Tensor = None,
387
+ attention_mask: torch.Tensor = None,
388
+ output_hidden_states: bool = False,
389
+ output_attentions: bool = False,
390
+ labels: Optional[torch.Tensor] = None,
391
+ return_dict: Optional[bool] = None,
392
+ ):
393
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
394
+
395
+ output = self.model.forward(
396
+ input_ids=input_ids,
397
+ position_ids=position_ids,
398
+ max_seqlen=max_seqlen,
399
+ cu_seqlens=cu_seqlens,
400
+ attention_mask=attention_mask,
401
+ output_hidden_states=output_hidden_states,
402
+ output_attentions=output_attentions,
403
+ )
404
+ hidden_states = output.last_hidden_state
405
+
406
+ x = hidden_states[:, 0, :]
407
+ x = self.dropout(x)
408
+ x = self.dense(x)
409
+ x = torch.tanh(x)
410
+ x = self.dropout(x)
411
+
412
+ logits = self.classifier(x)
413
+
414
+ loss = None
415
+ if labels is not None:
416
+ if self.config.problem_type is None:
417
+ if self.num_labels == 1:
418
+ self.config.problem_type = "regression"
419
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
420
+ self.config.problem_type = "single_label_classification"
421
+ else:
422
+ self.config.problem_type = "multi_label_classification"
423
+
424
+ if self.config.problem_type == "regression":
425
+ loss_fct = MSELoss()
426
+ if self.num_labels == 1:
427
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
428
+ else:
429
+ loss = loss_fct(logits, labels)
430
+ elif self.config.problem_type == "single_label_classification":
431
+ loss_fct = CrossEntropyLoss()
432
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
433
+ elif self.config.problem_type == "multi_label_classification":
434
+ loss_fct = BCEWithLogitsLoss()
435
+ loss = loss_fct(logits, labels)
436
+
437
+ if not return_dict:
438
+ result = (logits,)
439
+ return ((loss,) + result) if loss is not None else result
440
+
441
+ return SequenceClassifierOutput(
442
+ loss=loss,
443
+ logits=logits,
444
+ hidden_states=output.hidden_states if output_hidden_states else None,
445
+ attentions=output.attentions if output_attentions else None,
446
+ )
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:98d9e6783d2536af7221de8729cb5b262064df11e3228577cde499f8988979b9
3
+ size 1192538432
rotary.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # From https://github.com/facebookresearch/llama/blob/main/llama/model.py
2
+
3
+ import torch
4
+ from typing import Tuple
5
+
6
+
7
+ def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
8
+ """
9
+ Precompute the frequency tensor for complex exponentials (cis) with given dimensions.
10
+
11
+ This function calculates a frequency tensor with complex exponentials using the given dimension 'dim'
12
+ and the end index 'end'. The 'theta' parameter scales the frequencies.
13
+ The returned tensor contains complex values in complex64 data type.
14
+
15
+ Args:
16
+ dim (int): Dimension of the frequency tensor.
17
+ end (int): End index for precomputing frequencies.
18
+ theta (float, optional): Scaling factor for frequency computation. Defaults to 10000.0.
19
+
20
+ Returns:
21
+ torch.Tensor: Precomputed frequency tensor with complex exponentials.
22
+ """
23
+
24
+ freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
25
+ t = torch.arange(end, device=freqs.device)
26
+ freqs = torch.outer(t, freqs).float()
27
+ return torch.polar(torch.ones_like(freqs), freqs)
28
+
29
+
30
+ def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
31
+ assert freqs_cis.shape[1:] == (x.shape[1], x.shape[-1])
32
+ return freqs_cis.contiguous().unsqueeze(2)
33
+
34
+
35
+ def apply_rotary_emb(
36
+ xq: torch.Tensor,
37
+ xk: torch.Tensor,
38
+ freqs_cis: torch.Tensor,
39
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
40
+ """
41
+ Apply rotary embeddings to input tensors using the given frequency tensor.
42
+
43
+ This function applies rotary embeddings to the given query 'xq' and key 'xk' tensors using the provided
44
+ frequency tensor 'freqs_cis'. The input tensors are reshaped as complex numbers, and the frequency tensor
45
+ is reshaped for broadcasting compatibility. The resulting tensors contain rotary embeddings and are
46
+ returned as real tensors.
47
+
48
+ Args:
49
+ xq (torch.Tensor): Query tensor to apply rotary embeddings.
50
+ xk (torch.Tensor): Key tensor to apply rotary embeddings.
51
+ freqs_cis (torch.Tensor): Precomputed frequency tensor for complex exponentials.
52
+
53
+ Returns:
54
+ Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings.
55
+ """
56
+ xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
57
+ xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
58
+ freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
59
+ xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
60
+ xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
61
+ return xq_out.type_as(xq), xk_out.type_as(xk)
special_tokens_map.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "[+]"
4
+ ],
5
+ "cls_token": "[CLS]",
6
+ "mask_token": "[MASK]",
7
+ "pad_token": "[PAD]",
8
+ "sep_token": "[SEP]",
9
+ "unk_token": "[UNK]"
10
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.py ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List, Tuple
2
+ from transformers import PreTrainedTokenizerFast
3
+ import re
4
+ import fast_disambig
5
+
6
+ _TATWEEL_RE = re.compile(r"\u0640")
7
+ _ALIF_RE = re.compile(r"[آأإٱ]")
8
+ _ALIF_MAK_RE = re.compile(r"ى")
9
+ _TEH_MARB_RE = re.compile(r"ة")
10
+ _ZERO_WIDTH_RE = re.compile(r"[\u200B-\u200D\u200E\u200F\uFEFF]")
11
+ ARABIC_DIACRITICS = {
12
+ "ً", "ٌ", "ٍ",
13
+ "َ", "ُ", "ِ",
14
+ "ّ", "ْ",
15
+ "ٗ", "٘", "ٙ", "ٚ", "ٛ", "ٜ", "ٝ", "ٞ", "ٟ",
16
+ "ؐ", "ؑ", "ؒ", "ؓ", "ؔ", "ؕ", "ؖ", "ؗ", "ؘ", "ؙ", "ؚ",
17
+ "ۖ", "ۗ", "ۘ", "ۙ", "ۚ", "ۛ", "ۜ", "۟", "۠", "ۡ", "ۢ", "ۣ", "ۤ", "ۧ", "ۨ",
18
+ "۪", "۫", "۬", "ۭ",
19
+ }
20
+
21
+ def separate_diacritics(text):
22
+ tokens = re.split(r'(\s+|\[\+\])', text)
23
+ processed_tokens = []
24
+
25
+ for token in tokens:
26
+ if not token:
27
+ continue
28
+ if token.isspace() or token == '[+]':
29
+ processed_tokens.append(token)
30
+ continue
31
+
32
+ if not any(c in ARABIC_DIACRITICS for c in token):
33
+ processed_tokens.append(token)
34
+ continue
35
+
36
+ base_chars = []
37
+ diac_groups = []
38
+
39
+ for char in token:
40
+ if char in ARABIC_DIACRITICS:
41
+ if not diac_groups:
42
+ base_chars.append(" ")
43
+ diac_groups.append([])
44
+ diac_groups[-1].append(char)
45
+ else:
46
+ base_chars.append(char)
47
+ diac_groups.append([])
48
+
49
+ base_word = "".join(base_chars)
50
+ diac_string = []
51
+ for group in diac_groups:
52
+ if group:
53
+ diac_string.append("".join(group))
54
+ else:
55
+ diac_string.append("◌")
56
+
57
+ processed_tokens.append(base_word + " " + "".join(diac_string))
58
+ return "".join(processed_tokens)
59
+
60
+ def normalize_arabic(text):
61
+ text = _TATWEEL_RE.sub("", text)
62
+ text = _ZERO_WIDTH_RE.sub("", text)
63
+ text = _ALIF_RE.sub("ا", text)
64
+ text = _ALIF_MAK_RE.sub("ي", text)
65
+ text = _TEH_MARB_RE.sub("ه", text)
66
+ return text
67
+
68
+ class ArabicMorphTokenizer(PreTrainedTokenizerFast):
69
+ slow_tokenizer_class = None
70
+
71
+ def __init__(self, tokenizer_file=None, apply_stemming=True, **kwargs):
72
+ super().__init__(tokenizer_file=tokenizer_file, **kwargs)
73
+ self.apply_stemming = apply_stemming
74
+ if self.apply_stemming:
75
+ self.stemmer = fast_disambig.camel.Stemmer()
76
+
77
+
78
+ def _preprocess_one(self, s, do_stem):
79
+ if isinstance(s, (list, tuple)):
80
+ return [self._preprocess_one(x, do_stem) for x in s]
81
+ if do_stem:
82
+ s = self.stemmer.stem(s, preserve_diacritics=True)
83
+ s = normalize_arabic(s)
84
+ s = separate_diacritics(s)
85
+ return s
86
+
87
+ def _preprocess_pair(self, text, text_pair, do_stem):
88
+ def maybe(s):
89
+ return self._preprocess_one(s, do_stem) if isinstance(s, str) else s
90
+ if isinstance(text, (list, tuple)):
91
+ text = [maybe(x) for x in text]
92
+ else:
93
+ text = maybe(text)
94
+ if isinstance(text_pair, (list, tuple)):
95
+ text_pair = [maybe(x) for x in text_pair]
96
+ else:
97
+ text_pair = maybe(text_pair)
98
+ return text, text_pair
99
+
100
+ def _pop_flag(self, kwargs):
101
+ v = kwargs.pop("apply_stemming", None)
102
+ return self.apply_stemming if v is None else bool(v)
103
+
104
+ def __call__(self, text=None, text_pair=None, *args, **kwargs):
105
+ flag = self._pop_flag(kwargs)
106
+ if not getattr(self, "_processing", False):
107
+ self._processing = True
108
+ try:
109
+ text, text_pair = self._preprocess_pair(text, text_pair, flag)
110
+ return super().__call__(text=text, text_pair=text_pair, *args, **kwargs)
111
+ finally:
112
+ self._processing = False
113
+ return super().__call__(text=text, text_pair=text_pair, *args, **kwargs)
114
+
115
+ def encode(self, text, text_pair=None, *args, **kwargs):
116
+ flag = self._pop_flag(kwargs)
117
+ if not getattr(self, "_processing", False):
118
+ self._processing = True
119
+ try:
120
+ text, text_pair = self._preprocess_pair(text, text_pair, flag)
121
+ return super().encode(text, text_pair, *args, **kwargs)
122
+ finally:
123
+ self._processing = False
124
+ return super().encode(text, text_pair, *args, **kwargs)
125
+
126
+ def encode_plus(self, text=None, text_pair=None, *args, **kwargs):
127
+ flag = self._pop_flag(kwargs)
128
+ if not getattr(self, "_processing", False):
129
+ self._processing = True
130
+ try:
131
+ text, text_pair = self._preprocess_pair(text, text_pair, flag)
132
+ return super().encode_plus(text=text, text_pair=text_pair, *args, **kwargs)
133
+ finally:
134
+ self._processing = False
135
+ return super().encode_plus(text=text, text_pair=text_pair, *args, **kwargs)
136
+
137
+ def batch_encode_plus(self, batch_text_or_text_pairs=None, *args, **kwargs):
138
+ flag = self._pop_flag(kwargs)
139
+ if not getattr(self, "_processing", False):
140
+ self._processing = True
141
+ try:
142
+ data = batch_text_or_text_pairs
143
+ if isinstance(data, (list, tuple)):
144
+ new_data = []
145
+ for item in data:
146
+ if isinstance(item, (list, tuple)) and len(item) == 2:
147
+ new_data.append(self._preprocess_pair(item[0], item[1], flag))
148
+ else:
149
+ new_data.append(self._preprocess_one(item, flag))
150
+ batch_text_or_text_pairs = new_data
151
+ return super().batch_encode_plus(batch_text_or_text_pairs=batch_text_or_text_pairs, *args, **kwargs)
152
+ finally:
153
+ self._processing = False
154
+ return super().batch_encode_plus(batch_text_or_text_pairs=batch_text_or_text_pairs, *args, **kwargs)
155
+
156
+ def preprocess(self, text, apply_stemming=True):
157
+ flag = self.apply_stemming if apply_stemming is None else bool(apply_stemming)
158
+ return self._preprocess_one(text, flag)
tokenizer_config.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "[+]",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "additional_special_tokens": [
53
+ "[+]"
54
+ ],
55
+ "clean_up_tokenization_spaces": false,
56
+ "cls_token": "[CLS]",
57
+ "do_lower_case": false,
58
+ "extra_special_tokens": {},
59
+ "mask_token": "[MASK]",
60
+ "model_max_length": 1000000000000000019884624838656,
61
+ "pad_token": "[PAD]",
62
+ "sep_token": "[SEP]",
63
+ "strip_accents": null,
64
+ "tokenize_chinese_chars": true,
65
+ "tokenizer_class": "ArabicMorphTokenizer",
66
+ "trust_remote_code": true,
67
+ "unk_token": "[UNK]",
68
+ "auto_map": {
69
+ "AutoTokenizer": ["tokenizer.ArabicMorphTokenizer", null]
70
+ },
71
+ "apply_stemming": true
72
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff