Omartificial-Intelligence-Space's picture

Update README.md

7b267e3 verified about 1 month ago

6.04 kB

	---
	license: apache-2.0
	language:
	- ar
	base_model:
	- UBC-NLP/AraT5v2-base-1024
	library_name: transformers
	tags:
	- TST
	- Arabic
	- Author_Style
	- AraGenEval
	---

	# AraStyleTransfer-21 \| 21 Arabic Author Styles. One Model.

	🏆 First Place Winner at AraGenEval 2025 Competition

	A state-of-the-art Arabic text style transfer model that transforms text into the writing style of 21 different Arabic authors using descriptive author tokens and prompt engineering.

	## 🔗 Paper Link (ACL Anthology)

	📘 ANLPers at AraGenEval Shared Task: Descriptive Author Tokens for Transparent Arabic Authorship Style Transfer [https://aclanthology.org/2025.arabicnlp-sharedtasks.8.pdf]

	## 🏗️ Model Architecture

	- Base Model: UBC-NLP/AraT5v2-base-1024
	- Approach: Descriptive Author Tokens + Prompt Engineering
	- Input Format: `"اكتب النص التالي بأسلوب <author:name>: [text]"`
	- Training: Fine-tuned with author-specific tokens

	## 🔬 Technical Details

	### Stylometric Analysis
	The model incorporates comprehensive stylometric analysis including:
	- Lexical Features: Sentence length, word length, vocabulary richness
	- Syntactic Patterns: Definite articles, conjunctions, prepositions
	- Author-Specific Vocabulary: TF-IDF based characteristic words
	- Style Classification: Formality, complexity, emotional intensity

	### Prompt Engineering
	- Format: `"اكتب النص التالي بأسلوب <author:يوسف_إدريس>: [original_text]"`
	- Author Tokens: Descriptive tokens like `<author:يوسف_إدريس>`
	- Target: Generated text in author's style

	## 📚 Supported Authors

	<p align="center">
	<img src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F628f7a71dd993507cfcbe587%2FqDHUSa6ZvD1LjN9uJs-jp.png%26quot%3B%3C%2Fspan%3E width="600"/>
	</p>


	## 📁 Input File Format

	For batch processing, your input file should have the following format:

	## 📊 Example Snippets from the Dataset

	\| id \| text_in_msa (partial) \| text_in_author_style (partial) \|
	\|----\|------------------------\|--------------------------------\|
	\| 3835 \| "لم أقم مطلقًا بالاحتفال بعيد ميلادي... وكنت أتجادل مع كامل الشناوي..." \| "عمري ما احتفلت بعيد ميلادي... وأتشاجر مع كامل الشناوي على ذلك الاكتئاب..." \|
	\| 3836 \| "الزمن العام هو العداد الجماعي الذي يسجل السنين... ويبرز الزمن الخاص..." \| "الزمن العام يعدّ السنين للناس كلها... أما عدادك الخاص فأنت نادرًا ما تنظر فيه..." \|
	\| 3837 \| "مصر الغنية الراقية... اشتراكية وديمقراطية تتفاعل معًا... أحلام الخمسين..." \| "مصر المصنِّعة... الكون مائة زهرة... وحين أبلغ الخمسين أبدأ أعيش وأتعلم الموسيقى..." \|
	\| 3838 \| "غرابة التجربة... طفولة جادة تمامًا بلا مرح... الطفولة كانت عيبًا..." \| "غريبة هي الأفكار... كنتُ رجلًا رهيبًا في ثوب طفل... والطفولة تُهمة نخشى الاعتراف بها..." \|
	\| 3839 \| "هذا ليس ندمًا... موجة تفوقك قوة... النصر الحقيقي أن تعيش كما تختار..." \| "ليس مرارة ولا ندمًا... أنت تناضل موجة أعتى منك... والحق أن تحيا كما اخترت أنت..." \|


	## 📊 Performance Metrics

	- BLEU Score: 24.58
	- chrF Score: 59.01
	- Competition: First Place in AraGenEval 2024
	- Supported Authors: 21 Arabic authors

	Official results on the AraGenEval 2025 testset. Our prompt engineering system ranked first.

	<p align="left">
	<img src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F628f7a71dd993507cfcbe587%2FpCfAK4zefvXZ4YI1AvXIG.png%26quot%3B%3C%2Fspan%3E width="400"/>
	</p>

	## 🚀 Quick Start: Style Transfer Example

	```python
	from transformers import T5Tokenizer, T5ForConditionalGeneration
	import torch

	# Load model
	model_name = "Omartificial-Intelligence-Space/AraStyleTransfer-21"

	tokenizer = T5Tokenizer.from_pretrained(model_name)
	model = T5ForConditionalGeneration.from_pretrained(model_name)

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)

	# Input text and author
	text = "لم أقم مطلقًا بالاحتفال بعيد ميلادي منذ طفولتي."
	author = "يوسف إدريس"

	# Prompt format
	prompt = f"اكتب النص التالي بأسلوب <author:{author.replace(' ', '_')}>: {text}"

	# Tokenize
	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	# Generate
	output_ids = model.generate(
	**inputs,
	max_length=256,
	num_beams=5,
	early_stopping=True
	)

	# Decode
	generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

	print("Original:", text)
	print("Author:", author)
	print("Output:", generated_text)
	```


	## 🎯 Use Cases

	- Content Creation: Generate text in specific author styles
	- Educational Tools: Demonstrate different writing styles
	- Research: Study Arabic literary styles and patterns
	- Creative Writing: Inspire new content in classic styles

	## 🤝 Contributing

	This model was developed for the [AraGenEval 2025](https://ezzini.github.io/AraGenEval/) competition. For questions or contributions, please refer to the competition guidelines.

	## 📄 License

	This model is released under the same license as the base AraT5v2 model.


	## BibTeX Citation

	```bibtex
	@inproceedings{nacar2025anlpers,
	title={ANLPers at AraGenEval Shared Task: Descriptive Author Tokens for Transparent Arabic Authorship Style Transfer},
	author={Nacar, Omer and Reda, Mahmoud and Sibaee, Serry and Alhabashi, Yasser and Ammar, Adel and Boulila, Wadii},
	booktitle={Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks},
	pages={49--53},
	year={2025}
	}
	```
	---

	🏆 First Place Winner at AraGenEval 2025 - Arabic Text Style Transfer Competition