Spaces:

Zevir
/

SERASA_BERT_OCR

Sleeping

App Files Files Community

SERASA_BERT_OCR / app /preprocess.py

Zevir's picture

teste

4d16182 3 months ago

history blame contribute delete

312 Bytes

	import re
	import unicodedata

	def preprocess_text(text):
	text = unicodedata.normalize("NFKC", text)
	text = re.sub(r"http\S+\|www\.\S+", "", text)
	text = re.sub(r"<.*?>", "", text)
	text = re.sub(r"[^\wÀ-ÖØ-öø-ÿ?!,. ]", " ", text)
	text = re.sub(r"\s+", " ", text).strip()
	return text