OasisSimp - Coolwei AI Lab

Sentence simplification — rewriting a sentence to be easier to read without losing what it says — has had English benchmarks for years. For Pashto, Thai, and Tamil it has had none. OasisSimp closes that gap with native-speaker simplifications across English, Sinhala, Tamil, Pashto, and Thai, and uses the new resource to show that today's multilingual LLMs are nowhere near solving simplification once you leave English.

Background

Simplification is a workhorse for accessibility, education, and civic content — turning legalese into plain language, turning a dense newspaper paragraph into one a learner can read. Almost all benchmark progress has been measured in English, against single-reference targets, with metrics like SARI tuned to that setting. The result is a misleading picture: an LLM that looks competent in English may produce near-unusable output in a low-resource language, and the field has had no clean way to say so.

The dataset

Five languages, sourced from corpora that match how each language is actually written and read:

Language	Sources	Avg. refs	Source corpus
English	2,500	2.86	The Globe and Mail
Sinhala	2,500	5.00	SiTa
Tamil	520	4.66	SiTa
Pashto	2,500	3.00	Wikipedia
Thai	1,499	5.06	ThaiSum

OasisSimp data composition. Each source sentence has multiple reference simplifications written by trained native speakers; splits are 80% test / 20% validation per language. There is no train split — this is a benchmark, not a training corpus. License: CC BY 4.0.

Annotation was done by 3–6 native speakers per language, all with Bachelor's degrees or higher, after at least three rounds of guideline training covering rewording, splitting, deletion, and reordering. We do not report inter-annotator agreement — that is a known limitation worth flagging honestly rather than hiding.

Results

Eight open-weight multilingual LLMs were evaluated zero-shot and 5-shot: Aya-Expanse-8B, Command-R 7B, DeepSeek-LLM-7B-Chat, EuroLLM-9B-Instruct, Gemma-3-12B-it, Llama-3.2-3B-Instruct, Mistral-7B-Instruct-v0.2, Qwen2.5-7B-Instruct. Best per-language 5-shot scores:

Language	Best model	SARI	BERTScore
English	Command-R 7B	44.76	56.63
Sinhala	Gemma-3-12B	39.89	73.89
Thai	Llama-3.2-3B	40.23	68.91
Tamil	Gemma-3-12B	39.34	79.70
Pashto	Command-R 7B	30.95	70.52

Best 5-shot SARI / BERTScore per language. SARI in the 30s–40s is far from solved; the takeaway is that current LLMs underperform on low-resource simplification, not that they handle it.

What we found

Pashto is the hardest language by a clear margin (best SARI just 30.95).
No model dominates across all five. Gemma-3-12B is the most consistent multilingual simplifier; Command-R 7B wins English and Pashto.
Zero-shot is unreliable for Pashto and Thai — per-model variance is large, so reporting a single number understates the underlying instability.

Why it matters

A first for three languages. Thai, Pashto, and Tamil now have a sentence-simplification benchmark; Sinhala gets a substantial expansion. Without resources like this, "multilingual" claims for LLMs go untested in exactly the languages that most need accessibility tooling.

Multi-reference matters. Up to ~5 simplifications per source sentence make SARI and BERTScore meaningfully more robust than the single-reference English benchmarks the field has relied on. The methodology is reusable for new languages, not just consumable as a leaderboard.

Quantifies a fairness gap. The same LLM family can be near-publication-quality in English and near-unusable in Pashto. Now there's a number on it — and that number is the starting point for any team trying to close the gap.