Sentence simplification — rewriting a sentence to be easier to read without losing what it says — has had English benchmarks for years. For Pashto, Thai, and Tamil it has had none. OasisSimp closes that gap with native-speaker simplifications across English, Sinhala, Tamil, Pashto, and Thai, and uses the new resource to show that today's multilingual LLMs are nowhere near solving simplification once you leave English.
Background
Simplification is a workhorse for accessibility, education, and civic content — turning legalese into plain language, turning a dense newspaper paragraph into one a learner can read. Almost all benchmark progress has been measured in English, against single-reference targets, with metrics like SARI tuned to that setting. The result is a misleading picture: an LLM that looks competent in English may produce near-unusable output in a low-resource language, and the field has had no clean way to say so.
The dataset
Five languages, sourced from corpora that match how each language is actually written and read:
| Language | Sources | Avg. refs | Source corpus |
|---|---|---|---|
| English | 2,500 | 2.86 | The Globe and Mail |
| Sinhala | 2,500 | 5.00 | SiTa |
| Tamil | 520 | 4.66 | SiTa |
| Pashto | 2,500 | 3.00 | Wikipedia |
| Thai | 1,499 | 5.06 | ThaiSum |
OasisSimp data composition. Each source sentence has multiple reference simplifications written by trained native speakers; splits are 80% test / 20% validation per language. There is no train split — this is a benchmark, not a training corpus. License: CC BY 4.0.
Annotation was done by 3–6 native speakers per language, all with Bachelor's degrees or higher, after at least three rounds of guideline training covering rewording, splitting, deletion, and reordering. We do not report inter-annotator agreement — that is a known limitation worth flagging honestly rather than hiding.
Results
Eight open-weight multilingual LLMs were evaluated zero-shot and 5-shot: Aya-Expanse-8B, Command-R 7B, DeepSeek-LLM-7B-Chat, EuroLLM-9B-Instruct, Gemma-3-12B-it, Llama-3.2-3B-Instruct, Mistral-7B-Instruct-v0.2, Qwen2.5-7B-Instruct. Best per-language 5-shot scores:
| Language | Best model | SARI | BERTScore |
|---|---|---|---|
| English | Command-R 7B | 44.76 | 56.63 |
| Sinhala | Gemma-3-12B | 39.89 | 73.89 |
| Thai | Llama-3.2-3B | 40.23 | 68.91 |
| Tamil | Gemma-3-12B | 39.34 | 79.70 |
| Pashto | Command-R 7B | 30.95 | 70.52 |
Best 5-shot SARI / BERTScore per language. SARI in the 30s–40s is far from solved; the takeaway is that current LLMs underperform on low-resource simplification, not that they handle it.
What we found
- Pashto is the hardest language by a clear margin (best SARI just 30.95).
- No model dominates across all five. Gemma-3-12B is the most consistent multilingual simplifier; Command-R 7B wins English and Pashto.
- Zero-shot is unreliable for Pashto and Thai — per-model variance is large, so reporting a single number understates the underlying instability.