Frontier vision-language models can read a chess position from FEN and a molecule from SMILES — but ask the same question once as text and once as an image, and they often give different answers. SEAM is a benchmark that pairs semantically equivalent text and image inputs across four expert domains, isolating modality from task and exposing how far models still are from reasoning the same way regardless of how a problem arrives.
Background
Pre-SEAM, "multimodal" benchmarks usually pit OCR-style images-of-text against text, so any modality gap was confounded with reading-vs-knowing. SEAM uses each domain's native notation on both sides — FEN ↔ chess board, SMILES ↔ molecule diagram, MusicXML ↔ score, edge-list ↔ graph drawing — so the two modalities provably carry the same information. A measured gap is then a model failure, not an information asymmetry.
What's in the benchmark
Four expert domains (chess, chemistry, music, graph theory) × four tasks each = 16 tasks; ~200 base items per task gives 3,200 questions, each evaluated under three input conditions (Language-only, Vision-only, Vision+Language) for 9,600 evaluations total. The authors also apply visual transformations as robustness checks, so the modality gap can't be hand-waved as a rendering artifact.
Results
Twenty-one frontier models were evaluated, ranging from GPT-5, Claude 4.x, Qwen2.5-VL, InternVL3 down to Llama 3.2-Vision and gemma-3-27b. The top of the leaderboard:
| Model | Average | Language | Vision+Lang |
|---|---|---|---|
| GPT-5 | 0.765 | 0.804 | 0.857 |
| GPT-5-mini | 0.756 | — | — |
| Claude-4.1-Opus | 0.740 | — | — |
Top of the SEAM leaderboard, average accuracy across 16 tasks. Live results: SEAM leaderboard.
What we found
- Vision systematically lags language across nearly all 21 models, even though information content is identical.
- Cross-modal agreement is sometimes near-random. The same model often gives different answers to the text and image versions of the same question.
- Two distinct error sources: tokenization-driven text-side failures (e.g., SMILES fragmented into chemically meaningless tokens) and visual hallucinations of structures that aren't in the image.
Today's "VLMs" act more like language reasoners with a weaker visual channel attached. SEAM gives that intuition a number.