← Research Benchmark · COLM 2025

SEAM

A semantically equivalent across-modalities benchmark for vision-language models.

arXiv OpenReview GitHub Dataset Leaderboard
21models evaluated
16tasks
4expert domains
9,600evaluations

Frontier vision-language models can read a chess position from FEN and a molecule from SMILES — but ask the same question once as text and once as an image, and they often give different answers. SEAM is a benchmark that pairs semantically equivalent text and image inputs across four expert domains, isolating modality from task and exposing how far models still are from reasoning the same way regardless of how a problem arrives.

Background

Pre-SEAM, "multimodal" benchmarks usually pit OCR-style images-of-text against text, so any modality gap was confounded with reading-vs-knowing. SEAM uses each domain's native notation on both sides — FEN ↔ chess board, SMILES ↔ molecule diagram, MusicXML ↔ score, edge-list ↔ graph drawing — so the two modalities provably carry the same information. A measured gap is then a model failure, not an information asymmetry.

What's in the benchmark

Four expert domains (chess, chemistry, music, graph theory) × four tasks each = 16 tasks; ~200 base items per task gives 3,200 questions, each evaluated under three input conditions (Language-only, Vision-only, Vision+Language) for 9,600 evaluations total. The authors also apply visual transformations as robustness checks, so the modality gap can't be hand-waved as a rendering artifact.

Results

Twenty-one frontier models were evaluated, ranging from GPT-5, Claude 4.x, Qwen2.5-VL, InternVL3 down to Llama 3.2-Vision and gemma-3-27b. The top of the leaderboard:

Model Average Language Vision+Lang
GPT-5 0.765 0.804 0.857
GPT-5-mini 0.756
Claude-4.1-Opus 0.740

Top of the SEAM leaderboard, average accuracy across 16 tasks. Live results: SEAM leaderboard.

What we found

  • Vision systematically lags language across nearly all 21 models, even though information content is identical.
  • Cross-modal agreement is sometimes near-random. The same model often gives different answers to the text and image versions of the same question.
  • Two distinct error sources: tokenization-driven text-side failures (e.g., SMILES fragmented into chemically meaningless tokens) and visual hallucinations of structures that aren't in the image.

Today's "VLMs" act more like language reasoners with a weaker visual channel attached. SEAM gives that intuition a number.

Why it matters

A controlled testbed for modality-agnostic reasoning. Because both modalities provably carry the same information, gains on SEAM can be attributed to reasoning, not better OCR or richer captioning. That's a precondition for the field to make claims about "multimodal reasoning" at all.
A diagnostic, not just a leaderboard. The tokenization-side and perception-side error attribution generalizes to new models — anyone training a VLM can use SEAM to localize where their pipeline is leaking accuracy, instead of staring at a single average score.
An honest read on where frontier VLMs stand. Even GPT-5 only reaches ~0.80 on the language side; cross-modal consistency is much worse. Calling current systems "multimodal reasoners" overstates what they can do — and SEAM lets the field say so with numbers.