The same question, drawn as an image or written as text, carries identical information — yet vision-language models often give two different answers. SEAM turns that inconsistency into a controlled measurement.
Launch Highlights
- Problem:OCR-style tests that screenshot text into images cannot tell whether a model fails to see or fails to reason.
- Method:SEAM uses FEN/boards, SMILES/molecules, ABC/sheet music, and graphs/matrices to preserve semantics.
- Finding:Vision usually trails language, and answer agreement across modalities remains far from ideal.
- Why it matters:Researchers can separate perception failures from cross-modal reasoning failures.
Continue reading
The research page covers the background, method, key figures, and paper links; for quick sharing, use the illustrated promo copy.