Resum
Grid structured visual data such as forms, tables, and game boards require models that pair pixel level perception with symbolic consistency under global constraints. Recent Pixel Language Models (PLMs) map images to token sequences with promising flexibility, yet we find they generalize poorly when observable evidence becomes sparse or corrupted. We present GridMNIST-Sudoku, a benchmark that renders large numbers of Sudoku instances with style diverse handwritten digits and provides parameterized stress tracks for two tasks: Completion (predict missing cells) and Correction (detect and repair incorrect cells) across difficulty levels ranging from 1 to 90 altered positions in a 9 × 9 grid. Attention diagnostics on PLMs trained with conventional one dimensional positional encodings reveal weak structure awareness outside the natural Sudoku sparsity band. Motivated by these findings, we propose a lightweight Row-Column-Box (RCB) positional prior that injects grid aligned coordinates and combine it with simple sparsity and corruption augmentations. Trained only on the natural distribution, the resulting model substantially improves out of distribution accuracy across wide sparsity and corruption ranges while maintaining strong in distribution performance.
| Idioma original | Anglès |
|---|---|
| Número d’article | 2851 |
| Nombre de pàgines | 14 |
| Revista | Mathematics |
| Volum | 13 |
| Número | 17 |
| DOIs | |
| Estat de la publicació | Publicada - 4 de set. 2025 |
Fingerprint
Navegar pels temes de recerca de 'A Benchmark for Symbolic Reasoning from Pixel Sequences: Grid-Level Visual Completion and Correction'. Junts formen un fingerprint únic.Com citar-ho
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver