Evaluate a vision language model on archaic Chinese script reading
Train or fine tune a model on the Seven Chinese Scripts dataset
Compare open source VLMs against GPT and Gemini on character spotting
Score a model on ancient text parsing using edit distance metrics
Requires downloading a 2,800 image dataset and standing up a vision language model with enough VRAM to run inference across four tasks.
Chronicles-OCR is a research benchmark, which means it is a fixed collection of test images and scoring rules used to compare how well different AI models can read Chinese writing. The thing that makes it distinctive is that it covers the full historical span of Chinese characters, from the earliest carvings on bones and shells more than three thousand years ago up to brush-and-paper calligraphy in styles still used today. The dataset gathers exactly 2,800 images, split evenly into 400 per script across seven script styles known together as the Seven Chinese Scripts. The README walks through each one: Oracle Bone Script carved on tortoise shells in the Shang dynasty, Bronze Script cast on ceremonial vessels, Seal Script standardised after the Qin unification, Clerical Script which flattened characters and marks the boundary between ancient and modern forms, Regular Script which is the formal style still in use, plus Cursive Script and Running Script which developed for faster informal writing. The first five were each the official script of their era, while the last two are auxiliary styles. The collection was put together with the Key Laboratory of Oracle Bone Inscription Information Processing at Anyang Normal University and with the Palace Museum. The benchmark defines four evaluation tasks. Character Spotting asks a model to point to where each character sits in an image of an archaic script. Fine-grained Archaic Character Recognition asks the model to name each individual character in the older three scripts. Ancient Text Parsing covers all seven scripts and is scored using a string-edit distance to the correct transcription. Script Classification asks the model which of the seven styles a given image belongs to. Each task has its own metric, ranging from F1 with an intersection-over-union threshold to plain accuracy. Most of the README is taken up by a large leaderboard that compares many vision-language models on these tasks. Open-source models listed include several sizes of InternVL, Qwen2.5-VL, Qwen3-VL, Qwen3.5, Gemma 4, MiniCPM-V, Molmo, Ovis2.6, GLM-4.5V, and Kimi K2.5. Proprietary models include GPT-4o, GPT-5, several Seed releases, MiMo-V2-Omni, and Gemini. The numbers are consistently low on archaic-script tasks, showing how hard the benchmark is even for the strongest models. Links to a paper on arXiv and to the dataset on HuggingFace are provided at the top.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.