Why AI Can't Reliably Detect Construction Drawing Changes

TL;DR

AI reads text on drawings well but can't understand what changed between revisions. The AECV-bench benchmark shows frontier models hit 95% accuracy on text extraction but only 40-55% on counting architectural symbols like doors and windows.
"Upload two drawings and ask what changed" doesn't work. AI models hallucinate changes, miss spatial relationships, and can't process construction drawings at full resolution.
Pixel comparison finds everything but understands nothing. It flags title block updates the same as a relocated structural column.
What works: isolate first, then analyze. Computer vision identifies specific change boundaries at full resolution, then AI analyzes each isolated area with focused context.

Every few weeks, someone posts a demo of ChatGPT or Claude analyzing a floor plan. The model reads room labels, identifies spaces, maybe counts some doors. The reaction is always the same: "If AI can read a drawing, it can compare two revisions and tell me what changed."

We thought the same thing, and spent months learning why it doesn't work.

This post is less about whether AI is useful on construction drawings (it genuinely is) and more about why the jump from "reading a drawing" to "detecting what changed between revisions" is so much harder than it looks, much like the difference between reading a paragraph and playing spot-the-difference across two 200-million-pixel technical documents.

What the Benchmarks Actually Show

The models are genuinely good at text, and I don't want to undersell that. But the AECV-bench benchmark, a peer-reviewed study testing frontier models on real AEC drawings, makes the gap between text comprehension and spatial understanding obvious.

Text extraction hits up to 95% accuracy on room labels, dimension callouts, and notes. Counting doors on a floor plan? 40-55% accuracy, a completely different category of problem. The best model tested (Gemini 3 Pro) still had a 15% error rate on doors and 20% on windows, GPT-5.2 hit 20% and 25%, and Claude Opus 4.5 reached 31% and 37%.

Task	Best Model Accuracy	Worst Model Accuracy
Text extraction (OCR)	~95%	~80%
Counting doors	~85%	~63%
Counting windows	~80%	~63%
Spatial reasoning	Moderate	Poor

The study's concludes:

"Current systems function well as document assistants but lack robust drawing literacy, motivating domain-specific representations and tool-augmented, human-in-the-loop workflows."

AECV-Bench, Benchmarking Multimodal Models on Architectural and Engineering Drawings

AI can read the words on a drawing, but it can't reliably understand the drawing itself, and change detection needs exactly the kind of spatial understanding that models are weakest at.

Why "Just Ask the AI" Fails for Change Detection

We've tried all of these, and so have other teams we've talked to.

Approach 1: Show Both Drawings and Ask What Changed

Upload revision A and revision B, then ask the model to list the differences. The problem is resolution. Construction drawings at 300 DPI are 7,200 x 10,800 pixels per sheet (77.8 million pixels), and most AI models downsample images aggressively before processing. A dimension change from 12'-6" to 12'-4" that's clearly visible at full resolution just disappears after downsampling. The model literally can't see the change.

Even when changes are visible, spatial precision falls apart. The model might say "something changed near grid line C-4" but place a bounding box 200 pixels off. Much like how language models process text as tokens rather than individual characters, vision models process images as visual tokens, not coordinate systems. They weren't built for pixel-level accuracy on dense technical drawings.

Approach 2: Show the Overlay and Describe Changes

Generate a pixel overlay (old drawing in one color, new in another) and ask the model to interpret the highlights.

Better, but still flawed. Pixel overlays are noisy because updated dates, shifted title blocks, and minor print variations all show up with the same visual weight as a structural change, and the model can't distinguish meaningful scope changes from cosmetic noise. Overlays also can't capture what's missing: if a note was deleted, the overlay shows a colored smudge, and the model sees "something was here" but can't tell you what got removed or whether it matters.

Approach 3: Extract Text, Compare Programmatically

Use OCR to extract all text from both revisions, then diff the text programmatically.

This catches note changes and dimension callouts but misses everything graphical: a relocated wall, an added beam, a shifted duct run. None produce text differences. On most construction drawings, the meaningful changes are graphical, not textual.

The Core Problem

Drawing comparison requires three things current models can't do at the same time, and this is where the "just use AI" argument falls apart.

For one, there's resolution. Construction drawings carry meaning at the pixel level, where a line weight difference between 0.25mm and 0.35mm distinguishes a partition wall from a structural wall. Vision models process images at roughly 1,000-2,000 pixels on a side, but construction drawings need 7,000-10,000 pixels to preserve legible detail.

Then there's spatial precision. Change detection requires knowing exactly where something changed, precise enough that a project engineer can act on it. AI models are built for language generation, not coordinate-level spatial reasoning, and the AECV-bench results confirm this (even counting objects is unreliable).

Finally, there's full-sheet context. A dimension change on grid line B-3 only makes sense in context: did a load-bearing wall move, or did a furniture layout shift? You need the entire sheet at full resolution to answer that, and current vision pipelines can't do it.

What Actually Works

The answer isn't a bigger model or a better prompt. It's decomposing the problem so each tool handles the part it's good at.

Step 1: Find the Changes with Computer Vision

Classical computer vision (pixel alignment, image differencing, contour detection) is excellent at finding where something changed. It doesn't hallucinate, it doesn't miss subtle differences, and it works at full resolution.

We automatically align drawings using feature matching and constrained optimization, then generate precise difference maps. This gives us candidate change regions with exact pixel boundaries, not vague descriptions.

Step 2: Analyze Each Change with AI

Once we've isolated change regions, we hand each one to an AI model with the right context: a cropped view from both revisions, plus surrounding drawing area.

This flips the problem. Instead of asking "what changed on this 78-million-pixel drawing?" we're asking "what does this specific 500x500 pixel change represent?", which is a question models can actually answer well. The model classifies each change (dimensional modification, added element, removed element, note change) and describes it in plain language a PE can act on.

Why This Works

Computer vision finds every change with exact boundaries, and AI interprets what those changes mean, which is not to say either is perfect alone, but rather that combining them covers each other's weaknesses.

Pixel comparison produces noise without meaning, and AI by itself misses changes and makes things up. But together they give us results teams actually trust.

Key Takeaways

AI reads text on drawings at 95% accuracy but understands drawing content at 40-55%. The AECV-bench benchmark confirms a wide gap between document understanding and drawing literacy.
Construction drawings at 300 DPI exceed what any current vision model can process at full resolution. Downsampling destroys the detail that matters.
"Upload and compare" approaches fail because they combine the hardest parts of the problem: high resolution, spatial precision, and full-sheet context.
Decomposition is the key. Use computer vision to find change boundaries at full resolution, then AI to analyze each isolated area with focused context.
Give AI a small, well-defined problem instead of a 78-million-pixel haystack. It performs well for drawing analysis when you scope it right.

FAQ

Will future AI models solve this?

Maybe, but not soon. The bottleneck isn't model intelligence, it's input resolution, and even a perfect model would need to see all 78 million pixels at once. Still, context windows are growing fast, and I suspect decomposition stays necessary for now, though I'd love to be wrong about the timeline.

What about fine-tuning models on construction drawings?

Fine-tuning helps with domain vocabulary but doesn't fix the resolution problem. A fine-tuned model might know "GYP. BD." means gypsum board, but it still can't see a 2-pixel line weight difference after 5x downsampling. The AECV-bench researchers noted that current training data has "relatively little exposure to vector-like artefacts such as CAD plans and engineering diagrams." As this gap in training data gets filled (and it will), I expect static drawing analysis to improve, but the resolution constraint is architectural, not a data problem.

How is this different from Bluebeam?

Bluebeam's overlay still requires a human to review every sheet and interpret every highlight. We automate both: finding the changes and describing what they mean. For a deeper comparison, see Bluebeam vs Bedrock.

Can I use ChatGPT or Claude to analyze individual changes?

Yes, and we'd actually encourage it for certain workflows. For individual, cropped change areas with clear context, general-purpose models give useful descriptions. The hard part is getting there: finding where the changes are, cropping them precisely, providing the right context. That's what we automate.

We built Bedrock to do exactly this: find changes with computer vision, then interpret them with AI. Try it free with 50 comparisons.