How to Check Your Site for Dirty HTML That Confuses AI Crawlers
Dirty HTML — inline styles, span tags, Microsoft Office markup, and broken element nesting — is present on the majority of WordPress sites that have been built and edited over time. AI crawlers encountering dirty HTML produce inaccurate content models that lead to poor citations or no citations at all. This guide walks through the complete process of finding and cleaning it.
The direct answer
Dirty HTML — inline styles, span tags, Microsoft Office markup, and broken element nesting — is present on the majority of WordPress sites that have been built and edited over time. AI crawlers encountering dirty HTML produce inaccurate content models that lead to poor citations or no citations at all. This guide walks through the complete process of finding and cleaning it.
What to look for when checking for dirty HTML
Dirty HTML on WordPress sites falls into three main categories: Word/Docs paste artifacts (inline styles and Microsoft-specific classes), theme or plugin injection (unnecessary wrapper divs, inline style blocks, JavaScript-generated content), and editor formatting artifacts (nested font tags, redundant span tags from formatting operations). Each requires a slightly different detection and remediation approach.
Step 1: Check for Word paste artifacts
- Open any major content page in your browser
- Right-click anywhere and select View Page Source (Ctrl+U)
- Press Ctrl+F to open the browser’s search function
- Search for
MsoNormal— if found, that content was pasted from Word - Search for
font-familywithin the body content area — excessive occurrences indicate inline font declarations from paste artifacts - Search for
<span style=— more than 3 to 4 occurrences in the main content area indicates formatting noise
Step 2: Check heading hierarchy
- In page source, search for
<h1— count the occurrences. There should be exactly one. - Search for
<h2— there should be multiple (2 to 8 depending on page length) - Note the sequence: does H1 appear before H2, and H2 before H3? Inverted or skipped levels indicate hierarchy problems
- In the Gutenberg editor, use the Document Overview panel (list icon top left) to see heading structure as a nested list — inverted indentation reveals hierarchy errors instantly
Step 3: Check for semantic structure
In page source, search for <main, <article, and <nav. If your theme outputs all three, semantic structure is correct. If primary content is wrapped only in <div> containers with class names like “content-area” or “entry-content”, your theme may not be outputting semantic landmark elements correctly.
Tools that automate HTML auditing
W3C Markup Validation Service (validator.w3.org) checks for HTML specification errors including invalid nesting and missing required attributes. WAVE Accessibility Tool (wave.webaim.org) identifies heading structure problems and semantic HTML issues. Screaming Frog (free up to 500 URLs) can crawl your entire site and report heading count, meta data, and structural issues across all pages simultaneously — useful for identifying which pages need the most urgent attention.
Use the free TeachMeOptimization scanner to check your site’s ANI signals before and after implementing the techniques in this guide. The scanner evaluates all six optimization disciplines simultaneously and gives you a trackable score to monitor improvement over time.
How ANI, AEO, GEO, SEO, and ASI work together here
ANI is the technical foundation that makes every other optimization discipline effective. Every improvement you make to your crawler access, HTML structure, or author attribution directly benefits your AEO citation rates, your GEO topical authority recognition, and your SEO technical health simultaneously. ANI work is not siloed — it compounds across all five disciplines at once.
Related ANI guides
Semantic HTML for AI · Correct heading hierarchy · Checking for dirty HTML
The complete ANI guide library at teachmeoptimization.com/ani covers all 24 topics across five categories — from fundamental concepts to step-by-step implementation and quarterly audit processes.
Common mistakes to avoid
A common HTML audit mistake is checking only the homepage. Dirty HTML, heading hierarchy errors, and semantic structure problems are page-specific — the homepage may be perfectly clean while a high-traffic pillar page has significant issues. Prioritize auditing your 10 highest-traffic pages first, then work through the rest of your content library systematically over subsequent audit cycles.
Quick implementation checklist
- Audit top 10 pages by traffic first — these have the highest citation value
- Use Screaming Frog free tier to crawl all pages and flag heading count issues
- Check View Page Source on each pillar page for MsoNormal and inline styles
- Verify
and elements in page source on all content pages - Document findings in a spreadsheet with a remediation priority column
- Schedule one page cleanup session per week until all pillar pages are clean
How this connects to the full ANI system
HTML auditing reveals the specific technical barriers preventing accurate AI content extraction on each page. The audit findings directly translate into a prioritized remediation task list that improves citation quality as each page is cleaned up. For the complete ANI implementation guide covering all 24 topics in sequence, see the full ANI guide at teachmeoptimization.com/ani.
Measuring improvement
After implementing the steps in this guide, revisit your server access logs in 2 to 4 weeks to confirm AI crawler visits. Run your site through the free TeachMeOptimization scanner to check your ANI score before and after. Track your AI citation rate monthly using the manual Perplexity and ChatGPT audit process described in the ANI audit guide — citation rate improvement is the ultimate measure of whether your ANI implementation is working.
Why this matters for your overall optimization strategy
Every ANI improvement compounds with your AEO and GEO work. When AI crawlers can access your site cleanly, read your HTML correctly, and confidently attribute your content to a named, credentialed author, every piece of content you publish starts from a stronger position. The citation rates you earn from well-optimized AEO pages are higher, the topical authority you build through GEO content architecture is more quickly recognized, and the overall efficiency of your optimization investment improves significantly.
The quarterly ANI maintenance habit
ANI is not a set-and-forget discipline. Security plugin updates can add new bot blocking rules. New AI crawlers emerge that need to be added to your robots.txt allow list. Content editing habits can introduce new HTML artifacts over time. A 30-minute quarterly ANI check — reviewing your robots.txt, checking server logs for crawler visits, running the Rich Results Test on a few key pages, and verifying your author box is displaying correctly — keeps your technical AI accessibility foundation solid as your site grows. The quarterly check is a small time investment that protects the much larger time investment you have made in content creation and optimization.
For the complete ANI audit process covering all three technical layers — crawler access, HTML structure, and attribution — see the full ANI audit guide and the ANI checklist. Together they give you the complete framework for verifying every ANI signal is correctly implemented and maintaining it over time.
The Complete Optimization Playbook covers AEO, GEO, SEO, ANI, and ASI with step-by-step WordPress implementation. About 50 pages, instant download.