o

How to check for dirty HTML?

ANI/HTML Structure and Clean Code

How to Check Your Site for Dirty HTML That Confuses AI Crawlers

Dirty HTML — inline styles, span tags, Microsoft Office markup, and broken element nesting — is present on the majority of WordPress sites that have been built and edited over time. AI crawlers encountering dirty HTML produce inaccurate content models that lead to poor citations or no citations at all. This guide walks through the complete process of finding and cleaning it.

AEOGEOSEOANIASI

The direct answer

Dirty HTML — inline styles, span tags, Microsoft Office markup, and broken element nesting — is present on the majority of WordPress sites that have been built and edited over time. AI crawlers encountering dirty HTML produce inaccurate content models that lead to poor citations or no citations at all. This guide walks through the complete process of finding and cleaning it.

What to look for when checking for dirty HTML

Dirty HTML on WordPress sites falls into three main categories: Word/Docs paste artifacts (inline styles and Microsoft-specific classes), theme or plugin injection (unnecessary wrapper divs, inline style blocks, JavaScript-generated content), and editor formatting artifacts (nested font tags, redundant span tags from formatting operations). Each requires a slightly different detection and remediation approach.

Step 1: Check for Word paste artifacts

  1. Open any major content page in your browser
  2. Right-click anywhere and select View Page Source (Ctrl+U)
  3. Press Ctrl+F to open the browser’s search function
  4. Search for MsoNormal — if found, that content was pasted from Word
  5. Search for font-family within the body content area — excessive occurrences indicate inline font declarations from paste artifacts
  6. Search for <span style= — more than 3 to 4 occurrences in the main content area indicates formatting noise

Step 2: Check heading hierarchy

  1. In page source, search for <h1 — count the occurrences. There should be exactly one.
  2. Search for <h2 — there should be multiple (2 to 8 depending on page length)
  3. Note the sequence: does H1 appear before H2, and H2 before H3? Inverted or skipped levels indicate hierarchy problems
  4. In the Gutenberg editor, use the Document Overview panel (list icon top left) to see heading structure as a nested list — inverted indentation reveals hierarchy errors instantly

Step 3: Check for semantic structure

In page source, search for <main, <article, and <nav. If your theme outputs all three, semantic structure is correct. If primary content is wrapped only in <div> containers with class names like “content-area” or “entry-content”, your theme may not be outputting semantic landmark elements correctly.

Tools that automate HTML auditing

W3C Markup Validation Service (validator.w3.org) checks for HTML specification errors including invalid nesting and missing required attributes. WAVE Accessibility Tool (wave.webaim.org) identifies heading structure problems and semantic HTML issues. Screaming Frog (free up to 500 URLs) can crawl your entire site and report heading count, meta data, and structural issues across all pages simultaneously — useful for identifying which pages need the most urgent attention.

Implementation tip

Use the free TeachMeOptimization scanner to check your site’s ANI signals before and after implementing the techniques in this guide. The scanner evaluates all six optimization disciplines simultaneously and gives you a trackable score to monitor improvement over time.

How ANI, AEO, GEO, SEO, and ASI work together here

ANI is the technical foundation that makes every other optimization discipline effective. Every improvement you make to your crawler access, HTML structure, or author attribution directly benefits your AEO citation rates, your GEO topical authority recognition, and your SEO technical health simultaneously. ANI work is not siloed — it compounds across all five disciplines at once.

Related ANI guides

Semantic HTML for AI · Correct heading hierarchy · Checking for dirty HTML

The complete ANI guide library at teachmeoptimization.com/ani covers all 24 topics across five categories — from fundamental concepts to step-by-step implementation and quarterly audit processes.

Common mistakes to avoid

A common HTML audit mistake is checking only the homepage. Dirty HTML, heading hierarchy errors, and semantic structure problems are page-specific — the homepage may be perfectly clean while a high-traffic pillar page has significant issues. Prioritize auditing your 10 highest-traffic pages first, then work through the rest of your content library systematically over subsequent audit cycles.

Quick implementation checklist

  • Audit top 10 pages by traffic first — these have the highest citation value
  • Use Screaming Frog free tier to crawl all pages and flag heading count issues
  • Check View Page Source on each pillar page for MsoNormal and inline styles
  • Verify
    and
    elements in page source on all content pages
  • Document findings in a spreadsheet with a remediation priority column
  • Schedule one page cleanup session per week until all pillar pages are clean

How this connects to the full ANI system

HTML auditing reveals the specific technical barriers preventing accurate AI content extraction on each page. The audit findings directly translate into a prioritized remediation task list that improves citation quality as each page is cleaned up. For the complete ANI implementation guide covering all 24 topics in sequence, see the full ANI guide at teachmeoptimization.com/ani.

Measuring improvement

After implementing the steps in this guide, revisit your server access logs in 2 to 4 weeks to confirm AI crawler visits. Run your site through the free TeachMeOptimization scanner to check your ANI score before and after. Track your AI citation rate monthly using the manual Perplexity and ChatGPT audit process described in the ANI audit guide — citation rate improvement is the ultimate measure of whether your ANI implementation is working.

Why this matters for your overall optimization strategy

Every ANI improvement compounds with your AEO and GEO work. When AI crawlers can access your site cleanly, read your HTML correctly, and confidently attribute your content to a named, credentialed author, every piece of content you publish starts from a stronger position. The citation rates you earn from well-optimized AEO pages are higher, the topical authority you build through GEO content architecture is more quickly recognized, and the overall efficiency of your optimization investment improves significantly.

The quarterly ANI maintenance habit

ANI is not a set-and-forget discipline. Security plugin updates can add new bot blocking rules. New AI crawlers emerge that need to be added to your robots.txt allow list. Content editing habits can introduce new HTML artifacts over time. A 30-minute quarterly ANI check — reviewing your robots.txt, checking server logs for crawler visits, running the Rich Results Test on a few key pages, and verifying your author box is displaying correctly — keeps your technical AI accessibility foundation solid as your site grows. The quarterly check is a small time investment that protects the much larger time investment you have made in content creation and optimization.

For the complete ANI audit process covering all three technical layers — crawler access, HTML structure, and attribution — see the full ANI audit guide and the ANI checklist. Together they give you the complete framework for verifying every ANI signal is correctly implemented and maintaining it over time.

Scroll to Top