Skip to content

XML Formatter & Validator

Output

Estimates for educational purposes — not financial, medical, or legal advice. See terms.

XML is still everywhere — SOAP envelopes, Office Open XML (DOCX/XLSX/PPTX internals), Android layouts, Maven and Ant build files, RSS and Atom feeds, OPDS catalogs, KML and GPX route data, SVG markup. Making sense of an unindented blob, or finding the missing close tag in a 200-line document, is exactly what this tool is for.

How parsing works

The input is tokenised left-to-right into element tags (open, close, self-closing), text content, comments, CDATA blocks, processing instructions, and the doctype declaration. A second pass walks the token stream to check well-formedness:

  • Every open tag has a matching close tag (LIFO order — <a><b></b></a> is fine, <a><b></a></b> is not).
  • Exactly one root element.
  • No text content outside the root (whitespace and comments are allowed).

When validation fails, the tool reports the line and column of the offending token. The same tokeniser drives Format and Minify, so any structural error caught in Validate will also surface when formatting.

Example: cleaning up a SOAP response

Input (one long line):

<?xml version="1.0"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><Response><Status>OK</Status><Data>...</Data></Response></soap:Body></soap:Envelope>

Format with 2 spaces:

<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
  <soap:Body>
    <Response>
      <Status>OK</Status>
      <Data>...</Data>
    </Response>
  </soap:Body>
</soap:Envelope>

Now you can actually find the <Status> element with eye-grep, not text-search.

Example: spotting a missing close tag

Input:

<book>
  <title>The Pragmatic Programmer</title>
  <author>Hunt &amp; Thomas
</book>

Validate output: Mismatched tag: expected </author> (opened at line 3) but got </book> — line 4, column 1. The missing </author> is now obvious; the parser pointed at the close that arrived earlier than expected.

Example: minifying for transport

Input (formatted, 1.2 KB):

<library>
  <book id="1">
    <title>Clean Code</title>
  </book>
</library>

Minify produces <library><book id="1"><title>Clean Code</title></book></library> — same content, ~30% smaller. Useful before embedding in a JSON string, base64-encoding for transport, or computing a deterministic content hash.

Common XML mistakes

Unescaped &, <, > in text. <title>Hunt & Thomas</title> is malformed; write Hunt &amp; Thomas. The tokeniser flags this as soon as it hits the bare & inside text content.

Single quotes vs double quotes inconsistency. <a href="x'>... (mismatched delimiters) is malformed. Either delimiter works as long as it’s matched.

Self-close vs open / close confusion. <br> is HTML, not XML — XML requires <br/> (or the explicit pair <br></br>). XHTML compatibility tips for the same reason.

Multiple root elements. Some legacy systems concatenate documents into one stream; the tool will reject that with Multiple root elements. Wrap them: <root>...</root><root>...</root> becomes <wrapper><root>...</root><root>...</root></wrapper>.

What this tool does not do

It doesn’t validate against a schema (XSD, DTD, RelaxNG). For that you need the schema and a validating parser like xmllint.

It doesn’t expand entity references. &amp; stays as &amp; — rewriting to & would break the document.

It doesn’t handle non-UTF-8 input encoding. Convert to UTF-8 first if your file has a different declared encoding.

It doesn’t fix malformed XML. Errors are reported with location; you have to fix the source. For the JSON and YAML equivalents, the JSON formatter & validator and YAML formatter & validator run the same pass on their respective formats.

Frequently asked questions

Does this validate against a DTD or XSD?

No. The tool checks **well-formedness** only — every open tag has a matching close tag, attribute syntax is valid, comments and CDATA are properly delimited, exactly one root element exists. **Validity** (matching a specific DTD, XSD, or RelaxNG schema) requires the schema itself plus a validating parser. For schema validation, use a server-side tool like xmllint, libxml2, or your language's standard XML library with the schema loaded. The well-formedness check here catches roughly 95% of real-world XML mistakes (typos, copy-paste mid-element, unmatched braces) — the remaining 5% are content-shape issues a schema would catch.

Why are entities like &amp; left alone?

The tool tokenises but does not expand entity references. `&amp;` stays as `&amp;` in both formatted and minified output, because rewriting it to `&` would break the document — the literal `&` is illegal in XML text. Same goes for `&lt;`, `&gt;`, `&quot;`, `&apos;`, and any custom entities defined in the doctype. If you need to read entity-decoded text, your downstream parser will do that for you.

What's the difference between formatting and minifying?

Formatting indents the structure with whitespace so humans can read it — line breaks between elements, nesting reflected by indentation. Minifying does the opposite: removes whitespace between elements so the file is as compact as possible. Both preserve content (text inside elements is untouched). Use Format when reading or editing; Minify when transmitting (smaller payload) or when you need a deterministic byte representation for hashing or signing.

How does it handle mixed content (text + tags interleaved)?

Cautiously. A simple text-only element like `<title>Hello</title>` stays inline after formatting. But true mixed content — `<p>This is <em>important</em> text</p>` — gets emitted with the inner element on its own indented line, which inflates whitespace but preserves the parsed structure. For documents that depend on exact whitespace (XHTML, MathML, some signing pipelines) prefer Minify, or skip the formatter entirely and use the original source. Round-trip behaviour is preserved when you minify the formatted output.

Why doesn't it use the browser's DOMParser?

Two reasons. First, DOMParser silently 'recovers' from many malformed inputs — it produces a document with `<parsererror>` injected, which is harder to surface as a usable error message than a genuine parse exception. Second, hand-rolled tokenisation gives precise line + column on every error, which DOMParser does not consistently expose. The hand-rolled tokeniser handles the common cases (elements, attributes, comments, CDATA, PIs, doctype) and runs identically in tests and the browser.