XML is still everywhere — SOAP envelopes, Office Open XML (DOCX/XLSX/PPTX internals), Android layouts, Maven and Ant build files, RSS and Atom feeds, OPDS catalogs, KML and GPX route data, SVG markup. Making sense of an unindented blob, or finding the missing close tag in a 200-line document, is exactly what this tool is for.
How parsing works
The input is tokenised left-to-right into element tags (open, close, self-closing), text content, comments, CDATA blocks, processing instructions, and the doctype declaration. A second pass walks the token stream to check well-formedness:
- Every open tag has a matching close tag (LIFO order —
<a><b></b></a>is fine,<a><b></a></b>is not). - Exactly one root element.
- No text content outside the root (whitespace and comments are allowed).
When validation fails, the tool reports the line and column of the offending token. The same tokeniser drives Format and Minify, so any structural error caught in Validate will also surface when formatting.
Example: cleaning up a SOAP response
Input (one long line):
<?xml version="1.0"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><Response><Status>OK</Status><Data>...</Data></Response></soap:Body></soap:Envelope>
Format with 2 spaces:
<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<Response>
<Status>OK</Status>
<Data>...</Data>
</Response>
</soap:Body>
</soap:Envelope>
Now you can actually find the <Status> element with eye-grep, not text-search.
Example: spotting a missing close tag
Input:
<book>
<title>The Pragmatic Programmer</title>
<author>Hunt & Thomas
</book>
Validate output: Mismatched tag: expected </author> (opened at line 3) but got </book> — line 4, column 1. The missing </author> is now obvious; the parser pointed at the close that arrived earlier than expected.
Example: minifying for transport
Input (formatted, 1.2 KB):
<library>
<book id="1">
<title>Clean Code</title>
</book>
</library>
Minify produces <library><book id="1"><title>Clean Code</title></book></library> — same content, ~30% smaller. Useful before embedding in a JSON string, base64-encoding for transport, or computing a deterministic content hash.
Common XML mistakes
Unescaped &, <, > in text. <title>Hunt & Thomas</title> is malformed; write Hunt & Thomas. The tokeniser flags this as soon as it hits the bare & inside text content.
Single quotes vs double quotes inconsistency. <a href="x'>... (mismatched delimiters) is malformed. Either delimiter works as long as it’s matched.
Self-close vs open / close confusion. <br> is HTML, not XML — XML requires <br/> (or the explicit pair <br></br>). XHTML compatibility tips for the same reason.
Multiple root elements. Some legacy systems concatenate documents into one stream; the tool will reject that with Multiple root elements. Wrap them: <root>...</root><root>...</root> becomes <wrapper><root>...</root><root>...</root></wrapper>.
What this tool does not do
It doesn’t validate against a schema (XSD, DTD, RelaxNG). For that you need the schema and a validating parser like xmllint.
It doesn’t expand entity references. & stays as & — rewriting to & would break the document.
It doesn’t handle non-UTF-8 input encoding. Convert to UTF-8 first if your file has a different declared encoding.
It doesn’t fix malformed XML. Errors are reported with location; you have to fix the source. For the JSON and YAML equivalents, the JSON formatter & validator and YAML formatter & validator run the same pass on their respective formats.