1001Ferramentas
🧭Validators

XPath Validator

Verify whether an XPath expression is syntactically valid against a pasted HTML document. Counts how many nodes match.

β€”

XPath: the W3C query language for XML and HTML trees

XPath (XML Path Language) is a W3C-standardized expression language for selecting nodes in an XML document tree. Because every well-formed HTML page is exposed by browsers and parsers as a DOM tree, XPath has become the lingua franca of web scraping, XSLT transformations, Selenium/Cypress test automation and XQuery data processing.

The standard evolved across four major revisions: XPath 1.0 (1999) β€” the version most tools still implement; XPath 2.0 (2010) β€” added typed sequences, regex and a richer function library aligned with XQuery; XPath 3.0 (2014) β€” introduced higher-order functions and improved error handling; and XPath 3.1 (2017) β€” added maps, arrays and JSON support. Browsers (document.evaluate()) and most scraping libraries are stuck on 1.0; Saxon-HE/EE on the JVM is the reference engine for 3.1.

Syntax fundamentals: axes, predicates and steps

An XPath expression is a sequence of location steps separated by /. Each step has the shape axis::node-test[predicate]:

  • /root/child β€” absolute path starting from the document root.
  • //descendant β€” match the node anywhere below the context.
  • * β€” wildcard for any element; @attr selects an attribute.
  • [predicate] β€” boolean or positional filter applied to the node set.
  • . and .. β€” current node and parent node, respectively.

There are 13 axes in XPath 1.0, including parent, child, descendant, ancestor, following-sibling, preceding-sibling, self and attribute. Most engines accept the abbreviated forms (/, //, @) but the long axis form is sometimes the only way to traverse upward or sideways.

Practical examples and core functions

A handful of expressions cover 90% of real-world scraping work:

//div[@class='card']                  // every <div class="card">
//a[contains(@href, 'github')]        // links pointing to GitHub
//input[@type='text'][1]              // first text input
//tr[td[normalize-space()='Total']]   // rows whose cell text is "Total"
//*[@id='main']//p[position() < 4]    // first three paragraphs in #main
count(//li)                           // number of list items

The most useful XPath 1.0 functions are text(), contains(), starts-with(), normalize-space(), string-length(), count(), position() and last(). XPath 2.0+ adds matches() with regex, tokenize(), replace() and full date/time arithmetic.

XPath vs CSS selectors: when each one wins

CSS selectors are shorter, faster in browsers and idiomatic to front-end developers. XPath is the better tool whenever you need parent traversal (.. or ancestor::), text content matching (contains(text(), 'Total')), positional logic across siblings or complex boolean predicates. CSS Level 4 is closing the gap with :has() and :is(), but XPath remains strictly more expressive. Most Selenium projects mix both: CSS for the simple cases, XPath for the hard ones.

Tooling: browsers, scraping libraries and editors

  • Chrome / Firefox DevTools: type $x("//a") in the Console to evaluate XPath against the live DOM.
  • Python: lxml (libxml2 bindings β€” fastest), parsel (used by Scrapy) and BeautifulSoup via lxml backend.
  • Node.js: the xpath npm package on top of xmldom, and Playwright's page.locator("xpath=...").
  • JVM: Saxon-HE (free) and Saxon-EE for full XPath 3.1, plus the built-in javax.xml.xpath for 1.0.
  • Selenium: By.XPATH in every binding (Python, Java, C#, Ruby, JS).
  • Online playgrounds: xpath.in, freeformatter.com/xpath-tester and the W3C XML tutorial sandbox.

Common pitfalls and performance notes

XPath has a few gotchas that bite newcomers: indexing is 1-based, not 0-based, so //li[1] is the first item; HTML is not XML β€” invalid markup will break a strict XPath 1.0 engine, so always feed scraped pages through an HTML-aware parser such as lxml's html.fromstring or jsoup; namespaces must be bound explicitly (use a namespace context in Java, the namespaces dict in lxml); the // shortcut is expensive on large documents because it walks every descendant, so prefer anchored paths when performance matters. For very large XML, the C-based engines (libxml2, Saxon's Joost) outperform pure-Java/JS parsers by 5–20x.

FAQ

Should I use CSS or XPath in Selenium?

CSS is faster to read and slightly faster to execute in modern engines. Reach for XPath whenever you need to match by visible text, climb to a parent or filter siblings by position β€” situations CSS cannot express even with :has().

Is XPath indexing 0-based or 1-based?

XPath uses 1-based indexing. //li[1] is the first item, not the second. This trips up most developers coming from JavaScript or Python.

Can I validate XPath offline?

Yes. Any conforming XPath engine compiles the expression syntactically without contacting the network. This tool runs entirely in your browser using the built-in document.evaluate() API β€” nothing is sent to a server.

Why does my expression return zero nodes against an HTML page?

Three usual suspects: (i) the page uses XHTML namespaces that you have not bound; (ii) tags are uppercase (DIV) but XPath 1.0 is case-sensitive; (iii) the content is rendered by JavaScript after page load and was not in the original HTML. Use the rendered DOM, not the response body.

What is the difference between / and //?

/ selects a direct child, // any descendant at any depth. /html/body/div matches only top-level divs inside the body; //div matches every div in the document, but at higher CPU cost.

Related Tools