Database Patterns: Easy XPath against HTML

Wednesday, July 27, 2016

UPDATE. Even easier with xidel:

xidel 'http://example.com' --extract '//title'

ORIGINAL:

curl -L example.com | \
  tidy -asxml -numeric -utf8 | \
  sed -e 's/ xmlns.*=".*"//g' | \
  xml select -t -v "//title" -n

Get the HTML content from http://example.com using curl
Use HTML Tidy to tidy it up, covert it to XHTML, change &entities; to numeric ones, and set the encoding as UTF-8
Use sed to remove the XML namespace, for simpler XPaths
Use XML Starlet to select by XPath

You can output multiple columns like so:

curl -L example.com | \
  tidy -asxml -numeric -utf8 | \
  sed -e 's/ xmlns.*=".*"//g' | \
  xml select -t -v "//title" -o ','-v "//another" -n

Database Patterns