Wednesday, July 27, 2016

Easy XPath against HTML

UPDATE. Even easier with xidel:

xidel 'http://example.com' --extract '//title'


ORIGINAL:

curl -L example.com | \
  tidy -asxml -numeric -utf8 | \
  sed -e 's/ xmlns.*=".*"//g' | \
  xml select -t -v "//title" -n

  1. Get the HTML content from http://example.com using curl
  2. Use HTML Tidy to tidy it up, covert it to XHTML, change &entities; to numeric ones, and set the encoding as UTF-8
  3. Use sed to remove the XML namespace, for simpler XPaths
  4. Use XML Starlet to select by XPath

You can output multiple columns like so:
curl -L example.com | \
  tidy -asxml -numeric -utf8 | \
  sed -e 's/ xmlns.*=".*"//g' | \
  xml select -t -v "//title" -o ','-v "//another" -n

No comments: