Wednesday, July 27, 2016

Crush Tools convdate

Convert dates on the command line
$ echo '7-25-2015-12:12:12' | convdate --verbose
2016-07-25-12:12:12
$ echo '7/25/16' | convdate --verbose -i '%m/%d/%y' -o '%Y-%m-%d'
2016-07-25

where
i is the input format, as per http://pubs.opengroup.org/onlinepubs/009695399/functions/strptime.html 
o is the output format, as per http://pubs.opengroup.org/onlinepubs/009695399/functions/strftime.html

Crush Tools subtotal

Given foo.csv:
group1,1,2
group1,1,3
group1,1,4
group2,1,5
group2,1,6

We want to subtotal columns 2 and 3 (individually) by group in column1:
subtotal --key 1 --sum 2,3 --delim , -i foo.csv
group1,1,2
group1,1,3
group1,1,4
,3,9

group2,1,5
group2,1,6
,2,11

Easy XPath against HTML

UPDATE. Even easier with xidel:

xidel 'http://example.com' --extract '//title'


ORIGINAL:

curl -L example.com | \
  tidy -asxml -numeric -utf8 | \
  sed -e 's/ xmlns.*=".*"//g' | \
  xml select -t -v "//title" -n

  1. Get the HTML content from http://example.com using curl
  2. Use HTML Tidy to tidy it up, covert it to XHTML, change &entities; to numeric ones, and set the encoding as UTF-8
  3. Use sed to remove the XML namespace, for simpler XPaths
  4. Use XML Starlet to select by XPath

You can output multiple columns like so:
curl -L example.com | \
  tidy -asxml -numeric -utf8 | \
  sed -e 's/ xmlns.*=".*"//g' | \
  xml select -t -v "//title" -o ','-v "//another" -n

Monday, July 25, 2016

Crush Tools dbstream with PostgreSQL

There's a command line library from Google called Crush Tools. It hasn't been updated for a while, but a few commands look useful, like pivot and aggregate.

There's also one called dbstream, which is a perl script to stream to/from a database.

Here's how to run it (on OS X anyways):

Install it (crush-tools, Perl package manager, Perl PostgreSQL driver):
$ brew install crush-tools
$ curl -L https://cpanmin.us | perl - --sudo App::cpanminus
$ sudo cpanm DBD::Pg

And run it:
$ dbstream -de , -ds dbi:Pg:dbname=foo -u alice -p secret -s "select 1,2;"

or insert each record from a file:
$ cut -d, -f1-2 input.csv | dbstream ... -s "insert into tbl values (?,?);"

or look up each record from a file:
$ cut -d, -f1 input.csv | dbstream ... -s "select name from tbl where code=?;"

or update up each record from a file:
$ awk -F, '{print $2","$1}' input.csv | dbstream ... -s "update tbl set name=? where id=?;"

where
-de is the input & output field separator
-ds is a Perl DBI datasource
-u username
-p password
-s SQL

It's here if it's not in your package manager: https://github.com/google/crush-tools

Wednesday, May 25, 2016

Compiling Multicorn on OpenBSD


Install the following:

python 2.7
gmake
git


cd /tmp

git clone git://github.com/Kozea/Multicorn.git
cd Multicorn

Change the first line of preflight-check.sh so can run it with ksh

sed -i 1s/bash/sh/ preflight-check.sh

(I've created a pull request to change this)

Then

gmake && gmake install

Thursday, January 7, 2016

PostgreSQL 9.6 Wish List

First, a huge congratulations and thank you to the PostgreSQL Global Development Group for shipping such great features in 9.5!

I'm teaching myself C (I'm a Java guy) so that I can contribute down the road. 

Here's what I'd love to see in 9.6:

  • Okapi BM-25 as the default relevance measure for text search. SQLite has it, so it should be doable for Postgres. 
  • Overall, make text search a little more Solr-like (n-grams and such)
  • Allow custom, non-C functions (like PLV8) for text search parsing, so I can write my own parser
  • Allow parameters for SET ROLE ? and LISTEN ? so it's a bit safer to call them from a web application
  • Security cookie for SET ROLE, so that user cannot switch roles without permission. See this post for more info.
  • Async LISTEN/NOTIFY in the default JDBC driver.  pgjdbc-ng has it, so it should be doable.

Nice To Have:
  • Kill MD5 forever and use bcrypt or scrypt for password hashing
  • Raise an exception if an identifier is longer than 63 characters, instead of silently truncating it. This was an actual issue with Drupal Commerce
  • Begin SQL:2011 Temporal feature set