Monday, July 28, 2014

Searching multiple columns using PostgreSQL text search

You can do multiple column text search out-of-the-box with PostgreSQL. Well, you can do it on up to four columns.

The trick is to use the tsvector "weight" label as a label representing the column.


Our schema:

create table multi_fields (
  persons_initials text not null,
  phone text not null,
  province char(2) not null
);

insert into multi_fields values
('ON','6045551212','BC'),
('BC','4165551212','ON'),
('ON','2505551212','BC');


Now we create a function to label our columns, and put an index on our table using that function. This example bypasses the text search parser. You'll want probably want to use setweight() and to_tsvector() || to_tsvector() ... in production.

create or replace function three_column_ts_vector(text, text, text) returns tsvector strict immutable language sql as '
  select ( $1 || '':1A '' || $2 || '':1B '' || $3 || '':1C'' )::tsvector;
';

create index on multi_fields using gin( three_column_ts_vector(persons_initials, phone, province) );


Some sample queries:

--where any column contains BC (3 results):
select
 *
from multi_fields
where 
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC'::tsquery;
--where only the province column (label "C") contains BC (2 results):
select
 *
from multi_fields
where 
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC:C'::tsquery;
--where only the persons_initials column (label "A") contains BC (1 result):
select
 *
from multi_fields
where 
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC:A'::tsquery;
--where only the phone column (label "B") contains BC (0 results):
select
 *
from multi_fields
where 
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC:B'::tsquery;
--where province is BC and the name is ON (2 results):
select
 *
from multi_fields
where 
 three_column_ts_vector(persons_initials, phone, province) @@ 'BC:C & ON:A'::tsquery;

Loading Geonames Country Info into Postgres

Download the countryInfo.txt file and strip lines beginning with the "#" character, as Postgres can only skip one line using the COPY command.

mkdir /srv/geonames

cd /srv/geonames

wget http://download.geonames.org/export/dump/countryInfo.txt

sed '/^#/ d' < countryInfo.txt > countryInfoNoComments.txt

In Postgres:

create table country_info (
  iso char(2) primary key,
  iso3 char(3) not null unique,
  iso_numeric char(3) not null,
  fips char(2),
  country text not null unique,
  capital text,
  area_in_sq_km numeric,
  population int not null,
  continent char(2) not null,
  tld text,
  currency_code char(3),
  currency_name text,
  phone text,
  postal_code_format text,
  postal_code_regex text,
  languages text,
  geonameid int,
  neighbours text,
  equivalent_fips_code text
);

copy country_info
from '/srv/geonames/countryInfoNoComments.txt'
with (
  format text,
  null ''
);

Steps to install the pg_similarity extension into Postgres

The pg_similarity extension lets you run a variety of similarity functions, such as Jaro Winkler, in Postgres.

In your shell:

cd /usr/local/src

git clone https://github.com/eulerto/pg_similarity.git

cd pg_similarity

export USE_PGXS=1

make

make install


In Postgres:

create extension pg_similarity;

Test it out:

select jarowinkler('acme, inc', 'acme incorporated');

Output
jarowinkler
double precision
10.871895424836601

Steps to install the smlar extension into Postgres

The smlar extension lets you do TF-IDF and cosine similarity calculations in Postgres.

If you figure out how to compile these on Windows (64-bit), please let me know.

In your shell:

cd /usr/local/src

git clone git://sigaev.ru/smlar.git

cd smlar

export USE_PGXS=1

make

make install


In Postgres:

create extension smlar;

Test it out:

select smlar( '{a,b}'::text[], '{c,a}'::text[] );

Output
smlar
real
10.5

Wednesday, July 23, 2014

Using the same script for Postgres COPY in Windows and Ubuntu

As per my previous blog post, you should store shared application data in %ProgramData% (Windows) or /srv (Ubuntu).

Let's say you download Geonames' allCountries.zip file and unzipped it and put it into your shared application data folder. How can we access these different O/S paths using the same SQL script with Postgres?

One way would be to create a symlink on Windows from /srv to %ProgramData%, then you can use the following script on either Windows or Ubuntu:

copy geonames
from '/srv/geonames/allCountries.txt'
with (
  format text,
  null ''
)

To create a symlink in Windows, run Command Prompt as an admin:

mklink /d c:\srv %ProgramData%

It would be great if the Postgres guys would let us use O/S environment variables (envvars) in file path, like '$SharedAppData/geonames/allCountries.txt' , then you'd only have to define a common envvar in Linux and Windows.

Shared Data files, Java

On Windows (Vista and up, at least) the correct place to store shared data for applications is %ProgramData% . For example, say we want MaxMind's GeoLite2 database file available to several applications. It should go at %ProgramData%\maxmind\GeoLite2-City.mmdb .

On Ubuntu, the correct place to store shared program data is /srv . So GeoLite would go at /srv/maxmind/GeoLite2-City.mmdb .

It would be nice in Java to have a cross-platform system property to access these locations easily. Here's how.

In Windows:


setx JAVA_TOOL_OPTIONS "%JAVA_TOOL_OPTIONS% -Dshared.data=%ProgramData%" -m

In Ubuntu:


sudo nano /etc/environment

Add/update the following:

JAVA_TOOL_OPTIONS=-Dshared.data=/srv


Then in a Java/Spring program you can use the system property like:

${shared.data}/maxmind/GeoLite2-City.mmdb


Sunday, July 6, 2014

A Probably-Secure Tomcat HTTPS Setup

Edit %CATALINA_HOME%/conf/server.xml. We are doing a few things here:

1. Setting up port 443 for HTTPS and port 80 to redirect to it
2. Disabling insecure protocols (SSL)
3. Disabling HTTP compression (HTTP compression with TLS may reveal session ids or CSRF tokens)
4. using O/S User Environment Variables instead of putting passwords into files
5. Using only good ciphers

Warning: Old clients might not work


    <Connector 
        port                            ="80" 
        protocol                        ="org.apache.coyote.http11.Http11NioProtocol"
        redirectPort                    ="443" 
    />
    
    <Connector 
        sslEnabledProtocols            = "TLSv1.2, TLSv1.1, TLSv1"
        compression                    = "off"
        allowUnsafeLegacyRenegotiation = "false"
        port                           = "443" 
        protocol                       = "org.apache.coyote.http11.Http11NioProtocol"
        SSLEnabled                     = "true" 
        scheme                         = "https" 
        secure                         = "true"
        keyAlias                       = "localhost"
        keyPass                        = "${PWD_KEY_LOCALHOST}"
        keystoreFile                   = "${user.home}/.keystore"
        keystorePass                   = "${PWD_KEYSTORE}"
        keystoreType                   = "JKS"
        ciphers                        = "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDH_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDH_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDH_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDH_RSA_WITH_AES_256_CBC_SHA384,TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA384,TLS_ECDH_RSA_WITH_AES_256_CBC_SHA,TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDH_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDH_RSA_WITH_AES_128_CBC_SHA,TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA,TLS_ECDHE_ECDSA_WITH_3DES_EDE_CBC_SHA,TLS_ECDH_RSA_WITH_3DES_EDE_CBC_SHA,TLS_ECDH_ECDSA_WITH_3DES_EDE_CBC_SHA"
 /> 

You would want to add the following to Tomcat's web.xml to redirect http traffic to https:
    <security-constraint>

        <web-resource-collection>
            <web-resource-name>HTTPSOnly</web-resource-name>
            <url-pattern>/*</url-pattern>
        </web-resource-collection>

        <user-data-constraint>
            <transport-guarantee>CONFIDENTIAL</transport-guarantee>
        </user-data-constraint>

    </security-constraint>

To pass O/S User Environment Variables to Tomcat, edit %CATALINA_BASE%/bin/setenv.bat :

set "CATALINA_OPTS=-DPWD_KEYSTORE=%PWD_KEYSTORE% -DPWD_KEY_LOCALHOST=%PWD_KEY_LOCALHOST%"

To create a keypair for your localhost development server, type:

keytool -genkeypair -validity 365 -keyalg EC -keysize 256 -alias localhost

When it (bizarrely) asks for your first and last name, type in localhost


To set O/S User Environment Variables, run cmd as that user, and type, for example SETX PWD_KEYSTORE s3cr3t