...making Linux just a little more fun!

Understanding Full Text Search in PostgreSQL

By Paul Sephton

Those who have yet to enter their first search query in Google, Yahoo, or any of the myriad of other search engines available to the web, or have no interest in the use of PostgreSQL as a database engine to implement their own text search facility will find little use in reading further.

Although PostgreSQL coverage on the topic of Free Text Search is as good as the rest of their excellent documentation, the following might serve as an introduction to those who would further persue the topic. In the interest of conciseness, this article skips quite a lot of detail and depends on the ability of the reader to infer what is not explicitly stated.

Introduction

Full Text Search (or FTS) is not a new technology. The earliest recorded patents related to the search for documents about a given topic were filed in the year 1963, more than forty-five years ago. These patents include "CONTENT ADDRESSABLE MEMORY APPARATUS" (US Pat. 3290659 - Filed Dec 30, 1963), "SCAN CONTROL, AND NORMALIZATION FOR A CHARACTER RECOGNITION SYSTEM" (US Pat. 3295105 - Filed Aug 27, 1964) and "INFORMATION RETRIEVAL SYSTEM AND METHOD" (US Pat. RE26429 - Filed Dec 8, 1964).

To this day, the ability to impose structure upon the dearth of what is known as "Unstructured Data", in order to turn this haystack into an information system, continues to challenge the finest minds.

The long list of granted patents persists through 2006 and 2007, namely "Device and system for information management" (US Pat. 7376273 - Filed Jun 1, 2007 - Silverbrook Research Pty Ltd), "Metasearching by sending a plurality of queries to a plurality of servers" (US Pat. 7277918 - Filed Jan 16, 2007) and "Distributed internet based speech recognition system with natural language ..." (US Pat. 7203646 - Filed May 22, 2006 - Phoenix Solutions, Inc.)

Of course, we know today that "Full Text Search" is far from the only way to organise documents. A close runner up, though unable to reproduce the exact features of FTS, is the Bayesean search; a statistical means of indexing documents which determines the probabability that two documents are similar. Bayesean search has an almost uncanny knack of deriving the essence of documents rather than depending on a literal text match. The most common use of Bayesean search technology today is in spam filters, although engines such as DeepDyve demonstrates a more ambitious use.

Several SQL databases already have a built-in implementation of FTS, notably servers such as Oracle, SQL Server, and the free MySQL and PostgreSQL servers.

MySQL native FTS is currently only available against the MyISAM database back-end, and is not yet available for the more popular InnoDB back-end. For Postgres, the earliest implementation of FTS was the TSearch module, replaced in version 7.4 with Tsearch2, and finally included in the core PostgreSQL engine in version 8.3.

There are many implementations of external FTS engines which may be used in conjunction with SQL database engines, or individually. These include Lucene, which is a very powerful and popular engine implemented as part of the Apache project, Swish-E, and the Sphynx search engine, which has been steadily gaining in popularity.

Lucene is really a library against which programmers may develop search products. Their site has many links to generic products already developed against the library, including interfaces for Java (the original interface), PHP and DotNet. Sphynx is particularly popular amongst MySQL developers, as this engine supports both MyISAM as well as InnoDB database back-ends.

PostgreSQL Full Text Search

In a nutshell, Full Text Search is implemented by indexing the words contained in a document, and associated the indexed words with a reference to the document. Subsequently, searches for a boolean phrase using operators and, or, not and braces may be matched against the index to locate documents containing the words in the phrase. PostgreSQL does not currently support the fuzzy logic operators near, far or strip.

Eliminating redundancy

Clearly, indexing every single word in a document would result in a very large index. This is neither required, nor is it particularly useful. For example, one might convert all text to lower case prior to indexing the words, making your searches insensitive to case, and simultaneously achieving a smaller index. Then one might eliminate words from the text which carry no real meaning (like 'and', 'or', 'the', 'for', etc.) as these words would be likely found in most documents already. The first approach is called normalisation, and the second process is called stop word elimination.

Finally, one might further reduce the size of the index by replacing words with others which have an identical or similar meaning. Thus one may replace instances of 'hungrily' and 'hungry' with 'hunger'. This process is called dictionary substitution.

Further algorithmic measures (see Snoball)[Ref. 4] may be performed to further reduce words to their essential meaning prior to the article being indexed. The replacement of colour names with their hexadecimal equivalents and the reduction of numeric values by reducing precision are other ways of normalising the text.

All of these measures are to eliminate redundancy from the indexed text, thereby reducing index size and resulting in less storage requirement, less disk I/O, faster indexing and a consequently faster search.

PostgreSQL Dictionaries

To aid in the process of normalisation and elimination of redundancy from text, PostgreSQL provides templates for several types of dictionary for use in a text search configuration. These are the Simple Dictionary, the Synonym Dictionary, the Thesaurus Dictionary, the iSpell Dictionary and the Snoball [Ref.4] Dictionary.

The Simple Dictionary eliminates stop words, and performs case normalisation. The Synonym Dictionary replaces one word with another, Thesaurus Dictionaries provide for industry specific phrase recognition, and the iSpell template may be used to embed any of the standard iSpell dictionaries available from OpenOffice.Org. The Snoball[Ref.4] dictionaries, which perform algorithmic stemming and stop word elimination are included by default for a variety of languages in the PostgreSQL installation.

With PostgreSQL, all dictionaries are functionally equivalent, with the possible exception of the thesaurus type. Effectively, they consume a word (or token), and return an array of lexemes, or NULL. When a dictionary returns NULL, the token is considered to be unrecognised. This allows dictionaries to be strung together, with the most general dictionary at the end of the list. Thus, if a dictionary earlier in the list returns a word, further dictionaries in the list are ignored. However, where an earlier dictionary returns NULL, the token is processed by subsequent dictionaries in the list.

A lexeme is the equivalent of a token which has been converted to it's base form. Before being passed to the dictionary, PostgreSQL converts the document text into an array of tokens through means of a simple parser. The parser is able to identify various types of tokens from the text, such as XML or HTML tags, integer or floating point numbers, version numbers, URL's, host names and so forth. PostgreSQL provides the ability to process different token types specifically, by mapping a given token type to different dictionary lists.

Text Search Configuration

Note that although PostgreSQL provides a lot of configurability, the base installation already provides a workable configuration. Unless there is a specific reason, it is probably not necessary to mess with it. Even so, an example text search configuration would be the following:

CREATE TEXT SEARCH CONFIGURATION public.my_config (COPY=pg_catalog.english );

Now, having a new text search configuration, we can create a new dictionary:

CREATE TEXT SEARCH DICTIONARY english_ispell (
    TEMPLATE = ispell,
    DictFile = english,
    AffFile = english,
    StopWords = english
);

In the above, we created a dictionary based on the iSpell template, where DictFile, AffFile and StopWords refer to files in the `pg_config –share`\tsearch_data\ directory, called english.dict, english.affix, and english.stop respectively.

We can then add this dictionary into a configuration like this:

ALTER TEXT SEARCH CONFIGURATION my_config
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
                      word, hword, hword_part
    WITH english_ispell, english_stem;

This adds our new dictionary to our new text search configuration, overriding the default lexeme types (asciiword etc.) to ensure that they are processed through both the english_ispell and english_stem dictionaries. We can then start using our new text search configuration by changing the global parameter default_text_search_config:

SET default_text_search_config = 'public.my_config';

The above parameter will apply for the duration of the session, or until it is changed. To make the configuration permanent, one would need to set it in the data/postgresql.conf file.

Search Vectors

All of this is very impressive, but how does one turn document content into an array of lexemes using the parser and dictionaries? How does one match a search criterion ti body text? PostgreSQL provides a number of functions to do this. The first one we will look at is to_tsvector().

A tsvector is an internal data type containing an array of lexemes with position information. The lexeme positions are used when searching, to rank the search result based on proximity and other information. One may control the ranking by labelling the different portions which make up the search document content, for example the title, body and abstract may be weighted differently during search by labelling these sections differently. The section labels, quite simply A,B,C & D, are associated with the tsvector at the time it is created, but the weight modifiers associated with those labels may be controlled after the fact.

We can create a tsvector for text like this:

bob=# select to_tsvector('Free text seaRCh is a wonderful Thing');
                     to_tsvector                      
------------------------------------------------------
 'free':1 'text':2 'thing':7 'search':3 'wonderful':6
(1 row)

Assigning Weight Labels

As may be seen, the tsvector is just a list of lexemes with associated positions. The stop words such as 'a' and 'is' have been eliminated, and we have everything in lower case. Another example, adding labels:

bob=# select setweight(to_tsvector('Free text seaRCh is a wonderful Thing'),'A');
                         setweight                         
-----------------------------------------------------------
 'free':1A 'text':2A 'thing':7A 'search':3A 'wonderful':6A
(1 row)



In essence, one may create a new tsvector with lexemes labeled for different sections of the text using setweight for the various sections:

bob=# select                                                                  
bob-# setweight(to_tsvector('All about search'), 'B') ||
bob-# setweight(to_tsvector('Free text seaRCh is a wonderful Thing'),'A');
                           ?column?                            
---------------------------------------------------------------
 'free':4A 'text':5A 'thing':10A 'search':3B,6A 'wonderful':9A
(1 row)

Assuming "All about search" was the title, and "Free text seaRCh is a wonderful Thing" was the body, we have now labelled the lexemes from the title, together with those from the body. Note that the word 'All' and 'About' in the title were both considered to be stop words. Subsequently, one may use the labels to rank the results depending on weight labelss associated with the title and body. We will visit this in more detail later.

Matching Boolean Search Phrases

But first, how does one go about matching a boolean search phrase to the tsvector? This is done using the @@ operator. For example:

bob=# select to_tsvector('Free text seaRCh is a wonderful Thing') @@ 'wonderful';
 ?column? 
----------
 t
(1 row)

On the other hand 'wonderful' is not exactly a wonderful example of a boolean phrase, now is it? Nope. It's just ordinary text. If that text is in the wrong format, for example contains other words, the query will in fact fail. PostgreSQL provides two functions which may be used to turn text into a query, represented as the built in tsquery data type, for subsequent matching against tsvectors. These are the plainto_tsquery() and to_tsquery() functions.

The rather limited plainto_tsquery() function simply turns the text into a tsquery with all the lexemes in that text qualified by the '&' (or AND) operator:

bob=# select plainto_tsquery('wonderful text');   plainto_tsquery    
----------------------
 'wonderful' & 'text'
bob=# select to_tsvector('Free text seaRCh is a wonderful Thing') @@ plainto_tsquery('wonderful text');
 ?column? 
----------
 t
(1 row)

Where this might be useful, plainto_tsquery() does not provide for boolean operators other than '&'. to_tsquery() goes a lot further toward providing a decent boolean phrase. Unfortunately, it is also rather finicky about the text you throw at it. It will not accept text which is not separated by '&', '|' or '!'. It does accept braces, which are used to indicate operator precidence, but two tokens which translate to different lexemes directly after one another will cause to_tsquery() to fail.

bob=# select to_tsquery('wonderful text');
ERROR:  syntax error in tsquery: "wonderful text"
bob=# select to_tsquery('wonderful | text');
      to_tsquery      
----------------------
 'wonderful' | 'text'

Ranking Search Results

Nevertheless, we may use a tsquery boolean search phrase to match against any tsvector, or use a tsquery and a tsvector to produce a ranking by use of either the ts_rank() or ts_rank_cd() functions. These two functions behave slightly differently. ts_rank() is considered the 'standard' ranking function, whilst ts_rank_cd() uses the Cover Density Ranking algorithm[ref.6], which is more interested in phrases than in the actual terms of the query itself.

To rank a match, one would use:

bob=# select ts_rank(to_tsvector('Free text seaRCh is a wonderful Thing'), to_tsquery('wonderful | thing'));
  ts_rank  
-----------
 0.0607927
(1 row)

bob=# select ts_rank(to_tsvector('Free text seaRCh is a wonderful Thing'), to_tsquery('wonderful & thing'));
  ts_rank  
-----------
 0.0991032
(1 row)

bob=# select ts_rank_cd(to_tsvector('Free text seaRCh is a wonderful Thing'), to_tsquery('wonderful & thing'));
 ts_rank_cd 
------------
        0.1
(1 row)

Relating this ranking ability back to the subject of tsvector weights, you will recall

bob=# select                                                                  
bob-# setweight(to_tsvector('All about search'), 'B') ||
bob-# setweight(to_tsvector('Free text seaRCh is a wonderful Thing'),'A');
                           ?column?                            
---------------------------------------------------------------
 'free':4A 'text':5A 'thing':10A 'search':3B,6A 'wonderful':9A
(1 row)

where we labelled the text sections of our document? Now we can do the following:

bob=# select ts_rank(
bob-#  array[0.1,0.1,0.9,0.1],
bob-#  setweight(to_tsvector('All about search'), 'B') || 
bob-#  setweight(to_tsvector('Free text seaRCh is a wonderful Thing'),'A'),
bob-#  to_tsquery('wonderful & search'));
 ts_rank  
----------
 0.328337
(1 row)

bob=# select ts_rank(
bob-#  array[0.1,0.1,0.1,0.9],
bob-#  setweight(to_tsvector('All about search'), 'B') ||
bob-#  setweight(to_tsvector('Free text seaRCh is a wonderful Thing'),'A'),
bob-#  to_tsquery('wonderful & search'));
 ts_rank  
----------
 0.907899
(1 row)

The array[0.1,0.1,0.9,0.1] which is passed as the initial argument to ts_rank() takes arguments in order {D,C,B,A}. Since we labelled our sections A (for the body) and B (for the title), we first assigned B=0.9, A=0.1 and later B=0.1,A=0.9 in the statements above. Results of the ranking function differ accordingly. If not specified, the optional weights array defaults to {0.1, 0.2, 0.4, 1.0}.

Indexing TSVectors: The GIST & Gin

Until now, we have used only the select statement in our examples to demonstrate the matching between tsvector and tsquery, and subsequent ranking capabilities of PostgreSQL. Moving forward we will consider the use of indexes to speed up searches, and later tackle the real showstopper which we have glossed over until now: why is to_tsquery so pedantic, and how do we deal with that?

The two index types supported by PostgreSQL for full text search, are the GIST index, which is based on hash tables, and the Gin index which is based upon the Btree.

GIST is really fast in creating indexes. It really stores a hash table of the terms in the tsvector, and uses a hash of the terms in the tsquery to find the associated documents. Unfortunately, since the hash search result is non-deterministic, PostgreSQL has to check the results of the search, reading all of the articles located and double check the match before returning the result. For smaller sets of lexemes, and smaller databases, this works quite well. However, for really large datasets, the overheads in re-reading the data (or index) tend to make this approach quite slow.

Gin indexes are deterministic, and there is no implied overhead after a search using the index, but the downside to Gin is that index creation slows down logarithmically (thankfully not exponentially) as the number of entries grows. Whichever index method is chosen should therefore take the nature of the database and it's size into account.

There are two approaches to using an index. The first of these creates an index against a function of the data field, and the second approach is to store a tsvector in an additional field in the table; this tsvector field is then indexed. For example,

CREATE INDEX pgweb_idx ON pgweb 
  USING gin(to_tsvector('english', title || ' ' || body));

would create a Gin index using the combination of the title and body fields. This method is simple, but has an implied overhead: Index creation has to do additional work, and your search is more complex. The second approach uses a trigger to populate the dedicated tsvector field whenever a record is added or deleted from the table:

ALTER TABLE pgweb ADD COLUMN tsv tsvector;
UPDATE pgweb SET tsv =
     to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,''));
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
  ON pgweb FOR EACH ROW EXECUTE PROCEDURE
  tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);

Now we simply add, update or remove table rows as normal, and the trigger populates the tsvector field in the table row, which in turn updates the index appropriately. Subsequent queries may be against the tsvector field, and that simplifies the query.

Search result Markup

We should mention at this point, before proceeding any further, a rather useful feature related to text search. PostgreSQL provides the ability to mark up text based upon the result of a text search. The function here, is pg_headline(). This function returns text as a result, with all matching words in the text enclosed in <b></b> HTML tags. It is really easy to use:

bob=# select ts_headline('Free text seaRCh is a wonderful Thing',
        to_tsquery('wonderful & thing'));
                 ts_headline                     
-----------------------------------------------------
 Free text seaRCh is a <b>wonderful</b> <b>Thing</b>
(1 row)

To apply highlights to a matching abstract, one might issue the command:

select ts_headline(abstract, query)
  from pgweb, to_tsquery('wonderful & thing') query, 
  where query @@ tsv;

Phrase Search

Speaking of features, one allegedly missing feature which some people, particularly users of external text engines continuously complain about PostgreSQL not having, is phrase search. Here we refer to the apparent non-existent ability of PostgreSQL's full text search engine to support the match of indexed literal phrases in the text.

Just stop to think about it for a moment. What exactly are we expecting here? How many literal phrases are there in a document of say 10 words? Let's assume for the moment that the words are single character words: a,b,c...j. All the possible phrases in this document are:

a,ab,abc,abcd... (10 phrases)

b,bc,bcd,bcde... ( 9 phrases)

etc...


Put differently, for a document consisting of (n) words, there are (n) phrases starting with the first word, n-1 starting with the second word, n-3 starting with the third word, etc. This translates to

number of phrases = n/2 * (n+1) = 10/2*(10+1) = 55

phrases in a document with just 10 words. For a document of 100 words, we have 5050 phrases! Are we seriously suggesting a literal phrase index is a good thing? Remember, such an index would have to contain all words, and could not afford to eliminate redundancy from the text (eg. Stop words). Not really a practical approach.

How then, do external text search engines support index assisted literal phrase search? Why, in exactly the same way PostgreSQL does it! Quite simply, the algorithm goes:

  1. Turn the text phrase into a boolean search phrase, using the '&' operator

  2. Use the full text search engine to locate the matching documents

  3. For only those documents found, and using the original text phrase do a case insensitive text match, eliminating false positives

For PostgreSQL, case sensitive phrase matching is supported using the LIKE operator, and case insensitive matching through the ILIKE operator. Searching for the text


free text and 'Postgres text search'


the query would go something like this:

select headline, ts_rank(tsv, query) 
  from pgweb, to_tsquery('free & text & postgres & search') query
    where tsv @@ query and body ilike '%Postgres text search%';

... which uses the FTS index to find articles with body text matching a query, and then for those items finds the intersection matching the literal text.

Simple, eh?

Parsing Human Query Strings

Not quite so simple. How, pray, is a poor developer supposed to turn the 'human' phrase above into the SQL statement below? This is a non-trivial task and mores the pity, well beyond the abilities of some developers. Indeed, this is considered to be a 'show-stopper' for many people. It involves building a pre-parser to parse the entered search phrase, the execution of which translates the natural language search phrase into a SQL statement formatted similarly to the above.

To end off this article, we include with apologies to YACCers, the listing of a pre-parser which does precisely that. The general logic is generic enough to be translated to almost any language, but the listing shown here is in standard C++. This is a priority biassed recursive descent parser with no external dependencies. Feel free to use the code in any way you wish (hereby BSD license).

[ 'parser.cpp' requires either "g++" or "gcc -lstdc++" to compile cleanly. -- Ben ]


Other Linux Gazette Articles about PostgreSQL

References

  1. http://en.wikipedia.org/wiki/Full_text_search

  2. http://en.wikipedia.org/wiki/Search_engines

  3. http://www.postgresql.org/docs/8.4/static/textsearch.html

  4. http://snowball.tartarus.org/

  5. http://www.google.com/patents

  6. Web Communities By Yanchun Zhang, Jeffrey Xu Yu, Jingyu Hou




Talkback: Discuss this article with The Answer Gang


[BIO]

Paul works as a Software Architect and Technology Officer for a financial information vendor. After abandoning a career in nuclear chemistry, during which he developed an interest in hardware and software, he joined a firm of brokers as a developer. He first started using Linux in 1994 with version 1.18 of Slackware. His first passion will always be writing software. Other interests are composing music, playing computer games, Neural network simulations and reading.


Copyright © 2009, Paul Sephton. Released under the Open Publication License unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 164 of Linux Gazette, July 2009

Tux