Although my colleagues here on Tilburg University may think that I spend my time fiddling with Linux on a PC that could be put to better uses, they are wrong. The 'fiddling with Linux' I do at home; at my work I only do the bare minimum necessary to keep Linux fed and happy. As most readers of this journal know, this involves making the occasional backup and for the rest: nothing.
When I sit in front of my PC, I work (well, mostly). Linux makes it possible to do my work with a minimum of fuss and a big part of the credit for this goes to Jacques Gelinas, the man who wrote Umsdos: a layer between the Unix operating system and the vanilla MS-DOS 8+3 FAT system. This makes it possible to access the DOS-partition of my hard disk from either operating system. This is good news, because I am totally dependent from two programs: SMART, an indexing and retrieval system and SPSS for Windows to twiddle the data I obtain form SMART. SMART only runs under Unix (and not all Unixes for that matter) and SPSS4Windows, obviously, runs under MS-Windows and whatever the virtues of this operating system may be, you emphatically do not want to use it in any kind of experimental environment.
I suppose that SPSS (Statistical Package for the Social Sciences) will be familiar to most Linux users. If not: SPSS is just what it says, a statistical package but not only for the 'social sciences' but for about everyone who needs statistical analysis of his data. SMART, however, is an indexing and retrieval program for text. What is more: it does not just index the words, it also adds weights to them. It also allows the user to compare the indexed documents in the so-called Vector Space Model and to compute the distances between the documents, or between documents and queries. To understand why this is special we must delve a bit in the typical problems of Information Retrieval, i.e. the storage of books, articles etcetera and the retrieval of those on content.
When at the end of the sixties automatic indexing of texts became a
viable option many people thought that the problems of information
retrieval were solved. Programs like STAIRS (IBM,1972)
enabled the users to file and rapidly retrieve documents on any word
in the text or on boolean combinations (AND, OR, NOT) of those words
and who could ask for more? Then, in 1985 a famous article was
published by two researchers in the field . In
this article they reported on the performance of STAIRS in real life
and they showed that the efficiency of STAIRS and similar systems was,
in fact, much lower than assumed. Even experienced users could not obtain a
recall of more than 20-40% of the relevant documents in a database of
100,000 documents, and worse, they were not aware of the fact.
The problem with all retrieval systems of this type is that human language is so fuzzy. There may be as much as a dozen different terms and words pointing to one and the same object, whereas one word may have widely different meanings. In Information Retrieval this will lead to one of two situations. Either you try to obtain a high precision, when almost all the retrieved documents are relevant (but an unknown number of other relevant documents are not included) or you go for high recall, but then a lot of irrelevant documents will be included in the result. When in a retrieved set of documents the proportion of irrelevant documents is high, the user will probably stop looking at the documents before he or she has found all the relevant ones: in fact his futility-point has been reached. In such a case the net result is equal to the situation in which those relevant records that would be presented after the user reached that futility-point were not retrieved. Therefore the concept of ranking, i.e. the ordering of retrieved documents on relevance, is very important in Information Retrieval.
Modern (and not so modern) research has offered a number of possible solutions to this dilemma. Some of those solutions use the concept of weighted keywords. This means that every keyword-document combination has a weight attached that (hopefully) is an indication of the relevance of that particular keyword for that particular document. SMART does just that: it creates indexes for a database of documents and attaches weights to them. The way that happens may be expressed intuitively as 'the more a word occurs in the less documents, the higher the weight'. Or, if the word 'dog' occurs twenty times in a given document, but in no other documents, you may be relatively certain that this document is about dogs. Information Retrieval addicts like me talk about the tf.idf weight.
Smart offers several options as to how that weight should be arrived at: I generally prefer the so-called atc-variation, because it adjusts for the length of the individual documents.
It calculates the tf.idf in three steps. The first step creates the value for the term-frequency (tf) as
where is the term with the highest frequency in the document. This adjusts for the document-length and the number of terms. Then the weight is calculated as
where N is as before the number of documents and F the document frequency of term t (the number of documents in which term t occurs). Finally the cosine normalization is applied by
where T is the number of terms in the document vector. Now we have a number between zero and one that hopefully correlates with the importance of the word as a keyword for that document. For a detailed discussion of these and similar techniques see e.g. Salton and McGill (). You will love it!
This is not all. When SMART has constructed the index in one of the various ways available, it also can retrieve the documents for you. This is done according to something called ``the Vector Space Model''. This model is best explained using a three-dimensional example of a vector-space; you can add another few thousand dimensions in your own imagination.
Imagine you want to index your documents according to three keywords 'cat', 'dog' and 'horse'; keywords that may or may not occur in your documents. So you draw three axes to get a normal three-dimensional coordinate system. One dimension can be used to indicate the ``cat-ness'' of every document, the other its ``dog-ness'' and the third the ``lion-ness''. To make things easy we only use binary values 0 and 1, although SMART can cope with floats (the 'weights' mentioned before. So if a document is about cats, it scores a one on the corresponding axis, otherwise it scores a zero. Any document may now be drawn in that space according to the occurrence of one or more of the keywords and now we have a relatively easy way to compute the difference between those document. Moreover a query consisting of one or more of the keywords can be drawn in the same space and the documents can be ranked according to the distance to that query. Of course a typical document database has thousands of keywords and accordingly thousands of dimensions, but the arithmetics involved in multi-dimensional distances do not matter much to modern computers, and if they bother you, you just have to smoke something illegal and matters will rapidly become clear. If only till the next morning.
So SMART accepts queries, ranks the documents according to the ``nearness'' to that query and return them to you in that order. Therefore it is still one of the best retrieval systems that are ever written although it lacks the bells and whistles of its more expensive counterparts in some operating systems I could mention. And although it is not really optimized for speed, it runs typically 10-30 times faster than the fastest indexing program I ever saw under MS-Windows.
But I am not using SMART for bread-and-butter retrieval, but for the weights it computes and the indexes it creates. At this point I want to do some other manipulations of these data and again I have to offer my thanks to the developers of unix in general and to Linux in particular. A whole string of ever more complicated and sophisticated shell scripts, the standard unix tools and a few of My Very Own utilities suffice to process the SMART output to a file that is ready for importing in SPSS.
Nevertheless now I have to quit Linux and boot MS-DOS, start
MS-Windows and finally enter SPSS to do the statistics and create some
graphs. I am a newcomer to Unix (indeed it was the fact that Linux
offered a way to use SMART that pulled me over the line two years
ago), but already I am wondering how people can live in the stifling
atmosphere of MS-Windows. The fact that you can't really run two
applications at the same time is not even the worst thing. But who is
responsible for the idea that Icons and Popups were better and more
efficient than the plain old command line? And what happened to pipes
and filters? And a sensible command language? Be that as it may, SPSS
gets the job done and when the output is written to disk I immediately
escape back to Linux to write the final article, report or whatever
On this point I have two messages: one is good, the other bad. I'll
start with the good news. SMART is obtainable by anonymous ftp from
Cornell University and may be used for free for scientific and
experimental purposes. Better yet: it compiles under Linux without
much tweaking and twiddling. Also there exists a fairly active
mailing list for people who use SMART (smart-peoplecs.cornell.edu).
The bad news: the manual. What manual? SMART is not for the faint of
heart; after unpacking and compilation you'll find some extremely
obscure notes and examples and that is it. Nevertheless, if you have
more than just a few megabytes of text to manage AND the stamina to
learn SMART, it certainly is the best solution for your information
retrieval needs. But don't I wish somebody would write a comprehensive
manual! In the meantime you may perhaps be helped by my ``tutorial for
newbees'', to be found at