VM/ESA Site Search Engine
Brian Wade
IBM, VM Development
April 1999
When I noticed our colleagues at www.s390.ibm.com putting together
a search facility for their site, I decided to see if I could build
a VM-based one for www.vm.ibm.com. Doing so would satisfy a
long-standing requirement against our site. I ended up building
a search facility based on the RSK, and it works pretty well.
This web page is kind of a diary that explains what I did.
To explain the solution, let me first give you a couple of
sentences about how the site is organized. The site's files
are stored in SFS in filecontrol directories. The root of the
site is SFS directory
VMHOME:EWEBADM.VMPAGE, and all of the site's
directories reside under that. We maintain our HTML in
EBCDIC in files we edit with XEDIT. It's pretty simple,
actually.
The first (naive) idea I had for searching the site really was to
search the site each time a request came in.
I spent about an hour hacking together a CGI program that
used LISTDIR and LISTFILE to walk the SFS tree, using Pipelines
to look for the user's targets. It worked but it was miserably
slow. No wonder! All that file I/O. Yuck.
Having just shipped the RSK, I thought about its enrollment API
and how the enrollment API is nothing more than an indexed
access method that keeps all the records in a data space.
When I built the RSK I knew that the enrollment API could be
used for very large numbers of indexed records, but I had always
thought of it in terms of directories, such as phone books.
It turns out that the enrollment API is pretty good as a
dictionary holder, too. Here is how I exploited enrollment
sets to make the RSK one heck of a Web search server.
First I wrote an exec that walks the site's SFS tree and
generates a set of flat CMS files that together
house the relationship between
URLs on our site and the words they contain.
The exec, KWDS, places several
files
in each of the site's directories. One file in each
directory, CMS filename "INDEX IX-FILES",
keeps track
of the mapping between .HTM and .HTML files in the
directory and their corresponding "words" files
(filetype "IX-WORDS").
In a given IX-WORDS file, KWDS places
every word or every word prefix found in the corresponding
.HTM or .HTML file (one word or word prefix per record,
F format so I can read the file in huge chunks later).
So, if your .HTML file contained the word "Endicott",
its corresponding
IX-WORDS file would note that "E", "EN", "END", "ENDI", and
so on were all present in your
HTML file.
KWDS EXEC took a couple of hours to write, and it took about
an hour and fifteen minutes on our 9672-R61 (9345 DASD)
to index the 1600 files in our site's 200 SFS
directories. Some files had as many as 24,000 different
words or word prefixes in them, but the average was more
like 1500 or 2000.
Note that
because KWDS is able to tell whether an IX-WORDS
file is obsolete, subsequent runs take just a few minutes.
Our site runs KWDS automatically every Sunday night, so
as to pick up the previous week's changes.
Next I wrote an RSK program that does two things:
-
One thing this program
can do is walk the site's SFS tree, reading all the
IX-FILES and IX-WORDS
files and thereby
accruing knowledge about the relationship between
URLs and words. This knowledge is stored in an RSK enrollment
set, indexed by word.
Continuing the above
example, if "/devpages/bkw/personal.html" were a page that
contained "Endicott", the RSK enrollment records whose indices
were "E", "EN", "END", "ENDI", and so on would all
mention /devpages/bkw/personal.html as being a page
containing that word. In fact, as I am sure you realize
by now,
the RSK enrollment
record whose index is
"END" nominates EVERY page on our site that contains "END" at
the beginning of a word:
"Endicott", "endless", "endearing", "ending", and so on.
In this way, the URLs containing words starting with "END"
can be determined immediately, via one call to ssEnrollRecordGet.
When I tried to build this index,
one implementation constraint I found was that RSK enrollment records,
even though they can each be 16 MB long,
are too small to cite the URLs directly. I had to number the
URLs as I encountered them and then refer to them by number
in the keyword index.
Continuing the
example, the "END" record nominates URLs "1", "17", "48", etc.,
and then a second RSK enrollment set decodes the URL numbers
to actual URLs. This means that the RSK can hold 4 million
URLs for each keyword. We could program around this limit
if necessary.
The RSK program takes about 11 minutes to read the IX-FILES
and IX-WORDS data from our site's 200 SFS directories. The
IX-WORDS files together
contain about 1.3 million words or word prefixes,
and about 210,000 of them are unique. This means that in those
11 minutes, the RSK reads about 1.3 million records from SFS.
Each SFS record represents one word-URL pair to be added to
the index, so
in those 11 minutes the RSK also performs about 2.6 million
transactions against the word index -- 1.3 million calls to
ssEnrollRecordGet and 1.3 million associated calls to
ssEnrollRecordInsert. When this process completes, the
keyword index contains about 210,000 records, each one
nominating a list of URLs.
While the RSK program is accruing a new index, it keeps the
current index online and continues to handle search requests
using the current index.
When it finishes
accruing the new index, the RSK program puts the new index
into service and then discards the previous index.
Thus, after the
very first index is loaded, searching continues uninterrupted,
even while a new index is coming online. Each Sunday night,
after the site rescan completes, the RSK program is instructed
to reload its index.
-
The other thing the RSK program can do
is answer simple AND and OR questions about desired keywords.
For example, it can answer, "Which pages contain the words
'VM/ESA' and 'Endicott'?" The RSK program does this by performing
one lookup for each search term and then logically combining
the results. After the results are combined, the URL numbers
are decoded and the list of URLs is sent back to the client.
The rest of the pieces of the puzzle are pretty simple. There
is an HTML form, /search/, that lets
you type in some keywords and select AND or OR. That form
is backed by a CGI program that gathers the form content and
asks the RSK-based server the appropriate questions. The
CGI gathers up the RSK's responses, marks them up with HTML,
and sends the answer to the browser. Believe it or not,
the interface between
the CGI and the RSK-based server is the CP MSG command.
One CP MSG from the CGI results in a number of answers
from the RSK-based server. The CGI traps these responses
using Pipelines' starmsg device driver.
I chose CP MSG on purpose because it has very little setup
time for the CGI and none for the RSK-based server.
One thing the CGI program does that is kind of cool is that
it ranks the RSK's responses so that the most likely pages
are placed first on your browser screen. This ranking is
done only for OR-type searches. If you ask our server
an OR question, the CGI gets the result for you by issuing
several AND queries to the RSK server
and then merging the results. We do this so that our
searcher can
tell you about pages containing ALL of your targets before
it tells you about pages containing only SOME of your targets.
This handling of OR-type searches
is best explained by
giving an example. Suppose you want to find all the pages
that contain "birthday", "brian", or "forum". The CGI
will ask the RSK this list of questions:
AND BIRTHDAY BRIAN FORUM
AND BIRTHDAY BRIAN
AND BIRTHDAY FORUM
AND BRIAN FORUM
AND BIRTHDAY
AND BRIAN
AND FORUM
After this, it will merge the results, displaying a given
URL only at its highest matching level.
That's about all there is to it. None of this is rocket science.
Having the RSK did speed the development of this search facility.
It took me about 5 hours to write the RSK-based index server.
The CGI and HTML form took an hour or two. So, in about two
days, this search facility was finished, and our content
developers didn't have to do a thing to have their pages
indexed. Viva VM!
To learn more about our search tool, visit
its description,
or read the README file,
or download the package itself.
|