VM/ESA Site Search Engine

Brian Wade
IBM, VM Development
April 1999

When I noticed our colleagues at www.s390.ibm.com putting together a search facility for their site, I decided to see if I could build a VM-based one for www.vm.ibm.com. Doing so would satisfy a long-standing requirement against our site. I ended up building a search facility based on the RSK, and it works pretty well. This web page is kind of a diary that explains what I did.

To explain the solution, let me first give you a couple of sentences about how the site is organized. The site's files are stored in SFS in filecontrol directories. The root of the site is SFS directory VMHOME:EWEBADM.VMPAGE, and all of the site's directories reside under that. We maintain our HTML in EBCDIC in files we edit with XEDIT. It's pretty simple, actually.

The first (naive) idea I had for searching the site really was to search the site each time a request came in. I spent about an hour hacking together a CGI program that used LISTDIR and LISTFILE to walk the SFS tree, using Pipelines to look for the user's targets. It worked but it was miserably slow. No wonder! All that file I/O. Yuck.

Having just shipped the RSK, I thought about its enrollment API and how the enrollment API is nothing more than an indexed access method that keeps all the records in a data space. When I built the RSK I knew that the enrollment API could be used for very large numbers of indexed records, but I had always thought of it in terms of directories, such as phone books. It turns out that the enrollment API is pretty good as a dictionary holder, too. Here is how I exploited enrollment sets to make the RSK one heck of a Web search server.

First I wrote an exec that walks the site's SFS tree and generates a set of flat CMS files that together house the relationship between URLs on our site and the words they contain. The exec, KWDS, places several files in each of the site's directories. One file in each directory, CMS filename "INDEX IX-FILES", keeps track of the mapping between .HTM and .HTML files in the directory and their corresponding "words" files (filetype "IX-WORDS"). In a given IX-WORDS file, KWDS places every word or every word prefix found in the corresponding .HTM or .HTML file (one word or word prefix per record, F format so I can read the file in huge chunks later). So, if your .HTML file contained the word "Endicott", its corresponding IX-WORDS file would note that "E", "EN", "END", "ENDI", and so on were all present in your HTML file.

KWDS EXEC took a couple of hours to write, and it took about an hour and fifteen minutes on our 9672-R61 (9345 DASD) to index the 1600 files in our site's 200 SFS directories. Some files had as many as 24,000 different words or word prefixes in them, but the average was more like 1500 or 2000. Note that because KWDS is able to tell whether an IX-WORDS file is obsolete, subsequent runs take just a few minutes. Our site runs KWDS automatically every Sunday night, so as to pick up the previous week's changes.

Next I wrote an RSK program that does two things:

  • One thing this program can do is walk the site's SFS tree, reading all the IX-FILES and IX-WORDS files and thereby accruing knowledge about the relationship between URLs and words. This knowledge is stored in an RSK enrollment set, indexed by word. Continuing the above example, if "/devpages/bkw/personal.html" were a page that contained "Endicott", the RSK enrollment records whose indices were "E", "EN", "END", "ENDI", and so on would all mention /devpages/bkw/personal.html as being a page containing that word. In fact, as I am sure you realize by now, the RSK enrollment record whose index is "END" nominates EVERY page on our site that contains "END" at the beginning of a word: "Endicott", "endless", "endearing", "ending", and so on. In this way, the URLs containing words starting with "END" can be determined immediately, via one call to ssEnrollRecordGet.

    When I tried to build this index, one implementation constraint I found was that RSK enrollment records, even though they can each be 16 MB long, are too small to cite the URLs directly. I had to number the URLs as I encountered them and then refer to them by number in the keyword index. Continuing the example, the "END" record nominates URLs "1", "17", "48", etc., and then a second RSK enrollment set decodes the URL numbers to actual URLs. This means that the RSK can hold 4 million URLs for each keyword. We could program around this limit if necessary.

    The RSK program takes about 11 minutes to read the IX-FILES and IX-WORDS data from our site's 200 SFS directories. The IX-WORDS files together contain about 1.3 million words or word prefixes, and about 210,000 of them are unique. This means that in those 11 minutes, the RSK reads about 1.3 million records from SFS. Each SFS record represents one word-URL pair to be added to the index, so in those 11 minutes the RSK also performs about 2.6 million transactions against the word index -- 1.3 million calls to ssEnrollRecordGet and 1.3 million associated calls to ssEnrollRecordInsert. When this process completes, the keyword index contains about 210,000 records, each one nominating a list of URLs.

    While the RSK program is accruing a new index, it keeps the current index online and continues to handle search requests using the current index. When it finishes accruing the new index, the RSK program puts the new index into service and then discards the previous index. Thus, after the very first index is loaded, searching continues uninterrupted, even while a new index is coming online. Each Sunday night, after the site rescan completes, the RSK program is instructed to reload its index.

  • The other thing the RSK program can do is answer simple AND and OR questions about desired keywords. For example, it can answer, "Which pages contain the words 'VM/ESA' and 'Endicott'?" The RSK program does this by performing one lookup for each search term and then logically combining the results. After the results are combined, the URL numbers are decoded and the list of URLs is sent back to the client.

The rest of the pieces of the puzzle are pretty simple. There is an HTML form, /search/, that lets you type in some keywords and select AND or OR. That form is backed by a CGI program that gathers the form content and asks the RSK-based server the appropriate questions. The CGI gathers up the RSK's responses, marks them up with HTML, and sends the answer to the browser. Believe it or not, the interface between the CGI and the RSK-based server is the CP MSG command. One CP MSG from the CGI results in a number of answers from the RSK-based server. The CGI traps these responses using Pipelines' starmsg device driver. I chose CP MSG on purpose because it has very little setup time for the CGI and none for the RSK-based server.

One thing the CGI program does that is kind of cool is that it ranks the RSK's responses so that the most likely pages are placed first on your browser screen. This ranking is done only for OR-type searches. If you ask our server an OR question, the CGI gets the result for you by issuing several AND queries to the RSK server and then merging the results. We do this so that our searcher can tell you about pages containing ALL of your targets before it tells you about pages containing only SOME of your targets.

This handling of OR-type searches is best explained by giving an example. Suppose you want to find all the pages that contain "birthday", "brian", or "forum". The CGI will ask the RSK this list of questions:

After this, it will merge the results, displaying a given URL only at its highest matching level.

That's about all there is to it. None of this is rocket science. Having the RSK did speed the development of this search facility. It took me about 5 hours to write the RSK-based index server. The CGI and HTML form took an hour or two. So, in about two days, this search facility was finished, and our content developers didn't have to do a thing to have their pages indexed. Viva VM!

To learn more about our search tool, visit its description, or read the README file, or download the package itself.