(C) Copyright 1999 International Business Machines Corporation All Rights Reserved Instructions and Information About the www.vm.ibm.com Searcher This package contains the searcher we use to index and search the web pages for www.vm.ibm.com. The basic idea is that our site is completely contained in an SFS directory tree. We have a scanner (an exec) that walks the tree to detect keywords and an RSK-based module that reads the scanner's output into memory and services lookup requests sent by a simple CGI program. The searcher is NOT generalized, so you will have to modify it to get it to work for your site. To modify it you will need to have Rexx, Pipelines, CGI, and HTML skills. There are three main parts to the searcher: 1. The Scanner The scanner -- KWDS EXEC -- is comprised of these pieces: KWDS EXEC KWDS REXX TAGS REXX BB CONFIG This scanner is able to scan the SFS directory tree containing our site and build a set of CMS files describing the correspondence between keywords and URLs in which the keywords reside. To get the scanner to work for you, you will have to: a. in KWDS EXEC, change variable 'sitesfsroot' to nominate the root of your site b. create a directory /include/ in your site's directory tree and place file BB CONFIG in that directory, OR place BB CONFIG somewhere else and modify KWDS EXEC so it will be able to find BB CONFIG in the place you put it c. In file BB CONFIG, modify the EXCLUDE clauses at the end of the file to record those directories that should be excluded from keyword scans (you can get rid of the rest of the junk therein - it is there for another exec we run at our site) You must run the scanner in a userid that has write authority to your entire site. Perhaps a filepool administrator userid is a good choice for you - that's what we do here. KWDS EXEC indexes only the part of your file that is between the "" and "" markers therein. (Each file on our site contains a lot of boilerplate and the "real content" is between those HTML comments.) You might want to change KWDS EXEC to operate differently. The scanner leaves output files throughout your site. The files it leaves are: INDEX IX-DIRS (left only in root) This file names the directories the scanner searched. It is read later by the RSK search server. INDEX IX-FILES (one in each directory) This file names the HTML and HTM files that were scanned and records a little data about each scanned file. The most important thing recorded herein is the correspondence between scanned files and scanner output files. nnnnnnnn IX-WORDS (one for each HTM or HTML file scanned) This file contains the words found in the scan of one HTM or HTML file. The HTM/HTML file this IX-WORDS file corresponds to is recorded in INDEX IX-FILES. 2. The Server The server is comprised of the following files: IXSERV MODULE PROFILE $EXEC (rename to PROFILE EXEC to activate) PROFILE RSK IXSERV BKWUMAP IXSERV BKWSGP BKWRTE MODULE (this is actually an RSK part) BKWUME TEXT (this is actually an RSK part) You will need to create a userid to run IXSERV MODULE disconnected. This userid will need to have the following statements in its CP directory entry: XCONFIG ADDRSPACE MAXNUMBER 32 TOTSIZE 64G SHARE IUCV *MSG MSGLIMIT 65535 MACHINE XC The rest is pretty much up to you. Name your userid IXSERV if possible (if not possible, you must reconfigure file SEARCH CGI - see below). IXSERV MODULE attempts to allocate some pretty large memory buffers (4 MB). I would recommend a minimum of a 48 MB virtual machine. Maybe even 64 MB will be required, depending on what else is going on (nucleus extensions, for example). If you get an 801 error from ssMemoryAllocate, make the virtual machine bigger. Probably the easiest thing for you to do is to put all of the above-named files on the index server's A-disk. Then customize the files, as follows: PROFILE EXEC tinker with as desired PROFILE RSK change variable 'sitesfsroot' to nominate the root of your web site IPL the server machine and issue command "IXSERV" to start the server. DO THIS ONLY AFTER YOU HAVE RUN THE SCANNER AT LEAST ONCE. For a complete list of the RSK commands you can type at the IXSERV console, see the RSK Programmer's Guide. However, here is a very abbreviated list of commands you might find interesting. - CP cmdstring - issues CP command, writes results to console - CMS cmdstring - issues CMS command, writes results to console - ENROLL LIST - shows you some statistics about the keywords and URLs your server has indexed. The "Entries" column is the column of interest. The number of entries in the Uxxxxxxx set is the number of URLs you have indexed. Then number of entries in the Kxxxxxxx set is the number of different (distinct) keywords in your index. 3. The CGI The following files comprise the user interface to your searcher: INDEX HTML SEARCH CGI SEARCH FORM GENSRCH REXX VMHOME HEADER VMHOME TRAILER To install: a. Create a directory /search/ on your site and dump the above-mentioned files into it. b. Edit INDEX HTML to remove everything EXCEPT what's between "" and "". Supply your own HTML at top and bottom instead. c. If you want to tinker with the search form's appearance, do so in INDEX HTML, making the corresponding changes in SEARCH FORM. d. In SEARCH CGI, change the Rexx constants at the top of the exec to nominate your own SFS site root, name of search machine, and so on. e. Tinker with files VMHOME HEADER and VMHOME TRAILER so that SEARCH CGI builds HTML to your liking. (Look at label "answerbrowser" in SEARCH CGI to see what it does with the header and trailer. Adjust the header and trailer files accordingly.) f. Make sure your HTTP server machines all have this statement in their CP directory entries: IUCV *MSG MSGLIMIT 65535 File SEARCH CGI was written for use with EnterpriseWeb/VM. If you are using Velocity Software's ESAWEB, you might find SEARCH ESAWEB to be a suitable replacement for SEARCH CGI -- it was supplied by James Weissman of Velocity Software, Inc. and you should direct all questions to him: james@velocity-software.com. If you are using some other HTTP server you will have to customize SEARCH CGI to work with your server. 3a. Use of TCP (This section used to be called "use of UDP", but the UDP stuff never did work quite right, so I deleted it and wrote TCP support instead.) Here is how you can make search.cgi talk to IXSERV over TCP. a. Set up the TCP stack to be ready for IXSERV to use port 85 for TCP (changes in PROFILE TCPIP). You could pick a different port number if you want, I suppose. b. Add the following commands to PROFILE RSK: CONFIG NOMAP_TCP ON SUBCOM START TCP TCP START IXFIND 85 50 0.0.0.0 TCPIP (In the TCP START command, replace "TCPIP" with the name of your TCP/IP stack machine.) c. In SEARCH CGI, change "comm_method" to "TCP". d. In SEARCH CGI, change "tcp_p", "tcp_a", and "tcp_s" to be the right values for your environment. 4. Ongoing Concerns Your site's content isn't static, so you need to refresh your keyword index periodically. The general idea is that you should configure your system's programmable operator so that it periodically runs KWDS EXEC and then tells IXSERV to reload its index. Each time you want the index recomputed, your programmable operator must issue these commands: EXEC KWDS / CP MSG IXSERV IXLOAD LOAD 262144 sitesfsroot where "sitesfsroot" is the fully-qualified SFS directory (that is, filepool:filespace.directory) that is the root of your site. (NB: "262144" is the size of an index data space in pages. If you are running into out-of-space problems, you can make this number bigger, up to 524288.) 5. If You Have Problems If you have problems getting this to work, contact me: Brian Wade IBM VM/ESA Development bkw at us.ibm.com