Site-wide Search
part of the
ArsDigita Community System
by
Philip Greenspun,
Jin Choi, and
Michael Bryzek
Note: this facility requires interMedia version 8.1.5.1, which you can upgrade to via a patch, or Oracle 8.1.6.
The Big Idea
This is a system for using a full-text search engine bundled with the
RDBMS, e.g., Oracle Intermedia, to permit site-wide searches with
results ranked by descending relevance. So, for example, bulletin board
postings and classified ads are searched at the same time as static
documents.
Under the Hood
We define a
site_wide_index
table which ties together
rows from different tables to a single interMedia index column. We use
the user datastore feature of interMedia which gives us flexibility in
how we index various tables, and saves us from having to copy text we
wish to index.
Deeper Under the Hood
The main action is
create table site_wide_index (
table_name varchar(30) not null,
the_key varchar(700) not null,
one_line_description varchar(4000) not null,
datastore char(1) not null, -- place holder for datastore column
primary key (table_name, the_key)
);
We store the table name from which the content came originally. We
also keep a key from that table. This will typically be an integer
but it might be character data (e.g., the
msg_id
from
bboard
). In some cases, it might be a delimited
composite key (it is up to the application code to pull this out and
use it intelligently). Note that Oracle automatically coerces from
integer to varchar and therefore we don't have to do anything fancy to
make queries work in the typical case (where the primary key of the
fundamental table is an integer).
We must have the primary key constraint on this index because Oracle
Intermedia can't deal with any table that doesn't have a primary key.
We need one_line_description
so that we can display a nice
list of query results to users who do a site-wide search. If we didn't
keep this, we'd have to outer join with every possible table that
contributes to the index or laboriously look up every row with a PL/SQL
procedure. And even then we'd need to keep some record of, for each
table, what was the best way to get a one-line description.
For consistent site-wide user interface, we keep a table of what the
table names in the index should be called as far as sections go. For
example, the table name for the /bboard module is "bboard" but a
publisher might wish this presented in a search form as "Discussion
Forums". This is done by entering rows in the
table_acs_properties
table, described in
the user profiling documentation. We
reproduce the table definition here:
create table table_acs_properties (
table_name varchar(30) primary key,
section_name varchar(100) not null,
user_url_stub varchar(200) not null,
admin_url_stub varchar(200) not null
);
The
user_url_stub
column holds the url necessary to direct the
user from the search results page to the page with the content. For
example, for the discussion forum module, this column would contain
"/bboard/redirect-for-sws?msg_id=". The search
results page will append the value of
the_key
column to
this URL stub.
Triggers that fill the table
Here's some example PL/SQL code to keep the site-wide index table
updated for static pages. It is complicated by the need to observe the
index_p
flag, which is used to mark pages which should not
be indexed.
Note that the datastore
column is just a placeholder; its
value is unimportant (we just happen to use the letter 'a'). Updating
this column causes the index to be rebuilt for a row by calling the
user datastore procedure (see next section).
create or replace trigger static_pages_sws_insert_tr
after insert on static_pages for each row
BEGIN
IF :NEW.index_p = 't' THEN
-- we have to create a new row in the index table for this row.
insert into site_wide_index (table_name, the_key, one_line_description, datastore)
values ('static_pages', :new.page_id, :new.page_title, 'a');
END IF;
END;
/
show errors
CREATE OR replace trigger static_pages_sws_update_tr
after UPDATE ON static_pages FOR each row
BEGIN
IF :old.index_p = 'f' AND :NEW.index_p = 't' THEN
insert into site_wide_index (table_name, the_key, one_line_description, datastore)
values ('static_pages', :new.page_id, :new.page_title, 'a');
ELSIF :old.index_p = 't' AND :NEW.index_p = 'f' THEN
DELETE FROM site_wide_index
WHERE table_name = 'static_pages'
AND the_key = :old.page_id;
ELSIF :NEW.index_p = 't' THEN
update site_wide_index
set the_key = :new.page_id, one_line_description = nvl(:new.page_title, '(no title)'), datastore = 'a'
where the_key = :old.page_id
and table_name = 'static_pages';
END IF;
end;
/
show errors
CREATE OR replace trigger static_pages_sws_delete_tr
after DELETE ON static_pages FOR each row
BEGIN
IF :old.index_p = 't' THEN
DELETE FROM site_wide_index
WHERE the_key = :old.page_id
AND table_name = 'static_pages';
END IF;
END;
/
show errors
User Datastore Procedure
This is the heart of the site wide index. This procedure needs to know
about every section of the ACS which intends to use the site-wide
search. This example indexes
bboard
and
static_pages
.
This procedure is run to gather the text that is to be indexed for any
row of the site_wide_index
table. Its arguments are the
rowid of the row that it is to retrieve the content for, and a
temporary clob in which the results are to be stored.
Different sections can be indexed in different ways. Note that the
bboard section indexes entire threads into a single entry in the site
wide index, instead of indexing each message as its own entry.
The user datastore procedure must be loaded as the ctxsys user, and
must be granted select access on the tables you want to index
from. You must grant execute permissions on the user datastore
procedure to the user who will be using it to create the index.
CREATE OR replace procedure sws_user_datastore_proc ( rid IN ROWID, tlob IN OUT nocopy clob )
IS
v_table_name VARCHAR(30);
v_primary_key VARCHAR(700);
v_one_line VARCHAR(700);
v_static_pages_row photonet.static_pages%ROWTYPE;
cursor bboard_cursor(v_msg_id CHAR) IS
SELECT one_line, message
FROM photonet.bboard
WHERE sort_key LIKE v_msg_id || '%';
BEGIN
-- get various info on table and columns to index
SELECT table_name, the_key, one_line_description
INTO v_table_name, v_primary_key, v_one_line
FROM photonet.site_wide_index
WHERE rid = site_wide_index.ROWID;
-- clean out the clob we're going to stuff
dbms_lob.trim(tlob, 0);
-- handle different sections
IF v_table_name = 'bboard' THEN
-- Get data from every message in the thread.
FOR bboard_record IN bboard_cursor(v_primary_key) LOOP
IF bboard_record.one_line IS NOT NULL THEN
dbms_lob.writeappend(tlob, length(bboard_record.one_line) + 1, bboard_record.one_line || ' ');
END IF;
dbms_lob.append(tlob, bboard_record.message);
END LOOP;
ELSIF v_table_name = 'static_pages' THEN
SELECT * INTO v_static_pages_row
FROM photonet.static_pages
WHERE page_id = v_primary_key;
IF v_static_pages_row.page_title IS NOT NULL THEN
dbms_lob.writeappend(tlob, length(v_static_pages_row.page_title) + 1, v_static_pages_row.page_title || ' ');
END IF;
dbms_lob.append(tlob, v_static_pages_row.PAGE_BODY);
END IF;
END;
/
show errors
grant execute on sws_user_datastore_proc to photonet;
Querying
If you just want to find the most relevant documents across the entire
site:
select
score(10) as the_score,
the_key,
one_line_description,
map.user_url_stub
from site_wide_index swi, table_acs_properties map
where swi.table_name = map.table_name
and contains(indexed_stuff,'about($user_entered_query)',10) > 0
order by score(10) desc
Within the discussion forums, you wouldn't bother to join with
table_acs_properties
since you don't need the URL stub
and you know what the section is called. But you'd probably want to
join with the bboard table in order to restrict to a particular topic,
e.g.,
select
score(10) as the_score,
msg_id,
one-line,
posting_time,
topic
from site_wide_index swi, bboard
where swi.the_key = bboard.msg_id
and swi.table_name = 'bboard'
and contains(indexed_stuff,'about($user_entered_query)',10) > 0
order by score(10) desc
The about()
in the query specifies a theme query, which
takes the words of the query to be "themes" to search for. Oracle
has had geniuses locked up for the last several years dividing up the
English language into a taxonomy of semantic topics. interMedia will
decide which of these themes your documents are most relevant to, and
match those up to the words in your query. Leaving out the
about()
gives you a simple word query, which is more like
what you'd expect.
In practice, feeding a user entered query directly to interMedia turns
out to be very bad, because it is expecting queries to be specified
using their little query language. Any syntactical errors in defining
a query causes interMedia to cough up an error. This is very easy to
do, since various punctuation characters and words like "and" are
special in this query language. We use the PL/SQL function
im_convert
to massage the user input into a form which is
safe to pass to interMedia, and which performs a combination of text
search and theme search to try to bring the most relevant documents.
im_convert()
cannot be called directly from within
contains()
, so we must use a separate database query to
convert the user search string. This may be replaced at a later date
into a Tcl procedure, to make modifications easier.
Tcl processing of the results
Unlike Verity and PLS, ConText doesn't have a good way to refer to
previous searches and say "give me results 100 through 200 from that
last search". Combined with the stateless nature of HTTP, this makes it
hard to implement the kinds of search user interfaces prevalent at most
sites (notably AltaVista). Personally I'm not sorry. I've always hated
sites like those. My personal theory is that the user should get about
25 highly relevant documents back from a search. Additional documents
shouldn't be offered unless they are nearly as relevant as the most
relevant. Here's an example of the Tcl code necessary to implement
this:
set counter 0
while {[ns_db getrow $db $selection]} {
set_variables_after_query
incr counter
if { ![info exists max_score] } {
# first iteration, this is the highest score
set max_score $the_score
}
if { ($counter > 25) && ($the_score < [expr 0.3 * $max_score] ) } {
# we've gotten more than 25 rows AND our relevance score
# is down to 30% of what the maximally relevant row was
ns_db flush $db
break
}
if { ($counter > 50) && ($the_score < [expr 0.5 * $max_score] ) } {
# take a tougher look
ns_db flush $db
break
}
if { ($counter > 100) && ($the_score < [expr 0.8 * $max_score] ) } {
# take a tougher look yet
ns_db flush $db
break
}
...
}
The heuristic cutoff in the above procedure has been packaged into the
Tcl procedure ad_search_results_cutoff
.
For the user who has a PhD in computer science
Suppose that the user doesn't like the rough results that come back from
a phrase relevancy query in ConText. We should give the user an
opportunity to go into an "advanced search" mode where they can exploit
the full power of the ConText query language.
"More Like This"
interMedia doesn't have a built-in facility for doing query by example
("give me more documents like this one"), but we can use theme
searches to build a reasonable facsimile. The Tcl procedure
ad_search_qbe_get_themes
takes a table name and key from
the site wide index table and returns the top five themes associated
with that document. Those themes can be used in an
about()
search to find documents about the same themes.
In practice, the themes returned by interMedia for a document are
fairly general, and occasionally irrelevant (including things like
HTML tags). So we throw in the one_line_description
as
part of the query, which seems to do a better job of bringing up
relevant results. See /search/query-by-example
for an
example of query by example.
Using a Separate Search Server
You may decide you want a separate server to serve up the search
queries, perhaps for performance reasons, or perhaps because Oracle
hasn't made available the patch you need to run Oracle on a machine
with more than 10 IP addresses on the architecture your main server is
on. The two parameters,
BounceQueriesTo
and
BounceResultsTo
in the site-wide-search section of your
parameters file allows you to do this.
BounceQueriesTo
should be set on your main server, and
BounceResultsTo
set on the search server. This will bounce queries and results back
and forth for all search pages.
Keeping the index in sync
If you update a table with a ConText index.... nothing happens. The
index doesn't get updated, unlike any other kind of index in the SQL
world. There are a few ways to update the index:
- run
/bin/ad-context-server.pl
as a cron
job every hour.
- saddle your Unix system administrator with the task of making sure
that the ctxsrv process is kept running at all times (this is in
addition to the six Unix process that constitute a normal Oracle
installation)
- create a
DBMS_JOB
to alter the indexes.
The method used to update the index is independent of the rest of the
site-wide search module. At ArsDigita, we most often use
ad-context-server
because it is easy to configure and it
is robust enough to prevent deadlocks while updating interMedia
indexes.
If you want to manually update the index sws_ctx_index, you can execute:
alter index sws_ctx_index rebuild online parameters('sync');
Be careful that no other process is updating the index at the same
time or your index may become unusable.
Digging for trouble
The best place to begin trying to track down non-obvious problems is to look
for errors in the indexing:
connect ctxsys/[password]
select err_index_owner, count(1) from ctx_index_errors group by
err_index_owner;
Select out the err_text for the index_owners you care about and start
tracking down the problem.
Optimizing interMedia indexes
Another thing that you might have to do periodically is
alter index sws_ctx_index rebuild online
parameters('optimize full maxtime 60');
This gives Oracle one hour (60 minutes) to rebuild an optimized index.
I'm not quite sure whether this simply means more efficient for query
time or better (more relevant) results for users.
If you care about performance
If you don't want to slow down transactions by building up these clob
copies of everything, keep the
site_wide_index
table on a
separate physical disk drive. Also try to put the Intermedia index onto
yet another separate physical disk drive. Since every disk drive on a
24x7 server must be mirrored, that means you need four new disk drives
to implement this glorious module. If you happen to have two more disk
drives, use them for the primary key index of the site_wide_index table.
philg@mit.edu