Just Go

October 22, 2008

Search is a hard problem. I take it for granted because I have things like Google and Lucene available to me. But it is a difficult problem, and it's made more difficult when you're not actually allowed to go around and index everything you want to search. Furthermore this difficulty is compounded when you want to repeat this search in multiple locations, and then combine the results.

Now, performing multiple searches is necessary when you want to provide easy searching over information indexed in multiple places. But when the number of indexes increases so does the wait time. To try to make this easier and more bearable, I've been working on a Database Finder, an MLibrary Labs project that tries to identify indexes relevant to a search based on the keywords you supply. So far I've included indexes indexed by Search Tools and some of MLibrary's digital collections hosted by DLPS.

Between the two main sources of databases, I have much more information about the databases from Search Tools. It's likely you'll see a great difference in the quality of search results from both, so keep that in mind when you use the DBFinder.

So, like I said, I understand why federated search has such a hard time of being friendly. There are licensing issues, there are diverse schema issues, ranking issues, and overall, it's just a pain to make things play nicely together and keep the response time reasonable. Despite knowing what's hard about federated searching, what I still want is a way to sit down at a website type some keywords in a box, and get pointers to information relevant to my interest. I mean, Google Scholar does it, why can't I?

Well, I should take a step back. If Google Scholar does it, why should I? The real reason is, there's a problem with completeness of information. Google Scholar hasn't been given permission to index everything the library subscribes to so it isn't a complete index for my purposes.

Back to what I really want: to provide a website with a search box and be able to "just go." I clearly don't have this yet, but I have made steps in that direction. For an MLibrary Labs project, I've been making a layer on top of Search Tools with the intention of simplifying the search process within Search Tools. This really is just a first step, but it is rather fun if you compare it to the current state of affairs.

Let's take a quick look at what the status quo is.

I expect the first thing you'll notice is that you need to select one of eleven different options to search in before you can really make progress. Maybe it's clear which set you want to search, but if it isn't. Let's say you're like me, and you want to research cognitive modeling. Is that Engineering or Social Science? I don't know, and even after performing the same search in both, I couldn't tell you.

In the other hand, where I'm headed is just a search box with a go button. It looks fairly simple, really just something like this:

Which is a contrast to the early decision presented by Search Tools. Is it better? That's difficult to quantify, and I'm pretty sure the answer will change depending on who you ask, and what he is trying to accomplish. I haven't really set out how to evaluate the application other than conducting a few test searches of my own design. I expect my searches to be unlikely to be representative of the general population's searches, but I was generally pleased with the results.

With all that said, how does it work? You may have seen the keyword search in Search Tools' Find databases link, and if you've used it, you've more than likely found it a lackluster feature unless you knew what you were looking for. The Database Finder does not use this feature.

I'll start off by saying that Search Tools stores a log of search queries, and which databases found how many results. So, if I consider this to be a measure of how useful these databases are for those particular searches, then I can estimate how useful the resource is for your particular query. This kind of work is often done with probabilities so:

Let A = You find resource A to be useful in your search
Let B = You supply keywords B in your search

P(A | B) = P(A and B)/P(B)

Since I have log files, I'll estimate the probabilities with counts of articles found from searches in the past so:

P(A | B) = count(A and B)/count(B)

So really, this is the number of articles resource A came up with in past searches for keywords B, divided by the number of times keyword B has been searched for. That's pretty straight-forward. It turns out that for a given search, count(B) is constant. And if all I care about is how these things rank in comparison, I can omit that factor, so:

utility(A | B) = count(A and B)

But there's a little more to the story. Maybe it helps, maybe it makes things worse, but you probably know that the coverage of these resources differ wildly. I wanted our search results to reflect this. So if proquest has a lot of stuff in it, in general, I figured it might do to further weight the results such that resources which are more generally useful are placed higher in the results ranking, thus my new estimate for utility is:

utility'(A | B) = count(A) * count(A and B)


So, there it is, pretty simple and generally brain-dead, but at the same time it seems to work. If you'd like to try it yourself now that you've read about it, here's another chance to just go:

Posted by bertrama at 10:49 AM. Permalink

Comments

Login to leave a comment. If you don't have already have a University of Michigan uniqname, create a Friend account -- all you need is a valid email address.