If you would like to submit an article to this website, email us at info@heart-intl.net for a review of this paper
info@heart-intl.net
Better Searching Through Science
David Voss
Next-generation search tools now under
development will let scientists drill ever deeper into the billion-page
Web
In the beginning, the Web was without form, and
void. Vast heaps of information grew upon the deep, and it was good for one's desktop.
But users across the land were befuddled and could not find their way. There
arose the tribes of the Yahoos, the HotBots, and the AltaVistas to bring order out of chaos.
Google and CiteSeer prospered and lent guidance. But researchers and scientists, learned
ones who had built the Web in their own image, yearned for something more. ...
As myths go, this one may lack staying power, but
there is no doubt that in some sense scientists have been victims of their own success.
The real creation story is that the World Wide Web began as an information-sharing and
-retrieval project at the European particle physics lab CERN, and many scientists of all
fields now depend vitally on the Internet to do their jobs. It's only recently that it has evolved
into a convenient way to buy stuff. And although this commercial proliferation has been
good for the Web's growth, it has frustrated researchers seeking quality
content and pinpoint results among the noise and spam.
Now, a handful of companies and academic
researchers are working on a new breed of search engines to undo this
second curse of Babel. "I think the real action is in
focused and specialized search engines," says Web researcher Lee Giles
of Pennsylvania State University, University Park.
"This is where we're going to see the most interesting work."
The first generation of search engines was based
on what computer scientists such as Andrei Broder of AltaVista like to
call classic information retrieval. Stick in a key word
or phrase, and the software scurries around looking for matching words
in documents. The more times a word pops up, the
higher the document ranks in the output results.
But ranking by hits did not say how important or
authoritative or useful the pages might be. "The original idea was that
people would patiently look through 10 pages of results
to find what they wanted," says Monika Henzinger, director of research
at Google. "But we soon learned that people only look
at the first set of results, so ranking becomes very important." Some
early services, most famously Yahoo, tried to work
around this problem by using human analysts to construct Web directories
that retained only the most useful or authoritative Web
sites. Metaengines--Web sites that shot a query off to dozens of search engines--gave coverage another boost.
ILLUSTRATION: TERRY SMITH
Then, starting about 3 years ago, a second
generation of tools appeared whose software performs link analysis: not
only digesting the content of pages but also scoping
out what the pages point to and what pages point at them. Google is
considered the commercial pioneer in this field, but other
companies such as NEC have funded development of sophisticated Web structure analyzers such as CiteSeer and Inquirus.
And it's still a topic of intense basic research at academic incubators
like the alma mater of Google's founders, Stanford
University.
Nowadays, Broder says, virtually every large
search engine does some form of link crunching and has ranking functions
that order the results. These ranking algorithms are
closely guarded secrets. "They are the magic sauce in the recipe,"
Broder explains. At Google, a system called PageRank
measures the importance of Web pages by "solving an equation of 500
million variables and more than 2 billion terms,"
according to Google's Web site. Says Henzinger, "The idea is that every
link is a vote for a Web page, but the votes are weighted by the
importance of the linking page."
According to Broder, the goal of a third
generation of search engines now on the drawing boards "is to figure out
the intent behind the query." By looking at patterns of
searches and incorporating machine intelligence, software may anticipate
what an engine user really wants. That knowledge should
help it narrow and focus the search.
Future search technology will also begin to track
its human users much more closely--for example, divining that a query
about "Mustang" refers to the car, not the animal. In
its Inquirus-2 project, for instance, NEC has been looking at ways to
reformulate a query based on the user's information needs
before zipping it off to other search engines. Other search engines are
starting to present the results in a file cabinet stack of
categorized folders.
Privacy issues aside, the ramp-up in search engine
power is bound to benefit scientists. An example of a specialized search engine for scientists is Scirus, a joint venture
launched in April between FAST, a Norwegian search engine company, and
the Elsevier Science publishing group. Scirus is a
search interface that taps into Elsevier's proprietary journal content
while simultaneously searching the Web for the same key
words. "We found that scientists were searching proprietary databases as well as the Web," says Femke Markus, Elsevier's
project manager for Scirus. "Wouldn't it be ideal to have one search
engine to do both? We [also] would like to let people know
that we have journals that might be useful to them."
FAST's chief technology officer, John Lervik, says
that Scirus was designed to filter search results to present matches
only from Web pages with scientific content. "For the Web
content, we filter on the basis of some attributes like domain. A Web
site ending in '.edu' is more likely to have scientific
content, for instance." More important, he says, "we can do automatic categorization to estimate whether something is
scientific content or not." And like Google, Scirus also searches
content in PDF files, a document format widely used in scientific
research.
Such searching power does raise perils. Some users
fear that tailored search engines might promote the Web content of one publisher under the guise of an omniscient search
engine. Queries to Scirus, for example, yield not only free Web content
at universities and research labs but also links to
subscriber-only content in the Elsevier journals and MEDLINE.
Markus says Scirus does not stack the deck. "We've
joked about it: Can't we raise the ranking and make sure the top 20 is always Elsevier?" she says. "But that would be
very bad for us. Everyone would say, 'Hey, you're only pretending to
launch an independent platform.' " Markus says her team is
inviting other publishers, including the Los Alamos physics preprint
server, to have their content indexed on Scirus.
John Lervik at FAST also denies any bias. "We use
the same relevance algorithms for everything, and we don't emphasize ScienceDirect [Elsevier's online journal gateway]
over anything else." Lervik also wants users to speak up if they see
anything fishy.
A SMATTERING OF
SEARCH TOOLS
ResearchIndex (CiteSeer) researchindex.org or
citeseer.nj.nec.com Project at NEC Research Institute to use autonomous
citation indexing. Can search content in postscript and PDF files.
Cora cora.whizbang.com
Special-purpose search engine for computer science resources developed
by Carnegie Mellon University.
Scirus
www.scirus.com Joint venture
between FAST and Elsevier to search scientific Web pages and
subscription content at ScienceDirect and
BioMedNet.
Search4Science
www.search4science.com
Meta-search engine for scientists; includes colleague search and science
news headlines.
Another challenge for both the specialized and
general-purpose search engines is the "hidden Web"--databases that
search engines do not index, either because their content
has a short shelf life (such as daily weather reports) or because they
are available to subscribers only. The publishers of
Science, Nature, and other journals charge fees for online access to the
full text of research papers. Although abstracts may be
available, and the citations can readily be discovered by search engines
such as Google, the data and full text may never be seen
by search engines. This is partly why Elsevier cranked up the Scirus
Project. "Because of our firewalls and subscriptions,
engines like Google cannot get in and index us," says Markus.
These barriers pose a dilemma for researchers who
want the stamp of peer-review approval and publication in a high-profile journal but who also want the world to know about
their work. It has also led to a continuing debate about whether
scientific research publications should be free and available
without restriction on the Web (Science, 14 July 2000, p. 223). At the moment, Science and Nature both allow authors to
post copies of papers on their Web pages after a period of time. By
then, however, it may no longer be the breaking news
that researchers are looking for.
Other researchers believe that the highest quality
search tools will come not from rejiggering the search engines but from
a whole new way of creating Web content. One
initiative, called the "semantic Web," is being promoted by a team that
includes Tim Berners-Lee, the father of the World Wide Web,
who is now at the Massachusetts Institute of Technology. The goal is to incorporate "metadata"--a description of what a
document is about--into every Web page, in a form that computers can
easily digest and understand. To scientists wrestling
with information overload, that might mark the first big step toward
paradise regained.