Education + Advocacy = Change

Click a topic below for an index of articles:





Financial or Socio-Economic Issues


Health Insurance



Institutional Issues

International Reports

Legal Concerns

Math Models or Methods to Predict Trends

Medical Issues

Our Sponsors

Occupational Concerns

Our Board

Religion and infectious diseases

State Governments

Stigma or Discrimination Issues

If you would like to submit an article to this website, email us at for a review of this paper



any words all words
Results per page:

“The only thing necessary for these diseases to the triumph is for good people and governments to do nothing.”


 Better Searching Through Science

 David Voss

  Next-generation search tools now under development will let scientists drill ever deeper into the billion-page Web

 In the beginning, the Web was without form, and void. Vast heaps of information grew upon the deep, and it was good for one's desktop. But users across the land were befuddled and could not find their way. There arose the tribes of the Yahoos, the HotBots, and the AltaVistas to bring order out of chaos. Google and CiteSeer prospered and lent guidance. But researchers and scientists, learned ones who had built the Web in their own image, yearned for something more. ...

 As myths go, this one may lack staying power, but there is no doubt that in some sense scientists have been victims of their own success. The real creation story is that the World Wide Web began as an information-sharing and -retrieval project at the European particle physics lab CERN, and many scientists of all fields now depend vitally on the Internet to do their jobs. It's only recently that it has evolved into a convenient way to buy stuff. And although this commercial proliferation has been good for the Web's growth, it has frustrated researchers seeking quality content and pinpoint results among the noise and spam.

 Now, a handful of companies and academic researchers are working on a new breed of search engines to undo this second  curse of Babel. "I think the real action is in focused and specialized search engines," says Web researcher Lee Giles of Pennsylvania State University, University Park. "This is where we're going to see the most interesting work."

 The first generation of search engines was based on what computer scientists such as Andrei Broder of AltaVista like to call classic information retrieval. Stick in a key word or phrase, and the software scurries around looking for matching words in documents. The more times a word pops up, the higher the document ranks in the output results.

 But ranking by hits did not say how important or authoritative or useful the pages might be. "The original idea was that people would patiently look through 10 pages of results to find what they wanted," says Monika Henzinger, director of research at Google. "But we soon learned that people only look at the first set of results, so ranking becomes very important." Some early services, most famously Yahoo, tried to work around this problem by using human analysts to construct Web directories that retained only the most useful or authoritative Web sites. Metaengines--Web sites that shot a query off to dozens of search engines--gave coverage another boost.




 Then, starting about 3 years ago, a second generation of tools appeared whose software performs link analysis: not only digesting the content of pages but also scoping out what the pages point to and what pages point at them. Google is considered the commercial pioneer in this field, but other companies such as NEC have funded development of sophisticated Web structure analyzers such as CiteSeer and Inquirus. And it's still a topic of intense basic research at academic incubators like the alma mater of Google's founders, Stanford University.

 Nowadays, Broder says, virtually every large search engine does some form of link crunching and has ranking functions that order the results. These ranking algorithms are closely guarded secrets. "They are the magic sauce in the recipe," Broder explains. At Google, a system called PageRank measures the importance of Web pages by "solving an equation of 500 million variables and more than 2 billion terms," according to Google's Web site. Says Henzinger, "The idea is that every link is a vote for a Web page, but the votes are weighted by the importance of the linking page."

 According to Broder, the goal of a third generation of search engines now on the drawing boards "is to figure out the intent behind the query." By looking at patterns of searches and incorporating machine intelligence, software may anticipate what an engine user really wants. That knowledge should help it narrow and focus the search.

 Future search technology will also begin to track its human users much more closely--for example, divining that a query about "Mustang" refers to the car, not the animal. In its Inquirus-2 project, for instance, NEC has been looking at ways to reformulate a query based on the user's information needs before zipping it off to other search engines. Other search engines are starting to present the results in a file cabinet stack of categorized folders.

 Privacy issues aside, the ramp-up in search engine power is bound to benefit scientists. An example of a specialized search engine for scientists is Scirus, a joint venture launched in April between FAST, a Norwegian search engine company, and the Elsevier Science publishing group. Scirus is a search interface that taps into Elsevier's proprietary journal content while simultaneously searching the Web for the same key words. "We found that scientists were searching proprietary databases as well as the Web," says Femke Markus, Elsevier's project manager for Scirus. "Wouldn't it be ideal to have one search engine to do both? We [also] would like to let people know that we have journals that might be useful to them."

 FAST's chief technology officer, John Lervik, says that Scirus was designed to filter search results to present matches only from Web pages with scientific content. "For the Web content, we filter on the basis of some attributes like domain. A Web site ending in '.edu' is more likely to have scientific content, for instance." More important, he says, "we can do automatic categorization to estimate whether something is scientific content or not." And like Google, Scirus also searches content in PDF files, a document format widely used in scientific research.

 Such searching power does raise perils. Some users fear that tailored search engines might promote the Web content of one publisher under the guise of an omniscient search engine. Queries to Scirus, for example, yield not only free Web content at universities and research labs but also links to subscriber-only content in the Elsevier journals and MEDLINE.

 Markus says Scirus does not stack the deck. "We've joked about it: Can't we raise the ranking and make sure the top 20 is always Elsevier?" she says. "But that would be very bad for us. Everyone would say, 'Hey, you're only pretending to launch an independent platform.' " Markus says her team is inviting other publishers, including the Los Alamos physics preprint server, to have their content indexed on Scirus.

 John Lervik at FAST also denies any bias. "We use the same relevance algorithms for everything, and we don't emphasize ScienceDirect [Elsevier's online journal gateway] over anything else." Lervik also wants users to speak up if they see anything fishy.




   ResearchIndex (CiteSeer) or Project at NEC Research Institute to use autonomous citation indexing. Can search content in postscript and PDF files.           

   Cora   Special-purpose search engine for computer science resources developed by Carnegie Mellon University.

   Scirus  Joint venture between FAST and Elsevier to search scientific Web pages and subscription content at ScienceDirect and BioMedNet.                                        

   Search4Science  Meta-search engine for scientists; includes colleague search and science news headlines.

   Leiden University  List of specialty search engines.

   SearchAbility Guide to academic search engines.

   Teoma  Beta-test search engine provides "authoritative" Web page results, grouped into categories, plus "expert"  recommendations.                                     

   Search Engine Watch   Lists of specialty search engines and tips for searching; info for search engine researchers.

   Search Engine Guide  Current news about the business; lists of engines.

   Applied Semantics   Ontology-based search engine with meaning-based search.                                

   Semantic Web Activity   Information about efforts to add machine-readable description data to Web pages.

   WiseNut   Context-dependent Web search engine developed by founder of mySimon comparison shopping site.                                 

   Lasoo   Location-based search tool focuses search on selected geographical area.

 Another challenge for both the specialized and general-purpose search engines is the "hidden Web"--databases that search engines do not index, either because their content has a short shelf life (such as daily weather reports) or because they are available to subscribers only. The publishers of Science, Nature, and other journals charge fees for online access to the full text of research papers. Although abstracts may be available, and the citations can readily be discovered by search engines such as Google, the data and full text may never be seen by search engines. This is partly why Elsevier cranked up the Scirus Project. "Because of our firewalls and subscriptions, engines like Google cannot get in and index us," says Markus.

 These barriers pose a dilemma for researchers who want the stamp of peer-review approval and publication in a high-profile journal but who also want the world to know about their work. It has also led to a continuing debate about whether scientific research publications should be free and available without restriction on the Web (Science, 14 July 2000, p. 223). At the moment, Science and Nature both allow authors to post copies of papers on their Web pages after a period of time. By then, however, it may no longer be the breaking news that researchers are looking for.

 Other researchers believe that the highest quality search tools will come not from rejiggering the search engines but from a whole new way of creating Web content. One initiative, called the "semantic Web," is being promoted by a team that includes Tim Berners-Lee, the father of the World Wide Web, who is now at the Massachusetts Institute of Technology. The goal is to incorporate "metadata"--a description of what a document is about--into every Web page, in a form that computers can easily digest and understand. To scientists wrestling with information overload, that might mark the first big step toward paradise regained.