Why catalogue the Internet?
The limitations of Search engines
|
-
Commonly held view; that while current search engines have
limitations (too slow, find too much) future versions will find desired
information in response to a few keywords. 85% of users use search engines
(Georgia
Tech)
Search Tools – what do I mean by a
"search engine"
Subject directories (aka
subject gateways, jumpstations, linkfarms…)
Search Engines
-
Software "robot" or "spider" traverses
web, adds URL’s to database, which can be searched.
-
e.g.’s Alta Vista, Northern Light, ANZWERS.
Developments in search engines
-
Need to admit that modern search engines are powerful, and
surprisingly effective for specific classes of searches.
-
Could these ideas be used in library catalogues?
-
Some ways in which Search Engines are making effective searches
easier:
Relevancy
-
Make judgements of which items are most relevant to query,
based on frequency, position, etc of search terms. Means we are less concerned
with how many items found, than what’s in the first 20 ("first 20 precision").
-
Often little information on how relevancy is arrived at.
Concerns, e.g. advertising bias (GoTo
- "Promote your site").
Related Pages
-
Searches for items similar (eg. terms used, domain, etc)
to one that searcher has deemed relevant. E.G. AltaVista
Grouping
-
Sort search on common characteristics, e.g. Northern
Light. Enables a kind of sub-search.
Automatic translation of concepts
-
Synonyms automatically translated to search terms, e.g. "elderly"
also searches "senior citizens" - used by Excite
Popularity measures
-
Attempts to automatically bias ranking to most "popular"
sites: problems – may be self fulfilling; time lag, "popular" site not
necessarily relevant to specific search.
-
Hits, e.g. Direct
Hit
-
Boosts relevancy according to whether other searches followed
link
-
Links, e.g. Google
-
Boosts relevancy according to number of links to site
Databases of common searches, e.g.
Ask
Jeeves
-
Really a form of cataloguing
How do search engines perform?
-
Precision usually good (so users feel satisfied); but recall
(overlap) low.
-
Bar-Ilan 1998 – low recall (max 50%), high precision, little
overlap
-
Clarke 1997 – relative recall 50-60%
-
Hawking 1999 – commercial search engines sacrifice effectiveness
for speed
-
Schwartz 1998 – performance difficult to measure
-
Pollock 1997 - users need understanding of Internet knowledge
structures to carry out effective searches.
What search engines miss
"indexable web"
-
Lawrence99 estimate 16M web servers, only 2.8M on indexable
web.
Why pages aren’t indexable:
-
Pages not
linked to others on Web
-
Databases
-
Frames
-
Javascript
-
Image based information Not possible to find many organisations
directly, since their pages only include name as graphic logo.
-
Subscription only sites
-
Robot exclusion sites
Coverage of major search engines
-
Of indexable web, main search engines index 42%, best (Northern
Light) only covers 16%. (Lawrence99)
-
Search engines going for speed, profitablity rather than
coverage. (AltaVista times out during busy periods)
-
Little overlap between search engines, e.g. Notess00
Language problems
Many different ways even a simple query can be expressed,
e.g.
-
Users will give very different search statements for the
same query (Bates98/Saracevic: only 1.5% of sample of users used same term
for same query)
-
Most people only use 2-3 terms, single word searches common.
Examples from MetaSpy
-
People think they get higher recall than they actually do
(Blair and Maron 1985)
-
Size of vocabulary: typical vocab is 50K words; therefore
average hit on single word search of 800M pages of 500words each is 800M
x 500 / 50K = 8M hits!
-
Mis-spellings (mis-spelt words will retrieve something.)
Example
from AltaVista
-
Search engines open to bias, e.g. Some conservative mid-west
sites removed word "Gay" from Veronica indexes (Poulter97)
Is author-added metadata the solution?
-
If authors add metadata, surely search engines will be able
to utilise this, so having external cataloguers isn't necessary.
Metadata used on few sites (Lawrence99)
-
Metadata frequently doesn’t relate to content
of site
-
34% use any meaningful metadata
-
0.3% use Dublin Core
Metadata can be spammed
-
Therefore many search engines ignore – "search engines do
not trust metadata. It's fine to talk about how nice it would be if all
web pages were categorized, but the search engines know from experience
that people will lie, mislead or do whatever they can to get on top". (Sullivan
1997)
-
Ignored by Excite, Google, Lycos, Nlight
-
Do not boost ranking on AltaVista, Excite, Google, Lycos,
NLight
Summary: how do search engines deal
with "Classic" search types
Known item
-
E.g. Title, Author
-
Likely that many "known item" searches can be handled by
search engines, IF in "indexable" web, and search terms sufficiently unique.
Subject search
-
Challenge: Need to find terms that define our topic in such
a way that search engine can present a managable list of sites closely
matching our topic.
What Can Cataloguing achieve?
Uniform vocabulary access
-
Can potentially map topic to search terms
-
Provide structured access to names of people and organisations
Selection
-
If we try to catalogue everything, we’ll run out of index
terms; but cataloguers have created MARC records for ~40M print items (Weinberg
99), so cataloguing significant web sources should be achievable.
-
Cataloguing "significant" sites may be a form of selection.
Preservation
-
If we catalogue items, may be easier to find/recover if they
disappear or change location
Next
How should we catalogue the Web?
Standalone web directories vs OPAC integration
References:
-
Bar-Ilan,J (1998): On the overlap, the precision and estimated recall of
search engines. A case study of the query "ERDOS". Scientometrics
42(2), 207-228.
-
Bates,MJ (1998): Indexing and access for digital libraries and the Internet:
human, database, and domain factors. Journal of the American Society
for Information Science 49(13), 1185-1205.
-
Clarke,SJ; Willett,P (1997): Esimating the recall performance of Web search
engines. Aslib Proceedings 49(7, July/August), 184-189.
-
Hawking,D; Craswell,N; Thistlewaite,P; Harman,D (1999): Results and Challenges
in Web Search Evaluation. In: WWW8 Proceedings. http://www8.org/w8-papers/2c-search-discover/results/results.html
-
Lawrence,S; Giles,CL (1999): Accessibility of information on the Web. Nature
400(8 July), 107-109.
-
Pollock,A; Hockley,A (1997): What's Wrong with Internet Searching. D-Lib
Magazine (March)
(http://www.dlib.org/dlib/march97/bt/03 pollock.html)
-
Poulter,A (1997): The design of World Wide Web search engines: a critical
review. Program 31(2, April), 131-145.
-
Schwartz,C (1998): Web Search Engines. Journal of the American Society
for Information Science 49(11), 973-982.
-
Weinberg,BH (1999): Improved Internet access: guidance from research on
indexing and classification. Bulletin of the American Society for Information
Science 25(2, December/January), 26-29. (http://www.asis.org/Bulletin/Jan-99/weinberg.html)
Last updated 5 April 2001 by Alastair
Smith