|
Introduction:
Recent years
have seen convergence of work in digital libraries, museums and archives
with a view to resource discovery and opening up access to digital
collections. Various projects are following standards-based approaches
building upon terminology and knowledge organization systems (Hodge
2000). Concurrently, within the Web community, there has been growing
interest in vocabulary-based techniques, with the realization of the
challenges posed by Web searching and retrieval applications. This has
manifested itself in metadata initiatives, such as Dublin Core and the
proposed W3C Resource Description Framework. To support retrieval,
provision is made in such metadata element sets for thematic keywords
from vocabulary tools such as thesauri (ISO 2788, ISO 5964). Ontologies
incorporating thesauri or related semantic models underpin diverse
ongoing projects in remote access, cross domain searching, semantic
interoperability, building RDF models and digital libraries generally (Amann and
Fundulaki 1999; Doerr and
Fundulaki 1998; Koch 2000;
Michard
and Pham-Dac 1998).
This paper
is in two parts. First we discuss a case study that explored the
retrieval potential of an augmented set of thesaurus relationships by
specializing standard relationships into richer subtypes, in particular
hierarchical geographical containment and the associative relationship.
We then locate this work in a broader context by reviewing various
attempts to build taxonomies of thesaurus relationships and conclude by
discussing the feasibility of hierarchically augmenting the core set of
thesaurus relationships, particularly the associative relationship. The
work described here was part of a larger project, OASIS (Ontologically
Augmented Spatial Information System), exploring terminology systems for
thematic and spatial access in digital library applications. One of its
aims concerned the retrieval potential of spatial metadata with rich
place name data but limited locational data (footprint). Such
representations occur in online gazetteers, geographical thesauri or
geographic name servers, when conventional GIS datasets are unavailable,
unnecessary or pose undesirable bandwidth limitations (Hill 2000;
Jones
1997).
Another
aim was to explore the potential of reasoning over the semantic
relationships in thesauri to assist retrieval. The three main thesaurus
relationships are:
- Equivalence (equivalent terms)
- Hierarchical (broader/narrower terms:
BT/NTs)
- Associative (Related Terms: RTs)
Studies support
the use of thesauri in online retrieval and the potential for combining
free text and controlled vocabulary approaches (e.g. Fidel
1991). There are various research challenges, however, including the
'vocabulary problem' - differences in choice of index term at different
times by indexers and searchers (Chen et
al. 1997). Indexer and searcher may be operating at different
levels of specificity, and at different times an indexer(s) may make
different choices from a set of possible term options. While conventional
narrower term expansion may help in some situations, a more systematic
approach to thesaurus term expansion has the potential to improve recall
in such situations. In the work described here, we have employed the
Getty Art and Architecture Thesaurus (AAT) and Thesaurus of Geographic
Names (TGN) vocabularies. Harpring
(1999) gives an overview of the Getty's vocabularies with examples of
their use in Web retrieval interfaces and collection management systems.
Examples are given of their use as a source of variant names of a
concept. It is suggested that the AAT's RT relationships may be helpful
to a user exploring topics around an information need, and the issue of
how to perform query expansion without generating too large a result set
is also raised.
Section 2
discusses our schema, illustrating how the spatial relationships in the
thesaurus can be used to provide more flexible retrieval for queries
incorporating place names. The second topic (sections 3 and 4)
concerns the use of associative thesaurus relationships in retrieval.
Existing collection management systems include access to thesauri for
cataloguing with fairly rudimentary use of thesauri in retrieval (mostly
limited to interactive query expansion/refinement and Narrower Term
expansion). In particular, there is scope for increased use of
associative (RT) relationships in thesaurus-based retrieval tools. There
is a danger that incorporating RTs into retrieval tools with automatic
query expansion may lead to very large sets of query terms. We discuss
experimental scenarios involving semantic distance measures in order to
map key issues affecting use of RTs. Section 5
reviews various taxonomies of augmented thesaurus relationships while
section 6
discusses the potential for a limited extension of the core set of
relationships. Conclusions are outlined in section 7.
2
OASIS Overview and Spatial Access Example
We adopted
collection data from the Royal Commission on the Ancient and Historical Monuments of Scotland (RCAHMS)
database of Scottish archaeological sites and historical buildings (Murray
1997). We indexed this data with thematic terms such as 'arrow',
'bronze', 'axe', 'castle', etc. from the AAT. The spatial data in the
OASIS system includes information on hierarchical and adjacency relations
between named places, in addition to place types, and (centroid)
co-ordinates. This information was taken from the TGN (Harpring
1997), augmented with data derived from the Bartholomew's (Harper
Collins 2000) digital map data for Scotland.

Figure
1. OASIS schema for Place and Artefact
The term
'ontology' has widely differing uses in different domains (Guarino
1995). Our usage here follows that of Amann and
Fundulaki (1999), in that we see an ontology as a conceptualization
of a domain, in effect providing a connecting semantic between thesaurus
hierarchies with specifications of roles for combining thesaurus
elements. The OASIS schema (Figure 1)
encompasses different versions of place names (e.g. current and
historical names, different spellings, etc.), place types (e.g. Town,
Building, Port, River, Hill), latitude and longitude coordinates, and
topological relationships (e.g. meets, part of). The schema is
implemented using the object-oriented Semantic Index System (SIS - Constantopolous
and Doerr 1993), also used to store the data, and which provided the
AAT implementation. The SIS has a meta-modelling capability and an
application interface for querying the schema. Figure 1 shows the
meta-level classification of the classes Place and Artefact.
As we discuss later in relation to RTs, relationships can be instantiated
or subclassed from other relationships. Thus, meets, overlaps,
and partOf are subclasses of Topological Relationships. The
information stored in the OASIS database can be accessed using a set of
functions through which it is possible to find all information related to
a given place, or to find all places with a given spatial relationship,
or to find objects made of a certain material at a particular location.
For example to find all the places that are part of the City of Edinburgh, the system would return a set
of all the places that are linked with a geographical partOf
relationship to the City of Edinburgh.

Figure
2. Classification of the axe artefact NMRS Acc. No. DE 121
Figure 2
shows the OASIS schema for a particular object (axes are a common type of
archaeological artefact in the RCAHMS dataset). OASIS implements a set of
thematic and spatial measures that enables query expansion to find similar
terms. Conventional GIS measures (e.g. zone buffering) could be applied
in situations where a full GIS polygon dataset is available. As discussed
earlier, however, there are situations where a GIS is not available or
unnecessary. Consider the query:
Do you have any information on
axes found in the vicinity of
Leith?
An exact match
to the query would only return axes indexed by the term Leith (a
district of the city of Edinburgh).
To search for axes found in the vicinity of Leith, spatial
distance measures can be applied to expand the geographical term Leith to spatially similar
places, where axes have been found. These places can be ranked by spatial
similarity using the part-of spatial containment relationship,
which in OASIS is based on the spatial hierarchies in the TGN[1]. As we
discuss in section 5,
this relationship is a subtype of the hierarchical thesaurus
relationship. Given the term Leith, the OASIS spatial hierarchy
distance measure would produce the list of places in Table 1,
ranked according to their spatial hierarchical similarity to Leith. Some places such as Corstorphine,
Edinburgh, Currie score highly, since (like Leith)
they are districts within the regionCity of Edinburgh.
Similarly, since Penicuik, Broxburn, Inveresk, etc.,
are places in Scotland,
they would be returned ahead of any axe finds in England.
Table 1. A list of places ranked according to
their similarity to Leith
using the Spatial Hierarchical measure
Omitted
The TGN
also provides centroid coordinate data for places/regions - used by OASIS
in a Euclidean distance measure. Table 2
shows the places of the previous table, now ranked according to their
similarity to Leith
using an integration of the spatial hierarchical and Euclidean distance
measures. In some situations (e.g. queries relating to administrative
responsibilities), administrative hierarchies may be quite relevant but
not an overriding factor in judgements of spatial similarity and thus we
provided a combination of the two measures. It can be seen in Table 2
that the rankings now also take account of Euclidean distance (for
example, Musselburgh compared to Livingstone).
Table 2. A list of places ranked using the
Spatial Hierarchical and Euclidean measures
Omitted
Euclidean
distance between centroid coordinates is less satisfactory for large-area
regions. Voronoi-based techniques can be employed with limited centroid
footprint metadata to yield a richer approximation of spatial regions.
Our larger project has investigated this boundary approximation method,
combining it with geographical thesaurus relationships (Alani et
al. in press). The method has the potential to assist a range of
spatial queries.
3
The retrieval potential of the associative relationship
A thesaurus can
act as a search aid by providing a set of controlled terms that can be
browsed via some form of hypertext representation (e.g. Bruza
1990; Pollitt
1997). This can assist the user to understand the context of a
concept, how it is used in a particular thesaurus and provide feedback on
number of postings for terms (or combinations of terms). The inclusion of
semantic relationships in the index space, moreover, provides the
opportunity for knowledge-based approaches where the system takes a more
active role in building a query by automatic reasoning over the
relationships (Cunliffe
et al. 1997; Tudhope
and Cunliffe 1999). Candidate terms can be suggested for a user to
consider in refining a query and various forms of query expansion are
possible. For example, items indexed by terms semantically close to query
terms can be included in a ranked result list and imprecise matching
between two media items is useful in 'More like this' options. The basis
for such automatic term expansion is some kind of semantic distance
measure, often based on the minimum number of semantic relationships that
must be traversed in order to connect the terms (Rada et
al. 1989). For a review of semantic distance measures and
weighting factors that have been employed, see Alani et
al. (2000).
RTs
represent a class of non-hierarchical relationships, which have been less
clearly understood in thesaurus construction and applicability to
retrieval than the hierarchical relationships. At one extreme, an RT is
sometimes taken to represent nothing more than an extremely vague
'See-also' connection between two concepts. This can lead to uncontrolled
expansion of query term sets when RT relationships are expanded, and a
potential loss in precision. Rada et
al. (1989) argue that semantic distance measures over RT
relationships can be less reliable than over hierarchical relationships,
unless the user's query can be closely linked to the RT relationship. The
basic assumption of a cognitive basis for a semantic distance effect over
thesaurus terms has been investigated by Brooks
(1997), in a series of experiments exploring the relevance
relationships between bibliographic records and topical subject
descriptors. These studies, employing the ERIC database and linked
thesaurus, involved strictly linear hierarchies, as opposed to tree
hierarchical structures (as with the AAT) or indeed poly-hierarchies.
However, the results are suggestive of the existence of some semantic
distance effect, with an inverse correlation between semantic distance
and relevance assessment, dependent on position in the subject hierarchy,
direction of term traversal and other factors. In particular, a definite
effect was observed for RTs (typically less than for hierarchical
traversal. An empirical study by Kristensen
(1993) compared single-step automatic query expansion of synonym,
narrower-term, related term, combined union expansion and no expansion of
thesaurus relationships. Thesaurus expansion was found to improve recall
significantly at some (lesser) cost in precision. Taken separately,
single-step RT expansion results did not differ significantly from NT or
synonym expansion. In another empirical study (Jones et
al. 1995), a log was kept of users' choices of relationships
interactively expanded via thesaurus navigation while entering a query.
In this study of users refining a query, a majority of terms retrieved
from the thesaurus came from RTs (the then INSPEC thesaurus contained
many more RTs than hierarchical relationships).
4
Case study of RT retrieval scenarios
This section maps key issues
affecting use of RTs in term expansion algorithms for retrieval. Results
are given from a series of scenarios applying different versions of a
semantic distance algorithm to terms in the AAT (AAT 2000).
The distance measure employed a branch and bound algorithm, with a depth
factor which reduced costs according to hierarchical depth (Tudhope
and Taylor 1997). It was implemented in C++ using the SIS function
library to query the underlying schema given in Figure 1.
For the purposes of the scenarios, the threshold used to terminate
expansion was 2.33.
Our aim
was to investigate different factors relevant to RT expansion, rather
than relative weighting of relationships. The weights for this experiment
were selected to reflect some broad consensus of previous research (see Alani et
al. 2000). Our weights (BT 3, NT 3, RT 4), taken together with a
depth factor inversely proportional to the hierarchical depth of the destination
term, assign lowest costs to NTs and favour RTs over BTs at higher depths
in the hierarchy (following an AAT editorial observation that RTs appear
to work better at fairly broad levels).
We
developed a series of experimental scenarios based around term
generalization involving RT traversal. Building on the example in section
2,
we first focus on the AAT's Objects Facet: Weapons &
Ammunition and Tools & Equipment hierarchies. The initial
scenario supposes a narrowly defined information need for items
concerning axes used as weapons (mapping to AAT term axes (weapons)).
In this initial scenario, expansion is limited and restricted to NT
relationships only (shown in plain black text in Table 3): tomahawks,
battle-axes, throwing axes, and franciscas.
The second
scenario supposes an information need for items more broadly connected
with axes used as weapons, thus allowing for some flexibility in
expansion. We first consider expansion only over hierarchical
relationships and then discuss expansion with RTs. Table 3
shows results from BT/NT expansion, with semantic distance shown for each
term (terms in green italics result from expansion over both BT and NT
relationships as opposed to strict NT expansion downwards from axes
(weapons)).
Table 3. BT/NT expansion onlyBT/NT expansion
only
|
Term
|
Distance
|
|
axes (weapons)
|
0
|
|
tomahawks
|
0.6
|
|
battle-axes
|
0.6
|
|
edged weapons
|
1
|
|
throwing-axes
|
1.1
|
|
franciscas
|
1.53
|
|
staff weapons
|
1.75
|
|
sword sticks
|
1.75
|
|
harpoons
|
1.75
|
|
bayonets
|
1.75
|
|
daggers (weapons)
|
1.75
|
|
fist-weapons
|
1.75
|
|
knives (weapons)
|
1.75
|
|
swords
|
1.75
|
Table 4
shows the effect of introducing RT expansion (new terms in blue italics).
Staff weapons relating to axes (halberds, pollaxes, gisarmes) move
up the ranking and are now below the threshold. Other new terms (such as axes
(tools), chip axes, ceremonial axes) are also introduced. These terms
could well be relevant to broader information needs or to situations when
the initial thesaurus entry term was mismatched (for example, when an
information need related more to tool use than weapons). In some
situations, however, the new terms could be seen as 'noise' and
finer-grained control of RT expansion would be desirable.
Table 4. RT and BT/NT expansion - terms in
blue italics are introduced with RT expansion

As seen in
section 5,
the ISO standard and other reviews of RTs in thesaurus practice make a
distinction between RTs within the same hierarchy and RTs between
hierarchies (or sometimes facets). One method of achieving finer control
in RT expansion is to filter on the original term's (sub)hierarchy - RTs
to terms within different sub-hierarchies are not traversed. Table 5
shows the comparison with the previous example in Table 4.
Terms in red underline (mostly from the Tools& Equipment hierarchy)
would now be excluded. Thus the effects of RT expansion within the
hierarchy have been retained while the number of additional terms has
been reduced. Other options are possible for semantic distance measures
and term expansion in general. If information on facets and hierarchies
was retained in thesaurus database implementations, it would be possible
to weight RT traversal differentially according to the hierarchies/facets
linked. On the negative side, note that in this scenario axes that are
both tools and weapons (hatchets, machetes) are excluded, since
due to the monohierarchical representation of the AAT[2] they are located within the Tools
& Equipment hierarchy. In many situations this will not be
desirable.
Table 5. RT expansion - red underlined terms
excluded when inter-hierarchical RT traversals are not allowed

The next
scenario explores an alternative approach to control RT expansion based
on selective specialization of the associative relationship according to
retrieval context.[3] The aim is to take advantage
of more structured approaches to thesaurus construction where different
types of RTs are employed. In some circumstances it may be appropriate to
consider all types of associative relationships as a generic RT for
retrieval purposes (as in the above scenarios). However, under other
contexts it may be desirable to treat RT sub-types differently,
permitting some RT traversals but forbidding or penalising (via
weighting) others. Thus, heuristics may selectively guide RT expansion, depending
on query model and session context. The AAT is particularly suited to
investigation of this topic, since its editors followed a systematic,
rule-based approach to the design of RT links (Molholt
1996). The AAT RT editorial manual specifies a set of rules to apply
to the relevant hierarchical context and scope notes in order to identify
valid RT relationships between terms when building the vocabulary or
enhancing it. This includes a set of specializations of the RT
relationships (AAT 1995;
see also extract in Table 6),
following its notation:
- 1A and 1B) Alternate hierarchical (BT/NT)
relationships (since AAT is not polyhierarchical)
- 2A and 2B) Part/Whole relationships
- 3) Several Inter/intra Facet
relationships (e.g. Agents-Activities and Agents-Materials)
- 4) Distinguished From relationship (the
scope note evidences a need to distinguish the sense of two terms)
- 5) frequently Conjuncted terms (e.g. Cups
AND Saucers).
We have extended the original SIS
AAT schema to specialize the associative relationship (see Alani et
al. 2000). RTs in our schema can optionally be treated as
specialized sub-relationships, or as generic RTs.
Table 6. Extract from AAT Related Term
Guidelines. For a full definition, see AAT(1995)

The
editorial rules for creating specific associative relationships are not
retained in electronic implementations of the AAT to date. Therefore, for
this experiment we manually specialized all RT relationships three links
away from axes (weapons) into their corresponding sub-types by following
sample extracts of AAT Editorial Related Term Sheets and applying the
editorial rules. Figure 3
shows the resulting visualization of the concept axes (weapons)
after specializing the RT relationships - note the two different subtypes
of RT. In the next scenario, the distance algorithm was set to filter on
the subtype of RT, only permitting traversal over the Alternate BT and
Alternate NT relationships. Table 7
shows the results (terms now excluded from Table 4
shown in red underline). The effect can be compared with terms excluded
by the hierarchy filtering approach in Table 5.
This scenario might correspond to a reasonably strict information request
but where some terms located in the Tools & Equipment hierarchy
were relevant. For example, an alternate NT relationship exists between tomahawks
and hatchets. Since they are classed as both tools and weapons, hatchets
might well be regarded as relevant to the scenario. Terms, such as machetes
and hatchets from the Tools & Equipment hierarchy,
were excluded when narrowly filtering on the hierarchy but are now
included. The specialization permits the AAT to be treated as
polyhierarchical for retrieval.
Figure 3,
an extract from the AAT's Tools&Equipment and Weapons&Ammunition
hierarchies, focuses on the RT relationships connecting the hierarchies[4] - (see AAT 2000
for a display of the full hierarchies). Two different types of RT are
represented. The AAT Scope Note for axes (weapons) reads:
"Cutting
weapons consisting basically of a relatively heavy, flat blade fixed to a
handle, wielded by either striking or throwing. For axes used for other
purposes, typically having narrower blades, use axes (tools)."
Thus the
associative relationship between axes (weapons) and axes
(tools) is of subtype Distinguished From (see Table 6)
and is not traversed in this scenario when filtering only on alternate
hierarchical RT subtypes. We can see in Table 7
that the term axes (tools) and tool-related terms derived solely
from this link (chip axes, cutting tools, etc.) are
excluded. Under some contexts, such terms might be considered of relevance
but in a stricter weapons-related scenario they might well be seen as
less relevant and can now be suppressed. The point is that this control
can be passed to the retrieval system.

Figure
3. Extract of AAT around axes (weapons) with specialized RT
relationships
Table
7. Filtering by RT subtype (alternate hierarchical RTs only) - red
underlined terms now excluded from Table 4

Other
scenarios illustrate the potential for filtering on other types of RT
relationship. For example, an information need relating to archery and
its equipment might justify traversal of AAT RT inter-facet subtype Activity
- Equipment Needed or Produced. This would yield the terms arrows
and bows (weapons), which could in turn be expanded to terms such
as bolts (arrows), crossbows, composite bows, longbows,
and self bows. The same approach can be applied to scenarios
relating to parts or components of an object, using the RT Whole/Part,
and Part/Whole subtypes. Thus, a query on arrows (Figure 4)
yields the terms listed in Table 8,
using an expansion threshold of 1.3. Again, for this scenario we manually
specialized all RT relationships three links away from arrows into
their corresponding sub-types by following sample extracts of AAT
Editorial Related Term Sheets and applying the editorial rules. The terms
retrieved through Alternate Hierarchical RTs and Whole/Part RTs are shown
in blue italics and green italics (Arial font) respectively. For example,
note that subparts feathers and arrowheads are included in
the results.

Figure
4. AAT visualization of arrows with RT specializations
Table 8. RT expansion: filtering on Alternate Broader/Narrower and
Whole/Part subtypes

5
Review of thesaurus relationship taxonomies
Semantic
modeling occurs in various computing domains. The standards for thesauri
and related knowledge organization systems in information science can be
distinguished from the semantic structures common in AI or database
modeling (e.g. Brachman
1983, Storey
1993[5]) by a particular emphasis on
retrieval and interoperability across different subject domains. A well
established information science tradition[6] allows software for
collection management, thesaurus representation and retrieval
applications to be shared across thesauri in different domains. The
tradition rests on the core set of thesaurus relationships (equivalence,
hierarchical and associative) mentioned earlier. The disciplining of
semantic relationships to this core set makes possible the various
aspects of interoperability by providing a stable, manageable foundation
for different types of application.
Traditional
use has tended to rely on human inspection of thesaurus representations,
for example interactively browsing term hierarchies or manually looking
up thesaurus displays in print form. Recently, with the growth of online
collections, we have seen a move to enhanced machine processing of
thesaurus representations and this has motivated a concern with extending
the core set of relationships. Examples of these new applications include
investigations of query expansion techniques in retrieval (Beaulieu
1997, Tudhope
and Taylor 1997), efforts to devise automated mapping between
different thesauri for cross domain or multi-lingual searching (Doerr
2000) and proposals for RDF representations of thesauri for the
emerging 'semantic Web' (Amann and
Fundulaki 1999; Cross et
al. 2000). Human interpretation of the context to infer the
particular instance of a relationship type or tacit rules underlying
facet structure are no longer a resource in automated traversal of the
semantic network.
Support
for this trend towards an augmented set of relationships can be found in
the ALA Subject Analysis Committee Final Report (ALA 1999),
which supported a richer set of relationships, expressed in hierarchies,
and also in the NISO Report on the Workshop on Electronic Thesauri (Milstead
1999) which advocated a "core set of relationships,
hierarchically organized" (but not any minimal set). A richer set of
relationships would assist efforts in these new application areas for
thesauri by allowing finer grained automated reasoning. It would also
converge with work on broader ontological conceptualizations, attempting
to define more formally the roles played by entities in the schema (e.g. Bechofer
2000; Bechofer
and Goble 1999). However, there is a danger that an undisciplined
expansion of the underlying semantic model would lose the battle for
interoperability. There is also a need for compatibility with the large
number of existing thesauri (the Association for Information Management
has a library of over 600 thesauri). The following discussion connects
the case study discussed in the first half of the paper with
possibilities for limited extensions to the standard set of
relationships, in particular focusing on the problems inherent in
attempting to extend the associative relationship.
5.1
Hierarchical relationships
We first
briefly consider hierarchical thesaurus relationships. There are three
commonly accepted subtypes of the hierarchical relationship (ISO 2788),
which might form a natural second level of hierarchical relationships for
consideration in any standard extension of thesaurus relationships:
- Generic (subclass/superclass)
- Instance (class/instance)
- Whole-Part (partitive): this is a
hierarchical relationship between concepts of the same type, where
'the name of the part implies the name of the possessing whole in
any context'. ISO 2788 allows four partitive cases:
- Systems and organs of the body
- Geographical location (as in our spatial
query expansion examples in section 2)
- Discipline (or field of study)
- Social structures
5.2
Associative relationships
Associative
relationships are more difficult to specify. The notion of distinguishing
subtypes of RTs has surfaced from time to time, usually with respect to
proposed editorial methods for asserting associative relationships
between terms. The 1986 ISO
Guide to establishment and development of monolingual thesauri (ISO 2788)
gives examples of several subtypes of the associative relationship,
although these are intended to be representative examples from practice
rather than any definitive list. The Standard suggests that frequently
one RT term will occur in any definition of the other (e.g. in a Scope
Note). To prevent precision being unnecessarily degraded by RTs unlikely
to be of practical use, the Standard recommends that a term linked by an
associative relationship be strongly implied by the other "according
to the frames of reference shared by users of the index". Thus RT
practice can vary according to the intended purpose of the thesaurus.
The
standard first identifies two types of term that can be linked by an
associative relationship: those belonging to the same 'category', and
those bridging categories. In practice, category is usually taken to be a
thesaurus hierarchy and thus the distinction is essentially intra
versus inter hierarchy (as in the discussion around Table 5
in section 4).
RTs should not usually occur between sibling terms, since there is a
strong hierarchical connection between two siblings. (However, it can be
appropriate to make a RT between two siblings when there exists a
particularly strong relationship between them which does not extend to
the other siblings - the AAT RT subtype Distinguished From often
relates siblings.) Without intending to be exhaustive, the Standard goes
on to list typical examples of inter-hierarchical RTs:

Aitchison
& Gilchrist (1987, p44) suggest some other RT subtypes in their
influential explication of the Standard:

It is
possible to identify broader groupings of the above associative
relationships. For example, we might distinguish broad partitive, causal,
entity-property groups and a group corresponding to the AAT inter-facet
relationships. However, even this simple grouping includes overlapping
categories, illustrating the difficulties of creating a simple
hierarchical arrangement of RT subtypes.
As an example
of a highly specialized thesaurus, the UK National Railway Museum (part
of the National Museum of Science and Industry) and Museum Documentation
Association are constructing a railway terminology thesaurus. As part of
this ongoing effort, we contributed a set of editorial guidelines on RT
construction, drawing on the Getty AAT guidelines for RTs discussed
earlier, which included a tailored version of the RT subtypes. The
initial railway thesaurus will comprise different hierarchies but will
not be a faceted thesaurus. Thus AAT inter-facet relationships were not
relevant for the purposes of the thesaurus. The railway terminology
editorial group agreed on a similar set of RT subtypes to the AAT
guidelines, but collapsed the inter-facet relationships to one shared
operational context and also included a causal relationship.
Medical
information retrieval has seen a significant concentration of thesaurus
related research, with influential medical thesauri like MeSH (MeSH 2000)
forming part of online databases such as Medline. The US National Library
of Medicine Unified Medical Language System® (UMLS 2000)
is a metathesaurus bridging over 50 biomedical vocabularies including
different language versions of MeSH. A metathesaurus concept has
attributes, notably the higher-level semantic type or category to which
it belongs together with its hierarchical context in corresponding source
vocabularies. Examples of semantic types are Biologic function,
Organism, Mammal. A UMLS Semantic Network defines 54 links
(relationships) between the semantic types, the most common being the isa
relationship which establishes hierarchies. There is also a set of five
'non-hierarchical relationships', themselves arranged in a hierarchical
fashion, which may be seen as serving the function of associative
relationships in the metathesaurus:

There are several relationships
subsumed under Functionally related to, of which one is affects,
which captures various relationships associated with medical
intervention or interaction of entities - causal relationships are
important in the medical domain. The affects relationship has six
children:

Thus for
the purposes of the medical metathesaurus, we have a semantic typing of
concepts, more elaborate than the structure of most faceted thesauri,
combined with a fairly deep hierarchy of relationship types, specialized
for the purposes of the domain. Space and time are foregrounded as
primary subtypes of the non-hierarchical set of relationships.
In the
library domain, a subcommittee of the American Library Association (ALA)
produced a Report (ALA 1999)
on relationships between subjects and how they might be represented.
There was a recommendation that systems should include specific
relationships with a view to facilitating the development of more
intelligent subject access approaches to controlled vocabulary retrieval
systems. In particular, Appendix B presents a hierarchical taxonomy of
relationship types, based on Michel's extensive review of the literature,
with definitions of over 100 associative relationships. In the taxonomy,
there are nine first-level subtypes of associative relationships:

These are
in turn broken down further, to varying depth. Two major subtypes echo
the distinction made in ISO2788: Different hierarchy associative
relationships and Same hierarchy associative relationships,
both with relatively deep hierarchies. These are partially expanded below
(note this example does not show details of sublevels for all
relationships)

This large
set of RT subtypes constitutes a valuable resource. It is intended to
capture the rich diversity of thesaurus practice, rather than forming any
structured design proposal. However, the reason for the restriction of Causal
and Part-Whole relationships to Same hierarchy is unclear.
Some subtypes appear to represent fairly weak associative relationships
(eg Combined ideas, Conceptually related terms, Similarity), which
might well be subsumed into a generic high-level RT relationship in any
attempt at mapping to a core subset of RTs for retrieval purposes.
Several subtypes capture different aspects of closely, but not
completely, overlapping meaning (e.g., Meaning overlap, Scope issues,
as does the AAT Distinguished from) and could be grouped under
that broad sub-heading for retrieval purposes. A large number of
relationships reflect pairings of concepts from different
facets/hierarchies, which can be seen as representing different semantic
categories.
6
Is an extended core set of associative relationships possible?
This (not
exhaustive) review of RTs illustrates the practical difficulties in
extending the current loose definition of the associative relationship to
more precise hierarchies of RT relationships. Medical thesauri stand at
the more complex end of a continuum of thesaurus domains, which also
includes specialized smaller thesauri which may have no facet structure
and only specify the three basic thesaurus relationships. If we include
too many subtypes, then interoperability of thesaurus mapping and
retrieval software may be lost. In fact, it may not prove possible to
create one single extension of thesaurus relationships, but instead there
may need to be different standard extensions for different domains, e.g.
medical, digital library, commerce, etc.
However,
one possible approach might be to aim for a limited extension of RTs at a
second level and to expect domain specialization at lower levels. This
would permit some degree of interoperability in advanced thesaurus-based
applications involving automated traversal of a richer set of
relationships for term expansion. If the three standard thesaurus
relationships formed the top level of a hierarchical structure then any
such new applications would retain compatibility with the large number of
existing thesauri (and indexed collections) where it is infeasible to
augment the core relationships.
Many of
the taxonomies of RT subtypes discussed above have their origins in
editorial guidelines for creating RTs, rather than being attempts at
refining the semantics of RTs for retrieval purposes (our concern in this
paper). Therefore, it may prove useful to logically separate practical
heuristics or methods for identifying or creating RTs, such as the
occurrence of one term in another's definition or Scope note, from the
semantic meaning of the relationship. Documenting practical techniques
for creating RTs is important but should be a separate activity. This
might also reduce the need for some RT subtypes, such as Definitional
above[8]. As a very basic
illustration, our RT guidelines (derived from the AAT editorial
guidelines) for non-specialist editors constructing the railway thesaurus
are included the checklist in Appendix
1.
More
fundamentally, in section 5, the
distinction between intra and inter facet/hierarchy relationships is
common but can lead to apparently illogical contrasts between and within
systems. For example, should Part-whole and Causal-type
relationships be assigned to one or the other, or both? Furthermore, in
many systems, several subtypes of RT address various forms of
relationships between categories represented by different facets, for
example Agent-Process, Process-Product, Material-Product, etc.
Rather than making an a priori distinction between intra and inter
facet/hierarchy relationships as the basis for a classification of RT types,
it may be useful to broaden our focus. To this end, we can distinguish
the meaning of a relationship from the semantic category of the two
concepts involved. For example, philatelists and postage stamps
are two terms in the AAT connected by an associative relationship
regarding products used by agents. We can easily separate the underlying
concepts as to their semantic category or type. One concept involves some
notion of an agent or type of person and the other is some kind of
object. Thus we might foreground the category of the concept as an
explicit aspect of the relationship in its own right.
This could
result in a hierarchy of semantic categories, with generic Concepts
at the bottom level (default for terms in simple systems), belonging to
various Hierarchies (and sub-hierarchies/minor facets) at the next level
and with a set of Facets as the broadest level - see examples
below. An explicit representation of the semantic category of a term from
a standard set of categories would be valuable to the new application
areas, mentioned in the introduction to section 5,
which seek to process thesaurus-based metadata automatically.
Separating
out the semantic category of concepts from thesaurus relationships may be
useful in its own right, but could potentially yield an additional
benefit. If (following the faceted approach taken by the AAT editors) the
semantic category of a concept is taken as a dimension separate from the
type of relationship then a smaller number of inter-facet RT
relationships might suffice. The same Causal, Uses/Requires, Spatial
or Temporal relationship might, at a high level, connect various
categories of concepts[9]. Conceivably, this might
permit a restricted second level core set of RT subtypes to be applicable
across some range of thesaurus domains (although this would need
investigation). These relationships could themselves be refined into
richer subtypes when the purposes of the thesaurus warranted. While this
could be seen as shifting effort on to the identification of standard
categories or facets, it can be argued that there is already a fair
degree of agreement in this area.
For
example, the ISO Standard refers to implicit categories of concepts which
can assist an editor, say in compiling hierarchies. Examples of general
categories given by the Standard include Concrete entities (such as
Things and Materials), Abstract Entities (such as Events and Units of
measurement) and Individual entities (proper nouns). When such categories
are represented in thesauri, they are usually identified as facets, each
facet with its own hierarchical sub-divisions. Facet analysis[10] has been a long-standing
technique in thesaurus construction; concepts are decomposed into
elemental classes, or facets, which form homogenous, mutually exclusive
groups (Aitchison
and Gilchrist 1987). Faceted thesauri or classification systems
include MESH, BLISS, PRECIS and the AAT. For example, the AAT ( Soergel
1995) is organized into seven facets (and 33 hierarchies as
subdivisions): Associated concepts, Physical attributes, Styles and
periods, Agents, Activities, Materials, Objects and optional facets for
time and place. Categories such as Agent, Event, Material, Object, Time
and Place are likely to be common to many thesauri.
As one
possible example of an extended set of associative relationships, Table 9
contains a broad grouping of RT subtypes (other groupings are also
possible), which could be combined with a specification of a term's
semantic category.
Table 9. Example of broad groupings of RT
subtypes

To take
examples from the AAT, RT relationships between Agents and Materials,
Agents and Products, Materials and Objects
might all be represented by Causal subtypes in the above grouping. The
semantic categories of the concepts involved further define the nature of
the relationship. For example, an RT of type Causal/Uses (if the grouping
in Table
9 were used) could be applied to various AAT RT subtypes, provided
that additional context was provided by the semantic category of the
concepts involved. An RT (of type 3Q: Activity - Equipment needed or
produced) exists in the AAT between arrows and archery. A
specification of the relationship would include the categories of the two
concepts (Objects/Weapons&Ammunition/arrows and Activities/Physical
Activities/archery). An automated traversal application would at the
least be able to ascertain that the relationship asserted that a
particular kind of object was used in an activity. The relationship could
be refined hierarchically if appropriate for a particular thesaurus.
Similarly, a Causal/Uses RT could be employed for an AAT RT (of type 3T:
Locational setting - Equipment used or produced) connecting airports
(Objects/Built Complexes & Districts) with aircraft
(Objects/Transportation Vehicles).
7
Conclusions
It may be
impractical to expect non-specialist users to manually browse very large
thesauri (for example, there are 1792 terms in the AAT's Tools&Equipment
hierarchy). Semantic distance measures operating over thesaurus
relationships can underpin interactive and automatic query expansion
techniques. Ranked lists of candidate terms can assist query expansion or
automatic ranking of information items in retrieval, thesaurus mapping
and semantic Web applications.
Online
gazetteers and geographical thesauri may not contain coordinate data for
all places and regions or, if they do, associate place names with a
limited spatial footprint (e.g. centroid or minimum bounding rectangle).
In such situations, the ability to rank places within a vicinity
according to hierarchical (or other) relationships in a spatial
terminology system can be useful. Section 2
provides examples of the operation of semantic distance measures over
hierarchical spatial-partitive thesaurus relationships. In contexts where
administrative boundaries are highly relevant, distance measures could
combine quantitative and qualitative spatial relationships.
Related
work has highlighted the contribution of RTs to thesaurus search aids but
has noted the potential for an uncontrolled increase in query term sets
and a loss in precision (in cases where there is a specific search goal).
Experimental scenarios (section 4)
exploring different factors relating to incorporation of RTs in semantic
distance measures suggest a potential for filtering on the hierarchical
context of an RT link in faceted thesauri and for filtering on subtypes
of RT relationships. Specializing RTs allows the possibility of
dynamically linking RT type to query context. In practice, it is likely
that a combination of heuristics will be useful. In general, more control
can be transferred to the retrieval system to selectively traverse RT
relationship or to weight them differently. The ability in retrieval to
either specialize RTs or to treat them as generic retains the advantages
of the standard minimal set of thesaurus relationships for
interoperability purposes, while allowing an option of a richer set of RT
sub-relationships.
We have
suggested the possibility of enriching the specification and semantics of
RT relationships, while maintaining compatibility with traditional
thesauri, via a limited hierarchical extension of the associative (and
hierarchical) relationships. This would be facilitated by distinguishing
the type of term from the (sub)type of relationship and explicitly specifying
semantic categories for terms following a faceted approach. It may also
be useful to make a distinction between heuristics for
identifying/creating RTs in thesaurus construction, such as the
occurrence of one term in another's definition or Scope note, from the
semantic meaning of the relationship for retrieval purposes.
There are
implications for thesaurus developers and implementers. A systematic
approach to RT application in thesaurus design, as in the AAT, has
potential for retrieval systems. Information (such as relationship
subtypes) used in thesaurus design should be retained in data models and
database design for later use in retrieval algorithms. Various
possibilities exist for the user to characterize information need. In
future work, we intend to explore utility and usability issues concerned
with the incorporation of semantic distance controls in the search system
user interface.
Acknowledgements
An early version of this paper was
presented at the European Conference on Digital Libraries, Lisbon,
2000.
We would
like to acknowledge the support of the UK Engineering and Physical
Sciences Research Council (grant GR/M66233/01) and the support of HEFCW
for the Internet Technologies Research Lab. We would like to thank the J.
Paul Getty Trust for provision of their vocabularies and in particular
Patricia Harpring and Alison Chipman for information on the AAT and
Related Terms; Diana Murray and the Royal Commission on the Ancient and
Historical Monuments of Scotland for provision of their dataset; Martin
Doerr and Christos Georgis from the FORTH Institute of Computer Science
for assistance with the SIS; and helpful suggestions from the organisers
and participants at the NKOS Workshop at ECDL2000; and Ceri Binding and
Daniel Cunliffe from the Hypermedia Research Unit at the University of
Glamorgan. Use of the Getty Vocabularies is subject to the terms of their
licenses.
References
AAT (1995) The
AAT Editorial Manual: Related terms. User Friendly, 2(3-4), 6-15.
J. Paul Getty Trust.
AAT (2000) http://www.getty.edu/gri/vocabularies/index.htm
Aitchison,
J. and Gilchrist, A. (1987) Thesaurus construction: a practical manual
(ASLIB: London)
ALA 1999. Final Report
to the ALCTS/CCS Subject Analysis Committee. Greenberg J., Hemmasi H.,
Kuhr P., Michel D., Riel S., Strawn G., Wool G., El-Hoshy L. http://www.ala.org/alcts/organization/ccs/sac/rpt97rev.html
Alani, H., Jones,
C. and Tudhope, D. (2000) "Associative and Spatial Relationships in
Thesaurus-based Retrieval". Proceedings of the Fourth European
Conference on Research and Advanced Technology for Digital Libraries
(ECDL2000), edited by J. Borbinha and T. Baker, Lecture Notes in
Computer Science (Berlin: Springer), pp. 45-58
Alani, H.,
Jones, C. and Tudhope, D. (2001) "Voronoi-based region approximation
for geographical information retrieval with online gazetteers". International
Journal of Geographical Information Science, in press
Amann, B. and
Fundulaki, I. (1999) "Integrating
ontologies and thesauri to build RDF schemas". Proceedings of the
3rd European Conference on Digital Libraries (ECDL'99), edited by S.
Abiteboul and A. Vercoustre, Lecture Notes in Computer Science 1696
(Berlin: Springer-Verlag), pp. 234-253
Beaulieu, M.
(1997) "Experiments on interfaces to support query expansion". Journal
of Documentation, 53(1), 8-19
Bechofer, S.
(2000) "OIL: The Ontology Inference Layer". Special Workshop
on Networked Knowledge Organization Systems, Fourth European
Conference on Research and Advanced Technology for Digital Libraries
(ECDL2000)
Bechofer, S.
and Goble, C. (1999) "Classification Based Navigation and Retreval
for Picture Archives". Proceedings of IFIP WG2.6 Conference on
Data Semantics, Rotorua,
New Zealand
Brachman, R.
(1983) "What IS-A is and isn't: An analysis of taxonomic links in
semantic networks". IEEE Computer, 16(10), 30-36
Brooks, T.
(1997) "The relevance aura of bibliographic records". Information
Processing and Management, 33(1), 69-80
Bruza, P.
(1990) "Hyperindices: A novel aid for searching in hypermedia".
Proceedings of the ACM European Conference on Hypermedia Technology,
pp. 109-122
Chen, H., Ng,
T., Martinez,
J. and Schatz, B. (1997) "A concept space approach to addressing the
vocabulary problem in scientific information retrieval: an experiment on
the Worm Community System". Journal of the American Society for
Information Science, 48(1), 17-31
Constantopolous,
P. and Doerr, M. (1993) "The Semantic Index System - A brief
presentation". Institute
of Computer Science
Technical Report. FORTH-Hellas, GR-71110 Heraklion, Crete
Cross, P.,
Brickley, D. and Koch, T. (2000) Conceptual relationships for encoding
thesauri, classification systems and organised metadata collections and a
proposal for encoding a core set of thesaurus relationships using an RDF
Schema. http://www.desire.org/results/discovery/rdfthesschema.html
Cunliffe,
D., Taylor, C. and Tudhope, D. (1997) "Query-based navigation in
semantically indexed hypermedia". Proceedings of the 8th ACM
Conference on Hypertext, pp. 87-95
Doerr, M.
(2000) "Semantic problems of thesaurus mapping". Special
Workshop on Networked Knowledge Organization Systems, Fourth European
Conference on Research and Advanced Technology for Digital Libraries
(ECDL2000).
Doerr, M. and
Fundulaki, I. (1998) "SIS-TMS: A
thesaurus management system for distributed digital collections". Proceedings
of the 2nd European Conference on Digital Libraries (ECDL'98), edited
by C. Nikolaou and C. Stephanidis, Lecture Notes in Computer Science 1513
(Berlin: Springer-Verlag) pp. 215-234
Fidel, R.
(1991) "Searchers' selection of search keys (I-III)". Journal
of American Society for Information Science, 42(7), 490-527
Guarino, N.
(1995) "Ontologies and knowledge bases: towards a terminological
clarification". In Towards very large knowledge bases: knowledge
building and knowledge sharing (IOS Press), pp. 25-32
Bartholomew
Mapping Solutions. http://www.bartholomewmaps.com/
Harpring, P.
(1997) "The limits of the world: Theoretical and practical issues in
the construction of the Getty Thesaurus of Geographic Names". Proceedings
of the 4th International Conference on Hypermedia and Interactivity in
Museums (ICHIM'97) Archives and Museum Informatics, pp. 237-251
Harpring, P.
(1999) "How forcible are the right words: overview of applications
and interfaces incorporating the Getty vocabularies". Proceedings
of Museums and the Web 1999, Archives and Museum Informatics. http://www.archimuse.com/mw99/papers/harpring/harpring.html
Hill, L. (2000)
"Core elements of digital gazetteers: placenames, categories, and
footprints". Proceedings of the 4th European Conference on
Research and Advanced Technology for Digital Libraries, edited by J.
Borbinha, T. Baker, Lecture Notes in Computer Science (Berlin: Springer),
pp. 280-290
Hodge, G.
(2000) "Systems of Knowledge Organization for Digital Libraries:
Beyond Traditional Authority Files". The Digital Library Federation
Council on Library and Information Resources. http://www.clir.org/pubs/abstract/pub91abst.html
Koch, T. (2000)
"Quality-controlled subject gateways: definitions, typologies,
empirical overview". Online Information Review, 24(1), 24-34
ISO (1986)
"Guidelines for the establishment and development of monolingual
thesauri". ISO 2788 (BS 5723).
Jones, C.
(1997) "Geographic Interfaces to Museum Collections". Proceedings
of the 4th International Conference on Hypermedia and Interactivity in
Museums (ICHIM'97) Archives and Museum Informatics, pp. 226-236
Jones, S.
(1993) "A Thesaurus Data Model for an Intelligent Retrieval
System". Journal of Information Science 19: 167-178
Jones, S.,
Gatford, M., Robertson, S., Hancock-Beaulieu, M., Secker, J. and Walker,
S. (1995) "Interactive Thesaurus Navigation: Intelligence Rules
OK?" Journal of the American Society for Information Science,
46(1), 52-59
Kristensen,
J. (1993) "Expanding end-users' query statements for free text
searching with a search-aid thesaurus". Information Processing and
Management, 29(6), 733-744
MeSH 2000,
Medical Subject Headings. http://www.nlm.nih.gov/mesh/meshhome.html
Michard, A.
and Pham-Dac, G. (1998) "Description of Collections and
Encyclopaedias on the Web using XML". Archives and Museum
Informatics, 12(1), 39-79
Milstead, J.
(1999) Report on NISO Workshop on Electronic Thesauri: Planning for a
Standard. http://www.niso.org/thes99rprt.html
Molholt, P.
(1996) "Standardization of inter-concept links and their
usage". Proceedings of the 4th International ISKO Conference,
Advances in Knowledge Organisation (5), pp. 65-71
Murray, D. (1997)
"GIS in RCAHMS". MDA Information, 2(3): 35-38
Pollitt, A.
(1997) "Interactive information retrieval based on facetted
classification using views". Proceedings of the 6th International
Study Conference on Classification, London
Rada, R., Mili,
H., Bicknell, E. and Blettner, M. (1989) "Development and Application
of a Metric on Semantic Nets". IEEE Transactions on Systems, Man
and Cybernetics, 19(1), 17-30
Rada, R.,
Barlow, J., Potharst, J., Zanstra, P. and Bijstra, D. (1991)
Document ranking using an enriched thesaurus. Journal of
Documentation, 47(3), 240-253
Soergel, D.
(1995) "The Art and Architecture Thesaurus (AAT): a critical
appraisal". Visual Resources, 10(4), 369-400
Storey, V.
(1993) "Understanding semantic relationships". VLDB Journal,
2, 455-488
Tudhope, D.
and Taylor, C. (1997) "Navigation via Similarity: automatic linking
based on semantic closeness". Information Processing and
Management, 33(2), 233-242
Tudhope, D.
and Cunliffe, D. (1999) "Semantic index hypermedia: linking
information disciplines". ACM Computing Surveys, Electronic
Symposium on Hypertext and Hypermedia, 31(4es) http://www.acm.org/pubs/contents/journals/surveys/1999-31/#4es
Tyler, S. (ed.) (1969) Cognitive anthropology (New
York: Holt, Rinehart and Winston)
UMLS 2000,
Unified Medical Language System. http://www.nlm.nih.gov/research/umls/umlsmain.html
Appendix
1. Example of checklist for constructing RTs (our extension of Getty RT
guidelines)

Footnotes
[1]
Note that in the TGN (and other thesauri) hierarchical relationships for
place names represent purely administrative hierarchies and not other
possible nestings, such as watersheds, ecological regions, etc. We have
experimented with a poly-hierarchical similarity measure which takes
account of a place's membership in multiple hierarchical dimensions (such
as when an ecological region intersects several administrative areas),
but more work on the utility of such measures is required.
[2] The AAT is conceptually polyhierarchical but
is currently physically monohierarchical due to the original database
software employed in the project. There are plans to port it into a
polyhierarchical data structure.
[3] This is in keeping with the recommendation of
Rada et
al. (1991) that automatic expansion of non-hierarchical
relationships be restricted to situations where the type of relationship
can be linked with the particular query, and also with Jones'
(1993) discussion of using sub-classifications to help distinguish
relationships according to strength.
[4] Note that staff weapons not connected by an
RT to Axes (weapons) have been omitted due to space restrictions.
[5] Storey
(1993) reviews the use of semantic relationships in data modelling
and automated database design tools. She discusses a taxonomy of seven
types of relationships, including various partitive (meronymic)
relationships, and derives guidelines for representing them in a
relational data model.
[6] The social aspect is also important;
educational material and training practices distribute techniques for
cataloguing and searching widely and promote good practice. Standards
must also facilitate the modification of knowledge organisation systems,
as the corresponding field of study evolves.
[7]
Distinguished from the hierarchical partitive relationship in that the
terms linked belong to different categories.
[8] It may be that editorial RT subtype
definitions would be retained separately in editorial guidelines for
constructing thesauri.
[9] Simplification via the intersection of
different ordering principles has parallels in other domains. For
example, anthropologists have investigated how cultures organise, and group
the cognitive principles underlying social behaviour. Tyler's
(1969) classification of the underlying semantic structures observed
in cognitive anthropology included the familiar taxonomic hierarchical
relationship. He also identified a 'paradigm' ordering principle, a
non-hierarchical ordering which cuts across levels of taxonomic
hierarchies by multiple intersections. For example, the attributes gender
(male, female) and maturity (child, adolescent, adult) intersect with a
mammal hierarchy to yield concepts such as mare, colt, boar, etc.
[10] The faceted approach to subject analysis
began in 1933 with Ranganathan's Colon Classification (Personality,
Matter, Energy, Space and Time) and was subsequently elaborated by the
British Classification Research Group.
|