1 Introduction
In this paper, 'terminology services' is used to describe Web services involving various types of knowledge organization resources, including authority files, subject heading systems, thesauri, Web taxonomies, and classification schemes. The term 'vocabulary' is used to refer to these knowledge organization resources. Vocabularies with associations to other schemes will be a key component of Web-based terminology services. Web services are modular, Web-based, machine-to-machine applications that can be combined in various ways. For background information on Web services, see Gardner (2001) and Tennant (2002). Web services can be accessed at various points in the metadata lifecycle, for example, when a work is authored or created, at the time an object is indexed or cataloged, or during search and retrieval. A Web service that provides mappings from a term in one vocabulary to one or more terms in another vocabulary is an example of a terminology service.
2 Vocabulary compatibility
- Extent of overlap in the subject matter
- Level of specificity of terms
- Degree of pre/post-coordination
- How the vocabulary codes equivalence, hierarchical, and other relationships
- Differences in word use, e.g. common versus scientific names (Doerr 2001, Olson and Strawn 1997)
- Differences in meaning resulting from different classifications of terms (Doerr 2001, Whitehead 1990)
- Direct mapping-establishing equivalence between terms in different controlled vocabularies or between verbal terms and classification numbers.
- Co-occurrence mapping-establishing mappings from the co-occurrence of terms from different schemes in the same metadata or catalog record.
3 OCLC vocabulary projects
3.1 Dewey mappings
In 1994, OCLC staff began linking Library of Congress Subject
Headings (LCSH) to the Dewey Decimal Classification (DDC)
scheme. DDC/LCSH pairs were generated from OCLC WorldCat
records that contained both DDC numbers and LCSH.
Co-occurrence mappings were made for frequently occurring
pairs. Later, an association measure was introduced in the
co-occurrence mapping process to provide a better indicator
of association than simple pair frequencies (Vizine-Goetz
1998). Approximately 90,000 co-occurrence mappings have
been made in WebDewey, the electronic version of the DDC. An
example of DDC/LCSH co-occurrence mappings is shown below for
DDC class, 617.522 Oral region-surgery:
LC Subject Headings |
---|
Cleft lip |
Cleft lip-Surgery |
Cleft palate |
Mouth-Diseases |
Mouth-Microbiology |
Mouth-Surgery |
Oral medicine |
Temporomandibular joint-Diseases |
The mapped LCSH provide additional indexing vocabulary for the electronic version of the DDC and also assist catalogers in assigning subject headings. These terms are also included in versions of the DDC used in automated classification services.
3.2 Other mappings
From | To | |||||||
---|---|---|---|---|---|---|---|---|
Vocabulary | DDC | ERIC | GSAFD | LCC | LCSH | LCSHac | MeSH | NLMC |
DDC (Dewey Decimal Classification) | Direct | Direct & Co-occur | Direct & Co-occur | Direct | Direct | |||
ERIC Thesaurus | Direct | |||||||
GSAFD (Genre terms for fiction) | Direct | Direct | ||||||
LCC (Library of Congress Classification) | Direct | |||||||
LCSH (LC Subject Headings) | Direct & Co-occur | Direct | Direct | Co-occur | Direct | |||
LCSHac (LC Children's Headings) | Direct & Co-occur | |||||||
MeSH (Medical Subject Headings) | Direct | Direct | ||||||
NLMC (National Library of Medicine Classification) | Direct |
The GSAFD vocabulary terms with mappings are accessible using the OAI-PMH. The OAI protocol specifies a simple HTTP protocol for automated sharing of metadata, but as the OAI-Cat effort has shown, the approach works equally well for sharing other XML content. The content of the GSAFD records is MARC in XML (MARC Standards). The records are accessible to users via a browser (http://alcme.oclc.org/gsafd/) and to machines through the OAI-PMH Web services mechanisms. See Van de Sompel et al. (2003) for a more complete description of the how the file can be accessed using the OAI-PMH. The GSAFD/LCSH mapping file can also be downloaded from our project Web site. The file is encoded in MARC in XML and also according to version 0.5 of the Zthes schema. We have also prototyped some experimental Web services using co-occurrence mappings between the GSAFD vocabulary and LCSH.
3.3 Mapping to LCSH
She also notes that efforts to improve or replace LCSH must take into account its widespread use and the probability that it will be maintained for a long time. Others have reached similar conclusions. For example, the FAST project sponsored by OCLC selected LCSH as the basis for creating a faceted vocabulary for metadata. O'Neill and Chan (2003) cite the following reasons for choosing the LCSH scheme:Despite the weaknesses and the critical assessments that have plagued LCSH over the years, the fact remains that LCSH is the standard vocabulary used by the majority of information resources, especially libraries, in the United States.
- [LCSH] is by far the most commonly used and widely accepted subject vocabulary for general application.
- It is the de facto universal controlled vocabulary and has been translated or adapted by many countries around the world.
- It is the largest general indexing vocabulary in the English language.
3.4 Vocabulary encoding standards
- Library of Congress Subject Headings (LCSH) > 277,000/263,524 concepts/terms
- The Getty vocabularies (Art & Architecture Thesaurus; Union List of Artist Names; Thesaurus of Geographic Names) > 1.6 million concepts/names/terms
- Medical Subject Headings (MeSH) > 21,973/125,858 concepts/terms
- Canadian Subject Headings (CSH) > 6,000 concepts
- Library of Congress Classification data (LCC) > 595,000 categories
In the remainder of this paper we describe our approach to mapping the ERIC Thesaurus to LCSH. The ERIC Thesaurus was chosen because it is a well-established vocabulary, publicly accessible on the Web, and large enough to provide a meaningful test of our mapping approach. The ERIC Thesaurus is produced by the Educational Resources Information Center, an education information network, sponsored by the U.S. Department of Education, and provides public access to education literature (ERIC 2004).
4 Mapping the ERIC Thesaurus to LCSH
4.1 Converting ERIC to MARC
Multiple instances of broader terms (BT), narrower terms (NT), and related terms (RT) stored in single ERIC fields are encoded as separate fields in the MARC format (Figure 2). The RT field shown below generates 14 fields in the MARC record. These are the fields labeled with MARC tag 550 (without $w subfields). The field labeled UF is similarly converted into two MARC fields (tag 450). One of the terms, Student ability, represents a formerly valid term. The notation in parentheses in the ERIC record indicates this and gives the lifespan of the term. When this data is converted to MARC, a 688 field (Application History Note) is constructed for this data. In the 450 field, subfield $w is added to indicate the term was formerly valid. By encoding the source and target vocabularies in the MARC Authorities Format we are able to standardize the representation of similar information and improve our ability to match vocabularies.
<TERM> | Academic Ability |
<SCOPE> | The degree of actual competence to perform in scholastic or educational activities (Note: For potential competence, use "Academic Aptitude" -- for measured achievement, use "Academic Achievement") |
<RT> | Ability Grouping; Academic Achievement; Academic Aptitude; Academic Aspiration; Academically Gifted; Aptitude Treatment Interaction; Cognitive Ability; College Entrance Examinations; High Risk Students; Intelligence; Scholarship; Spatial Ability; Student Characteristics; Verbal Ability |
<BT> | Ability |
<UF> | Scholastic Ability; Student Ability (1966 1980) |
<GROUP> | 120 |
<TYPE> | Main |
<ADD> | 07/01/1966 |
001 | ERIC00025 |
003 | OCoLC-O |
005 | 20031117154238.0 |
008 | 031118 n|a|znn|bb||||||||||| ||an| ||| d |
040 | $beng$cOCoLC-O$dOCoLC-O$eericd |
072 | $a120 |
150 | $aAcademic Ability |
450 | $aScholastic Ability |
450 | $wa$aStudent Ability |
550 | $aAbility Grouping |
550 | $aAcademic Achievement |
550 | $aAcademic Aptitude |
550 | $aAcademic Aspiration |
550 | $aAcademically Gifted |
550 | $aAptitude Treatment Interaction |
550 | $aCognitive Ability |
550 | $aCollege Entrance Examinations |
550 | $aHigh Risk Students |
550 | $aIntelligence |
550 | $aScholarship |
550 | $aSpatial Ability |
550 | $aStudent Characteristics |
550 | $aVerbal Ability |
550 | $aAbility$wg |
680 | $iThe degree of actual competence to perform in scholastic or educational activities (Note: For potential competence, use "Academic Aptitude" -- for measured achievement, use "Academic Achievement") |
688 | $aStudent Ability (1966 1980) |
MARC field and subfield statistics are provided in Appendices 1-4 for the following versions of the files:
- ERIC
-
- Statistics for complete file ( Appendix 1)
- Statistics for subset without mapping data ( Appendix 2)
- Statistics for subset with mapping data ( Appendix 3)
- LCSH
-
- Statistics for complete file ( Appendix 4)
As these statistics show, LCSH is a large vocabulary with more than 200,000 preferred terms (MARC tag 150) and nearly as many topical non-preferred terms (MARC tag 450). In contrast, the ERIC Thesaurus has about 6,000 preferred terms and 4,500 non-preferred terms. Although these statistics do not provide information about the potential subject overlap between ERIC and LCSH, the sheer size of the LCSH file compared with ERIC leads us to expect a favorable match rate. Statistics are provided for the subset of ERIC records, without and with mapping data, reported in this paper. This subset is described in detail in section 4.2.
4.2 Matching vocabulary terms
ERIC Thesaurus Term | LCSH Term |
---|---|
Alzheimers Disease | Alzheimer's disease |
Nurses Aides | Nurses' aides |
Currently, plural versus singular forms, terms that differ only by the presence or absence of a parenthetical qualifier, and terms with a qualifier introduced by a comma are not being matched. These refinements would likely improve the match rate and will be employed in the next phase of the project.
ERIC Thesaurus Term | LCSH Term |
---|---|
Echolocation | Echolocation (Physiology) |
Crack | Crack (Drug) |
Radiology | Radiology, Medical |
Rh factors | Rh factor |
A total of 3,797 ERIC terms were matched to LCSH and categorized according to the following match types:
- PT/PT - An exact match (after normalization) of a preferred term (PT) in the source vocabulary to a preferred term (PT) in the target vocabulary
- PT/NPT - An exact match of a preferred term (PT) (source) to a non-preferred term (NPT) (target)
- NPT/NPT - An exact match between a non-preferred term (NPT) (source) and a non-preferred term (NPT) (target)
- NPT/PT - An exact match between a non-preferred term (NPT) (source) and a preferred term (PT) (target)
4.3 Evaluating matches
- Learning & perception (110)
- Individual development & characteristics (120)
- Health & safety (210)
- Disabilities (220)
ERIC Category | Total Preferred Terms | PT/PTTermMatch | PT/PTConcept Match | PT/NPT Term Match | PT/NPT Concept Match |
---|---|---|---|---|---|
Learning & perception (110) | 164 | 49 | 49 | 10 | 8 |
Individual development & characteristics (120) | 269 | 83 | 81 | 25 | 23 |
Health & safety (210) | 227 | 129 | 127 | 31 | 30 |
Disabilities (220) | 113 | 37 | 37 | 12 | 10 |
Total | 773 | 298 | 294 | 78 | 71 |
About 99% of PT/PT matches were found to represent equivalent concepts in the two vocabularies and 91% of PT/NPT matches represent equivalent concepts. Very few false matches were observed for these two match types. A false match occurs when terms from the vocabularies are identical but the concepts represented are different. Some examples of false matches are:
Term | ERIC | LCSH |
---|---|---|
Females | For works on human females | For works on female organisms in general. Works on the human female are entered under Women. |
Males | For works on human males | For works on male organisms in general. Works on the human male are entered under Men. |
Radiology | For works on the use of radiation in medical diagnosis and treatment. | For works on radiological physics |
A total of 365 (294 + 71) equivalent concepts were identified. This is 47% (365/773) of the preferred terms in the ERIC subset. All matches in the subset were manually reviewed to determine which matches represented valid mappings. The following guidelines established in the Northwestern University LCSH/MeSH mapping project (Olson and Strawn 1997) were applied in the evaluation:
- Mapped terms should have generally the same scope in both vocabularies. One should not be broader or narrower than the other.
- Source vocabulary terms are mapped only to main terms in LCSH (main headings). An exception was made for non-preferred terms that matched a subdivided LCSH (main heading + subheading). For example:
ERIC | LCSH | Match Type | Valid Mapping |
---|---|---|---|
PT: Ametropia
|
PT: Eye-Refractive errors
NPT: Ametropia
|
PT/NPT
|
Yes
|
- One to one mappings are preferred, but a term in the source vocabulary could be mapped to more than one term in the target vocabulary when multiple terms are needed to form an equivalent concept.
ERIC | LCSH | Match Type | Valid Mapping |
---|---|---|---|
PT: Cleft Palate
NPT: Cleft Lip
|
PT: Cleft Palate
|
PT/PT
|
Yes
|
PT: Cleft Lip
|
NPT/PT
|
Yes
|

ERIC | LCSH | Match Type | Valid Mapping |
---|---|---|---|
PT: Extraversion Introversion
NPT: Ambiversion
NPT: Extroversion
NPT: Introversion
|
PT: Extraversion
NPT: Extroversion
|
NPT/NPT
|
Yes
|
PT: Introversion
|
NPT/PT
|
Yes
|

The match types guided our review of the matches. Matches were coded by type and each type was assigned a different color. PT/PT (white) matches were reviewed first, followed by PT/NPT (green). Evaluation of these matches was relatively straightforward since most involved one-to-one matches. NPT/NPT (yellow) and NPT/PT (blue) were more complex to review because they often involved matches to multiple terms in the target vocabulary.
ERIC | LCSH | Match Type | Valid Mapping |
---|---|---|---|
PT: Adolescents
|
PT: Teenagers
NPT: Adolescents
|
PT/NPT
|
Yes
|
NPT: Adolescence
|
PT: Adolescence
|
NPT/PT
|
No
|
NPT: Teenagers
|
PT: Teenagers
|
NPT/PT
|
Yes
|
In the example above, the NPT/PT match on the term Adolescence is an invalid mapping because the ERIC term and the LCSH term represent different concepts. The ERIC term Adolescents is for works on young people, 13-17 years of age. The LCSH term Adolescence is for works on the physiological, psychological, or social development of adolescents. The ERIC term, Adolescent Development, is a better match for the later term. For terms that matched three or more LCSH, e.g. Neurological Impairments, the review could be quite time-consuming and sometimes did not yield a correct mapping. In the subset, NPT/NPT matches represent equivalent concepts about 81% of the time, and NPT/PT matches represent equivalent concepts about 55% of the time. This last set of statistics should be viewed with some caution, given the small number of matches analyzed. Even so, the mapping results do have some interesting implications for future mapping projects.
If the term/concept-mapping rate is constant within a vocabulary, it should be possible to predict the expected mapping rate for a vocabulary based on a review of a sample of matches. Further, if the false match rate can be predicted reliably, review of matches with a high term/concept-mapping rate (PT/PT and PT/NPT, Table 2) could be dispensed with when the false match rate is below a particular threshold. Only those types of matches with low term/concept mapping rates (NPT/NPT and NPT/PT, Table 3) would need to be reviewed. Further, for matches requiring review, more experienced reviewers could be assigned to complex matches while less experienced reviewers could be given simpler matches.
ERIC Category | Total NPT matches | NPT/NPT Term Match | NPT/NPT Concept Match | NPT/PT Term Match | NPT/PT Concept Match |
---|---|---|---|---|---|
Learning & perception (110) | 12 | 6 | 6 | 6 | 4 |
Individual development & characteristics (120) | 57 | 16 | 12 | 41 | 22 |
Health & safety (210) | 60 | 22 | n/a | 38 | n/a |
Disabilities (220) | 30 | 15 | n/a | 15 | n/a |
Total | 159 | 59 | 18 | 100 | 26 |
5 Inter-vocabulary linking
- name or code of the target vocabulary
- mapped term
- control number or unique identifier for the mapped term
- identity of the mapping organization
A legitimate concern about vocabulary mapping is how the mappings will be maintained. Although not a trivial task, mappings can be maintained with the help of software that tracks changes to vocabulary term records. Changes to vocabulary terms are recorded in a number of ways, e.g. by data in a vocabulary record that indicates when the record was last modified, by notes fields that chronicle changes to a vocabulary term (see field 688 in the MARC record examples), and through notifications of additions and changes distributed by vocabulary owners. Depending on the nature of the changes, human review may be needed to determine if mappings are still valid when a vocabulary term changes.
001 | ERIC03056 | |
003 | OCoLC-O | |
005 | 20031117154238.0 | |
008 | 031118 n|a|znn|bb||||||||||| ||an| ||| d | |
040 | $beng$cOCoLC-O$dOCoLC-O$eericd | |
072 | 7 | $a110$2ericd |
150 | $aEidetic Imagery | |
450 | $wa$aEidetic Images | |
450 | $aPhotographic Memory | |
550 | $aVisualization | |
550 | $aMemory$wg | |
680 | $iVividly clear, detailed imagery of something (usually visual) that has been previously perceived | |
688 | $aEidetic Images (1967 1980) | |
750 | 0 | $aEidetic imagery$0(DLC)sh 85041379 $5OCoLC-O |
750 | 0 | $aPhotographic memory$0(DLC)sh 00009368 $5OCoLC-O |
750 | 2 | $a Eidetic Imagery$0(DNLM)D004538 |
In this example, the LCSH terms are linked to LC subject authority records accessible through the OAI-Cat framework. These records are accessible to users via a browser and to machines through the OAI-PMH Web services mechanisms. The MeSH link generates a search of the MeSH vocabulary using the search features of the MeSH Browser.
6 Next steps
Our plans for the near term include refining the matching software and developing improved tools for reviewers. When the review of the ERIC/LCSH matches is complete, the file of mappings will be made available to other researchers. The file will be available in MARC in XML and also encoded according to version 0.5 of the Zthes schema. We also anticipate making this file available via OAI-PMH and for searching using SRU/SRW and the Zthes profile. See the Terminology Services project Web site for details.