An Evaluation of Document Keyphrase Sets: Jones and Paynter: JoDI

Abstract

Keywords and keyphrases have many useful roles as document surrogates and descriptors, but the manual production of keyphrase metadata for large digital library collections is at best expensive and time-consuming, and at worst logistically impossible. Algorithms for keyphrase extraction like Kea and Extractor produce a set of phrases that are associated with a document. Though these sets are often utilized as a group, keyphrase extraction is usually evaluated by measuring the quality of individual keyphrases. This paper reports an assessment that asks human assessors to rate entire sets of keyphrases produced by Kea, Extractor and document authors. The results provide further evidence that human assessors rate all three sources highly (with some caveats), but show that the relationship between the quality of the phrases in a set and the set as a whole is not always simple. Choosing the best individual phrases will not necessarily produce the best set; combinations of lesser phrases may result in better overall quality.

1 Introduction

Conventional document repositories, such as physical libraries, devote substantial resources to the task of cataloguing their holdings. Books, journals, conference proceedings, novels, papers and other types of holding are assigned, or already have, associated metadata that describe them and form a catalogue entry. The metadata can include items such as title, author, date of publication, subject descriptors, classification labels and keywords. This information eases management and organization of the holdings and supports access to them by end users.

This metadata is as equally useful in digital document repositories like digital libraries and Web search engines. When available, it can facilitate document clustering, document retrieval, thesaurus production, browsing mechanisms, and many other access tools. It can be provided to end users to enrich query result sets and document displays, and to help them discriminate between documents.

Unfortunately, digital document repositories often lack sufficient resources to assign detailed metadata. One of the research goals of the New Zealand Digital Library Project (http://www.nzdl.org) is the automatic generation of metadata from source documents, and one aspect of this work is the Kea algorithm, which selects words and phrases from within a document that reflect its content. Some documents, such as scientific papers, are assigned sets of keyphrases (in our use of this term we are also referring to keywords) by authors or professional cataloguers. Many more documents are not. The goal of keyphrase extraction algorithms like Kea (Frank et al. 1999; Witten et al. 1999) and Extractor (Turney 1999,2000) is to identify automatically a set of keyphrases for a document that approximates the list that might be supplied by a document's author. Identified keyphrases may be exploited in (among others) clustering algorithms (Anick and Vaithynathan 1997; Jones and Mahoui 2000), retrieval algorithms (Arampatzis et al. 1998; Croft et al. 1991), thesaurus construction (Paynter et al. 2000), and browsing interfaces (Gutwin et al. 1999; Jones and Paynter 1999).

This paper reports on an evaluation of the keyphrases produced by document authors, and by the Kea and Extractor algorithms. Kea and Extractor have previously been evaluated in two ways: automatically, by measuring the algorithm's precision and recall of author-supplied keyphrases (Frank et al. 1999; Turney 2000); and subjectively, by asking human assessors to rate the quality of individual keyphrases (Barker and Cornacchia 2000; Jones and Paynter 2001). The evaluation reported here complements these studies.

Jones and Paynter (2001) describe a subjective evaluation of Kea and author keyphrases that asks participants to rate individual keyphrases, then uses these ratings to evaluate the various phrase sources. Although this experiment showed that the phrases in the author and Kea sets were viewed favourably, a limitation is that there is no evidence that the quality of a phrase set as a whole is directly related to the quality of its individual members. In fact, Barker and Cornacchia (2000) report the opposite phenomenon: when evaluating two algorithms, in some cases human assessors preferred the individual phrases from one set, but preferred the other as a whole. The purpose of the present experiment was to evaluate the quality of the phrase sets generated by various sources, and to test the hypothesis that the quality of a phrase set is directly and positively related to the quality of its constituent phrases.

At first glance, this hypothesis appears trivially true, but it is not necessarily so. Consider as a (simple) example a document that discusses two topics, dogs and fish, for which we wish to choose three keyphrases. Our first attempt yields dingoes, wolves, and domestic canines, all of which are sensible phrases, but none of which covers the second topic of the paper. A second attempt might yield poodles, quadrupeds, and ocean life, all of which seem less useful, but which cover both topics. A third attempt might suggest dogs, fish, and electricity, which has two phrases that cover each of the topics perfectly, and one that is unrelated to the document and likely to mislead the user about its content. It is not clear which of the three attempts is the best-this will depend on the way the phrases are employed and on the preferences of the user.

This paper shows that sets comprised of phrases that are very good in isolation are likely to form a very good keyphrase set when it is considered as a whole. However, this is not always the case. The relationship between the quality of phrases and of sets is too complex for the current generation of keyphrase extraction algorithms to fully encompass because they are syntactically-based, and do not understand the semantics of the phrases they extract. Document authors do take this additional information into account, and their phrase sets were viewed most favourably in this evaluation.

In the following section we summarize keyphrase extraction, the Kea algorithm and the results of previous evaluations. We then describe the method of the present study, and go on to present our findings. These results are then discussed and our conclusions summarized.

2 Related Work

We define a keyphrase extraction algorithm as a process that explicitly attempts to identify those phrases in a document that the author or a cataloger would assign the document as keyword metadata. There is a growing body of research into automatic keyphrase extraction that encompasses techniques, evaluation methodologies and empirical evaluations.

2.1 Associating Keyphrases with Documents

Keyphrase metadata can be automatically generated in two ways: by assigning keyphrases from a controlled vocabulary to documents (Dumais et al. 1998), or by identifying and selecting the most descriptive phrases in the document. In the first approach, called keyphrase assignment or text categorization, the controlled vocabulary ensures that similar documents are classified consistently, and allows phrases to be assigned even if they are not explicitly mentioned in their text. However, controlled vocabularies are expensive to build and maintain, so are not always available, and potentially useful keyphrases are ignored if they are not in the vocabulary.

In the second approach, called keyphrase extraction, the text of a document is analyzed and its most appropriate words and phrases are identified and associated with the document. This means that every phrase that occurs in the document is a potential keyphrase of the document, and a controlled vocabulary is not required. The keyphrases generated are less consistent, however. Automatically identifying and extracting phrases is a complex task, but a range of techniques for identifying useful, descriptive and meaningful phrases have been suggested (Chen 1999; Krulwich and Burkey 1997; Smeaton and Kelledy 1998; Tolle and Chen 2000; see Jones and Paynter 2002 for an overview). Few of these conform to our definition of a keyphrase extraction algorithm. The distinction between keyphrase extraction algorithms and other phrase extraction algorithms is important because the author keyphrases provide an objective basis for evaluation, as will be discussed below.

Turney (1999; 2000) was the first to frame keyphrase extraction as a supervised learning problem, where all the phrases in a document are potential keyphrases, but only those that match the authors' choices are "correct" keyphrases. Turney devised an algorithm, called Extractor, that uses a set of heuristics and a genetic algorithm to identify the phrases that are most likely to be the authors'. Barker and Cornacchia (2000) suggest an alternative strategy; they identify noun phrases using dictionary lookup, and then consider the frequency of a given noun as a phrase head within a document, discarding those that fall below a given threshold. A third algorithm, Kea, is discussed in greater detail.

2.2 Kea

Kea is a keyphrase extraction algorithm developed by members of the New Zealand Digital Library Project. A Java implementation, released under the GNU General Public License, is available from http://www.nzdl.org/Kea. The Kea algorithm is described in detail elsewhere (Frank et al. 1999; Witten et al. 1999), and summarized below.

Kea operates in two distinct stages. First, a model is built from a set of training documents with exemplar keyphrases (usually the author keywords, though any authoritative source may be used). Second, documents without keyphrase metadata are presented to Kea, and the model is used to identify those of their phrases that are most likely to be keyphrases; these phrases are "extracted" from the document and provided as output. The process is illustrated in Figure 1.

Figure 1. Kea phrase extraction process

To learn a model, Kea extracts every phrase from each of the training documents. Many phrases are immediately discarded, including proper nouns, those that begin or end with a stopword, those that do not match predefined phrase length constraints, and those that occur only once within a document. The remaining phrases are called the candidate phrases of the document. Three attributes are calculated for each candidate phrase: whether or not it is an author-specified keyphrase of the document, the distance into a document that it first occurs, and how specific it is to the document. The third attribute is represented using the TF·IDF measure. TF is the frequency of the phrase in the document, and is divided by DF, the number of other documents in which the phrase occurs. The candidate phrases from every training document are combined into a single dataset and used to construct a Naïve Bayes classifier (Domingos and Pazzani 1997) that predicts whether or not a phrase is an author keyphrase based on its other attributes.

Once a model has been learned from the training documents, it can be used to extract keyphrases from new documents. The candidate phrases are extracted from the new document as described above, and the distance and TF·IDF attributes computed. The Naïve Bayes model uses these attributes to calculate the probability that each candidate phrase is a keyphrase of its document. The most probable candidates for each document are output in ranked order; these are the keyphrases that Kea associates with the document.

Changes to the model building process affect the parameters of the model and the characteristics of the keyphrases that will ultimately be extracted. The simplest and most significant possible change is to vary the training data. In this study we have used two different sets of training documents labelled cstr and aliweb. A question that arises is whether the level of similarity between the training and target documents has a noticeable effect on the quality of the extracted keyphrases. Therefore we have applied a domain-focussed extraction model (cstr) and a non-domain focussed model (aliweb) in our study.

The cstr corpus was drawn from a collection of Computer Science research papers gathered for inclusion in the New Zealand Digital Library (http://www.nzdl.org). This corpus was used as training data for Kea to produce the cstr keyphrase extraction model. Frank et al. (1999) generated the model that we use in this study. The cstr training corpus and keyphrase extraction model may be considered domain-focussed in that all documents discuss Computer Science research, although a range of research areas are represented.

The aliweb corpus contains HTML Web pages gathered by Turney via the Aliweb search engine for his studies on keyphrase extraction (1999; 2000). The documents address a broad variety of topics such as micro-breweries, law libraries, text processing and university departments, and contain author-assigned keyphrases specified within them using the HTML META tag. This corpus was used to train Kea and produce the aliweb keyphrase extraction model. The training material and the consequent extraction model are clearly not domain-focussed. Many Kea users will not have suitable training data that matches the documents from which they wish to extract keyphrase sets, and so it is important to establish the utility of a generic model such as aliweb.

The extraction model can be tailored in further ways. The length of the phrases to be extracted, expressed as a minimum and maximum number of words, can be constrained, as can the number of phrases extracted from each document.

Finally, the model can be extended by the inclusion of the keyphrase-frequency attribute in addition to the three attributes described above. This attribute represents the number of times a phrase has occured as an author-specified keyphrase in the set of training documents. The effect of including this attribute is to bias the extraction model in favour of the most common author-selected keyphrases in the training corpus. The resulting model is more domain specific and it is suited to situations where there is a strong relationship between the domains of the training and testing documents. This study evaluates a Kea model called cstr-kf that is based on the cstr corpus and employs the keyphrase-frequency attribute.

2.3 Evaluating Keyphrase Extraction

The difference between keyphrase extraction algorithms and other phrase extraction algorithms is that the former explicitly attempt to extract the same phrases that the author would choose. A consequence of this intention is that there is an objective way to measure performance: if an extracted keyphrase is the same as an author keyphrase, then the algorithm has succeeded; if not, then it has failed. We can extend this to evaluate sets of keyphrases with standard information retrieval measures: precision is the proportion of the extracted keyphrases that are also author keyphrases, and recall is the proportion of the author keyphrases that were actually extracted.

Both Kea (Frank et al. 1999) and Extractor (Turney 2000) have been evaluated using precision and recall. These studies show that there is no statistically significant difference between the two algorithms, though Kea's performance can be significantly improved by the keyphrase-frequency attribute. However, the keyphrase-frequency attribute may not always be available and is domain-specific, meaning that it boosts performance only for documents describing a single domain, such as computer science or physics. Further, recent work by Turney has shown that keyphrase frequencies can be detrimental when used in other domains (personal communication).

There are well-known problems with evaluations that rely on the author's keyphrases. Author keyphrases do not always appear in the text of the document to which they belong, and therefore cannot be found by an extraction-based technique. Further, author keyphrases are available for a limited number and type of documents, and authors rarely provide more than a few keyphrases, far fewer than may be extracted automatically. Finally, authors choose keyphrases for purposes other than document description-to increase the likelihood of publication, for example. Barker and Cornacchia (2000) observed these deficiencies, and proposed instead a subjective evaluation of Extractor and their own algorithm (called B&C), though their results are limited by the paucity and disparity of assessors and test documents, and a consequent lack of inter-assessor agreement.

Jones and Paynter (2002) describe a more extensive subjective evaluation of Kea and author keyphrases, using the same test material and similar assessors to the study described later in this paper. In this previous evaluation, assessors were presented with scientific papers and an associated list of individual keyphrases for each paper. The keyphrases were produced from a number of sources, including author keyphrases, and three Kea models (aliweb, cstr and cstr-kf, which are described below). Assessors rated each of the keyphrases using a numeric scale, based on how well they reflected the content of the paper.

This study revealed that:

Participants thought that author keyphrases were good descriptors of the content of a document;
different Kea models produce keyphrases of varying quality;
phrases produced by the aliweb and cstr models were judged positively and rivaled the quality of the author phrases;
phrases produced by the domain-specific cstr-kf model performed poorly;
author keyphrase sets are ranked: the participants' ratings were higher for those at the start of the list;
Kea's ranking of its results matched participants' assessments of the keyphrases;
Participants preferred multiple word phrases over single words, favouring phrases of two words.

These results are encouraging, but are based on the assessment of individual keyphrases. In some uses of keyphrases, such as when they appear as document surrogates on a search results page, the quality of the keyphrase list as a whole is important, and the quality of the individual phrases is less so. The previous evaluation does not measure the quality of groups of phrases; nor is this information captured by evaluations based on the author's keyphrases.

This limitation is particularly relevant to single-word keyphrases like WWW, Clustering, Categorization, Scripting and navigation, which are too general to describe a document when they appear in isolation, but benefit from the context provided by other phrases. For example, the keyword navigation has many possible interpretations when it appears on its own, but its meaning is plain when it occurs in a keyphrase list with WWW, hypertext and history mechanisms. It is possible that a keyphrase that is rated poorly on its own makes a significant contribution to a keyphrase list, or that a phrase that is rated highly in isolation is redundant when presented with other, similar phrases. Consequently, a list comprised of phrases with high individual ratings is not necessarily better than one whose constituents were less well-regarded.

In fact, Barker and Cornacchia (2000) observed exactly this phenomenon. They determined their participants' preference for phrase sets and for individual phrase scores from two sources (Extractor and B&C). Only half of the phrase set preferences matched those derived from individual phrase scores. They conjecture that the phrase set preferences of the participants were not simply based on the individual phrases that constituted the sets, and that the sets were more (or less) than the sum of their parts.

3 Evaluation of Kea Keyphrase Sets

The remainder of this paper describes a subjective evaluation of keyphrase sets drawn from a variety of sources. The basic procedure is straightforward: participants are shown documents and sets of keyphrase from different sources, and are asked to assign a score to each of the sets.

3.1 Research questions

Our study was designed to answer three questions. First, we wished to know whether author, Kea and Extractor keyphrase sets were viewed positively or negatively when treated as a whole. Second, we want to compare the performance of different sources of keyphrase sets. Third, we wanted to determine whether there is a strong positive correlation between the quality of keyphrase sets and the quality of the phrases that make up those sets. To satisfy the third goal, the conditions of the experiment were kept as close as possible to those described by Jones and Paynter (2001; 2002).

Additionally we were interested in the extent to which the sets of keyphrases covered the content of a document. We wished to determine whether the phrase sources focussed on particular portions of a document for keyphrase extraction, or whether the extracted phrases occured throughout the document text. This is of particular interest with respect to Kea, given that the distance attribute of the Kea models favours extraction of phrases that occur near to the start of a document.

3.2 Experimental Texts

A set of six English language papers from the Proceedings of the ACM Conference on Human Factors (1997, 1998) was used for the test documents. The papers used are listed in Table 1 and are the same papers employed in our previous subjective evaluation of Kea.

Table 1. Documents used in the experiment, drawn from proceedings of CHI97 and CHI98. Author-specified keyphrases are also shown

Paper ID	Reference
1	Borchers, J.O., "WorldBeat: Designing a Baton-Based Interface for an Interactive Music Exhibit", pp. 131-138. interface design, interactive exhibit, baton, music, education
2	Kandogan, E. and Shneiderman, B., "Elastic Windows: Evaluation of Multi-Window Operations", pp. 250-257. Window Management, Multi-window operations, Personal Role Management, Tiled Layout, User Interfaces, Information Access and Organization
3	Wilcox, L.D., Schilit, B.N. and Sawhney, N., "Dynomite: a Dynamically Organized Ink and Audio Notebook", pp. 186-193. Electronic notebook, note-taking, audio interfaces, hand-writing, keyword indexing, ink properties, retrieval, paper-like interfaces, PDA, pen computing
4	Myers, B.A., "Scripting Graphical Applications by Demonstration", pp. 534-541. Scripting, Macros, Programming by Demonstration (PBD), Command Objects, Toolkits, User Interface Development Environments, Amulet
5	Tauscher, L. and Greenberg, S., "Revisitation Patterns in World Wide Web Navigation", pp. 399-406. History mechanisms, WWW, web, hypertext, navigation
6	Pitkow, J. and Pirolli, P., "Life, Death and Lawfulness on the Electronic Frontier", pp. 383-390. Clustering, categorization, co-citation analysis, World Wide Web, hypertext, survival analysis, usage model

The papers were chosen because they contain author-specified keywords and phrases, and provide a good fit with the background and experience of our participants. Each paper was eight pages long, and the authors' keyphrases were removed from each so that they would not influence the participants' responses.

3.3 Participants

As in the previous study, the participants were recruited from a final year course on Human Computer Interaction taken as part of an undergraduate degree programme in Computer Science. Although the actual participants differed from previous studies their characteristics are similar. All but one had completed at least three years of undergraduate education in computer science or a related discipline and were nearing completion of a 15 week course on Human Computer Interaction. Of the 20 participants that took part, the first language of 14 of the participants was English. The youngest participant was 21, the oldest 43, and the mean age was 24. Twelve of the participants were male and eight female.

3.4 Tasks

Each participant was asked to complete two tasks. For each task a participant was provided with one of the six experimental texts, and a collection of nine sets of keyphrases that had been produced for the text. For each keyphrase set the participant was asked to rate on a fixed scale how well the set matched the content of the document. A participant undertook the tasks separately, one after the other with a short break in between if required.

3.5 Paper Allocation

Two texts were allocated to each of the participants. Texts were allocated randomly to the participants, though presentation order, number of viewings of each paper, and participants' first language were controlled to minimize confounding effects. All participants considered both of the papers allocated to them, completing the tasks within one and a half hours. All paper and keyphrase set pairs were considered by six participants.

3.6 Instructions

For each task the participants were instructed to first read the text fully. They were then told to reveal a collection of keyphrase sets for the text and asked: How well does each of the following sets of phrases represent what the document is either wholly or partly about? Each keyphrase set was presented in the form shown in Figure 2. Nine keyphrase sets were presented for each text, with three sets displayed per A4 page. Presentation order was randomized for each participant. Participants indicated their rating by drawing a circle around the appropriate value. Participants could refer back to the paper and reread it as often as required.

Figure 2. Sample keyphrase set presentation

3.7 Candidate Phrase Sets

Each of the nine phrase sets was generated in a different way. The first was specified by the author(s) of the text, and six more were generated by Kea. These sources are the same as for the previous evaluation, and a further set was produced by taking the phrases from these sources that were judged to be the best individual phrases in our previous study. A final set was created by the Extractor system (Turney 2000), using the default settings.

Three Kea models were used to extract keyphrases. The first, aliweb, was trained on a set of general content web pages gathered by Turney (1999; 2000). The second, cstr, is derived from a collection of technical reports in a range of computer science subject areas. The third, cstr-kf, was trained on the same documents as cstr, but uses a further attribute which reflects how frequently a phrase occurs as an author keyphrase in a set of training documents. Both cstr and cstr-kf were produced by Frank et al. (1999).

The minimum phrase length was varied for each Kea model. Two phrase sets were produced with each model, corresponding to phrases of 1 to 3 words and 2 to 3 words. The six Kea phrase sets for each text were the same as those used in our previous study.

In our previous study we captured participants' ratings of individual phrases provided by Kea and the authors. In this study we use these data to calculate the mean score across participants for all phrases for each of the texts. For each text we then produced a list of phrases ranked in descending order by the mean of the scores assigned to them by participants; this we called the best-individual set.

The number of phrases in each set was determined by the number of author phrases specified for the text (it is, of course, possible to select more phrases from the other sets). In the case of the Kea and best-individual sets the list of phrases is already ranked and we chose the first N from the beginning of the list. A command-line parameter was used to request the appropriate number of phrases from Extractor.

The nine phrase sets for each paper were labelled as follows (labels for Kea sets reflect the model used and the length restriction, in words, placed on the phrases):

author
best-individual
Extractor
aliweb 1-3
cstr 1-3
cstr-kf 1-3
aliweb 2-3
cstr 2-3
cstr-kf 2-3

Table 2 provides links to source documents and associated phrase sets.

Table 2. Links to source text and phrase sets for each experimental text

	Document text (requires subscription to ACM Digital Library to access full text)	Phrase sets
Paper 1: Borchers	http://doi.acm.org/10.1145/258549.258632	Borchers phrase set
Paper 2: Kandogan and Shneiderman	http://doi.acm.org/10.1145/258549.258720	Kandogan and Shneiderman phrase set
Paper 3: Wilcox et al.	http://doi.acm.org/10.1145/258549.258700	Wilcox et al. phrase set
Paper 4: Myers	http://doi.acm.org/10.1145/274644.274716	Myers phrase set
Paper 5: Tauscher and Greenberg	http://doi.acm.org/10.1145/258549.258816	Tauscher and Greenberg phrase set
Paper 6: Pitkow and Pirolli	http://doi.acm.org/10.1145/258549.258805	Pitkow and Pirolli phrase set

4 Results

This section describes the participant's responses; the implications of these results are discussed in Section 5.

4.1 Overall Quality of Keyphrase Sets

Our assessment is based on the scores assigned to each keyphrase list by the particpants in the experiment, on the mean score for each of the nine keyphrase sources, and on the rank each of the keyphrase sources (found by ranking the mean scores).

Table 3 shows the ranks and mean scores when data from all six papers are combined, providing an immediate indication of the "best" keyphrase sources. The author phrase sets were judged to be of the highest quality with a mean score of 6.65. They were closely followed by the best-individual set, which were the top-ranking individual phrases from our previous study, and which received a mean score of 6.63. The Kea cstr 2-3 sets were ranked third overall, followed by Extractor. The aliweb 1-3 and cstr 1-3 sets were ranked fifth and sixth respectively, and on average received positive ratings that were higher than the midpoint on the scale used by the participants. Three sets received negative ratings overall, with mean scores below the midpoint: aliweb 2-3, cstr-kf 2-3 and cstr-kf 1-3.

**Table 3. Keyphrase set rankings and mean scores for all papers**
	Keyphrase Set
	author	best-individual	cstr 2-3	Extractor	aliweb 1-3	cstr 1-3	aliweb 2-3	cstr-kf 2-3	cstr-kf 1-3
rank	1	2	3	4	5	6	7	8	9
mean score	6.65	6.63	6.20	5.93	5.78	5.53	4.98	4.65	4.25
sd	1.75	1.72	2.26	1.80	2.15	2.22	3.10	1.89	2.33

4.2 Comparisons Between Keyphrase Sources

The ranks and mean scores for the individual papers are shown in Table 4. These scores vary from paper to paper. For example, the author set ranked first for papers 3 and 6, but ranked fifth for paper 2. The cstr 2-3 set ranked first for papers 3 and 5, but ranked sixth for papers 4 and 6. Note that in the case of a tie, more than one paper may be assigned the same ranking.

**Table 4. Keyphrase set rankings and mean scores for each paper**
		Keyphrase Set
		author	best-individual	cstr 2-3	Extractor	aliweb 1-3	cstr 1-3	aliweb 2-3	cstr-kf 2-3	cstr-kf 1-3
Paper 1	rank	3	6	2	4	6	5	1	6	6
	mean	6.00	4.86	6.29	5.43	4.86	5.14	8.00	4.86	4.86
	sd	2.00	1.77	2.21	1.62	2.61	2.04	1.15	0.90	1.46
Paper 2	rank	5	1	3	5	7	4	1	8	9
	mean	5.50	6.75	6.63	5.50	5.13	6.00	6.75	4.63	3.38
	sd	1.77	1.04	1.77	1.85	2.10	1.51	0.89	1.19	1.77
Paper 3	rank	1	4	1	4	3	6	8	7	8
	mean	6.71	5.57	6.71	5.57	5.71	5.43	3.00	5.29	3.00
	sd	0.76	1.13	1.70	1.72	2.14	1.81	2.31	1.70	1.83
Paper 4	rank	2	1	6	3	4	4	8	7	9
	mean	6.67	7.50	5.00	6.00	5.33	5.33	3.50	4.17	2.67
	sd	1.37	1.38	0.89	1.67	1.63	1.21	1.52	2.40	1.97
Paper 5	rank	8	2	1	6	5	3	3	9	6
	mean	6.20	8.40	8.80	7.20	7.40	7.80	7.80	5.00	7.20
	sd	0.84	1.14	1.30	0.84	1.67	1.64	1.10	2.12	1.30
Paper 6	rank	1	2	6	4	3	7	9	7	5
	mean	8.86	7.29	4.29	6.29	6.71	4.00	1.14	4.00	5.14
	sd	1.21	1.50	2.81	2.56	2.14	3.37	2.61	2.94	2.79

Table 5 summarises Table 4 by showing how many times each phrase set achieved each ranking. aliweb 2-3 was the most variable: its average rating was below halfway, but twice it was ranked first.

**Table 5. Frequency of ranking for each keyphrase set**
		Keyphrase Set
		author	cstr 1-3	cstr 2-3	aliweb 1-3	aliweb 2-3	cstr-kf 1-3	cstr-kf 2-3	Extractor	best-individual
Ranking	1	2		2		2				2
	2	1		1		1				2
	3	1	1	1	2				1
	4		2		1				3	1
	5	1	1		1		1		1
	6		1	2	1		2	1	1	1
	7		1		1			3
	8	1				2	1	1
	9					1	2	1

Given the variation observed between papers we asked if there was a significant difference between the quality of the phrase sets for each of the six papers. Using Friedman's two-way analysis of variance by ranks (Siegel and Castellan 1988) we established that at the p=0.05 level there was a significant difference between at least one pair of phrase sets for all papers except for paper 5. (For paper 5 there was a significant difference at the p=0.051 level.) Table 6 shows the results of this analysis. All scores shown were adjusted for ties in the data.

Table 6. Results of Friedman test of significant difference between phrase sets for each paper

Paper	F_r	p
1	23.53	=0.003
2	28.67	<0.001
3	24.34	=0.002
4	26.42	=0.001
5	15.52	=0.51
6	40.15	<0.001

We investigated further to identify which phrase sets are significantly different (at p=0.05, excluding data from paper 5). We extended our analysis by Friedman, carrying out multiple comparisons between conditions (Siegel and Castellan 1988).

Table 7 summarizes the phrase sources that are significantly different. Each cell of the table shows the papers for which the phrase set labeling the row was significantly better than the phrase set labeling the column. For example, the row labeled cstr 2-3 shows that the cstr 2-3 phrase set was significantly better than the aliweb 2-3 phrase set for papers 2 and 3, and also significantly better than the cstr-kf 1-3 phrase set for paper 3.

Table 7. Summary of papers for which phrase sets to the left were significantly better than phrase sets shown as column headers

	Keyphrase Set
	aliweb 1-3	aliweb 2-3	cstr 1-3	cstr 2-3	cstr-kf 1-3	cstr-kf 2-3	Extractor	best-individual
author		6	6		3,4	6
aliweb 1-3		6
aliweb 2-3					1	2		1
cstr 1-3
cstr 2-3		2,3			3
cstr-kf 1-3
cstr-kf 2-3
Extractor		6
best-individual		4,6			4	4

4.3 Inter-Participant Agreement

The results presented above are only meaningful if we can be confident that the assessors are in broad agreement on how phrase sets should be ranked. If there is significant agreement between the assessors, then we can assume they are interpreting the task in the same way and providing meaningful responses. The Kendall Coefficient of Concordance (W) is a measure of agreement between rankings and is appropriate given that we wish to determine whether participants agreed on which phrase source provided the most effective set of keyphrases, the next best and so on. To allow this, the scores assigned by a participant to each phrase set for a given paper were converted to rankings.

W has a value between 0 (agreement as expected by chance) and 1 (complete agreement). Table 8 shows the agreement values for each paper. In each case, W is non-zero, indicating that there is some level of inter-participant agreement other than that expected by chance. The weakest agreement of 0.39 was observed for paper 5, and the strongest of 0.72 for paper 6. The score and degrees of freedom (nine phrase sets) can be used to determine the level of significance of the W value. The level of agreement is highly significant (to at least the p=0.003 level) for all papers except paper 5 which is significant at the p=0.0514 level.

Table 8. Inter-assessor agreement measured by the Kendall Coefficient of Concordance

Paper	W	X²	p
1	0.42	23.53	=0.0027
2	0.45	28.67	=0.0004
3	0.44	24.34	=0.0020
4	0.55	26.41	=0.0009
5	0.39	15.43	=0.0514
6	0.72	40.15	<0.0001
All	0.39	18.91	=0.0153

We also measured overall agreement, across all papers. For each paper we computed the mean score for each phrase set (across all participants), and converted the mean scores to rankings. The resulting W value is also shown in Table 8 and provides evidence of significant agreement. Consequently, we conclude that participants agree sufficiently for us to consider the reported results to be meaningful.

5 Analysis and Discussion

This section retuirns to the goals of the experiment stated in section 3.1, and attempts to answer the questions they raise. We evaluate the quality of the keyphrase sets, compare the relative performance of the different sources, and consider the hypothesis that the quality of a phrase set is directly and positively related to the quality of its constituent phrases. The discussion concludes with some thoughts on the design of our experiment, and of similar experiments.

5.1 Overall Quality of Keyphrase Sets

The most highly-rated phrase set overall was the author list. We have previously established that participants viewed individual author keyphrases positively. The results of this study show that the keyphrase sets supplied by authors are also judged positively. They received mean scores greater that the midpoint (5) of the rating scale for all papers and only six of all author list scores were less than 5. In each of these cases the score was 4, which was just less than the midpoint, and was assigned by participants who tended to award lower scores to the phrase sets from all of the sources. The author list for paper 2 received the lowest mean score (5.5) and the list for paper 6 received the highest (8.86). These sets are shown in Table 1. There is no immediately apparent reason for such a perceived difference between the two phrase sets when they are viewed in isolation. Consequently we believe that the explanation lies in the strength of relationships between the phrase sets and the content of the paper with which they are associated.

After author, the next most highly-rated set was the best-individual set, whose average was only fractionally less than the author set. This was followed by one of the Kea sets, cstr 2-3, and the Extractor set. The next two Kea sets, aliweb 1-3 and cstr 1-3 were also rated positively, but it is notable that the Kea phrases are more variable than the author, best-individual and Extractor sets (Table 3, standard deviation row). This indicates that information providers should utilise author keyphrases whenever possible. However, such metadata is often unavailable, and consequently automated extraction is required. In this case, the results indicate that a focussed Kea extraction model (cstr in this case) can be used to provide good quality keyphrase sets where there is a good fit between the model and the target documents. The situation will arise where it is not possible to create a new model to match the target documents and no pre-existing focussed model is appropriate. In such a case, the results indicate that generic models such as Extractor and aliweb 1-3 can provide useful phrase sets.

The two cstr-kf sets performed particularly poorly: they received the lowest average ratings, and were not ranked better than fifth for any of the papers (Table 4). This is consistent with our previous study where the individual phrases produced by this model were significantly worse than those from other sources. As in the previous study, we attribute its poor performance to a mismatch between the domain of the training documents (Computer Science) and of the documents used in the evaluation (Human Computer Interaction). There is mounting evidence that domain-specific models using the keyphrase-frequency attribute perform significantly worse than generic models when applied outside the domain on which they are trained: this effect was visible in both of our subjective evaluations, and has been demonstrated by Turney when Kea is evaluated against author keyphrases (personal communication).

5.2 Comparing Keyphrase Sources

Our earlier study, based on the ratings of individual phrases, found no significant difference between the author keyphrases and Kea models averaged over all the documents. In this study we again found no significant differences across all the papers, though this is not very surprising given the size of the experiment and the variability of the responses (Table 3, standard deviation row).

Some significant differences were detected within individual papers, as described in Table 7. Most of these differences involve one of the top three sets (author, best-individual, cstr 2-3) being significantly better than one of the bottom three sets (aliweb 2-3, cstr-kf 1-3, cstr-kf 2-3). The aliweb 2-3 set is anomalous in that it is both significantly better and significantly worse than other sets for more than one document. This is because the ratings for this set were the most variable, as can be seen from the standard deviation row of Table 3, and in Table 4. The cstr-kf phrase sets, as we have observed, simply perform poorly. In summary, the best sources of keyphrase sets are author, best-individual and cstr 2-3, while the domain-specific Kea models perform poorly. The differences are not statistically significant in this dataset.

5.3 Comparing Sets to Individual Phrases

Our second research question asks whether there is a correlation between the quality of the keyphrase sets and the quality of the individual phrases that make up those sets.

One approach is to compare the average score assigned to each phrase set in this evaluation to the average score assigned to each constituent phrase in the previous evaluation. In our previous study, the mean score of an individual author phrase was 6.36, which is very similar to the mean score of 6.65 for author phrase sets in this study. These figures, and figures for the other sets, are shown in the second and third columns of Table 9. (Extractor is omitted because it did not appear in the previous experiment.) Column 4 of this table contains the difference between the two scores; the differences are small in all cases.

**Table 9. Individual and group phrase scores (and ranks) for phrase sets in both evaluations**
Phrase set	Mean individual score (Rank)	Mean set score (Rank)	Difference in means	Difference in ranks
best-individual	n/a (1)	6.63 (2)	n/a	-1
author	6.36 (2)	6.65 (1)	0.29	+1
cstr 2-3	5.87 (3)	6.20 (3)	0.33	0
aliweb 2-3	5.65 (4)	4.98 (6)	0.67	-2
cstr 1-3	5.62 (5)	5.53 (5)	0.09	0
cstr-kf 2-3	5.59 (6)	4.65 (7)	0.94	-1
aliweb 1-3	5.25 (7)	5.78 (4)	0.53	+3
cstr-kf 1-3	4.92 (8)	4.25 (8)	0.67	0

Ultimately, the mean individual scores and mean set scores in Table 9 are not directly comparable because they measure different things. Instead, we can consider the performance of each set by each measure relative to the other sets. Table 9 also assigns a rank to each of the sets for when they are sorted by mean individual score (from the previous experiment) and mean set score (from this experiment). The author set, for example, was ranked second (behind best-individual) when sorted by mean individual score, but ranked first when sorted by mean set score. The ranks for each of the other sets are shown in parentheses, while the rightmost column contains the change in rank for each set between the two evaluations.

Generally, the keyphrase sources ranked approximately the same when ranked as sets as when they are ranked individually. Only the phrase sets based on the aliweb model change by more than 1 position in the rank hierarchy. aliweb 2-3 has already proved to be the most variable of the sets, and performed much better when its individual phrases were measured than when they were considered as a group. Conversely, aliweb 1-3 was ranked seventh when its phrases were considered in isolation, but improved to fourth when considered as a group.

The relationship between the best-individual set and the author set is interesting. best-individual is constructed of the best individual phrases drawn from both the author and Kea sets, and is by definition the top-ranked set when measured by individual phrases. The author set must therefore contain some phrases that are (when considered individually) ranked less highly that those in best-individual, yet the author phrases as a group very slightly outperform the best-individual set. The prominence of the author set suggests that authors take care to choose a set of complementary keyphrases, rather than choosing good phrases in isolation. This can be explained intuitively: if a document describes five topics, the reader is much better served by five adequate phrases that describe each of the five different topics than they are by five excellent phrases that describe only one or two of the topics.

We conclude from this data that if we compare two sets of keyphrases, it is likely that the set whose constituent phrases are individually the best (in the opinion of a user) will be the best set when the phrases are considered as a whole. However, this is not always the case. Barker and Cornacchia (2000) observed that phrase sets and individual phrases were ranked differently; our experiments suggest this is the exception rather than the rule.

Developers using keyphrase extraction algorithms should consider whether their phrases are to be considered as a group (e.g. when a document is to be summarised), or in isolation (e.g. when indexing a set of documents by topic). Overall, Kea appears to have equivalent performance by both measures, and both Kea and Extractor perform well, but domain-specific models like cstr-kf which have been reported to boost performance in particular domains are demonstrably worse when applied outside their domains--even to seemingly very similar domains like computer science and human-computer interaction.

5.4 Experimental Design

The experiment that we have reported was constructed to mirror the previous experiment (Jones and Paynter 2002). It is possible that the lack of a statistically siginificant result in Section 5.2 is due to the number of data points that were collected for each phrase source. However, we have observed signficant agreement between the participants both for individual papers and across all papers. Therefore we believe that the results are useful in offering responses to the research questions that we posed.

A design goal of both experiments was to maximise the number of judgements per phrase or phrase set given the constraints of the resources available. Other researchers have reported very low levels of agreement in phrase evaluation studies (Barker and Cornacchia 2000; Chen 1999). Such inconsistency is a common difficulty experienced in studies that gather subjective ratings by human assessors. Our previous study of individual keyphrases reported a high level of agreement, which we attributed to the expertise of the participants in the domain of the documents that they considered. We have again observed significant agreement for our current study, with the exception of one paper where agreement was marginally not significant. This strengthens our belief that it is important to match participant knowledge to the documents for which they provide assessments, following the methodology of Tolle and Chen (2000).

6 Compositional coverage

Keyphrase extraction techniques like Kea, Extractor and B&C are syntactically based: they choose sequences of words from a document based on statistical and lexical attributes, but have no understanding of the meaning of the phrases that they suggest. Contrast this approach to that of the human author, who does know the meaning of each phrase, and uses this information to construct a set of phrases that cover the topics discussed in the document.

Kea and the other algorithms are handicapped by their inability to explicitly identify and label the high-level concepts in a document. Our studies have established that some of the phrase sets extracted by Kea models are good ones, but has not directly addressed how completely those sets cover the topics within each document. However, we assume that participants considered this to some degree in their rating of the phrase sets.

Documents are normally composed of multiple topics related to the main theme, and these may be interwoven to form the document as a whole. This paper makes the topical structure reasonably clear through the use of sections and subsections, but many documents do not contain such cues (and Kea does not use them). They do however suggest that different topics will be discussed at different points in the document, and that in order to cover all the topics in a document, we must draw keyphrases from the document's entire length.

We have investigated this idea further by considering the distribution of extracted keyphrases within documents. Each document was split into ten segments of equal size and the number and proportion of occurrences of keyphrases from each keyphrase set within each segment was calculated (the "Keywords" list was removed from the start of each document). The mean proportion of keyphrases occurring within each segment are shown in Figure 3.

Figure 3. Distribution of keyphrase occurences within the experimental documents

The overall trend is that the occurrence of keyphrases declines as distance into the document decreases. (Note that Kea is biased towards phrases that occur early in the text through the distance attribute described earlier). More than a third (38%) of aliweb 2-3 keyphrase occurrences were in the first 10% of the document text, and 25% and 26% for cstr 2-3 and author respectively. The effect was less evident for aliweb 1-3, cstr 1-3, and each of the cstr-kf phrase sources, all of which tended to be rated less highly by the human assessors. Extractor provided the most even distribution of phrase occurrences across document segments.

Document structure clearly impacts upon these observations. Each of the study documents is a research paper from an ACM conference, and includes an abstract, introduction, conclusion and references section. Consequently, one might expect that the topics discussed later in the document are presented in overview near to the start. The phrases chosen by authors do tend to occur near to the beginning of documents. However, the author phrases do occur throughout the document text, so this emphasis does not exclude selection from, and consequently coverage of topics within, the entire document. It is also noticeable that phrase occurrence displays a regular pattern in the last three sections of the documents, which we attribute to the similar structures of conclusions and reference lists.

7 Conclusions

Our experiment presented human assessors with keyphrase sets from various sources, and asked them to assess these sets of phrases. It complements an earlier experiment in which the same sets were assessed as individual phrases. In the present experiment the author and Extractor sets were rated positively, as were half of the Kea sets, but three more Kea phrase sets had a mean score less than the midpoint. In the previous experiments the mean individual scores were almost always greater than the midpoint, suggesting that Kea phrases are generally more satisfactory in isolation than when considered as a group.

We are interested in testing the hypothesis that when two phrase sets are compared, the set whose constituent phrases are rated highest is also rated highest when considered as a whole. The evidence suggests this is often but not always true, and that the quality of a phrase set depends not only on the quality of its individual phrases, but on the relationships between the phrases, such as whether the terms are synonyms, how many phrases describe the same topic, and how many of the concepts in the document are reflected in the phrase set.

Kea, and other automatic keyphrase generation algorithms, work by attempting to find good individual phrases, but this does not always lead to a phrase set that reflects the composition of topics in the document. Author keyphrases, on the other hand, are consistently judged highly, better even than the set of best individual phrases from our previous evaluation, which suggests that authors take care to get a balanced phrase set, and this care is reflected in the scores assigned to their choices.

The topical composition of the source documents is generally not considered in keyphrase extraction algorithms and evaluations. We have attempted to gauge its effect by examining the distributions of keyphrases throughout the documents. Although Kea favours extraction of phrases occuring near to the start of the document, it selects phrases that occur throughout the document text, providing good coverage of document content.

An Evaluation of Document Keyphrase Sets