1 Introduction
Text categorization is the process of grouping texts into one or more predefined categories based on their content. Automatic text categorization can play an important role in a wide variety of more flexible, dynamic and personalized information management tasks as well as: real-time sorting of email or files into folder hierarchies; topic identification to support topic-specific processing operations; structured search and/or browsing; or finding texts that match long-standing interests or more dynamic task-based interests (Anick and Vaithyanathan 1997). A number of statistical classification and machine learning techniques have been applied to text categorization, including regression models, Bayesian classifiers, decision trees, nearest neighbour classifiers, neural networks, and support vector machines (Weston et al. 1997).
Classification technologies should be able to support category structures that are
very general, consistent across individuals, and relatively static (e.g. Dewey
Decimal or Library of Congress classification systems, Medical Subject Headings
(MeSH), or Yahoo!'s topic hierarchy), as well as those that are more dynamic and
customized to individual interests or tasks. Until now the first step in text
categorization has been to transform texts, which are typically strings of
characters, into a representation suitable for the learning algorithm and the
classification task. Each document is represented as a vector of words, as is
typically done in information retrieval (Salton and
McGill 1983). For most text retrieval applications, entries are weighted to
reflect the frequency of terms in texts and the distribution of terms across the
collection as a whole. For text classification, simpler binary feature values (i.e.
a word either occurs or does not occur in a document) are often used instead. For
example, in the most commonly used text representation, the vector space model, a
vector of words represents each text. A word-by-text matrix A is
used for a collection of texts, where each entry represents the occurrence of a word
in a text, i.e. ,
where
is the weight
of word i in text j. Determining the weight and the frequency
of words in a text is common to all existing methods.
Classification of the aforementioned vectors has been based on several learning techniques, including multivariate regression, nearest neighbour classifiers, probabilistic Bayesian models, decision trees, and neural networks. Overviews of this text classification work can be found in Lewis and Hayes (1994), Yang (1999), Joachims (1998) and Dumais et al. (1998). Furthermore, in the semantic text categorization reduction technique, latent semantic analysis (LSA) (Deerwester et al. 1990), there is an approach to automatic indexing and information retrieval that attempts to overcome common problems in information retrieval. The first problem is termed synonym and refers to the fact that there are many names for the same object. The second problem is called polysemy, and refers to the fact that most words have more than one meaning.
Recently, machine learning algorithms have been applied successfully to automatic text categorization problems, with a large number of unique words (over 15000). However, it has not yet been clarified, particularly for large-scale problems with more than a million feature terms, to what degree the following two factors work to the detriment of each other: the reduction of the classification error using computationally-intensive machine learning algorithms, and the loss of information due to the selection of feature terms before applying the expensive learning algorithms.
Specifically, in the learning procedure the algorithm of Slonim and Tishby (2001), which is based on the Bayesian classifier, is O(n3k) in complexity where n is the total number of words and k is the number of classes. Furthermore, the philosophy of the aforementioned methodology is justified because a Bayesian classifier presents a weakness in the point sets for large dimensions (Domingos and Pazzani 1997). It is known that high dimensionality, which presents a classified vector S of words, creates classification problems. These arise from the asymptotic error rate of the nearest neighbour rule (Toussaint et al. 1984). This problem is solved partially by the Voronoi diagram of S, which partitions the plane convex polygonal Voronoi cells.
This paper attempts to improve the performance of the naive text categorization methods by using the concept of probability weighted amount of information, and also by incorporating morphological information and probabilistic language modelling techniques in extracting and computing the weights of feature terms. Thus, our analysis is based on the advantages of the Voronoi diagram and we reduce the complexity of this transformation by using the onion algorithm of Bremner et al. (2003). The idea behind this implementation is to use the convex hull subroutine recursively to extract the outmost convex hull (S1 smallest convex polygon) of the given points (characters). In particular, we have developed a method based on an onion polygon each point of which represents a character of the document. For the numerical conversion of each character we used a specific numerical representation of the conversion strings which equates to double precision of ASCII characters.
In this novel way we avoid the problems of vector representation by converting the entire text into a unique set of numbers (more details of this method are given in section 2.1). This conversion took place in order for all the features of the characters of a document to be represented in the Cartesian plane, and used in the computational geometry. Furthermore, a characteristic of the onion layers method (Borgefors et al. 1992) is that all the features are sorted in an ideal way and all specific features of documents, such as spaces and punctuation marks, have a domain role in the final reduction. This method may be characterized as a continuation of our latest research, such as specific feature extraction for fingerprint verification via onion layers (Poulos et al. 2003), and the study of the encephalogram where the feature of a convex polygon gave positive results in diagnostic biometrical tests (Poulos et al. 1999).
The aim of this paper is to describe the proposed algorithm in depth and with examples, and, furthermore, to compare it statistically with the similar LSA reduction method in order to extract significant conclusions regarding the statistical assurance of this method.
2 Method
In brief, our proposed method is divided into the following stages:
- Preprocessing stage: conversion of the symbolic expression (in our case an array of characters of a text) to numeric values.
- Processing stage: analysis of the proposed dimensionality reduction technique using an onion algorithm based on computational geometry for text categorization purposes.
- Categorization stage: the result of the above reduction is to obtain the smallest characteristic subset. This is compared geometrically with another characteristic subset from a different text in order to determine their semantic correlation.
- Decision stage: determination of the rules of categorization.
2.1 Preprocessing stage
In this stage we suppose that a selected text is an input vector , where
represent the
characters of the selected text. Then, using a conversion procedure where a symbolic
expression (in our case an array of characters of a text) is converted to ASCII
characters in string arithmetic values, we obtained a numerical value vector
where these values
ranged between 1-128.
In our case this conversion was achieved by using the double.m function of the Matlab language. This function converts strings to double precision and equates with converting an ASCII character to its numerical representation.
For better comprehension we give the following example via Matlab:
>> S = 'This is a message to test the double "command".'
>> double(S)
ans =
Columns 1 through 12
84 104 105 115 32 105 115 32 97 32 109 101
Columns 13 through 24
115 97 103 101 32 116 111 32 116 101 115 116
Columns 25 through 36
32 116 104 101 32 100 111 117 98 108 101 32
Columns 37 through 46
34 99 111 109 109 97 110 100 34 46
2.2 Processing stage
Our proposed method is based on the following proposition:
The set of elements of vector
for each selected text contains a convex subset, which has a specific position in relation to the original set (Poulos et al. 1999, Poulos et al. 2003). This position may be determined by using a combination of computational geometry algorithms, which is known as Onion Peeling Algorithms (Bose and Toussaint 1995) with overall complexity O(d*n log n) times, where d is the depth of the smallest convex layer and n is the number of characters in the numerical representation (in accordance with section 2.1)
Thus, the smallest convex layer of the original set of vector
carries specific information. In
particular, vector
may be characterized as a common geometrical area of all the elements of vector
. In our case, this
consideration is valuable because this subset may be characterized as representing
the significant semantics of the selected text.
Implementation
We consider the set of characters of a selected text to be vector . The algorithm starts
with a finite set of points
in the plane. The following iterative process is considered. Let
be the set
minus all the points
on the boundary of the hull of
. Similarly, define
. The process continues until the set is
(see Figure 1). The
hulls
are called the
layers of the set, and the process of peeling away the layers is called onion
peeling for obvious reasons (Figure 1). More details of the algorithmic construction
of the proposed convex polygon and onion algorithm are presented in the Appendix.

Figure 1. Onion layers of a set of points
2.3 Categorization stage
In our case we considered that the smallest convex layer that has depth 3 (Figure 1) carries specific information (Poulos et al. 2003), because this position gives a geometrical interpretation of the semantics of the selected text. In other words, the smallest convex polygon (layer) depicts a particular geometrical area in which the values of these semantics range. This structure has also been used for example in statistics to define a generalization of the concept of the median into two-dimensional data (O'Rourke et al. 1982). This feature may be characterized as unique to each selected text because the two following conditions (Poulos et al. 1999) are ensured:
- the selected area layer is non-intersected with another layer
- the particular depth of the smallest layer is variable in each case.
Thus, two variables were extracted from the proposed text categorization method: the
area of the smallest onion layer and the depth (i) of this layer, which is a subset of
the original text set S values. In more detail, the smallest onion layer
is a matrix that
contains the set of convex coordinate values of the smallest onion layer, in other
words,
where x represents the frequency of each converted character and y the numerical value of each character.
Taking into account the specific features of the aforementioned variables it is easy to ascertain that these may be used in the next stage.
2.4 Decision stage of computational geometry method
In this stage subset
intersects a new subset
, which came from the processing of another set
, on the Cartesian plane (Figure 2).

Figure 2. Decision stage between two different onion algorithm procedures
The correlation of investigated subsets ,
is decided based on the following experimental conditions:
If the above subsets are intersected, the degree of correlation is extracted by
using the fraction of the common intersection and the area of the original convex
set. In more detail, if we consider that and the corresponding areas of these convex polygons
are
,
and
, then the degree
of the correlation of
subset
with regard to
subset
is extracted
as follows:
(1)
In our case if , then
the system decides that the two compared subsets
and
are not correlated (Figures 3 and 4).

Figure 3. Procedure of correlating the two smallest convex polygons, extracted using the onion layer method (see Figure 2)
In this way degree is
the defining factor in categorizing two investigated texts.

Figure 4. Case of categorization in the same category. The green area is the common area of the compared text categorization documents
3 Experimental part
3.1 Data acquisition
In this study we tested the proposed method in three different semantic categories:
- Five documents (A1, A2, A3, A4 and A5) concerning Money and Marketing with the Reuters-21578 collection as a source.
- Five documents (B1, B2, B3, B4 and B5) concerning computational geometry with common keywords: Computational, Geometry, Algorithm, Onion Algorithm and Fingerprint Verification
- Five documents (C1, C2, C3, C4 and C5) concerning the AIDS virus from the Reuters-21578 collection
Each text had about 500 characters (with spaces), or about 100 words.
3.2 Implementation of the algorithm
The implementation of the algorithm takes place using the following steps.
In the preprocessing stage we selected a text from each category and submitted it for numerical conversion using the following algorithm implemented via Matlab function double.m.
For example, we took the following text from the third category (AIDS and Virus), which we called C1:
West German chemical group Bayer AG <BAYG.F> said it had received claims for damages from haemophiliacs who allege they were infected with AIDS through a blood plasma produced by a Bayer subsidiary. Bayer's shares fell 13 marks to close at 292 in Frankfurt today. Dealers said a newspaper report about possible huge claims for damages triggered off the share selling in Bayer, other pharmaceuticals and insurance companies. Bayer said in a statement it was insured against claims and had received two.
This text was submitted to numerical conversion, and a vector of a set number was constructed.
The same procedure was followed for other selected texts from the three categories.
In the processing stage vector was put on the Cartesian plane and the onion
algorithm was applied, thus isolating the characteristic smallest convex polygon
(vector
). This
algorithm was implemented via Matlab algorithms convex2.m.
This procedure was repeated for the other vectors until we finally isolated the 15
smallest convex polygons or vectors where n=1 ... 15
An example of this procedure is presented in Figures 5 and 6.

Figure 5. Implementation of the onion polygon of text C1
using vector

Figure 6. Isolation of the smallest convex polygon of text C1
or vectors
,
In the categorization stage vectors and
of the smallest polygon were intersected with the other vectors.
The algorithmic procedure was as follows.
We considered that the intersected coordinate vectors were and
from the first text and
and
from the second text. Then the convex polygon, which
was produced by chance intersection, had
and
coordinate vectors. This algorithm was implemented via the
following Matlab function:
For example, we used the following second text from the third category (Aids and Virus), which we called C2:
Bristol-Myers Co <BMY> said it would seek Food and Drug Administration permission to test its AIDS vaccine in humans by the end of March. A company spokesman said the company will file an investigational new drug application by the end of the month, requesting the FDA to allow testing of the vaccine in humans. Scientists at the company's Genetic Systems unit created an AIDS vaccine, which Bristol-Myers said produced antibodies to the AIDS virus in mice and monkeys. The vaccine consists of smallpox virus.
We also used the following text, which we called B1, from the second category, Computational Geometry:
In this paper, fingerprint segmentation for secure Internet verification purposes is investigated. The novel application of computational geometry algorithms in the fingerprint segmentation stage showed that the extracted feature (characteristic polygon) may be used as a secure and accurate method for fingerprint-based verification over the Internet. On the other hand, the proposed method promisingly allows very small false acceptance and false rejection rates, as it is based on specific segmentation,
Figures 7 and 8 show the implementation of the intersection of semantic texts from
the same categories, C1
and C2, where we obtained a large convex polygon C
12
(green) (). In
contrast, in Figures 9 and 10 show that the intersection of semantic texts
C1
and B1
are zero (B1∩C1=0).

Figure 7. Intersection of two semantic texts, C1 and C2, using the onion algorithm

Figure 8. Isolation of the intersection (green) of the two smallest convex polygons of semantic texts C1 (blue) and C2 (red)

Figure 9. Intersection of two different semantic texts, C1 and B1, using the onion algorithm

Figure 10. Zero intersection between the two smallest convex polygons of semantic texts C1 (blue) and B1 (red)
In the decision stage degree of correlation between subsets
and
was calculated as
follows.
First, we calculated the area of each convex subset and
followed by the area of their common intersection
. Then degree
was calculated as
follows:
and
In Matlab this procedure was calculated as follows:
and
and was repeated for all text cases. The degrees are presented in Table 1.
4 Results
In this part the results of the implementation of the proposed algorithm for text
categorization purposes are presented in Table 1. More specifically, we extracted
the degrees of
correlation using Equation 1. In total we obtained 15 x 15 = 225
degrees of correlation
for the three semantic categories.
Table 1:
degree of correlation for three semantic categories (A, B, C) using an onion
algorithm
TEXTS |
A1 |
A2 |
A3 |
A4 |
A5 |
B1 |
B2 |
B3 |
B4 |
B5 |
C1 |
C2 |
C3 |
C4 |
C5 |
A1 |
1 |
0.71 |
0.42 |
0.35 |
0.33 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
A2 |
0.53 |
1 |
0.67 |
0.22 |
0.18 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
A3 |
0.44 |
0.48 |
1 |
0.39 |
0.87 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
A4 |
0.67 |
0.56 |
0.33 |
1 |
0.26 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
A5 |
0.44 |
0.18 |
0.35 |
0.56 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
B1 |
0 |
0 |
0 |
0 |
0 |
1 |
0.67 |
0.43 |
0.38 |
0.35 |
0 |
0 |
0 |
0 |
0 |
B2 |
0 |
0 |
0 |
0 |
0 |
0.19 |
1 |
0.22 |
0.34 |
0.44 |
0 |
0 |
0 |
0 |
0 |
B3 |
0 |
0 |
0 |
0 |
0 |
0.54 |
0.87 |
1 |
0.56 |
0.57 |
0 |
0 |
0 |
0 |
0 |
B4 |
0 |
0 |
0 |
0 |
0 |
0.66 |
0.56 |
0.11 |
1 |
0.33 |
0 |
0 |
0 |
0 |
0 |
B5 |
0 |
0 |
0 |
0 |
0 |
0.23 |
0.43 |
0.31 |
0.27 |
1 |
0 |
0 |
0 |
0 |
0 |
C1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0.44 |
0.23 |
0.25 |
0.33 |
C2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.81 |
1 |
0.12 |
0.17 |
0.19 |
C3 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.54 |
0.41 |
1 |
0.33 |
0.33 |
C4 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.30 |
0.29 |
0.46 |
1 |
0.27 |
C5 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.23 |
0.25 |
0.42 |
0.25 |
1 |
5 Statistical evaluation
In this section a statistical evaluation between the most used information retrieval method, latent semantic indexing (LSI), and the proposed method is attempted. In particular, we tested the five texts of the same semantic category C (AIDS-virus). To compare the two techniques we used rank correlation among several variables of the chi-square control. In the following paragraphs we develop the procedure followed in order to yield the normalized matrices for each method and, thereafter, apply them to the selected statistical method.
5.1 Concordance: rank correlation among several variables
The concept of correlation between two variables can be expanded to consider association among more than two, as shown for multiple correlations (Kendall 1962). Such association is readily measured non-parametrically by a statistic known as Coefficients of Concordance. In our case this correlation, W, gave the perfect criterion of agreement among five selected sets of data. In other words, this test investigates if the data of columns are in agreement. The correlation W was employed by the following equation:
(2)
where M is the number of variables being correlated and n is the number of data per variable, Ri is the sum of the values of each column for i=1 ... n.
We can ask whether a calculated sample, W, is significant, that is,
whether it represents an association different from zero in the population that was
sampled. A simple way to find out the relationship between the Kendall Coefficients
of Concordance, W, and the Friedman chi-square involves using the following formula:
(3)
Therefore, we can convert the calculated W to its equivalent and then employ the
critical values of
.
5.2 Latent semantic indexing
In the first stage we created Table 2 to sort out all the conceptual contents of the documents. In particular, the local weight (Lij) of each term is represented by each row in Table 2, and every document is represented by each column. Thus, an individual entry in Table 2, aij, represents the weights of the terms i in document j, where, in our case, i=1...88 and j=1...5. More specifically, the term document matrix can be expressed as a vector consisting of the weights of each term and mapped in a vector space (Tokunaga 1999):
(4)
where: Lij is the local weighting. This approach defines the weights wij in each document. Let Pij denote the occurrence probability of ti (terms frequency) and dj (documents). We ascribe significance to a term's occurrence, on the grounds that it represents a document's topics more than other factors do. Therefore, we base w ij on the term occurrence probability Pij in each document, and we define a local weighting Lij as follows:
(5)
The LSI method uses an algebraic model of document retrieval, using a matrix technique known as singular value decomposition. Thus, using the SVD function of Matlab, we obtained three separate matrices from the original table A, as represented in Table 2, of Lij elements. The first matrix is a term by concept matrix B. The second matrix is a concept by concept matrix (see Table 3). The third matrix is a concept by document matrix D. One important note is that multiplying B x C x D may not necessarily yield the original Table 2, since many of the concepts may be very small and not worth keeping. In our case, we used the normalized matrix B (Table 3) of dimensionality 5 x 5 in order to conceptually compare the LSI method with our proposed method.
Table 2: Mapping of the most frequent common terms (words) from the same
five semantic (AIDS) texts, where: T/O is the term's occurrence and
Lij
is the local weighting
Terms | Document C1 | Document C2 | Document C3 | Document C4 | Document C5 | |||||
T/O | Lij | T/O | Lij | T/O | Lij | T/O | Lij | T/O | Lij | |
West | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 |
German | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 |
chemical | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 3 | 0.0011 |
group | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
Bayer | 4 | 0.0020 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 |
said | 3 | 0.0011 | 3 | 0.0011 | 1 | 0.0001 | 2 | 0.0005 | 1 | 0.0001 |
received | 2 | 0.0005 | 0 | 0.0000 | 0 | 0.0000 | 2 | 0.0005 | 0 | 0.0000 |
claims | 3 | 0.0011 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 |
damages | 2 | 0.0005 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 |
hemophiliacs | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 2 | 0.0005 | 0 | 0.0000 |
allege | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 |
infected | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 |
AIDS | 1 | 0.0001 | 3 | 0.0011 | 2 | 0.0005 | 2 | 0.0005 | 2 | 0.0005 |
through | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 |
blood | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 |
plasma | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 |
produced | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 |
Bayer | 3 | 0.0011 | 0 | 0.0000 | 0 | 0.0000 | 2 | 0.0005 | 0 | 0.0000 |
subsidiary | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 |
shares | 2 | 0.0005 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 |
fell | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
marks | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0001 |
close | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 |
Frankfurt | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 |
today | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 |
Dealers | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
newspaper | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 |
report | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 |
about | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 |
possible | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0001 |
huge | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 |
triggered | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
selling | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 |
other | 1 | 0.0001 | 0 | 0.0000 | 2 | 0.0005 | 3 | 0.0011 | 1 | 0.0001 |
harmaceutical | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
insurance | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 2 | 0.0005 | 1 | 0.0001 |
company | 1 | 0.0001 | 3 | 0.0011 | 1 | 0.0001 | 2 | 0.0005 | 1 | 0.0001 |
statement | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
insured | 1 | 0.0001 | 0 | 0.0000 | 2 | 0.0005 | 1 | 0.0001 | 1 | 0.0001 |
against | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
and | 1 | 0.0001 | 2 | 0.0005 | 2 | 0.0005 | 2 | 0.0005 | 3 | 0.0011 |
two | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 |
Bristol-Myers | 0 | 0.0000 | 2 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0001 |
would | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 |
seek | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 2 | 0.0005 |
Food | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 |
Drug | 0 | 0.0000 | 2 | 0.0005 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 |
Administration | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
permission | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 |
test | 0 | 0.0000 | 2 | 0.0005 | 0 | 0.0000 | 1 | 0.0001 | 2 | 0.0005 |
vaccine | 0 | 0.0000 | 4 | 0.0020 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 |
humans | 0 | 0.0000 | 2 | 0.0005 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0001 |
end | 0 | 0.0000 | 2 | 0.0005 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0001 |
March. | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 |
spokesman | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 2 | 0.0005 | 3 | 0.0011 |
file | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
investigational | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0001 |
new | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 |
application | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 |
month | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 |
requesting | 0 | 0.0000 | 1 | 0.0001 | 2 | 0.0005 | 0 | 0.0000 | 0 | 0.0000 |
FDA | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 |
allow | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 |
Scientists | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
Genetic | 0 | 0.0000 | 1 | 0.0001 | 2 | 0.0005 | 1 | 0.0001 | 1 | 0.0001 |
Systems | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
unit | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
created | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 |
which | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
antibodies | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
virus | 0 | 0.0000 | 2 | 0.0005 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 |
monkey | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
mice | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0000 | 0 | 0.0000 |
smallpox | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 |
consists | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 |
hospital | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 1 | 0.0001 |
general | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 1 | 0.0001 | 0 | 0.0000 |
thus | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 |
letter | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 |
Growth | 0 | 0.0000 | 0 | 0.0000 | 2 | 0.0005 | 1 | 0.0001 | 0 | 0.0000 |
aggreed | 0 | 0.0000 | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 |
lecture | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 |
report | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 1 | 0.0001 |
patient | 0 | 0.0000 | 0 | 0.0000 | 2 | 0.0005 | 1 | 0.0001 | 1 | 0.0001 |
library | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 |
food | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 |
care | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 |
staff | 0 | 0.0000 | 0 | 0.0000 | 1 | 0.0001 | 0 | 0.0000 | 0 | 0.0000 |
Table 3: Concept by concept, decompose matrix using SVD method
0.4489 | 0.8667 | 0.8840 | 0.8430 | 0.9810 | |
0.5386 | 0.7050 | 0.6724 | 0.6230 | 0.7500 | |
0.4733 | 0.7560 | 0.8038 | 0.4137 | 0.7860 | |
0.5693 | 0.4980 | 0.5210 | 0.6750 | 0.9238 | |
0.5220 | 0.8500 | 0.6610 | 1.0000 | 0.7870 | |
Rank Score | 2.5521 | 3.6757 | 3.5422 | 3.5547 | 4.2278 |
In the next stage the data of Table 3 can be tested using the top-down correlation
technique (see Equation 2). The null hypothesis (H0) of this
test is that there is an agreement regarding the investigated data of the columns of
Table 3. In our case, according to Equation (2), M=5 was the number of
investigated documents, n=5 were the conceptual values after reduction,
and R was the sum of the rank score (see Table 3). Then, W=0.0059
and thus (see
Equation 3). Comparing it to the chi-square distribution with a degrees
of freedom í=5-1= 4 and
, we had non-rejection of the null hypothesis (H0) of
the agreement regarding the conceptuality of the compared documents (0.99 <
P < 0.95).
5.3 Onion layers method
In this stage we obtained the five sets of data for each document (Table 4) using the method described in section 3.2. It should be noted that each set contains the points of the smallest convex polygon plus a number of points ranging between 0-2 (see Figure 1).
Table 4: Pair points of the last (smallest polygon) of the five onion
polygons of documents C1, C2, C3, C4, C5
Document C1 | Document C2 | Document C3 | Document C4 | Document C5 | |||||
Axis X | Axis Y | Axis X | Axis Y | Axis X | Axis Y | Axis X | Axis Y | Axis X | Axis Y |
248 | 97 | 255 | 101 | 259 | 99 | 251 | 99 | 232 | 98 |
246 | 70 | 242 | 101 | 249 | 100 | 242 | 99 | 230 | 97 |
265 | 97 | 235 | 100 | 242 | 97 | 240 | 97 | 220 | 83 |
248 | 97 | 249 | 44 | 265 | 32 | 237 | 83 | 243 | 97 |
259 | 97 | 267 | 68 | 266 | 98 | 262 | 77 | 232 | 98 |
255 | 101 | 259 | 99 | 269 | 97 | 240 | 97 | ||
252 | 101 | 254 | 97 | 251 | 99 | ||||
266 | 70 | 260 | 97 |
The next stage was to transform the data of Table 4 in order for each pair of data to be represented by a value. Thus, for the whole set we calculated the magnitude of each pair point (xij,yij) from the origin 0 of the coordinate system using the following formula:
(6)
where: i=1, 5 is the number of elements of each smallest convex layer and j=1, 5 is the number of documents.
We then obtained the first five values (Table 5) for each document in order to compare these statistically with those of the of LSI method (Table 3) and normalized the data in Table 6.
Table 5: Magnitude of all the pair points of documents C1, C2, C3, C4, C5
Document C1 | Document C2 | Document C3 | Document C4 | Document C5 |
266.2949 |
255.7655 |
282.1950 |
266.2949 |
276.5683 |
274.2736 |
262.2308 |
255.3919 |
252.8577 |
275.5231 |
277.2760 |
268.3300 |
260.7163 |
266.9251 |
283.4784 |
269.8185 |
261.4670 |
258.8610 |
251.1135 |
273.0806 |
251.8492 |
249.6177 |
235.1361 |
261.6448 |
251.8492 |
Table 6: Normalization on the maximum value (283.48) of Table 5
Document C1 |
Document C2 |
Document C3 |
Document C4 |
Document C5 |
|
0.9394 |
0.9022 |
0.9955 |
0.9394 |
0.9756 |
|
0.9675 |
0.9250 |
0.9009 |
0.8920 |
0.9719 |
|
0.9781 |
0.9466 |
0.9197 |
0.9416 |
1.0000 |
|
0.9518 |
0.9224 |
0.9132 |
0.8858 |
0.9633 |
|
0.8884 |
0.8806 |
0.8295 |
0.9230 |
0.8884 |
|
Rank Score | 4.7252 | 4.5768 | 4.5588 | 4.5818 | 4.7992 |
In this stage the data of Table 6 can be tested using the top-down correlation
technique (Equation 2). The null hypothesis (H
0
) of this test is that there is an agreement regarding the investigated data of
the columns of Table 6. In our case and according to Equation (2), M=5 was
the number of investigated documents, n=5 were the conceptual values after
reduction, and R was the sum of the rank score (see Table 6). Then
W=0.0003 and thus (see Equation 3). On comparing it to the chi-square distribution
with degree of freedom í=5-1=4 and
, we had non-rejection of the null hypothesis of the agreement regarding
the conceptuality of the compared documents (0.99 < P <
0.95).
6 Analysis and discussion
As can be seen from the results in Table 1, the proposed method correctly categorized all tested texts.
The statistical evaluation method based on rank correlation (section
5.1) of our proposed method and the LSI method (section
5.2) showed that they have the same level significance (0.99 < P <
0.95), although our proposed method may be considered to be superior to the LSI
method because the reduction technique is independent of the number of documents and
the number of terms involved. In more detail, the method promises a great reduction
in the original information. For example, a selected text of 500 characters is now
represented by a convex polygon (categorization stage) of two vectors and
that have 3-9 elements
each: a total reduction in information of about
or 96.4%
Furthermore, for smaller texts (100 words in length) we avoid the training procedure and, in this way, the complexity of the proposed method is reduced dramatically in comparison with traditional methods.
For training on longer texts this method may be applied as follows:
- The text is divided into smaller subsets of about 100 words in length.
- The procedure of the preprocessing stage (section 3.2) is then applied to each subset in turn.
7 Conclusions and future work
7.1 Conclusion
This paper has explained a text categorization technique that combines numeric conversion of a symbolic expression and a computational geometric algorithm. The proposed technique is suitable for the specific partitioning of digital library collections, and semantic categorization of emails and abstracts. This method may be characterized as a supervised method because it is based on the claim that an unknown text categorization vector (smallest convex polygon) is compared with a particular text categorization vector the class of which is already known. Thus, a practical way to see if this method is able to classify an unknown document vector correctly is to test it with a particular number of different predetermined semantic document vectors and let the system decide which of these vectors is geometrically nearest the unknown vector according to the method described in section 3.2.
An example of this is the experiment in which five test categories (of 5 x 100 word documents per category) were used to test this technique. Statistical information was collected from the experiment and used to corroborate our findings:
- Statistical evaluation using rank correlation gave a significant level of accuracy (0.99 < P < 0.95) and showed that the results of the five tested documents are in agreement.
- The proposed method is based on a different philosophy to those methods that depend on the removal of features based on overall frequency counts (Salton and McGill 1983, Jones and Paynter 2002). In contrast to those methods, the proposed method is based on total word participation in the feature extraction. The results showed that this feature is efficient and accurate for semantic text categorization.
- The onion algorithm may be characterized as a more efficient algorithm for the lowest complexity, O(d*n log n), than the text categorization methods that are used currently, for example the Bayesian classifier with O(n3k) complexity.
In conclusion, use of the onion algorithm for text categorization purposes may be considered to be a solution to the problems of managing information storage and retrieval in large digital libraries.
7.2 Future work
A way to improve our proposed method in future would be to develop numerical conversion of a symbolic expression based on a different encoding method. In this way, we propose to extend the encoding to a larger 8-bit conversion. The aim of this proposition is for each character of each word to be described in a more specific way. Phonetic conversion in place of ASCII conversion may provide an even more accurate solution in our proposed method.
The next aim of this study is to increase the text length in each processed document to more than 100 words.
Furthermore, this system has the ability to investigate the relation of the position of one convex polygon to that of another in the Cartesian plane. Thus, in this study we considered the degree a of the intersection of convex polygons as a measure of comparison. However, in future the centre of gravity of a convex polygon may be used as a measure of comparison and treated as a vector (O'Rourke 1995).
In our latest work modified the proposed method to construct an anti-spam filter.
Finally, this method may be considered to be a preliminary study and the next aim is to apply it to a real test collection. However, the statistical conclusion and the results of the 15 tested documents show that we are moving in the right direction.