A Text Categorization Technique based on a Numerical Conversion of a Symbolic Expression and an Onion Layers Algorithm: Poulos et al.: JoDI

Abstract

The dramatic increase in the amount of content available in digital forms gives rise to large-scale digital libraries, targeted at millions of users. As a result, it has become a necessary to categorize large texts (documents). The paper develops a novel method where text categorization is achieved via a reduction in the original data information using numerical conversion of a symbolic expression and an onion layers algorithm. Three different semantic categories were considered and five texts selected from each category for submission to a text categorization procedure using the proposed method. The results and the statistical evaluation of this procedure showed that the proposed method may be characterized as highly accurate for text categorization purposes.

1 Introduction

Text categorization is the process of grouping texts into one or more predefined categories based on their content. Automatic text categorization can play an important role in a wide variety of more flexible, dynamic and personalized information management tasks as well as: real-time sorting of email or files into folder hierarchies; topic identification to support topic-specific processing operations; structured search and/or browsing; or finding texts that match long-standing interests or more dynamic task-based interests (Anick and Vaithyanathan 1997). A number of statistical classification and machine learning techniques have been applied to text categorization, including regression models, Bayesian classifiers, decision trees, nearest neighbour classifiers, neural networks, and support vector machines (Weston et al. 1997).

Classification technologies should be able to support category structures that are very general, consistent across individuals, and relatively static (e.g. Dewey Decimal or Library of Congress classification systems, Medical Subject Headings (MeSH), or Yahoo!'s topic hierarchy), as well as those that are more dynamic and customized to individual interests or tasks. Until now the first step in text categorization has been to transform texts, which are typically strings of characters, into a representation suitable for the learning algorithm and the classification task. Each document is represented as a vector of words, as is typically done in information retrieval (Salton and McGill 1983). For most text retrieval applications, entries are weighted to reflect the frequency of terms in texts and the distribution of terms across the collection as a whole. For text classification, simpler binary feature values (i.e. a word either occurs or does not occur in a document) are often used instead. For example, in the most commonly used text representation, the vector space model, a vector of words represents each text. A word-by-text matrix A is used for a collection of texts, where each entry represents the occurrence of a word in a text, i.e. , where is the weight of word i in text j. Determining the weight and the frequency of words in a text is common to all existing methods.

Classification of the aforementioned vectors has been based on several learning techniques, including multivariate regression, nearest neighbour classifiers, probabilistic Bayesian models, decision trees, and neural networks. Overviews of this text classification work can be found in Lewis and Hayes (1994), Yang (1999), Joachims (1998) and Dumais et al. (1998). Furthermore, in the semantic text categorization reduction technique, latent semantic analysis (LSA) (Deerwester et al. 1990), there is an approach to automatic indexing and information retrieval that attempts to overcome common problems in information retrieval. The first problem is termed synonym and refers to the fact that there are many names for the same object. The second problem is called polysemy, and refers to the fact that most words have more than one meaning.

Recently, machine learning algorithms have been applied successfully to automatic text categorization problems, with a large number of unique words (over 15000). However, it has not yet been clarified, particularly for large-scale problems with more than a million feature terms, to what degree the following two factors work to the detriment of each other: the reduction of the classification error using computationally-intensive machine learning algorithms, and the loss of information due to the selection of feature terms before applying the expensive learning algorithms.

Specifically, in the learning procedure the algorithm of Slonim and Tishby (2001), which is based on the Bayesian classifier, is O(n³k) in complexity where n is the total number of words and k is the number of classes. Furthermore, the philosophy of the aforementioned methodology is justified because a Bayesian classifier presents a weakness in the point sets for large dimensions (Domingos and Pazzani 1997). It is known that high dimensionality, which presents a classified vector S of words, creates classification problems. These arise from the asymptotic error rate of the nearest neighbour rule (Toussaint et al. 1984). This problem is solved partially by the Voronoi diagram of S, which partitions the plane convex polygonal Voronoi cells.

This paper attempts to improve the performance of the naive text categorization methods by using the concept of probability weighted amount of information, and also by incorporating morphological information and probabilistic language modelling techniques in extracting and computing the weights of feature terms. Thus, our analysis is based on the advantages of the Voronoi diagram and we reduce the complexity of this transformation by using the onion algorithm of Bremner et al. (2003). The idea behind this implementation is to use the convex hull subroutine recursively to extract the outmost convex hull (S1 smallest convex polygon) of the given points (characters). In particular, we have developed a method based on an onion polygon each point of which represents a character of the document. For the numerical conversion of each character we used a specific numerical representation of the conversion strings which equates to double precision of ASCII characters.

In this novel way we avoid the problems of vector representation by converting the entire text into a unique set of numbers (more details of this method are given in section 2.1). This conversion took place in order for all the features of the characters of a document to be represented in the Cartesian plane, and used in the computational geometry. Furthermore, a characteristic of the onion layers method (Borgefors et al. 1992) is that all the features are sorted in an ideal way and all specific features of documents, such as spaces and punctuation marks, have a domain role in the final reduction. This method may be characterized as a continuation of our latest research, such as specific feature extraction for fingerprint verification via onion layers (Poulos et al. 2003), and the study of the encephalogram where the feature of a convex polygon gave positive results in diagnostic biometrical tests (Poulos et al. 1999).

The aim of this paper is to describe the proposed algorithm in depth and with examples, and, furthermore, to compare it statistically with the similar LSA reduction method in order to extract significant conclusions regarding the statistical assurance of this method.

2 Method

In brief, our proposed method is divided into the following stages:

Preprocessing stage: conversion of the symbolic expression (in our case an array of characters of a text) to numeric values.
Processing stage: analysis of the proposed dimensionality reduction technique using an onion algorithm based on computational geometry for text categorization purposes.
Categorization stage: the result of the above reduction is to obtain the smallest characteristic subset. This is compared geometrically with another characteristic subset from a different text in order to determine their semantic correlation.
Decision stage: determination of the rules of categorization.

2.1 Preprocessing stage

In this stage we suppose that a selected text is an input vector , where represent the characters of the selected text. Then, using a conversion procedure where a symbolic expression (in our case an array of characters of a text) is converted to ASCII characters in string arithmetic values, we obtained a numerical value vector where these values ranged between 1-128.

In our case this conversion was achieved by using the double.m function of the Matlab language. This function converts strings to double precision and equates with converting an ASCII character to its numerical representation.

For better comprehension we give the following example via Matlab:

>> S = 'This is a message to test the double "command".'

>> double(S)

ans =

Columns 1 through 12

84 104 105 115 32 105 115 32 97 32 109 101

Columns 13 through 24

115 97 103 101 32 116 111 32 116 101 115 116

Columns 25 through 36

32 116 104 101 32 100 111 117 98 108 101 32

Columns 37 through 46

34 99 111 109 109 97 110 100 34 46

2.2 Processing stage

Our proposed method is based on the following proposition:

The set of elements of vector for each selected text contains a convex subset, which has a specific position in relation to the original set (Poulos et al. 1999, Poulos et al. 2003). This position may be determined by using a combination of computational geometry algorithms, which is known as Onion Peeling Algorithms (Bose and Toussaint 1995) with overall complexity O(d*n log n) times, where d is the depth of the smallest convex layer and n is the number of characters in the numerical representation (in accordance with section 2.1)

Thus, the smallest convex layer of the original set of vector carries specific information. In particular, vector may be characterized as a common geometrical area of all the elements of vector . In our case, this consideration is valuable because this subset may be characterized as representing the significant semantics of the selected text.

Implementation

We consider the set of characters of a selected text to be vector . The algorithm starts with a finite set of points in the plane. The following iterative process is considered. Let be the set minus all the points on the boundary of the hull of . Similarly, define . The process continues until the set is (see Figure 1). The hulls are called the layers of the set, and the process of peeling away the layers is called onion peeling for obvious reasons (Figure 1). More details of the algorithmic construction of the proposed convex polygon and onion algorithm are presented in the Appendix.

Figure 1. Onion layers of a set of points

2.3 Categorization stage

In our case we considered that the smallest convex layer that has depth 3 (Figure 1) carries specific information (Poulos et al. 2003), because this position gives a geometrical interpretation of the semantics of the selected text. In other words, the smallest convex polygon (layer) depicts a particular geometrical area in which the values of these semantics range. This structure has also been used for example in statistics to define a generalization of the concept of the median into two-dimensional data (O'Rourke et al. 1982). This feature may be characterized as unique to each selected text because the two following conditions (Poulos et al. 1999) are ensured:

the selected area layer is non-intersected with another layer
the particular depth of the smallest layer is variable in each case.

Thus, two variables were extracted from the proposed text categorization method: the area of the smallest onion layer and the depth (i) of this layer, which is a subset of the original text set S values. In more detail, the smallest onion layer is a matrix that contains the set of convex coordinate values of the smallest onion layer, in other words,

where x represents the frequency of each converted character and y the numerical value of each character.

Taking into account the specific features of the aforementioned variables it is easy to ascertain that these may be used in the next stage.

2.4 Decision stage of computational geometry method

In this stage subset intersects a new subset , which came from the processing of another set , on the Cartesian plane (Figure 2).

Figure 2. Decision stage between two different onion algorithm procedures

The correlation of investigated subsets , is decided based on the following experimental conditions:

If the above subsets are intersected, the degree of correlation is extracted by using the fraction of the common intersection and the area of the original convex set. In more detail, if we consider that and the corresponding areas of these convex polygons are , and , then the degree of the correlation of subset with regard to subset is extracted as follows:

(1)

In our case if , then the system decides that the two compared subsets and are not correlated (Figures 3 and 4).

Figure 3. Procedure of correlating the two smallest convex polygons, extracted using the onion layer method (see Figure 2)

In this way degree is the defining factor in categorizing two investigated texts.

Figure 4. Case of categorization in the same category. The green area is the common area of the compared text categorization documents

3 Experimental part

3.1 Data acquisition

In this study we tested the proposed method in three different semantic categories:

Five documents (A_1, A_2, A_3, A₄ and A₅) concerning Money and Marketing with the Reuters-21578 collection as a source.
Five documents (B₁, B₂, B₃, B₄ and B₅) concerning computational geometry with common keywords: Computational, Geometry, Algorithm, Onion Algorithm and Fingerprint Verification
Five documents (C_1, C_2, C_3, C₄ and C₅) concerning the AIDS virus from the Reuters-21578 collection

Each text had about 500 characters (with spaces), or about 100 words.

3.2 Implementation of the algorithm

The implementation of the algorithm takes place using the following steps.

In the preprocessing stage we selected a text from each category and submitted it for numerical conversion using the following algorithm implemented via Matlab function double.m.

For example, we took the following text from the third category (AIDS and Virus), which we called C₁:

West German chemical group Bayer AG <BAYG.F> said it had received claims for damages from haemophiliacs who allege they were infected with AIDS through a blood plasma produced by a Bayer subsidiary. Bayer's shares fell 13 marks to close at 292 in Frankfurt today. Dealers said a newspaper report about possible huge claims for damages triggered off the share selling in Bayer, other pharmaceuticals and insurance companies. Bayer said in a statement it was insured against claims and had received two.

This text was submitted to numerical conversion, and a vector of a set number was constructed.

The same procedure was followed for other selected texts from the three categories.

In the processing stage vector was put on the Cartesian plane and the onion algorithm was applied, thus isolating the characteristic smallest convex polygon (vector ). This algorithm was implemented via Matlab algorithms convex2.m.

This procedure was repeated for the other vectors until we finally isolated the 15 smallest convex polygons or vectors where n=1 ... 15

An example of this procedure is presented in Figures 5 and 6.

Figure 5. Implementation of the onion polygon of text C₁ using vector

Figure 6. Isolation of the smallest convex polygon of text C₁ or vectors ,

In the categorization stage vectors and of the smallest polygon were intersected with the other vectors. The algorithmic procedure was as follows.

We considered that the intersected coordinate vectors were and from the first text and and from the second text. Then the convex polygon, which was produced by chance intersection, had and coordinate vectors. This algorithm was implemented via the following Matlab function:

For example, we used the following second text from the third category (Aids and Virus), which we called C₂:

Bristol-Myers Co <BMY> said it would seek Food and Drug Administration permission to test its AIDS vaccine in humans by the end of March. A company spokesman said the company will file an investigational new drug application by the end of the month, requesting the FDA to allow testing of the vaccine in humans. Scientists at the company's Genetic Systems unit created an AIDS vaccine, which Bristol-Myers said produced antibodies to the AIDS virus in mice and monkeys. The vaccine consists of smallpox virus.

We also used the following text, which we called B₁, from the second category, Computational Geometry:

In this paper, fingerprint segmentation for secure Internet verification purposes is investigated. The novel application of computational geometry algorithms in the fingerprint segmentation stage showed that the extracted feature (characteristic polygon) may be used as a secure and accurate method for fingerprint-based verification over the Internet. On the other hand, the proposed method promisingly allows very small false acceptance and false rejection rates, as it is based on specific segmentation,

Figures 7 and 8 show the implementation of the intersection of semantic texts from the same categories, C₁ and C₂, where we obtained a large convex polygon C ₁₂ (green) (). In contrast, in Figures 9 and 10 show that the intersection of semantic texts C₁ and B₁ are zero (B₁∩C₁=0).

Figure 7. Intersection of two semantic texts, C₁ and C₂_, using the onion algorithm

Figure 8. Isolation of the intersection (green) of the two smallest convex polygons of semantic texts C₁ (blue) and C₂ (red)

Figure 9. Intersection of two different semantic texts, C₁ and B₁, using the onion algorithm

Figure 10. Zero intersection between the two smallest convex polygons of semantic texts C₁ (blue) and B₁ (red)

In the decision stage degree of correlation between subsets and was calculated as follows.

First, we calculated the area of each convex subset and followed by the area of their common intersection . Then degree was calculated as follows:

and

In Matlab this procedure was calculated as follows:

and

and was repeated for all text cases. The degrees are presented in Table 1.

4 Results

In this part the results of the implementation of the proposed algorithm for text categorization purposes are presented in Table 1. More specifically, we extracted the degrees of correlation using Equation 1. In total we obtained 15 x 15 = 225 degrees of correlation for the three semantic categories.

Table 1: degree of correlation for three semantic categories (A, B, C) using an onion algorithm

TEXTS	A1	A2	A3	A4	A5	B1	B2	B3	B4	B5	C1	C2	C3	C4	C5
A1	1	0.71	0.42	0.35	0.33	0	0	0	0	0	0	0	0	0	0
A2	0.53	1	0.67	0.22	0.18	0	0	0	0	0	0	0	0	0	0
A3	0.44	0.48	1	0.39	0.87	0	0	0	0	0	0	0	0	0	0
A4	0.67	0.56	0.33	1	0.26	0	0	0	0	0	0	0	0	0	0
A5	0.44	0.18	0.35	0.56	1	0	0	0	0	0	0	0	0	0	0
B1	0	0	0	0	0	1	0.67	0.43	0.38	0.35	0	0	0	0	0
B2	0	0	0	0	0	0.19	1	0.22	0.34	0.44	0	0	0	0	0
B3	0	0	0	0	0	0.54	0.87	1	0.56	0.57	0	0	0	0	0
B4	0	0	0	0	0	0.66	0.56	0.11	1	0.33	0	0	0	0	0
B5	0	0	0	0	0	0.23	0.43	0.31	0.27	1	0	0	0	0	0
C1	0	0	0	0	0	0	0	0	0	0	1	0.44	0.23	0.25	0.33
C2	0	0	0	0	0	0	0	0	0	0	0.81	1	0.12	0.17	0.19
C3	0	0	0	0	0	0	0	0	0	0	0.54	0.41	1	0.33	0.33
C4	0	0	0	0	0	0	0	0	0	0	0.30	0.29	0.46	1	0.27
C5	0	0	0	0	0	0	0	0	0	0	0.23	0.25	0.42	0.25	1

5 Statistical evaluation

In this section a statistical evaluation between the most used information retrieval method, latent semantic indexing (LSI), and the proposed method is attempted. In particular, we tested the five texts of the same semantic category C (AIDS-virus). To compare the two techniques we used rank correlation among several variables of the chi-square control. In the following paragraphs we develop the procedure followed in order to yield the normalized matrices for each method and, thereafter, apply them to the selected statistical method.

5.1 Concordance: rank correlation among several variables

The concept of correlation between two variables can be expanded to consider association among more than two, as shown for multiple correlations (Kendall 1962). Such association is readily measured non-parametrically by a statistic known as Coefficients of Concordance. In our case this correlation, W, gave the perfect criterion of agreement among five selected sets of data. In other words, this test investigates if the data of columns are in agreement. The correlation W was employed by the following equation:

(2)

where M is the number of variables being correlated and n is the number of data per variable, R_i is the sum of the values of each column for i=1 ... n.

We can ask whether a calculated sample, W, is significant, that is, whether it represents an association different from zero in the population that was sampled. A simple way to find out the relationship between the Kendall Coefficients of Concordance, W, and the Friedman chi-square involves using the following formula:

(3)

Therefore, we can convert the calculated W to its equivalent and then employ the critical values of .

5.2 Latent semantic indexing

In the first stage we created Table 2 to sort out all the conceptual contents of the documents. In particular, the local weight (L_ij) of each term is represented by each row in Table 2, and every document is represented by each column. Thus, an individual entry in Table 2, a_ij, represents the weights of the terms i in document j, where, in our case, i=1...88 and j=1...5. More specifically, the term document matrix can be expressed as a vector consisting of the weights of each term and mapped in a vector space (Tokunaga 1999):

(4)

where: L_ij is the local weighting. This approach defines the weights w_ij in each document. Let P_ij denote the occurrence probability of t_i (terms frequency) and d_j (documents). We ascribe significance to a term's occurrence, on the grounds that it represents a document's topics more than other factors do. Therefore, we base w _ij on the term occurrence probability P_ij in each document, and we define a local weighting L_ij as follows:

(5)

The LSI method uses an algebraic model of document retrieval, using a matrix technique known as singular value decomposition. Thus, using the SVD function of Matlab, we obtained three separate matrices from the original table A, as represented in Table 2, of L_ij elements. The first matrix is a term by concept matrix B. The second matrix is a concept by concept matrix (see Table 3). The third matrix is a concept by document matrix D. One important note is that multiplying B x C x D may not necessarily yield the original Table 2, since many of the concepts may be very small and not worth keeping. In our case, we used the normalized matrix B (Table 3) of dimensionality 5 x 5 in order to conceptually compare the LSI method with our proposed method.

Table 2: Mapping of the most frequent common terms (words) from the same five semantic (AIDS) texts, where: T/O is the term's occurrence and L_ij is the local weighting

Terms	Document C1		Document C2		Document C3		Document C4		Document C5
Terms	T/O	*L_ij*	T/O	*L_ij*	T/O	*L_ij*	T/O	*L_ij*	T/O	*L_ij*
West	1	0.0001	0	0.0000	0	0.0000	0	0.0000	1	0.0001
German	1	0.0001	0	0.0000	0	0.0000	1	0.0001	0	0.0000
chemical	1	0.0001	0	0.0000	0	0.0000	1	0.0001	3	0.0011
group	1	0.0001	0	0.0000	0	0.0000	0	0.0000	0	0.0000
Bayer	4	0.0020	0	0.0000	0	0.0000	1	0.0001	1	0.0001
said	3	0.0011	3	0.0011	1	0.0001	2	0.0005	1	0.0001
received	2	0.0005	0	0.0000	0	0.0000	2	0.0005	0	0.0000
claims	3	0.0011	0	0.0000	1	0.0001	1	0.0001	0	0.0000
damages	2	0.0005	0	0.0000	0	0.0000	1	0.0001	1	0.0001
hemophiliacs	1	0.0001	0	0.0000	1	0.0001	2	0.0005	0	0.0000
allege	1	0.0001	0	0.0000	0	0.0000	0	0.0000	1	0.0001
infected	1	0.0001	0	0.0000	1	0.0001	0	0.0000	0	0.0000
AIDS	1	0.0001	3	0.0011	2	0.0005	2	0.0005	2	0.0005
through	1	0.0001	0	0.0000	1	0.0001	0	0.0000	0	0.0000
blood	1	0.0001	0	0.0000	0	0.0000	1	0.0001	1	0.0001
plasma	1	0.0001	0	0.0000	1	0.0001	0	0.0000	1	0.0001
produced	1	0.0001	1	0.0001	0	0.0000	1	0.0001	1	0.0001
Bayer	3	0.0011	0	0.0000	0	0.0000	2	0.0005	0	0.0000
subsidiary	1	0.0001	0	0.0000	1	0.0001	1	0.0001	0	0.0000
shares	2	0.0005	0	0.0000	0	0.0000	0	0.0000	1	0.0001
fell	1	0.0001	0	0.0000	0	0.0000	0	0.0000	0	0.0000
marks	1	0.0001	0	0.0000	1	0.0001	1	0.0001	1	0.0001
close	1	0.0001	0	0.0000	0	0.0000	1	0.0001	0	0.0000
Frankfurt	1	0.0001	0	0.0000	0	0.0000	0	0.0000	1	0.0001
today	1	0.0001	0	0.0000	0	0.0000	1	0.0001	0	0.0000
Dealers	1	0.0001	0	0.0000	0	0.0000	0	0.0000	0	0.0000
newspaper	1	0.0001	0	0.0000	1	0.0001	0	0.0000	1	0.0001
report	1	0.0001	0	0.0000	1	0.0001	1	0.0001	0	0.0000
about	1	0.0001	0	0.0000	0	0.0000	0	0.0000	1	0.0001
possible	1	0.0001	0	0.0000	1	0.0001	1	0.0001	1	0.0001
huge	1	0.0001	0	0.0000	0	0.0000	1	0.0001	1	0.0001
triggered	1	0.0001	0	0.0000	0	0.0000	0	0.0000	0	0.0000
selling	1	0.0001	0	0.0000	1	0.0001	0	0.0000	0	0.0000
other	1	0.0001	0	0.0000	2	0.0005	3	0.0011	1	0.0001
harmaceutical	1	0.0001	0	0.0000	0	0.0000	0	0.0000	0	0.0000
insurance	1	0.0001	0	0.0000	1	0.0001	2	0.0005	1	0.0001
company	1	0.0001	3	0.0011	1	0.0001	2	0.0005	1	0.0001
statement	1	0.0001	0	0.0000	0	0.0000	0	0.0000	0	0.0000
insured	1	0.0001	0	0.0000	2	0.0005	1	0.0001	1	0.0001
against	1	0.0001	0	0.0000	0	0.0000	0	0.0000	0	0.0000
and	1	0.0001	2	0.0005	2	0.0005	2	0.0005	3	0.0011
two	1	0.0001	0	0.0000	0	0.0000	0	0.0000	1	0.0001
Bristol-Myers	0	0.0000	2	0.0000	1	0.0001	1	0.0001	1	0.0001
would	0	0.0000	1	0.0001	0	0.0000	1	0.0001	1	0.0001
seek	0	0.0000	1	0.0001	0	0.0000	0	0.0000	2	0.0005
Food	0	0.0000	1	0.0001	1	0.0001	1	0.0001	0	0.0000
Drug	0	0.0000	2	0.0005	0	0.0000	0	0.0000	1	0.0001
Administration	0	0.0000	1	0.0001	0	0.0000	0	0.0000	0	0.0000
permission	0	0.0000	1	0.0001	1	0.0001	0	0.0000	0	0.0000
test	0	0.0000	2	0.0005	0	0.0000	1	0.0001	2	0.0005
vaccine	0	0.0000	4	0.0020	1	0.0001	1	0.0001	0	0.0000
humans	0	0.0000	2	0.0005	1	0.0001	1	0.0001	1	0.0001
end	0	0.0000	2	0.0005	1	0.0001	1	0.0001	1	0.0001
March.	0	0.0000	1	0.0001	0	0.0000	0	0.0000	1	0.0001
spokesman	0	0.0000	1	0.0001	1	0.0001	2	0.0005	3	0.0011
file	0	0.0000	1	0.0001	0	0.0000	0	0.0000	0	0.0000
investigational	0	0.0000	1	0.0001	1	0.0001	1	0.0001	1	0.0001
new	0	0.0000	1	0.0001	0	0.0000	0	0.0000	1	0.0001
application	0	0.0000	1	0.0001	1	0.0001	0	0.0000	1	0.0001
month	0	0.0000	1	0.0001	0	0.0000	1	0.0001	0	0.0000
requesting	0	0.0000	1	0.0001	2	0.0005	0	0.0000	0	0.0000
FDA	0	0.0000	1	0.0001	0	0.0000	1	0.0001	1	0.0001
allow	0	0.0000	1	0.0001	1	0.0001	0	0.0000	1	0.0001
Scientists	0	0.0000	1	0.0001	0	0.0000	0	0.0000	0	0.0000
Genetic	0	0.0000	1	0.0001	2	0.0005	1	0.0001	1	0.0001
Systems	0	0.0000	1	0.0001	0	0.0000	0	0.0000	0	0.0000
unit	0	0.0000	1	0.0001	0	0.0000	0	0.0000	0	0.0000
created	0	0.0000	1	0.0001	1	0.0001	0	0.0000	1	0.0001
which	0	0.0000	1	0.0001	0	0.0000	0	0.0000	0	0.0000
antibodies	0	0.0000	1	0.0001	0	0.0000	0	0.0000	0	0.0000
virus	0	0.0000	2	0.0005	1	0.0001	1	0.0001	0	0.0000
monkey	0	0.0000	1	0.0001	0	0.0000	0	0.0000	0	0.0000
mice	0	0.0000	1	0.0001	1	0.0001	1	0.0000	0	0.0000
smallpox	0	0.0000	1	0.0001	1	0.0001	0	0.0000	0	0.0000
consists	0	0.0000	1	0.0001	1	0.0001	0	0.0000	1	0.0001
hospital	0	0.0000	0	0.0000	1	0.0001	1	0.0001	1	0.0001
general	0	0.0000	0	0.0000	1	0.0001	1	0.0001	0	0.0000
thus	0	0.0000	0	0.0000	0	0.0000	0	0.0000	0	0.0000
letter	0	0.0000	0	0.0000	1	0.0001	0	0.0000	1	0.0001
Growth	0	0.0000	0	0.0000	2	0.0005	1	0.0001	0	0.0000
aggreed	0	0.0000	0	0.0000	0	0.0000	1	0.0001	0	0.0000
lecture	0	0.0000	0	0.0000	1	0.0001	0	0.0000	1	0.0001
report	0	0.0000	0	0.0000	1	0.0001	0	0.0000	1	0.0001
patient	0	0.0000	0	0.0000	2	0.0005	1	0.0001	1	0.0001
library	0	0.0000	0	0.0000	1	0.0001	0	0.0000	0	0.0000
food	0	0.0000	0	0.0000	1	0.0001	0	0.0000	0	0.0000
care	0	0.0000	0	0.0000	1	0.0001	0	0.0000	0	0.0000
staff	0	0.0000	0	0.0000	1	0.0001	0	0.0000	0	0.0000

Table 3: Concept by concept, decompose matrix using SVD method

	0.4489	0.8667	0.8840	0.8430	0.9810
	0.5386	0.7050	0.6724	0.6230	0.7500
	0.4733	0.7560	0.8038	0.4137	0.7860
	0.5693	0.4980	0.5210	0.6750	0.9238
	0.5220	0.8500	0.6610	1.0000	0.7870
Rank Score	2.5521	3.6757	3.5422	3.5547	4.2278

In the next stage the data of Table 3 can be tested using the top-down correlation technique (see Equation 2). The null hypothesis (H₀) of this test is that there is an agreement regarding the investigated data of the columns of Table 3. In our case, according to Equation (2), M=5 was the number of investigated documents, n=5 were the conceptual values after reduction, and R was the sum of the rank score (see Table 3). Then, W=0.0059 and thus (see Equation 3). Comparing it to the chi-square distribution with a degrees of freedom √≠=5-1= 4 and , we had non-rejection of the null hypothesis (H₀) of the agreement regarding the conceptuality of the compared documents (0.99 < P < 0.95).

5.3 Onion layers method

In this stage we obtained the five sets of data for each document (Table 4) using the method described in section 3.2. It should be noted that each set contains the points of the smallest convex polygon plus a number of points ranging between 0-2 (see Figure 1).

Table 4: Pair points of the last (smallest polygon) of the five onion polygons of documents C1, C2, C3, C4, C5

Document C1		Document C2		Document C3		Document C4		Document C5
Axis X	Axis Y	Axis X	Axis Y	Axis X	Axis Y	Axis X	Axis Y	Axis X	Axis Y
248	97	255	101	259	99	251	99	232	98
246	70	242	101	249	100	242	99	230	97
265	97	235	100	242	97	240	97	220	83
248	97	249	44	265	32	237	83	243	97
259	97	267	68	266	98	262	77	232	98
		255	101	259	99	269	97	240	97
		252	101	254	97	251	99
		266	70	260	97

The next stage was to transform the data of Table 4 in order for each pair of data to be represented by a value. Thus, for the whole set we calculated the magnitude of each pair point (x_ij,y_ij) from the origin 0 of the coordinate system using the following formula:

(6)

where: i=1, 5 is the number of elements of each smallest convex layer and j=1, 5 is the number of documents.

We then obtained the first five values (Table 5) for each document in order to compare these statistically with those of the of LSI method (Table 3) and normalized the data in Table 6.

Table 5: Magnitude of all the pair points of documents C1, C2, C3, C4, C5

Document C1	Document C2	Document C3	Document C4	Document C5
266.2949	255.7655	282.1950	266.2949	276.5683
274.2736	262.2308	255.3919	252.8577	275.5231
277.2760	268.3300	260.7163	266.9251	283.4784
269.8185	261.4670	258.8610	251.1135	273.0806
251.8492	249.6177	235.1361	261.6448	251.8492

Table 6: Normalization on the maximum value (283.48) of Table 5

	Document C1	Document C2	Document C3	Document C4	Document C5
	0.9394	0.9022	0.9955	0.9394	0.9756
	0.9675	0.9250	0.9009	0.8920	0.9719
	0.9781	0.9466	0.9197	0.9416	1.0000
	0.9518	0.9224	0.9132	0.8858	0.9633
	0.8884	0.8806	0.8295	0.9230	0.8884
Rank Score	4.7252	4.5768	4.5588	4.5818	4.7992

In this stage the data of Table 6 can be tested using the top-down correlation technique (Equation 2). The null hypothesis (H ₀ ) of this test is that there is an agreement regarding the investigated data of the columns of Table 6. In our case and according to Equation (2), M=5 was the number of investigated documents, n=5 were the conceptual values after reduction, and R was the sum of the rank score (see Table 6). Then W=0.0003 and thus (see Equation 3). On comparing it to the chi-square distribution with degree of freedom √≠=5-1=4 and , we had non-rejection of the null hypothesis of the agreement regarding the conceptuality of the compared documents (0.99 < P < 0.95).

6 Analysis and discussion

As can be seen from the results in Table 1, the proposed method correctly categorized all tested texts.

The statistical evaluation method based on rank correlation (section 5.1) of our proposed method and the LSI method (section 5.2) showed that they have the same level significance (0.99 < P < 0.95), although our proposed method may be considered to be superior to the LSI method because the reduction technique is independent of the number of documents and the number of terms involved. In more detail, the method promises a great reduction in the original information. For example, a selected text of 500 characters is now represented by a convex polygon (categorization stage) of two vectors and that have 3-9 elements each: a total reduction in information of about or 96.4%

Furthermore, for smaller texts (100 words in length) we avoid the training procedure and, in this way, the complexity of the proposed method is reduced dramatically in comparison with traditional methods.

For training on longer texts this method may be applied as follows:

The text is divided into smaller subsets of about 100 words in length.
The procedure of the preprocessing stage (section 3.2) is then applied to each subset in turn.

7 Conclusions and future work

7.1 Conclusion

This paper has explained a text categorization technique that combines numeric conversion of a symbolic expression and a computational geometric algorithm. The proposed technique is suitable for the specific partitioning of digital library collections, and semantic categorization of emails and abstracts. This method may be characterized as a supervised method because it is based on the claim that an unknown text categorization vector (smallest convex polygon) is compared with a particular text categorization vector the class of which is already known. Thus, a practical way to see if this method is able to classify an unknown document vector correctly is to test it with a particular number of different predetermined semantic document vectors and let the system decide which of these vectors is geometrically nearest the unknown vector according to the method described in section 3.2.

An example of this is the experiment in which five test categories (of 5 x 100 word documents per category) were used to test this technique. Statistical information was collected from the experiment and used to corroborate our findings:

Statistical evaluation using rank correlation gave a significant level of accuracy (0.99 < P < 0.95) and showed that the results of the five tested documents are in agreement.
The proposed method is based on a different philosophy to those methods that depend on the removal of features based on overall frequency counts (Salton and McGill 1983, Jones and Paynter 2002). In contrast to those methods, the proposed method is based on total word participation in the feature extraction. The results showed that this feature is efficient and accurate for semantic text categorization.
The onion algorithm may be characterized as a more efficient algorithm for the lowest complexity, O(d*n log n), than the text categorization methods that are used currently, for example the Bayesian classifier with O(n³k) complexity.

In conclusion, use of the onion algorithm for text categorization purposes may be considered to be a solution to the problems of managing information storage and retrieval in large digital libraries.

7.2 Future work

A way to improve our proposed method in future would be to develop numerical conversion of a symbolic expression based on a different encoding method. In this way, we propose to extend the encoding to a larger 8-bit conversion. The aim of this proposition is for each character of each word to be described in a more specific way. Phonetic conversion in place of ASCII conversion may provide an even more accurate solution in our proposed method.

The next aim of this study is to increase the text length in each processed document to more than 100 words.

Furthermore, this system has the ability to investigate the relation of the position of one convex polygon to that of another in the Cartesian plane. Thus, in this study we considered the degree a of the intersection of convex polygons as a measure of comparison. However, in future the centre of gravity of a convex polygon may be used as a measure of comparison and treated as a vector (O'Rourke 1995).

In our latest work modified the proposed method to construct an anti-spam filter.

Finally, this method may be considered to be a preliminary study and the next aim is to apply it to a real test collection. However, the statistical conclusion and the results of the 15 tested documents show that we are moving in the right direction.

A Text Categorization Technique based on a Numerical Conversion of a Symbolic Expression and an Onion Layers Algorithm