Publication facilities have been developed in the recent decades with high
quality attributes. This can be well recognized by looking at the books, newspapers,
magazines and other means of information repositories. Earlier trends had focused
on the information itself rather than the demonstration capabilities and step-by-step
great amount of attention have been oriented towards the printing facilities
of coloring, font design and many other items (Sharma et
Shape analysis and characterization of literature has been a field of intense
activity in computer vision, image processing and pattern recognition (Theodoridis
and Koutroumbas, 2003). Shape of an object represents an important part
of the Information which can be based on a description of the boundaries or
the whole region of the object and implementing OCR techniques. It can be classified
as structural, in which the object is broken into parts and properties of those
parts are encoded. In addition, shape representations can be monolithic or hierarchical.
Monolithic encode only properties at a single scale, while hierarchical consist
of describing the shape at multiple discrete scales. Different methods for shape
similarity and recognition have been developed. For example, Sinha
and Giardina (1990) used morphological function to decompose a given shape
into primitive parts. The properties and relationships among these primitives
were used to describe different objects. Another technique, the potential-based
approach developed by Chuang (1996), identifies the
best match from a selected group of shape templates by measuring the repulsive
force and torque when the template and sample are put to interact through a
potential field. Kupeev and Wolfson (1996) used a weighted
graph to represents the structure and the quantitative elements of a contour
for estimation of shape similarity. Tsang and Tsang (2006)
presented a neighborhood vector representation as shape parameter for binary
images. This method is based on the pixel neighborhood relation. The results
show that the shape parameter can be used as a general method of feature extraction
for problems in image processing and pattern recognition.
Variety of classification techniques uses page segmentation for pattern recognition.
Many of the existing page segmentation algorithms have region classification
modules embedded in the systems. Various ways to segment and classify regions
have been proposed ranging from purely top-down approaches that recursively
split a page into smaller components, to purely bottom-up approaches that attempt
to cluster individually connected components into larger and larger entities.
Top-down approaches usually segment the page into homogenous, extracting features
and then classify them into appropriate categories. A number of different features
have been proposed for region classification. Some of these include the area
of the connected component of a block, number of black pixels in the region,
the mean horizontal black run length, component width height ratio, component
density, mean length of black intervals to the mean length of white intervals,
number of black intervals over a certain length, feature-based interaction map
(Chetverikov et al., 1996), texture discrimination
masks (Pappas et al., 2001), periodicity measures,
the measures of visual attention, i.e., legibility, complexity, attractiveness,
etc (Ryu et al., 2000). Bottom-up approaches
usually classify each image pixel first and then group the pixels into regions,
such as multi-scale texture segmentation (Wu et al.,
1999), texture discrimination masks (Jain and Zhong,
1995), mask based local textural characteristics extraction (Williams
and Alder, 1996). Some approaches are based on geometric relations that
group the pixels according to various patterns and then use simple statistical
features to classify them (Mitchell and Yan, 2001).
Moreover, Non-text regions were examined using neural network based region
identification algorithm as key component of a document recognition system implementing
segmentation. They were classified into text, graphic, photo and other region
types (Andersen and Zhang, 2003), Low quality images
of newspaper documents obtained from microfilmed archives and the results compare
favorably with other results reported elsewhere.
The invading era of computer systems onto this area of application had played
great role in the occurring achievements. With computer implementation, two
media emerged in information handling; either a print out on paper sheet or
a file saved on storage medium (commonly termed as hard or soft copy modes).
Computers with their broad analogous systems of digital instruments are nowadays
regarded as the main data developing tools. In accordance to the nature of these
systems, information generation can be utilized in two different processes;
writing (i.e., typing) and printing (which in turn distinguishes between the
soft and hard copy modes). Along with these digital systems, wide range of word
processors have been developed by Besser (1996). The
ever growing memory sizes support heavily these word processors to demonstrate
the documents with net and plain character generators. The outcome of the devoted
care on these missions exceeds the informative boundaries and stimulates the
artistic attentions in writing to appear as drawing and painting resultants
(DeSoi et al., 1990). Any information repository
these days, despite of its sort whether a magazine newspaper, book, or the like
can be simply specified by the artistic appearance of the cover page or the
text features themselves. Correspondingly, for the sake of standardization,
calls for papers or articles by various institutes or publishers insist on following
up publisher regulations or template in order to be considered for acceptance.
Hence, these regulations could indirectly be considered as identifiers for the
publisher or document generation source.
Based on the last deduction, the present study has adopted an automatic detection system designed to extract the writing specifications of any printed document. The increasing number of information producers with their distinct specifications of writing made the identification activity to invoke a data base utilization in order to integrate the process of identification. By this database the extracted specification can be associated with that of the producer of the information and supports a decision-making activity.
MATERIALS AND METHODS
Broadly speaking, document format is characterized by paragraph layout and
font attributes design. The former involves line spacing, number of text columns
and gap dimension properties, while the latter denotes font type, size and style
selections (Wieser and Pinz, 1993). Based on these two
main categories of document description, the proposed detection system is designed
to involve two successive procedures followed by an identification process.
Therefore, implementation of the proposed system consists of three phases; paragraph
layout detection phase, font attribute detection phase and identification phase
as shown diagrammatically in Fig. 1 and briefly outlined below.
The first two phases determine the attributes which are handed over next process
in order to identify the most probable source of the document.
|| The proposed identification system
Document example 1, (a) input document, (b) row wise pixel
density, (c) row wise line spacing frequency distribution, (d) column wise
pixel density and (e) column wise line spacing frequency distribution
This process is carried out, utilizing a neural network association scheme. It may be added that the design and implementation of the proposed scheme was carried out for about ten months starting in April 2008 at Isra Private University, Jordan.
Phase one: Paragraph layout detection: Principally, the current procedure
depends thoroughly on pixel density that is acquired in two modes of row wise
and column wise calculations. These density distributions are traced so as to
gather the underlined spacing characteristics with the aid of the computed frequency
distribution of line spacing (Kaiman, 1968). Examples
showing the essential data of pixel density acquisition and computed line spacing
frequency distribution for different text contents along the two modes are shown
in Fig. 2a-e and 3a-e.
However, paragraph setup characteristics are determined according to the following
||From row frequency distribution, line spacing measure tagged
to maximum occurrences represents document line spacing attribute 1
||Header and bottom spaces denote attribute 2 and attribute
3. These attributes are computed by tracing and counting the empty rows
initiated with the first (upper) row and with the last (lower) row, respectively
||With the exception of considering the header and bottom spaces,
in-between paragraph measure is that whose related accumulated number should
satisfy Eq. 1
In-between paragraph measure = Amount
This measure defines attribute 4, which is determined throughout two stages.
In the first, paragraph number is computed using the adjacent non empty lines
(separated by attribute 1) counting that is carried out using pixel density
Document example 2, (a) Input document, (b) Row wise pixel
density, (c) Row wise line spacing frequency distribution, (d) Column wise
pixel density, (e) Column wise line spacing frequency distribution
Whereas, in the second stage, any other number of line spacing other than
attribute 1 and satisfies Eq. 1 will be considered as the
target attribute (attribute 4).
In particular circumstances, a verification process resumes execution in order to justify the results gained by this stage. This process is stimulated when in-between paragraph space does not differ from the standard line spacing, attribute 1. Here, a tracing mechanism is started to analyze pixel density diagram extracting attribute 4.
||From column pixel density characteristics both left and right
margins are calculated so as to define attribute 5 and 6, respectively.
Obviously the algorithm used is similar to that used for header and bottom
spaces with the exception of using the column mode density instead of the
row mode acquisition
||Number of written column represents attribute 7. The procedure
used here is also similar to the one used in Paragraph Amount evaluation
but using the column mode instead
||The column in-between measure, considered as attribute 8,
is computed in similar way as that used in addressing the process of evaluating
the In-between line spacing of attribute 4, but with difference of using
the column mode instead of the row mode of data investigation
||Finally, written column width is determined from column pixel
density characteristics and represents the last paragraph layout, attribute
9. It might be noted that the column width is considered by ignoring the
left and right margins
Phase two: Font attributes detection: Due to the great modalities of
font and its associated characteristics, the function of the second procedure
is divided in two sub steps of processing. The first deals with general detection
of the font and can be termed as character recognition stage, whereas the second
step investigates the concise descriptions of the underlined character. The
procedure in whole adopts a small set of alphabetical character patterns to
sketch the overall activities. This set involves the first characters from A
to F. The investigations fixes the capital mode of these characters in order
to minimize the variations as much as possible and to compress the required
database needed for the decision making phase. Moreover, traditional steps of
commonly used recognition system are used to design the different activities
of the first step of the current procedure (Oh et al.,
2001). These activities denote line segmentation and character segmentation
as first trends. Freeman chain technique is adopted for this purpose (Tsang
et al., 1999). The most significant feature of base line position
is contributed in this stage. The obtained chain hence is subjected to feature
extraction operations oriented towards the following features:
||Horizontal propagation of the character
||Vertical propagation of the character
||The relative propagation with the base line
||Whole existence and the count of the existing ones
Based on the above features, the first stage candidates is the character with its global definition. When the gathered features do not convince any available character subset, another character is segmented and processed until a match status is achieved.
The match will in turn commit the character represented by its chain to the second stage of this procedure.
The second stage is designed to investigate the frequency analysis and the statistical
measures of the provided character from previous stage. The statistical analysis
and statistical classification adopted in the implementation of this stage follows
methods outline by Michie et al. (1994) and Härdle
and Simar (2007). All the characters mentioned earlier are traced along their
chains to build the classification repository. This repository is constructed
with different tables, involving the eight Freeman chain frequencies. Figure
4a-c show the basic principle of Freeman chain construction
and how they are applied to represent the character A as an example. Then Table
1 shows such characteristics of font description summary constructed for character
A for a specific font type, style and range of font sizes. For the example of
character A written in Times New Roman font with different sizes, the absolute
direction frequency count summary and the normalized frequency count with reference
to direction 7 is shown in Table 1.
|| Freeman chain extraction for letter A as an example
|| Direction frequency count for character A in Times New Roman
font with different sizes
|| Frequency count of direction in Times New Roman font for
character A with different font sizes
Finally, the direction frequency counts normalized to direction 7 is plotted as a function of the font size for letter A, written in Times New Roman type in Fig. 5. Information of any character provided from the first stage will be compared with the available behaviors of the repository to decide three parameters of its main features of font, i.e., size, font style and font type, taken into consideration that the bold attribute is ignored.
Phase three: The identification process: So far details of the foregoing
two procedures are summarized by 12 different features (attributes). Based on
this amount of features, a neural network design is used as an identification
tool (Crawford-Hines and Anderson, 1997).
The neural network structure: The overall processing of the proposed
identification system is summarized by the final stage of neural network utilization
structure. The main function of the subjected structure relies on the association
capability of the neural network and its generalization power (Feng
et al., 2006; Inohira et al., 2008;
Rasheed and Ali, 2009). It is worth mentioning that
classifiers other than neural networks for pattern analysis and identification
add extra programming burdens on the system, while neural network classifiers
reduce data dimensionality to a great extent. This is due to their reliance
on features instead of abstract data configuration. Feed forward network can
successfully achieve this task (Hagan et al., 2002).
A net with twelve input layers, two hidden layers of five neurons each and five
neurons output layer is used, as shown in Fig. 6.
All neurons of the hidden and output layers get input signal from bias neurons.
Obviously, the number of the neurons at the output layer specifies the number
of the selected publishers identifying the features at the input layer of the
scheme. Hence, 32 binary codes denote the whole domain of the identification
process under consideration. The five outputs of the network are chosen based
on the selected experiment sources of the documents.
On the other hand, input measurements for the features collected have been
normalized to be used as proper data for the training purposes. Then a training
table of twelve input values and five output values sums up all the needed data
of the training. Genetic algorithm search technique has been adapted for the
evaluation of the weights for neural network. Based on the collected data, training
has been smoothly tolerated the weights to a satisfactory error level. This
scheme on the testing phase has showed suitable outcome of identification that
is found easy to be recovered with its associated error value.
Different experiments were conducted with samples of publishers documents for network training purposes. For each experiment, samples of provenances were taken depending on the allowable coding space that covers the selected provenances. Five samples for each publisher were chosen as data for each publisher in order to include as most expected variations as possible in document contents. Therefore, 160 overall samples are involved as training patterns in this prototype design. Genetic algorithm search technique was adopted for the training of the neural network of the identification process.
Although, diagrams, photos, footers and headers are important features of any published document, this research was limited to the most important characterizing properties of publisher identification which is writing and page layout. Therefore, at this stage, diagrams, photos, footers and headers have not been considered. As the designed neural network in the identification process phase have five output, i.e., allowing for 32 different outputs, different samples of only 29 publishers were selected to be experimented with while the remaining three possible outputs (29, 30 and 31) were reserved for unrecognized printing materials or unknown publisher. In order to have a better representation of any publisher printout layout, four pages of published material are selected as full training sample spans for each publisher. This strategy is followed as a measure to avoid unexpected occurrence of non-standard contents, such as the inclusion of introductory paragraph that disturbs the regular print layout.
The main methodology in computing the parameters is based on analysis of the
four pages samples in order to extract the outcome of Fig. 2b.
Then Fig. 2b in turn is transformed into Fig.
2c, which demonstrates the bases of measurement. These measurements produce
the first four parameters (attributes 1 to 4) of the paragraph layout detection
phase. Along the same phase, Fig. 2d is transformed into Fig.
2e, determining the next parameters (attributes 5 to 9). Furthermore, Fig.
3 shows results for different publisher printout sample having two columns
text instead of one.
It is worth mentioning that within the second phase, computation passes through
two stages, namely character recognition and font feature detection processes.
Traditional methodologies of recognition are conducted to identify any selected
character (Oh et al., 2001). This selection depends
on whether its related Dbase is already available or not. In case it is not
available, the algorithm resumes processing to select another character of the
text. Once a character is defined, the second stage determines the remaining
three parameters (i.e., font size, style and type). This process is achieved
with the aid of the familiar Freemann description model (Tsang
et al., 1999).
It is known that fonts can be supplemented to any word processor as additional library. A fast detection of the availability for the defined character can be detected by applying the normalized frequency count of Table 1-b to its content. Once a character is found, concise description can be further extracted through matching its frequency count with that of Table 1. For simplicity and less storage requirement, this prototype study is based on five characters only, therefore five tables of the form shown in Table 1 are constructed and embedded in the system.
For testing purposes, different publishers printout documents were fed to the system. The correct detection of publishers identification achievement was about 95%. Failure of identification can be attributed to the illed processing in phase two. Incorrect recognition of noising text stands for the unsuccessful identifications. Poor quality of printed documents that do not supply adequate data statistics may also result in similar consequences.
Huge differences in publishing styles and artistic products nowadays are attributed to the wide variety of word processing tools and printing hardware provided by computer technology advancement. For various reasons, such varieties necessitate the attempts to recognized and distinguish publishing resources and their classifications.
Although, not many published study is available on the approach of publisher
identification, some relevant techniques might be quoted. For example, Chetverikov
et al. (1996) uses texture features reported text identification
accuracy of 97-98% and non-text identification accuracy of 84-89% while achieving
a total accuracy of 96%. Ryu et al. (2000) used
multi-scale analysis and top-down approach for segmenting and a periodicity
for classification reported an accuracy of 97.1% for image identification. Fast
scan method and classification for pattern extraction of document by Mitchell
and Yan (2001) resulted into a total accuracy of 98%. Pappas
et al. (2001) uses a simple mask that makes use of the different
correlation properties to classify the region reported a total accuracy of 98.3%.
For identification of newspaper documents using neural networks, (Andersen
and Zhang, 2003) reported an identification accuracies in the range from
87.5-98% depending on text and non-text contents in the tested materials. Neighborhood
vector representation as shape parameter for binary images adopted by Tsang
and Tsang (2006) used neural network classifier claimed to be satisfactory,
but it was for hand written characters.
Therefore, the results reported in this article, which produced an overall
accuracy of about 95% can be seen as approximately equal to most of the above
mentioned results, slightly worse than those reported in (Ryu
et al., 2000) but better than some results of (Chetverikov
et al., 1996). The identification achieved in this paper would be
considered encouraging due to the fact that further efforts to include diagrams,
photos, footers and headers as extra attributes at detection and identification
stages would certainly enhance the recognition results.
Furthermore, the study reported in this research involves wide range of computation techniques that includes image processing, segmentation, statistics and neural networks capabilities to be utilized for identification. Within the authors framework, many tasks relied primitively on two objectives, i.e., scanning feature detection and character recognition. Such technique with its analysis procedure can be regarded as a key to investigate the era of style mining of websites. These websites despite their varieties and modalities are classified corresponding to limited publishers. The main criteria appearing to describe the publisher in this application denote the coloring and animation effects.
Having achieved about 95% correct recognition of publisher printouts identification, the proposed prototype scheme appears to be promising and could be improved with the addition of more printing features and/or network elaboration. This system uses only twelve features of the published material and yet may encourage the implementation of the detection system to be adopted with automatic schemes that could be used effectively for document identification in different investigation departments.
Failure of the publisher detection is captured and can be attributed to lack of enough information or some fuzziness in the fed documents the system. This matter is studied and is encompassed with the generalization capability of the used neural network. Furthermore, other document features such as footers, headers, photos, diagrams and colors may be added to obtain more attributes in order to improve system performance.