HOME JOURNALS CONTACT

Information Technology Journal

Year: 2012 | Volume: 11 | Issue: 4 | Page No.: 408-413
DOI: 10.3923/itj.2012.408.413
Web Information Extraction Based on Visual Characteristics
Ruyue Tan

Abstract: Due to the explosive development of Internet technology, the web is becoming the world's largest database of information, effective management and utilization of web information is currently a hot issue. This study mainly discusses the extraction of web information. Traditional web information extraction is mainly based on DOM tree and HTML tag analysis. Based on VIPS, the study proposes visual block positioning algorithm for webpage information extraction through induction webpage visual features and visual block feature information. It inputs the theme-based webpages and BBS webpages for VIPS, analyzes the output of VIPS and the VBT and then defines visual characteristics such as text density and link text density. The study puts forward a visual block positioning algorithm VBPA. It will position the theme information block to one VBT node, then extract the theme information. Experimental results show that the visual block positioning algorithm based on visual features is superior to the traditional web information extraction algorithm and has a higher quality of information extraction.

Fulltext PDF Fulltext HTML

How to cite this article
Ruyue Tan , 2012. Web Information Extraction Based on Visual Characteristics. Information Technology Journal, 11: 408-413.

Keywords: visual pieces positioning; VBPA, VIPS, BBS information extraction and Subject extraction

INTRODUCTION

Rich in content exhibiting and interacting, the web conveys information visually (Cai et al., 2003a). Web visitors subconsciously partition the layout by visual characteristics to search more efficiently. Hence, in visual recognition, special information and visual features are significant. By analyzing visual information and layout features, the study manages to partition pages into blocks by VIPS (Cai et al., 2003b), effectively positions specific blocks and extracts further valuable information.

RELATED WORKS

Information extraction methods can be classified into 2 classes: one is wrapper induction systems with delimiter-based rules; the other includes methods with syntactic/semantic constraints. There are also different taxonomies for extraction toolkits (Chang et al., 2006). For the first, Sarawagi classified wrappers into 3 kinds by tasks (Sarawagi, 2002): record-level, page-level and site-level wrappers. For the second, Laender categorizes it by wrapper-generating method (Laender et al., 2002): Languages for wrapper development (e.g., Minerva (Crescenzi and Mecca, 1998); HTML-aware tools (e.g., W4F (Saiiuguet and Azavant, 2001)); NLP-based tools (e.g., WHISK); Wrapper induction tools (e.g., WIEN); Modeling-based tools (e.g., NoDoSE (Saiiuguet and Azavant, 2001)). Vision-based extraction remains developing. Cai et al. (2003b) proposes a top-down independent label tree but fails to locate automatically in the study of Liu and Meng, (2006) extracts automatically but is too complex. By analyzing visual features of different areas and their general characters, VBPA is proposed.

VIPS (VISION-BASED PAGE SEGMENTATION)

VIPS, proposed by Microsoft Research Asia, is an iterative top-down process. Its description will not be repeated again but is incorporated herewith by Cai et al. (2003b). VIPS parses a input page into a VBT (Liu and Meng, 2006), shown in Fig. 1.

Fig. 1: An example webpage corresponding VBT

VBT has three characteristics: each node equals to a visual block; each node equals to a visual area; VBT parent-child nodes are in containing relationship in corresponding page areas. Cai et al. (2003b) provides a segmentation algorithm but did not deal with extraction. Based on VIPS proposed by Cai et al. (2003b), the study proposes VBPA.

VBPA AND WEB INFORMATION EXTRACTION

With visual blocks obtained by VIPS, information positioning and extraction can be done by VBPA.

The eigenvalue of visual block B: For each visual block B in the VBT, record its location, size, image information and text feature. Set the upper left corner vertex as the origin and the coordinates of the right bottom corner vertex as (Width, Height). Width and Height are the width and height of each block, the coordinates of center of each page block is (CenterX, CenterY). By VIPS, these can be got: distance from every B to the top margin of current page B_top, to the left B_top, position of the horizontal and vertical central axis L_land = B_top + 0.5 Height and L_protrait = B_left + 0.5 Width. Definitions of each B are:

Definition 1: The distance α between the horizontal central axis of B and the central axis of the parent node block of B is the absolute value of difference of the position of the horizontal central axis of B L_land and the position of the horizontal central axis of the parent node block of B L_fland:

(1)

Definition 2: The distance β between the vertical central axis of B and the central axis of the parent node block of B is the absolute value of difference of the position of the vertical central axis of B L_protrait and the position of the vertical central axis of the parent node block of B L_fprotrait:

(2)

Definition 3: The semantic block B area S_B to the webpage area S_page is γ:

Definition 4: The pure text density λ is the ratio of the pure text length in B to the area of B:

(3)

L_textlength is the length of the pure text in the visual block B.

Definition 5: The link-text density of B S_B is the ratio of the link-text length in B and the area of B:

(4)

L_linklength is the length of the link-text in the visual block B.

VBPA for positioning and extraction theme content in theme-based website: Theme-based pages extraction need to position and extract the theme block. Three specific steps are: generate a VBT from a theme-based page by VIPS; position theme block; extract theme information.

Rules of visual block B: In theme-based pages, theme area corresponds to a VBT node. Six rules of B are defined:

Rule 1: The ratio of the link-text length of each direct child node of B L_linklength to the pure text length L_textlength satisfies:


Rule 2: The link-text density λ_link of each direct child node of B satisfies:

Z_link is the threshold determined by experiments on different sample webpages.

Rule 3: The length-width ratio of each direct child node b of B is:

the width of B is B_wide, the height of B is B_high, then, ε<0.2 and:

or ε>5 and:


Rule 4: The ratio of the distance β of the vertical central axis of B and the central axis of the parent node of B to the width of B'_wide the parent node block of B satisfies:

T_disprotrait is the threshold determined by measurement

Rule 5: The ratio γ of the area of B to the area of the webpage satisfies γ≥M_webregion, M_webregion is the threshold determined by experiments on different sample webpages

Rule 6: If B is a theme block, its pure text density λ_text satisfies:

Z_text is the threshold determined by experiments on different sample webpages.

Theme information block positioning algorithm: The algorithm starts its layer-by-layer top-down iteration with the VBT root node; detailed as follow:


Fig. 2: The general structure of theme-based websites (LEFT)

Fig. 3: Theme-based website VBT(RIGHT)

Figure 2 is to illustrate this process. B2, a parent block with title and body in the red frame, is positioned in the second round of VBPA. The next iteration positions theme block B2_2 which corresponds to a VBT node, shown in Fig. 3. Theme information can be extracted from the positioned block.

VBPA for extracting posts speech information in the BBS website: Based on the algorithm proposed in IV-B-2) the study provides VBPA to extract speech information from the positioned areas. Four specific steps are: (a) generate a VBT from a BBS page by VIPS; (b) position theme blocks; (c) cluster BBS speech areas in the theme blocks; (d) extract information from the set clustered in c). VBPA for BBS positions theme areas by VBPA for theme-based website and positions speech blocks by visual block feature clustering algorithm. The former VBPA, based on the latter, still needs further refined screening and positioning.

Positioning the theme information block in BBS website: Position theme blocks in the VBT generated by VIPS. Such blocks, possibly contains some speech areas, are both speech and users’ information areas. Visual blocks which correspond to theme areas satisfy the rules defined in IV-B-1 but they have different threshold. BBS theme block positioning algorithm is given based on the positioning algorithm proposed in IV-B-2). The VBT generated by VIPS is the input and the positioned VBT nodes is the output. See specific steps in IV-B-2.

Visual block similarity of the posts in the BBS website theme area: The theme block positioned in IV-C-1 has many child blocks with different content. Consider two child blocks B1 and B2 with areas S (B1), S (B2), respectively, areas of their parent blocks B1’ and B2’ are S (B1’), S (B2’), respectively, let s = S (B1) + S (B2) and s' = S (B2') + S (B2'). Definition are as follow:

Definition 6: The position similarity δ of blocks B1 and B 2 equals to the ratio of the distance between B1, B2 and the left margin of its corresponding parent block, i.e.:

weight δ_w = 0.3

Definition 7: The area similarity η of blocks B1 and B 2 satisfies:

its weight η_w is the ratio of the area sum s of B1 and B2 to the area sum s' of their parent B1' and B2', i.e.:

Definition 8: The number of content letters in B1 and B2 are C_num (B1), C_num (B2), respectively. Let C_min = min {C_num (B1)}, C_max = max {C_num (B1, C_num (B2)}, content similarity θ satisfies:

the content weightis θ_w the ratio of the text area sum s_t of B1 and B2 to the areas sum s of B1 and B2, i.e.:

Definition 9: The similarity sim (B1, B2) of visual blocks B1 and B2 satisfies Eq. 5:

(5)

Definition 10: The similarity sim (Q, B) of visual blocks set Q = {Q1, Q2, … Qn} and visual block B is the average value of the similarity of visual block B and every element in Q, i.e.:

Q_max is the set who has the greatest similarity with B, it satisfies:

Visual block clustering algorithm: Process theme area blocks with level-by-level iteration clustering algorithm by similarity. This algorithm will aggregate all speech areas into a group in the first iteration round; in the following iterations, it will extract BBS page content information of different granularities shown as follow:

Algorithm: Visual block clustering algorithm in BBS website

Theme block can be positioned by VBPA in IV-B-2), shown in Fig. 4. VBT node B2 is shown in Fig. 5, in gray frame.

Fig. 4: BBS-type structure of webpage (LEFT)

Fig. 5: BBS-based Web VBT(RIGHT)

Table 1: A sample of BBS-based Web information extraction results

Cluster speech information similarities. Set B2_1, B2_2, B2_3 can be built in the first round iteration while set B2_1_1, B2_2_1, B2_3_1, B2_1_2, B2_2_2, B2_3_2 in the second and extraction of current granularity can be done. Specific position information in VBT is shown in Fig. 5.

For one page http://bbs.stardestroyer.net/viewtopic.php?f=4&t=146025 of the BBS website http://bbs.stardestroyer.net/, the result of extraction after three rounds of iteration is shown in Table 1.

EXPERIMENT AND RESULTS

Hardware configuration: Intel (R) Core (TM) Duo CPU P8700 at 2.53GHzx2, 2GB RAM, IDE: Visual Studio 2008. Extraction tests have been done on theme-based and BBS webpages, respectively.

Experimental tests on theme-based webpage: The data is collected from several famous Chinese portal-based sites and involves many domains like society, military, etc., shown in Table 2. Use label templates with HMTLParser to semi-automatically extract and mark the theme blocks in the 750 pages. The results are reported in three grades as excellent, good and poor. Excellent means all theme blocks are extracted precisely without interfere information; good means blocks are extracted but interfere information remains or information integrity is lower than 85%; others are poor. Table 3 shows the extracting results, where accuracy is (number of pages whose theme content is (precisely extracted/total number of pages)x100%.

Table 2: Experimental data set of theme-based pages

Table 3: Extraction results of theme-based pages

This VBPA applies in supervisory systems of public sentiment in projects. The paper analyzes the blog-type results and accuracy in such systems. 200 pages from each site are analyzed, as in Fig. 6.

Experimental tests on BBS webpage: The test data is collected from several famous BBS forums and university forums in China, involving nearly 10 domains. It selects 5 types of BBS sites and pages of 4 theme kinds from each site and tests 540 pages chosen in total. Using label templates with HMTLParser to semi-automatically extract and mark speech blocks in the 540 pages and extracting by BBS VBPA. The results are shown in Table 4.

As is refined to speech block, it is not precise to experiment only on correctly positioned pages and further calculation is demanded after speech-block-level extraction, where accuracy is the ration of correctly extracted speech areas to all extracted areas and recall rate is the ration of correctly extracted speech areas to all speech areas in input pages, both in percentage.

Fig. 6: Blog-type page extraction accuracies

Table 4: BBS-based page experimental results

Table 5: Comparison of two kinds of extraction algorithm

The results are compared with traditional ones. As in Table 5, VBPA extraction has better recall rate and accuracy than that of HTML DOM tree extraction (9.15 and 2.33% greater, respectively), indicating the former has a higher recognition ratio and better performance.

CONCLUSIONS

The study considers reading habits and visual features, proposes VBPA to extract information from pages of different types and structures, based on VIPS. It has better performance and higher accuracy and also proposes assumptions for large-scale algorithm in generic pages. Its complexity is lower than that of traditional ones while efficiency and accuracy are obviously enhanced. Further research will involve more page types, thus next step is to enhance accuracy, integrity and generality so that customized information of generic pages can be extracted by changing the thresholds and parameters.

ACKNOWLEDGMENT

I would like to take this chance to express my sincere gratitude to my upperclassman, Xiao Yang, for his kindly assistance and valuable suggestions during my paper writing. My gratitude also extends to all the teachers who taught me for their kind encouragement and patient instructions. Special thanks to dear Sephiroth Chen, for the love, hope, courage and strength you have brought, bring and will bring to me. Last but not least, I would like to offer my particular thanks to my friends and family, for their encouragement and support for the completion of this study.

REFERENCES

  • Cai, D., S. Yu, J.R. Wen and W.Y. Ma, 2003. Extracting content structure for web pages based on visual representation. Proceedings of the 5th Asia Pacific Web Conference, April 23-25, 2003, Xi'an China, pp: 406-417.


  • Cai, D., S. Yu, J.R. Wen and W.Y. Ma, 2003. VIPS: A vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-203-79. http://research.microsoft.com/apps/pubs/default.aspx?id=70027.


  • Liu, W. and X. Meng, 2006. Vision-based web and data records extraetion. Proceedings of the 9th SIGMOD International Workshop on Web and Databases, June 30, 2006, Chicago. -.


  • Chang, C.H., M. Kayed, R. Girgis and K.F. Shaalan, 2006. A survey of web information extraction system. Inst. Electr. Electron. Eng. Trans. Knowledge Data Eng., 18: 1411-1428.
    CrossRef    


  • Sarawagi, S., 2002. Automation in information extraction and integration. Proceedings of the 28th International Conference on Very Large Data Bases, August 20-23, 2002, Hong Kong, China. -.


  • Laender, A.H.F., B.A. Ribeiro-Neto, A.S. da Silva and J.S. Teixeira, 2002. A brief survey of web data extraction tools. ACM SIGMOD Record, 31: 84-93.
    CrossRef    Direct Link    


  • Crescenzi, V. and G. Mecca, 1998. Grammars have exceptions. Inform. Syst., 23: 539-565.
    CrossRef    


  • Saiiuguet, A. and F. Azavant, 2001. Building intelligent web applications using lightweight wrappers. Data Knowledge Eng., 36: 283-316.
    Direct Link    

  • © Science Alert. All Rights Reserved