Intelligent Web Robot for Content Extraction

Wenxing HONG; Jie LI; Weiwei WANG; Yang WENG

Articles

Vol. 6 No. 3 (2019)

Intelligent Web Robot for Content Extraction

Wenxing HONG
Jie LI
Weiwei WANG
Yang WENG

PDF

Submitted: February 5, 2024
Published: 2024-02-05

Abstract

The main content of a news web page is a source of data for Natural Language Processing (NLP) and Information Retrieval (IR), which contains large quantities of valuable information. This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem. In terms of feature extraction, we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties, such as text length, tag path, tag properties and so on. In consideration that the essence of the problem is the classification model, we use Xgboost to help select nodes. Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages, and achieves about 98% accuracy over 1083 news pages from 10 different new sites, and the average processing time per page is within 10ms.

Downloads

Download data is not yet available.

Keywords

Content Extraction
DOM
Machine Learning
Xgboost

How to Cite

HONG , W., LI, J., WANG , W., & WENG, Y. (2024). Intelligent Web Robot for Content Extraction. Instrumentation, 6(3). Retrieved from https://editorial.instrumentationjournal.com/index.php/instr/article/view/96

This work is licensed under a Creative Commons Attribution 4.0 International License.

Similar Articles

Zhiting Du, Xianchun Zhou, Mengnan Lü, Yuze Chen, Binxin Tang, Multi-scale Attention Dilated Residual Image Denoising Network based on Skip Connection , Instrumentation: Vol. 11 No. 3 (2024)
Yunfei ZHANG , Yanjun WANG , Haoxiang LANG, Ying WANG, SILVA Clarence W. DE, Visual Avoidance of Collision with Randomly Moving Obstacles through Approximate Reinforcement Learning , Instrumentation: Vol. 6 No. 3 (2019)
XIA Min, SILVA Clarence W. DE, Gear Transmission Fault Classification using Deep Neural Networks and Classifier Level Sensor Fusion , Instrumentation: Vol. 6 No. 2 (2019)
Na Feng, Fei Fan, Guanglin Xu, Lianqing Yu, Deep Reinforcement Learning Based AGV Self-navigation Obstacle Avoidance Method , Instrumentation: Vol. 9 No. 4 (2022)
Changfu LIU, Wenxiang ZHANG, On-line Chatter Detection Using an Improved Support Vector Machine , Instrumentation: Vol. 6 No. 2 (2019)
LOKUGE A. R. , WIMALASIRI R. J. , Failure Analysis of a Crank Hinge in a Guillotine Machine , Instrumentation: Vol. 8 No. 4 (2021)
Qingshan WANG, Guoying SU, Qingzun MA, Haiquan YIN, Zhihang LIU, Chuanzhen LV, Automatic monitoring system for 3-D deformation of crustal fault based on laser and machine vision , Instrumentation: Vol. 11 No. 2 (2024): Instrumentation Volume 11 Issue 2
Hongtao Hao, Kai Wang, Fault Detection and Diagnosis of Pneumatic Control Valve Based on a Hybrid Deep Learning Model , Instrumentation: Vol. 10 No. 4 (2023)
DANISTAN Roch , ARUNAKIRINATHAN Thulasika , SIVARAJAH Archchana , MEHENDRAN Yanusha , EKANAYAKE Jayalath , Aspect Based User Reviews Classification , Instrumentation: Vol. 7 No. 2 (2020)
Jianjun ZHUANG, Xiaohui WU, Dongdong MENG, Shenghua JING, A Swin transformer and residual network combined model for breast cancer disease multi-classification using histopathological images , Instrumentation: Vol. 11 No. 1 (2024)

<< < 1 2 3 4 5 > >>

You may also start an advanced similarity search for this article.