Intelligent Web Robot for Content Extraction

Wenxing HONG; Jie LI; Weiwei WANG; Yang WENG

Articles

Vol. 6 No. 3 (2019)

Intelligent Web Robot for Content Extraction

Wenxing HONG
Jie LI
Weiwei WANG
Yang WENG

PDF

Submitted: February 5, 2024
Published: 2024-02-05

Abstract

The main content of a news web page is a source of data for Natural Language Processing (NLP) and Information Retrieval (IR), which contains large quantities of valuable information. This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem. In terms of feature extraction, we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties, such as text length, tag path, tag properties and so on. In consideration that the essence of the problem is the classification model, we use Xgboost to help select nodes. Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages, and achieves about 98% accuracy over 1083 news pages from 10 different new sites, and the average processing time per page is within 10ms.

Downloads

Download data is not yet available.

Keywords

Content Extraction
DOM
Machine Learning
Xgboost

How to Cite

HONG , W., LI, J., WANG , W., & WENG, Y. (2024). Intelligent Web Robot for Content Extraction. Instrumentation, 6(3). Retrieved from https://editorial.instrumentationjournal.com/index.php/instr/article/view/96

This work is licensed under a Creative Commons Attribution 4.0 International License.

Similar Articles

Xu LU , Xianchun ZHOU , Ying ZHANG , Yuxuan YE , "Research on Low Sampling Rate Digital Pre-distortion Technology Based on Improved Chebyshev Polynomial" , Instrumentation: Vol. 10 No. 2 (2023)
Xinyi Liang, Hongyan Xing, Wei Gu, Tianhao Hou, Zhiwei Ni, Xinyi Wang, Hybrid Gaussian Network Intrusion Detection Method Based on CGAN and E-GraphSAGE , Instrumentation: Vol. 11 No. 2 (2024): Instrumentation Volume 11 Issue 2
Mengyuan SHI , Junchai GAO, Research on High Altitude Remote Sensing Building Segmentation Based on Improved U-Net Algorithm , Instrumentation: Vol. 8 No. 4 (2021)
Jiangmiao ZHU, Weibo ZHAO , Yuan GAO, Xing WANG , Xiuna GAO , Design of Atomic Time Scale Release System for Multiple Laboratories , Instrumentation: Vol. 7 No. 1 (2020)
Yuxuan YE , Xianchun ZHOU , Wenyan WANG , Chuanbin YANG , Qingyu ZOU , Research on Facial Fatigue Detection of Drivers with Multi-feature Fusion , Instrumentation: Vol. 10 No. 1 (2023)
Rui WANG , Mingzhou LIANG , Xiaofeng LIU , Tianyun SHI , Comparative Study on Perimeter Intrusion Detection System of High-speed Railway , Instrumentation: Vol. 7 No. 1 (2020)
SIVALINGAM Disne , An Approach to Speech Emotion Classification Using k-NN and SVMs , Instrumentation: Vol. 8 No. 3 (2021)
Yazhi Zhang, Xuguang Zhang, Hui Yu, The Triple-Branch Asymmetric Network for Real-Time Semantic Segmentation of Road Scenes , Instrumentation: Vol. 11 No. 2 (2024): Instrumentation Volume 11 Issue 2
Zhidong ZHANG , Xiaolong ZHU , Xiyuan CAO , Bo LI , Junbin ZANG , Dong GUO , Jiuzhang MEN , Chenyang XUE , Advances in Tongue Diagnosis Objectification of Traditional Chinese Medicine , Instrumentation: Vol. 10 No. 1 (2023)
Jiaxin Li, Fengzhi Dai, Di Yin, Peng Lu, Haokang Wen, A Method of SSVEP Signal Identification Based on Improved eCAA , Instrumentation: Vol. 10 No. 4 (2023)

<< < 1 2 3 4 5

You may also start an advanced similarity search for this article.