The main content of a news web page is a source of data for Natural Language Processing (NLP) and Information Retrieval (IR), which contains large quantities of valuable information. This paper proposes a method that formulates the main content extraction problem as a DOM tree node classification problem. In terms of feature extraction, we use the DOM tree node to represent HTML document and then develop multiple features by using the DOM tree node properties, such as text length, tag path, tag properties and so on. In consideration that the essence of the problem is the classification model, we use Xgboost to help select nodes. Experimental results show that the proposed approach is effective and efficient in extracting main content of new web pages, and achieves about 98% accuracy over 1083 news pages from 10 different new sites, and the average processing time per page is within 10ms.
You may also start an advanced similarity search for this article.