Benchmark Effect of Web Search Engines on Text Mining
Authors : Ahmet Toprak, Metin Turan
Pages : 84-92
View : 51 | Download : 13
Publication Date : 2021-01-15
Article Type : Research
Abstract :There have been many studies about creating a dictionary and these studies have come from past to present with different methods and different analyzes. Especially with the emergence of the World Wide Web, efforts to create dictionary based on instant data have gained importance. Therefore, the performance of the web search engines directly effects the model which is using web documents for automatic dictionary creation. The web search engines were evaluated in terms of their suggested documents relationality to the query in the research. For this purpose, an automatic dictionary creating model using web documents were developed. First of all, the topic seed words are determined by the documents presented to the system initially. Search is executed by these seed words initially. Then TF-IDF metric was used as meaningful word selection method for returned first document. The top n meaningful words were selected from the highest TF-IDF values. The value of n was determined experimentally. When searching the web with these words added to the dictionary, new documents were suggesting by the web search engine. By repeating the process, experimental dictionaries of a certain size were obtained. By the way, the documents suggested by each web engine are generally different, so that the dictionary similarity produced from the top suggested documents can measure web engines performance of selecting relational documents. Hash similarity was used to evaluate dictionary performance. According to the results, dictionary with the 73.9% highest similarity for Google search engine, dictionary with the 68.7% highest similarity for Bing search engine and dictionary with the 60.5% highest similarity for Yandex search engine were produced.Keywords : Automatic Dictionary Creation, Hash Similarity, Natural Language Processing, Performance of Web, TF-IDF Metric