Pengembangan Modul PreprocessingTeks untuk Kasus Formalisasi dan Pengecekan Ejaan Bahasa Indonesia pada Aplikasi Web Mining Simple Solution (WMSS)
DOI:
https://doi.org/10.20956/jmsk.v15i2.5718Abstract
Abstract
Data of social media currently has been much used to analyze both sentiment analysis and another analysis. In fact, data that is obtained from the social media in generally has some mistakes which can influence the spelling in writing of words. The solution offered is word formalization and spelling check. Based on the problem, it will be built a preprocessing model to overcome two the mistakes. The method that will be used in formalization is to change the words to be formal form based on KBBI, while the method used for spelling check is spelling correction. Spelling correction method consists of distance edit, bigram and distance edit rule. In this study, in addition the application of both methods, also it will be analyzed comparing the result of spelling correction. From the result of analysis shows that distance edit rule has higher accuracy, namely 83.39% than using both edit distance and bigram method. In addition, edit distance rule method also has faster performance than another both methods. Overall, method to change word to formal word were based on KBBI and spelling correction has been able to overcome the problem of two cases, such that it can increase accuracy of the result of the analysis.
Keywords: preprocessing, spelling correction, edit distance, bigram
Abstrak
Data media sosial saat ini telah banyak digunakan untuk melakukan analisis baik analisis sentimen maupun analisis terkait lainnya. Nyatanya, data yang diperoleh dari media sosial tersebut pada umumnya memiliki kesalahan yang akan mempengaruhi hasil analisis. Kesalahan tersebut berupa penggunaan kata yang tidak baku dan adanya kesalahan ejaan dalam penulisan kata. Solusi yang ditawarkan berupa formalisasi kata dan pengecekan ejaan. Berdasarkan masalah tersebut, akan dibangun modul preprocessing untuk mengatasi dua kesalahan di atas. Metode yang digunakan pada formalisasi adalah mengubah kata ke bentuk formal berdasarkan KBBI sedangkan metode yang digunakan pada pengecekan ejaan adalah spelling correction. Metode spelling correction tersebut terdiri dari tiga yaitu edit distance, bigram dan edit distance + rule. Pada penelitian ini, selain penerapan kedua metode juga akan dilakukan analisis untuk melihat perbandingan hasil pada metode spelling correction. Dari hasil analisis tersebut, diketahui bahwa metode edit distance + rule memiliki akurasi yang lebih tinggi yaitu sebesar 83,39% dibandingkan dengan kedua metode lainnya yaitu edit distance dan bigram. Selain itu, metode edit distance + rule juga memiliki performa tercepat dibandingkan kedua metode lainnya. Secara keseluruhan, metode mengubah kata ke bentuk formal berdasarkan KBBI dan spelling correction telah mampu mengatasi masalah pada dua kasus di atas sehingga dapat meningkatkan akurasi hasil analisis.
Kata Kunci:preprocessing, spelling correction, edit distance, bigram
References
. Badan Pengembangan dan Pembinaan Bahasa, 2016. Pedoman Umum Ejaan Bahasa Indonesia, Kementrian Pendidikan dan Kebudayaan, Jakarta.
. Clark, Eleanor dan Kenji A., 2011. Text normalization in social media: progress, problems and applications for a pre-processing system of casual English,Pacific Association for Computational Linguistics (PACLING 2011), pp.2-11.
. D. Goldhahn, T. Eckart & U. Quasthoff, 2012. Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages,Proceedings of the 8th International Language Ressources and Evaluation (LREC'12)
. Flor, Michael, 2012. Four types of context for automatic spelling correction, TAL. Volume 53 – n° 3/2012, pp. 61-99.
. Haddi, Emma dkk, 2013.The Role of Text Pre-processing in Sentiment Analysis, Procedia Computer Science 17, pp. 26 – 32.
. Jody, dkk, 2015. Analisis dan Implementasi Algoritma Winnowing dengan Synonym Recognition pada Deteksi Plagiarisme untuk Dokumen Teks Berbahasa Indonesia, Telkom University, Bandung.
. Li, Zhifei, dan David Yarowsky, 2008. Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora, Proceedings of the 2008 Conference on Empirical Methods in Natural Languange Processing, pp. 1031-1040.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Jurnal Matematika, Statistika dan Komputasi is an Open Access journal, all articles are distributed under the terms of the Creative Commons Attribution License, allowing third parties to copy and redistribute the material in any medium or format, transform, and build upon the material, provided the original work is properly cited and states its license. This license allows authors and readers to use all articles, data sets, graphics and appendices in data mining applications, search engines, web sites, blogs and other platforms by providing appropriate reference.