Comparison of M Estimation, S Estimation, with MM Estimation to Get the Best Estimation of Robust Regression in Criminal Cases in Indonesia

Crime incidents that occurred in Indonesia in 2019 based on Survey Based Data on criminal data sourced from the National Socio-Economic Survey and Village Potential Data Collection produced by the Central Statistics Agency recorded 269,324 cases. The high crime rate is caused by several factors, including poverty and population density. Determination of the most influential factors in criminal acts in Indonesia can be done with Regression Analysis. One method of Regression Analysis that is very commonly used is the Least Square Method. However, Regression Analysis can be used if the assumption test is met. If outliers are found, then the assumption test is not completed. The outlier problem can be overcome by using a robust estimation method. This study aims to determine the best estimation method between Maximum Likelihood Type (M) estimation, Scale (S) estimation, and Method of Moment (MM) estimation on Robust Regression. The best estimate of Robust Regression is the smallest Residual Standard Error (RSE) value and the largest Adjusted R-square. The analysis of case studies of criminal acts in Indonesia in 2019 showed that the best estimate was the S estimate with an RSE value of 4226 and an Adjusted R-square of 0.98
 


INTRODUCTION AND PRELIMINARIES
A crime is when a person commits an act that is prohibited and is against the law. Criminal acts can also be interpreted as all forms of prohibited actions and have been regulated in applicable law. Criminal acts have a broad scope, including immoral acts, corruption, fraud, persecution, theft, and so on. Anyone who violates the prohibition that has been regulated in applicable law can be threatened with a criminal [9].
Based on Survey Based Data on criminal data sourced from the National Socio-Economic Survey (SUSENAS) and the Village Potential Data Collection (PODES) produced by the Central Agency on Statistics Indonesia (BPS). The incidence of criminal acts that occurred in Indonesia in 2019 was recorded at 269,324 cases. Based on these data, out of 100,000 people in Indonesia, 103 of them are at risk of being hit by a crime; within 1 minute 57 seconds, there is one criminal act that occurred [4].

Malecita Nur Atala Singgih, Achmad Fauzan
Looking at the incidence of criminal acts in Indonesia in 2019, the high number of criminal acts can be caused by several factors, including poverty and population density. In determining the most influential factors in criminal acts in Indonesia, it can be done with Regression Analysis. The Least Square Method is one of the most commonly used Regression Analysis methods [18]. Regression Analysis can be used if the assumption test is met, namely normality, homoscedasticity, no autocorrelation, and free of multicollinearity [8]. In some cases, regression analysis cannot be used to solve problems because outliers cause unfulfilled assumptions. The outlier problem can be overcome by using a robust estimation method [11]. Robust regression is a method used when the assumption test is not met and there are outliers. This method is very suitable to be used to analyze data that is affected by outliers to obtain a robust model or resistance to outliers [6]. The benefits of this research is to find out how to determine the factors that most influence crime in Indonesia in 2019. This can be a reference for the government to form policies to tackle criminal acts in Indonesia.
Several studies have compared the method of S estimation, LTS estimation, M estimation, and MM estimation on robust regression. Perihatini [12] conducted a comparative study of LTS estimation, S estimation, with M estimation for a case study of car financing at company "X" which aims to produce the best parameter estimation model seen from the Mean Square Error (MSE) and values. Widodo [19] compared the LTS estimate, the M estimate, with the MM estimate for a case study of farmer exchange rates. Comparison saw from the Residual Standard Error value.
Based on previous research, the methods used have advantages and disadvantages. Based on the characteristics of the data tested in this study, the authors conducted a study by comparing the M estimate, the S estimate, and the MM estimate on Robust Regression. The purpose of this research is to determine the best estimate to obtain the best model. The selection of the best estimate is based on the slightest Residual Standard Error (RSE) value and the largest value. The data will be processed using the help of the RStudio software

METHOD
This research was conducted at PT Kedata Indonesia Digital from January 18, 2021, to February 26, 2021. The data used is data on criminal acts in Indonesia in 2019. This data is surveybased secondary data sourced from the National Socio-Economics Survey (SUSENAS) and the Village Potential data collection (Podes) produced by the Central Statistics Agency (BPS). The variables used are the number of criminal acts ( ) with case units, the number of poor people ( ) with a soul unit, and population density ( ) with a person/km2 unit. This research was conducted using the Robust Regression analysis method to find out how to determine the factors that influence criminal acts in Indonesia in 2019 if the data contained outliers and assumptions were not met. The first thing to do in this research is to input data. The second step is the regression analysis is carried out using the Least Square Method. Furthermore, in the assumption test, if any assumptions are not met, outlier detection is carried out. If the data assumptions are not met, and there are outliers, proceed with Robust Regression analysis using M, S, and MM estimates. The three estimates are selected as the best estimate, and the best model is obtained

Regression Analysis with Least Square Method
Ordinary Least Square (OLS) is an approach method for regression or equation formation in modeling, as well as measurement analysis in model validation [2]. In the Regression Analysis with the Least Squares Method, there are two validation tests: the overall and partial tests. The overall test or F test is used to determine whether the regression model is feasible or not to be used as a model [5]. In addition, the test is also used to determine simultaneously whether the independent variables are significant to the dependent variable. Reject if means the model is feasible to use.
Partial test or t test is used to determine whether the independent variables have a significant effect on the model [20]. A partial test is also used to know whether the independent variables are Malecita Nur Atala Singgih, Achmad Fauzan significant to the dependent variable. Reject if means that there is a partial influence of the independent variable on the dependent variable.

Assumption Test
The regression model obtained from OLS is a regression model with a regression coefficient that meets the characteristics of an unbiased linear estimator and the best, commonly referred to as the Best Linear Unbiased Estimator (BLUE) [1]. A normality test is conducted to test whether the regression model of the independent and dependent variables is normally distributed [7]. One of the methods used to test for normality is the Shapiro-Wilk test. This test has a good test power for small data samples or less than 50 [13]. The test statistic is formulated by Equation 1.
. If < 0.05 or then rejected means that the data is not normally distributed. Then it must fail to reject so that the assumption test is met. A heteroscedasticity test is carried out to test whether the regression model has an inequality or similarity of residual variance from one observation to another observation [3]. The method used to test heteroscedasticity is the Breusch Pagan test. If < then reject means that heteroscedasticity occurs, then it must fail to reject so that the assumption is fulfilled. Autocorrelation test is carried out to test the correlation between residuals in one observation and previous observations [15]. The method used to perform the autocorrelation test is Durbin-Watson. If the < then rejects meaning that there is autocorrelation in the residuals. For the assumption test to be fulfilled, the existing data obtained failed to reject .
A multicollinearity test was conducted to see whether the independent variables had a significant relationship or not. One of the ways to determine the presence or absence of multicollinearity is by looking at the Variance Inflation Factor (VIF) value [17]. If the value of VIF < 10 fails to reject which means that there is no multicollinearity, so it can be said that the assumption is fulfilled.

Outlier detection
Outliers are data that does not follow the overall data pattern, or that does not follow the general pattern for the resulting regression model [16]. One outlier identification can be made using the Cook's Distance method, and the test statistic can be defined by Equation 2.

Robust Regression Analysis
Robust regression is an important tool for analyzing data affected by outlier data to produce a robust model or resistance to outliers. Robust regression aims to overcome deviations as an alternative to OLS [14]. There are three estimates used in Robust Regression Analysis. M estimation is a simple estimation method, both in theory and calculation. This M estimate can analyze the data assuming that most of the outliers detected are in the dependent variable. Estimation of M using Huber's weighting function [10].

Malecita Nur Atala Singgih, Achmad Fauzan
Robust regression is essential for analyzing data affected by outlier data to produce a robust model or resistance to outliers. Robust regression aims to overcome deviations as an alternative to OLS [12]. MM estimation is a combination method between estimation with high breakdown point or estimation S with estimation M. This MM estimation has better performance than S estimate [21].

Descriptive statistics
The data obtained are the number of criminal acts ( ) with case units, the number of poor people ( ) with a soul unit, and population density ( ) with a person/km2 unit. The descriptive analysis is presented in Table 1. Based on Table 1, the average number of non-criminals in Indonesia is 7921 cases, with the highest case being 31934 cases and the lowest case being 718 cases. The average number of poor people is 739551 people, with the highest poor population being 4112250 people and the lowest poor being 48780 people. The average population density is 742 people/km2, with a maximum population density of 15900 people/km2 and a minimum population density of 9 people/km2.

Regression Analysis with Ordinary Least Square
Overall test is used for testing the feasibility of the model and testing the general parameters. The F test obtained = 2.739× where this value is more significant than = 0.05 so that the conclusion obtained is that the model is feasible to use. The partial test (t-test) is used to determine whether the independent variable has a significant effect on the dependent variable. , respectively, less than 0.05 so that it can be concluded that the variables (Poverty) and (Population Density) have an effect on significant to the variable Y (Criminal Act). Furthermore, the parameter estimation results for the Least Square Method will be obtained as shown in Table 3. Estimation of MKT Parameter.
The result of from the model is 0.5341, meaning that the independent variable can explain the dependent variable in the model by 53.41%. In contrast, the rest is explained or influenced by other variables outside the model. The regression model obtained using OLS is said to meet the properties of an unbiased linear estimator. The best is also called the Best Linear Unbiased Estimator (BLUE) if the assumption test is completed.
When performing the analysis using the regression method, there are several assumption tests that must be met. They are normality test, homoscedasticity test, autocorrelation test, and multicollinearity test. Normality test can be done using the Shapiro-Wilk test. The normality test on the data obtained = 3.305×10-05 where this value is greater than = 0.05 so that the conclusion obtained is that the data is not normally distributed (the assumption is not met). Homoscedasticity test was carried out using the Breusch Pagan test. Homoscedasticity test obtained = 0.004816 where this value is less than = 0.05 so that the conclusion obtained is the assumption of residual homoscedasticity is not met (assumptions are not met). The autocorrelation test was performed using the Durbin-Watson test. Autocorrelation test obtained = 0.004816 where this value is greater than = 0.05 so that the conclusion obtained is that there is no autocorrelation (the assumption is satisfied).
Multicollinearity test by looking at the value of Variance Inflation Factor (VIF). The multicollinearity test obtained the VIF value of the variables and of 1,000667 where this value is less than 10 so that the conclusion obtained is that there is no multicollinearity (the assumption is fulfilled). The assumption test that has been carried out has several tests that are not met, namely the normality test and the homoscedasticity test.
The normality test and homoscedasticity test on the assumption test are not met, then outlier detection is carried out. Cook Distance (Cook's D) is used to measure the presence or absence of outliers, and the results are presented in Fig.2.   Fig. 2. Outlier detection.

Malecita Nur Atala Singgih, Achmad Fauzan
Based on Fig. 2, it can be seen that there are five outliers in the data, namely the 2 nd , 11 th , 12 th , 13 th , and 15 th observations.

Robust Regression
Robust Regression parameter estimation using M estimation is shown in Table 4. shows that if there is a one percent increase in , then will increase by 0.004128021%, while if there is a one percent increase in , then will increase by 1.692069%. Next, the approach for variables and is shown in Table 5. , is less than of 0.05 so that it can be concluded that the variables and have a significant effect on the variable.
Robust Regression parameter estimation using S estimation is presented in Table 6.  5 shows that if there is a one percent increase in , then will increase by 0.005885%, while if there is a one percent increase in , then will increase by 1.764%. Furthermore, the approach for variables and is presented in Table 7.

Malecita Nur Atala Singgih, Achmad Fauzan
Based on Table 7, the of the variables and , respectively 2.03× and < 2× , is less than 0.05. It can be concluded that the variables and have a significant effect on the Y variable. Robust Regression parameter estimation using MM estimation is shown in Table  8. shows that if there is a one percent increase in , then will increase by 0.005761%, while if there is a one percent increase in , then will increase by 1.722%. Furthermore, the approach for variables and is presented in Table 9.  Table 9, the of the variables and of < 2× is less than 0.05, it can be concluded that the variables and have a significant effect on the variable.

Selection of the Best Estimate
The best estimates are selected from the smallest RSE value, and the most significant value is presented in Table 10. Robust Regression Model with an estimated S value of of 0.98 means that the dependent variable can be explained by variable in the model by 98% while the rest is explained or influenced by other variables outside the model.

CONCLUSION
This study was conducted to overcome the problem of regression analysis when the existing data assumptions are not met. There are data outliers by comparing the M estimate, the S estimate with the MM estimate from robust regression. Based on the analysis results, it is concluded that to determine the factors that most influence criminal acts in Indonesia in 2019 are to use the robust regression method because several test assumptions are not met, and there are outliers in the data. The robust Regression Model with S estimation is the best model. The value is 0.98, meaning that the dependent variable can be explained by variable in the model by 98%. In contrast, the rest is explained or influenced by other variables outside the model.