Open Problems in Indonesian Automatic Essay Scoring System

,


Introduction
Every learning process requires an evaluation to measure the level of students' understanding. There are many types of evaluations include multiple choice question, short question, and essay question. Some studies have revealed that essay question is better than others if the student's knowledge is evaluated thoroughly [1]. But, the problem arises is time-consuming of the rating process. The teacher should read and evaluate sentence by sentence of student answer. Nowadays, many information technologies are developed to automate human activities. In the education issue, the developing example is essay grading. Researchers have done research on automated essays scoring (AES) since sixties years last century [2]. There are so many advantages that can be obtained in automated grading rather than in conventional grading. It is reported that teachers in Britain are spending about 30% their time in scoring student's answers and it loses about 30 billion pounds per year [3]. So, there will be many benefits from the application of the automated essay scoring system. The application of automated essay scoring system has been developed with many different methods being used. However, there is no study indicating which method is better in automated essay scoring, especially in Indonesian. The previous research [4] reveals the average errors of some methods which are commonly used in automated essay scoring in Indonesian. The average errors of each method are calculated with comparing the scores from human raters and scores from the system. The methods are Cosine Similarity, Euclidean Distance and Jaccard. The results show Jaccard is the best approach, but the average error is still high, more than 50%. Therefore this paper exposes several ideas that can be explored further toward this issue. With the benefit of the openly available dataset in http://dx.doi.org/10.17632/6gp8m72s9p.1 [5]. Several evaluations can be done by changing the parameters, such as feature extraction, similarity metric, learning algorithm, environment implementation, and performance evaluation. This paper presentation is divided into several chapters. Chapter 1 describes the introduction. Then, Chapter 2 exposes the summary of the previous study in English, because Roshinta and Rahutomo report [4] are written in Indonesian. Chapter 3 explores further ideas and open problems toward this issue. Finally, Chapter 4 concludes this paper. Roshinta and Rahutomo [4] propose a web-based automated essay scoring system for Indonesian. The research also develops a dataset for performance evaluation purpose [5]. The study consists of several phases. First, developing the dataset. Inside the dataset are questioned texts with corresponding answer texts. The questions are classified into four categories: lifestyle, politics, sport, and technology. Second, develop the web-based automated essay scoring system. Third, student respondents are asked to answer the questions through web-based application system. Then, the system calculates the score with 3 methods. Fourth, the students' answers are scored manually by 3 lecturer respondents. The final score is defined as the average score of the three respondents then served as the gold standard. Finally, the calculation of the average percentage error between manual scores and the system scores of each method. Furthermore, this chapter exposes the research summary of Indonesian essay scoring system in English.

The dataset
The dataset being used in this study is defined in Table 1. The questions are 40 texts which are divided into 4 categories (politics, lifestyle, sport, and technology). Each category has 10 question texts. Roshinta and Rahutomo [4] also provide the answer texts of corresponding questions. An example of Indonesian question text is, "Jelaskan kegunaan karbohidrat untuk tubuh kita". The corresponding Indonesian answering text is, "Fungsi karbohidrat adalah sebagai pemasok energi, dapat memperlancar proses pada pencernaan, memberikan efek kenyang dengan kandungan selulosa-nya dan penyeimbang asam dan basa dalam tubuh". The respondents answering the question are around 50 students. The respondents are 2nd grade Information Technology Department student of State Polytechnic of Malang. An example of Indonesian answer text corresponding to above question example is, "sumber tenaga, pemanis alami, menjaga sistem imun, dan sebagai keseimbangan tubuh". The total Indonesian answer texts being collected from the respondents are 2,162 texts. Several problems occur during the answer text collection phase. Therefore several approaches are done toward the issues. The duplicate texts are filtered into a unique text. The problem arises because of duplicate entry into the system by the respondent. Sometimes the student just answers the question carelessly. The problem is investigated further because of the lack of student knowledge toward a general issue in the question text. The other condition is a different amount of corresponding answer text between the questions. The condition happens because the students do not always attend the data collection sessions. Furthermore, the data can be downloaded freely in Mendeley data [5]. Three lecturer respondents give the manual scoring of the students' answers. The score is determined between 0 to 100. Then the final manual score is determined by calculating the average of the three scores.  [7]. The value of this component depends on term existence in a document. Furthermore, not only existence but also it depends on the tern weights which can be obtained from term frequency operation. If the document is represented as a vector, then the mathematic operation can be done. The first process in preprocessing the text is the transformation of text data into numerical data. There are several steps which consist of case folding (convert text into lowercase), tokenizing (explode text into words), stemming (convert words into root word), and stopword (remove words which are not necessary), and term frequency. In the previous study, preprocessing is divided into two ways: with stemming and without stemming. The tokenizing phase of the text is based on whitespace, with no ngram consideration. The study uses Nazief and Andriani stemming algorithm [8] in the evaluation system with the stemming process. The algorithm is work for Indonesian. Furthermore, the study uses Tala list [9] in Indonesian stopword phase. Finally, the study calculates the term frequency of the text and generate the term vector of text. The term weights can be obtained from global weighting (by considering the other texts/ document) or local weighting (considering only the text itself). In the study, the term weighting uses local weighting which is expressed by normalized term frequency. The normalized term frequency is a frequency of existence term fij of term i in a document j compared with all term in the text [6] [7]. Local term weighting of term i in document j (wij) can be defined in Equation 1.

= ∑
(1) Table 2 shows an example of term frequency weight of the term vector in a 5x4 matrix. The rows represent the documents and the columns represent the terms. According to Table 2, the terms are term 1, term 2, term 3 and term 4. The documents are answer key, student1's answer, student2's answer, student3's answer, and student4's answer. Therefore a text is represented as a vector by reading the matrix horizontally.

Similarity metric
The study uses three similarity metrics of two vectors: Cosine similarity, Euclidian distance, and Jaccard. The calculation of Cosine Similarity is not derived from the length of the vectors but is derived from the degrees between two vectors [6] [7]. The Cosine Similarity can be calculated by Equation 2.

Percentage error
The calculation of percentage errors of Cosine Similarity, Euclidean Distance and Jaccard can be seen in Figure 1. The gold standard in this experiment is averaged manual scoring by three lecturers as described previously. Figure 1 shows that the Jaccard method with stemming has lowest percentage error, 52.31%. Jaccard without stemming method has an error that is not much different, 56.05%. Jaccard with or without stemming slightly higher than Cosine Similarity. In stemming schema, the difference is around 1.94 %. While in non-stemming schema the difference is around 7.18%. The Euclidean Distance has the highest percentage error. Euclidean Distance without stemming has error around 339.41% and with stemming has error around 332.90%. Figure 2 shows the percentage errors of each question (1-40). The results clearly show the Euclidean Distance always has the highest error compared with Cosine Similarity and Jaccard. The next subchapters will describe the other slices of analysis to the experiment results.

Student stability
Analysis of students' stability shows the score of students. In this analysis, not all the student data will be shown, but only several data taken as samples. Figure 3 shows a graph of students' stability in this study. Figure 3 shows that students occasionally have a high score, but occasionally have a low score. It indicates that students answered the questions base on their ability. There is no pattern showing students always have a high score or low score.

Percentage error based on question type
Analysis of percentage error base on types of question shows error in question with a free answer and definite answer. In this study, there are 6 questions of the definite answer and 34 questions of the free answer. The numbers of question are not equal, but this study only sees the comparison of each method. Figure 4 shows the percentage error base on types of question. According to Figure 4, the percentage of errors in all methods of definite questions are lower than free questions. Jaccard with stemming method has the lowest error, it is 25.35%. In another hand, Euclidean Distance has the highest error, it is more than 100%.

Each category analysis
Analysis of questions' category shows the category with the highest average score. The higher the average score, the better the students' ability in that category. Figure 5 clearly shows that the politics category has the highest average score than lifestyle, sport, and technology. The lifestyle category has the lowest average score. Technology and sports category have slightly differences value.

Open problems
This chapter exposes the open problems in this research issue. Due to the lack of experimental results of standard similarity metrics in the previous study, further investigation is needed. The ideas are described as follows.

Feature extraction
There are many approaches in feature extraction of texts. Several evaluations can be done to the dataset with a different approach. During tokenization of the text, n-gram consideration [10] may be interesting due to the fact, Indonesian lemma is possibly more than one word. Bigram and trigram are considered best for Indonesian. A standardization of terms based on a dictionary [11] is another interesting approach since maybe a respondent did typographical error. Sometimes the respondent is possible to type a correct term of slang or nonstandard term. During standardization of terms by a dictionary, the simplicity of terms by synonym set of thesaurus dictionary [11] and word sense disambiguation technique are possible as well [12]. The removal of unnecessary symbol or extract a correct information with a regular expression is another effort. More advanced preprocessing of terms based on part of speech of terms in a sentence is considerable since the Indonesian part of speech tagger is already available [13]. The previous study uses Nazief and Adriani stemming algorithm for Indonesian [8]. The algorithm obtains the root of Indonesian words. This principle is relatively different than a famous English stemmer algorithm, Porter stemmer. Further research is possible to use Indonesian Porter stemmer algorithm [9] and comparing the results.
If the dimension is an important issue, then stopword removal is an important preprocessing step. The previous study uses Tala list [9]. The other list is available for Indonesian, namely developed by Wibisono [14] and Doyle [15]. Filtering term based on its frequency is possible as well to reduce the dimension. A specific threshold can be defined and several evaluations based on different threshold values are interesting to be investigated further.
In the weighting scheme, global weight such as inverse document frequency (IDF) [16][17] can be evaluated as well. So many weighting schemes are available such as probabilistic retrieval BM25 family [18]. The word2vec vector scheme [19] is interesting as well to be used in this system since matrix of word2vec is a dense matrix, not a sparse matrix like a conventional matrix of term vector of texts.

Similarity metric
Several distances and similarity metric, different from Cosine, Euclidean, and Jaccard are available as well. For distance scheme, there are the other schemes: Manhattan, Minkowsky, Hamming, Jaro-Winkler, Kendall, Lee, and Levenshtein. For similarity measurement, the other schemes are Dice and Adamic. Further research of that similarity metric performance is interesting as well.
A semantic similarity approaches such as latent semantic analysis (LSA) [20] or explicit semantic analysis (ESA) [21] are possible to explore as well. LSA working principle is based on a statistical approach, namely singular value decomposition. Involvement of Indonesian WordNet [22] (if available) with different similarity schemes in a taxonomy such as Wu and Palmer or Lesk is also interesting as well.

Learning algorithm
Machine learning approaches of classification are seemly working as well in this issue. A quantitative approach like linear regression can be used. With an additional threshold of pass or fail, or marking such as A, B, C, the categorical classification is possible to be evaluated. So many approaches in this approach such as support vector machine (SVM), naive bayes classifier (NBC), decision tree with various variations, KNN, or logistic regression [23] [24]. The novel deep learning approach is tempted to be tested in this issue as well [25]. The computational cost in the learning algorithm evaluation can be reduced by dimensional reduction or feature selection. Dimensional reduction of singular value decomposition (SVD) [26] [20] or principal component analysis (PCA) [27]are interesting as well.

Environment implementation
The previous study implements an automatic grading system in a web-based application with PHP CodeIgniter framework. The development of mobile or desktop application is possible as well.
Another approach such as front-end and back-end are interesting as well as web service implementation with node js and angular. Several programming languages such as python, java, and VB are another implementation area of exploration.

Performance evaluation
The previous study only uses average error performance evaluation. Statistical evaluation such as correlation coefficient [28] can be used as well as the other correlation schemes. Standard deviation is interesting as well. Another important evaluation performances are precision, recall, and accuracy [7].

Conclusion
This paper has been describing the summary of Roshinta and Faisal study in English. This paper also describes several further research idea toward the issue. Ranging from feature extraction, similarity metric, learning algorithm, environment implementation, and performance evaluation. Hopefully, this paper motivates the other researcher to work in Indonesian automatic essay grading system and improve the learning experience inside the classroom.