Professional Documents
Culture Documents
ISSN No:-2456-2165
Abstract:- The presence of the internet network which is perspectives. A large amount of data available then raises
getting faster and the digital world which is growing various problems in terms of how to process and analyze
rapidly in various fields have a very big influence on data so that this data can be useful in human life both within
every aspect of human life, not limited to people who are the scope of individuals or companies and in the fields of
related, especially jobs in the field of information education, health and so on. An example of a problem
technology but also outside the field of information encountered within the scope of the company is the
technology. The massive development of the digital employee turnover rate in a company is 5% of the total
ecosystem and the entry of the industrial era 4.0 means employees, then the costs incurred by the company can be
that more and more data is available on the internet. A calculated with an estimate of around 1.5 times the annual
large amount of data available then raises various income of an employee [1].
problems in terms of how to process and analyze data so
that this data can be useful in human life both within the With the concept of Machine Learning technology,
scope of individuals or companies and in the fields of problems in processing and analyzing large amounts of data
education, health and so on. With the concept of can be solved more quickly when compared to doing it
Machine Learning technology, problems in processing manually by humans. Machine Learning utilizes a computer
and analyzing large amounts of data can be solved more to run a processed program [2]. Computers in this program
quickly when compared to doing it manually by humans. can do learning automatically based on the analysis of the
The more data that is processed, the performance of data entered, the analysis process is carried out with an
Machine Learning in conducting analysis will increase. algorithm that produces certain mathematical patterns.
In this analysis process, the determined algorithm also Patterns that have been determined from the results of this
affects the performance of Machine Learning. The study, can be used to determine the characteristics of new
Author will use Google's services in this research, data by comparing patterns that already existed before [3].
namely Google Colaboratory. Then the Author will also Machine learning can be defined as computer applications
compare the use of two algorithms, XG Boost, and and mathematical algorithms that are adopted by learning
Support Vector Machine, as well as carry out the feature from data and generating predictions in the future [4]. There
selection process. The Author will use Pearson to find are 3 types of machine learning, Reinforcement Learning,
factors that have a high correlation value. Based on the Unsupervised and Supervised Learning [5], and
results of prediction research on employee turnover Reinforcement Learning [6]. The more data that is
using Machine Learning by comparing the two processed, the performance of Machine Learning in carrying
algorithms, XG Boost and Support Vector Machine it out the analysis will increase, in this analysis process the
can be concluded that the accuracy obtained from each specified algorithm also affects the performance of Machine
accuracy is that XG Boost obtained 86% and Support Learning.
Vector Machine 84%.
Processing and analyzing data that uses a lot of
Keywords:- Machine learning, pearson, xg boost, support Machine Learning also requires reliable and stable
vector machine, employee turnover. computing resources, so that the data analysis process is
more reliable and accurate. To get these computing
I. INTRODUCTION resources, we can rely on cloud computing technology. This
technology is more commonly referred to as cloud
The presence of the internet network which is getting computing technology (cloud computing) which is a
faster and the digital world which is growing rapidly in paradigm for information technology infrastructure that
various fields has a very big influence on every aspect of managed by the service provider [7]. Some of cloud
human life, not limited to people who are related, especially provider that serve the services in the global such Azure
jobs in the information system and outside information Cloud, Amazon Web Services and Google.
system field. The massive development of the digital
ecosystem and the entry of the industrial era 4.0 means that There are several studies related to predicting
more and more data is available on the internet. A large employee turnover. A summary of some of these studies is
amount of data available will be an advantage and can also presented in Table 1. below.
be detrimental when viewed from a variety of different
The author will use Google's services in this research, data or not. Data contains more information that may not be
namely Google Colaboratory (Google Colab). Google Colab needed to build a model or contains wrong information [11].
itself is a modified form of Jupyter Notebook provided to For example, if there is a dataset with 50 columns, many
Google, where this platform is often used for Machine columns contain duplicate data from other columns. This
Learning and Deep Learning [13]. We can use google affects the quality and efficiency of the resulting data model.
collabs with a lot of command and it provided with Memory
dan Storage per user [14]. The author will compare the use If the dataset you have does not contain
of two algorithms, namely XG Boost and Support Vector duplicate/multiple data, then the next step is to check
Machine to determine the highest accuracy value, and to find whether there are missing values or not. If a row/column
out features that have a high correlation, the author will use with a missing value condition is found, it is necessary to
the Pearson method. find a way so that the data is no longer empty. Several ways
that can be done to fill in the missing value include deleting
II. RESEARCH METHOD rows or filling in the empty values.
So when making modeling, the author makes the two Fig. 2: Command for checking column, data type, and null
algorithms and then does the training process with data that data
has previously been processed. If the modeling is complete,
Fig. 12: Line graphic plot of the distribution of features against attrition
C. Cleansing Data so that the data is no longer empty. Then proceed with
The next stage is data cleansing. The dataset that is owned removing features that don't play an important role if deemed
needs to be checked first whether the dataset that is owned has necessary.
duplicate data or not. If the dataset you have does not contain
duplicate/multiple data, then the next step is to check whether The command for checking duplicate data and the results
there are missing values or not. If a row/column with a are shown in the following Fig.19.
missing value condition is found, it is necessary to find a way
Furthermore, to carry out the missing value-checking The results of checking shows that the dataset has no
process, the command used is as shown in the image below. missing values.
Then the data is checked again whether there is feature removed because some of these data features have the same
data that needs to be dropped or not. In this dataset, it turns value in all rows. The data that needs to be dropped include
out that there is feature data that needs to be dropped or EmployeeCount, StandardHours and Over18 data.
After that, the process continues by converting the EmployeeNumber data into a string with the following command.
D. Transformation Data are very far from the general value, or in other words have
After the data cleansing process is complete, then the extreme values. The following is the command that is run to
outliers are removed. Outliers are data that have values that remove the outliers.
The process is then continued by encoding features. into numeric. The encoding process uses One-Hot Encoding.
Categorical data will be transformed into a form that can be The following is the command that is executed during the
understood by the system. Character data will be converted encoding process.
After the labeling encoding process is complete, the Encoding. The command for performing the One-hot
process continues with data encoding using One-hot Encoding process is shown in the following Fig. 27.
E. Split Data
G. Modeling
After the data processing is complete, now the data is If the modeling is complete, the next step is to carry
ready to be used in the modeling phase. Because it will out the trial process. The trial process expected that the
compare two algorithms, namely XG Boost and Support system can provide optimal predictive results. The following
Vector Machine, it is necessary to model the two algorithms is the command used to carry out the testing process and its
and then carry out the experimental process with previously results.
processed data.
The initial hypothesis of this study is that it is [1.] S. Yadav, A. Jain, and D. Singh, “Early Prediction of
suspected that the two algorithms used (XG Boost and Employee Attrition using Data Mining Techniques,”
Support Vector Machine) will obtain accurate results that Proc. 8th Int. Adv. Comput. Conf. IACC 2018, no.
reach >80%. April, pp. 349–354, 2018, doi:
10.1109/IADCC.2018.8692137.
V. CONCLUSSION [2.] D. Prasetyawan and R. Gatra, “Model Convolutional
Neural Network untuk Mengukur Kepuasan
Based on the results of prediction research using Pelanggan Berdasarkan Ekspresi Wajah,” J. Tek.
Machine Learning by comparing the two XG Boost and Inform. dan Sist. Inf., vol. 8, no. 3, pp. 661–673,
Support Vector Machine it can be concluded that the 2022, doi: 10.28932/jutisi.v8i3.5493.
accuracy obtained from each accuracy is that XG Boost [3.] Ö. Çelik, “A Research on Machine Learning Methods
obtained 86% and Support Vector Machine 84%. In terms of and Its Applications,” J. Educ. Technol. Online
accuracy, much of the predicted data is correct. Precision Learn., vol. 1, no. 3, pp. 25–40, 2018, doi:
shows low false positives. Recall shows low false negatives. 10.31681/jetol.457046.