Atrition Analysis Using XG Boost and Support Vector Machine Algorithms

Volume 8, Issue 6, June 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Atrition Analysis using XG Boost and Support

Vector Machine Algorithms
Bandung Prihanto
Graduate Student
Atma Jaya Catholic University of Indonesia Jakarta, Indonesia
Catherine Olivia Sereati, Maria A. Kartawidjaja, Marsul Siregar

Department of Electrical Engineering,
Atma Jaya Catholic University of Indonesia Jakarta, Indonesia
Abstract:- The presence of the internet network which is perspectives. A large amount of data available then raises
getting faster and the digital world which is growing various problems in terms of how to process and analyze
rapidly in various fields have a very big influence on data so that this data can be useful in human life both within
every aspect of human life, not limited to people who are the scope of individuals or companies and in the fields of
related, especially jobs in the field of information education, health and so on. An example of a problem
technology but also outside the field of information encountered within the scope of the company is the
technology. The massive development of the digital employee turnover rate in a company is 5% of the total
ecosystem and the entry of the industrial era 4.0 means employees, then the costs incurred by the company can be
that more and more data is available on the internet. A calculated with an estimate of around 1.5 times the annual
large amount of data available then raises various income of an employee [1].
problems in terms of how to process and analyze data so
that this data can be useful in human life both within the With the concept of Machine Learning technology,
scope of individuals or companies and in the fields of problems in processing and analyzing large amounts of data
education, health and so on. With the concept of can be solved more quickly when compared to doing it
Machine Learning technology, problems in processing manually by humans. Machine Learning utilizes a computer
and analyzing large amounts of data can be solved more to run a processed program [2]. Computers in this program
quickly when compared to doing it manually by humans. can do learning automatically based on the analysis of the
The more data that is processed, the performance of data entered, the analysis process is carried out with an
Machine Learning in conducting analysis will increase. algorithm that produces certain mathematical patterns.
In this analysis process, the determined algorithm also Patterns that have been determined from the results of this
affects the performance of Machine Learning. The study, can be used to determine the characteristics of new
Author will use Google's services in this research, data by comparing patterns that already existed before [3].
namely Google Colaboratory. Then the Author will also Machine learning can be defined as computer applications
compare the use of two algorithms, XG Boost, and and mathematical algorithms that are adopted by learning
Support Vector Machine, as well as carry out the feature from data and generating predictions in the future [4]. There
selection process. The Author will use Pearson to find are 3 types of machine learning, Reinforcement Learning,
factors that have a high correlation value. Based on the Unsupervised and Supervised Learning [5], and
results of prediction research on employee turnover Reinforcement Learning [6]. The more data that is
using Machine Learning by comparing the two processed, the performance of Machine Learning in carrying
algorithms, XG Boost and Support Vector Machine it out the analysis will increase, in this analysis process the
can be concluded that the accuracy obtained from each specified algorithm also affects the performance of Machine
accuracy is that XG Boost obtained 86% and Support Learning.
Vector Machine 84%.
Processing and analyzing data that uses a lot of
Keywords:- Machine learning, pearson, xg boost, support Machine Learning also requires reliable and stable
vector machine, employee turnover. computing resources, so that the data analysis process is
more reliable and accurate. To get these computing
I. INTRODUCTION resources, we can rely on cloud computing technology. This
technology is more commonly referred to as cloud
The presence of the internet network which is getting computing technology (cloud computing) which is a
faster and the digital world which is growing rapidly in paradigm for information technology infrastructure that
various fields has a very big influence on every aspect of managed by the service provider [7]. Some of cloud
human life, not limited to people who are related, especially provider that serve the services in the global such Azure
jobs in the information system and outside information Cloud, Amazon Web Services and Google.
system field. The massive development of the digital
ecosystem and the entry of the industrial era 4.0 means that There are several studies related to predicting
more and more data is available on the internet. A large employee turnover. A summary of some of these studies is
amount of data available will be an advantage and can also presented in Table 1. below.
be detrimental when viewed from a variety of different
IJISRT23JUN1315 www.ijisrt.com 2096

ISSN No:-2456-2165
Table 1: Study Literature
No. Title Discussion
1 HR Analytics: Employee Attrition This study discusses the prediction of employee turnover, where the prediction
Analysis Using Logistic Regression [8] results obtained an accuracy of 75% with feature selection using the Variance
Inflation Factor (VIC) using the Logistic Regression algorithm. From this study,
it is suggested to use another algorithm to find out the comparison of the correct
prediction results between the Random Forest algorithm and other algorithms.
2 An Extensive Analytical Approach on This study discusses employee predictions that need to be maintained based on
Human Resources Using Random several parameters. In this study, 14 parameters were used from a dataset sourced
Forest Algorithm [9] from Kaggle, totaling 19258 training data and 2129 test data. From the results of
this study, 100% accuracy was obtained with the random forest algorithm [5].
This 100% accuracy result can be claimed as overfitting, this is due to noise or a
less clear dataset. Suggestions in this study are to make predictions with other
algorithmic models and perform data cleansing before learning.
3 Comparative Study of The Machine In this study, the prediction of employee reduction was carried out. Where it is
Learning Techniques for Predicting done with a total of 35 attributes and using the Linear Discriminant Analysis
The Employee Attrition [10] (LDA) algorithm to get an accuracy of 86.39%.
4 Feature Selection pada Azure Machine In this study, predictions were made of prospective student achievements using
Learning untuk Prediksi Calon the SVM algorithm on the Azure Machine Learning platform. The results of this
Mahasiswa Berprestasi [11] study obtained an accuracy of 82.7% with the use of 10 attributes.
5 Employee Attrition Rate Prediction In this research, we develop a model to predict employee reduction, starting with
Using Machine Learning Approach some basic exploratory data analysis and continuing to feature engineering and
[12] applying a learning model in the form of a Random Forest, which has an
accuracy of 85% in its predictions. Effort-reward imbalance is most likely the
underlying common explanation for friction. For these situations, especially true
of individuals who work longer hours than necessary and who often have
generally low wages - it should be investigated whether our organization has an
attractive extra time strategy. In the context of the future aspect, a certain
heuristic-based approach can be followed in the coming years to predict the level
of employee attrition and can solve real-world problems.
The author will use Google's services in this research, data or not. Data contains more information that may not be
namely Google Colaboratory (Google Colab). Google Colab needed to build a model or contains wrong information [11].
itself is a modified form of Jupyter Notebook provided to For example, if there is a dataset with 50 columns, many
Google, where this platform is often used for Machine columns contain duplicate data from other columns. This
Learning and Deep Learning [13]. We can use google affects the quality and efficiency of the resulting data model.
collabs with a lot of command and it provided with Memory
dan Storage per user [14]. The author will compare the use If the dataset you have does not contain
of two algorithms, namely XG Boost and Support Vector duplicate/multiple data, then the next step is to check
Machine to determine the highest accuracy value, and to find whether there are missing values or not. If a row/column
out features that have a high correlation, the author will use with a missing value condition is found, it is necessary to
the Pearson method. find a way so that the data is no longer empty. Several ways
that can be done to fill in the missing value include deleting
II. RESEARCH METHOD rows or filling in the empty values.
A. Dataset Collection D. Transformation Data

After the literature study process is carried out, the next After there are no empty rows/columns of data, then the
process is to find data sources that will be used in research. outliers are removed. Outliers are data that have values that
These data are collected and then called a dataset. In this are very far from the general value, or in other words have
study, the author will use a secondary dataset from Kaggle. extreme values. The process is then continued by encoding
features. Categorical data will be transformed into a form
B. Exploratory Data that can be understood by the system. Character data will be
Data Exploratory is intended to see how the spread of converted into numeric.
data from datasets that have been obtained into graphical
forms that are easier to see and understand. The encoding process uses One-Hot Encoding. One-
hot encoding is used as a method for measuring categorical
C. Cleansing Data data. One-hot Encoding is the process of creating a unique
The next process after the dataset is obtained is to carry feature value in a column. A nominal feature is a category
out the pre-processing process. As explained in the previous feature type that cannot be sorted. After the encoding
chapter, the pre-processing process consists of several process is complete, it is followed by a data split process to
stages. The first stage is data cleansing, it is necessary to break the data into 2 parts, namely test data and training
check beforehand whether the dataset owned has duplicate data.

ISSN No:-2456-2165
E. Split Data the next step is to carry out the trial process. The trial
In the next stage, the existing data will be divided into process expected that the system can provide optimal
two parts, namely training data and test data. The dataset predictive results.
obtained from Kaggle will be divided by a simulation ratio
of training data, namely 0.7 and test data, 0.3. After the H. Evaluation
encoding process is complete, it is continued with the After the trial process is complete, an evaluation is then
process of overcoming data imbalance by oversampling. carried out to find out whether the results obtained from the
trial results are optimal or not. In addition, from the
F. Pre-processing Data evaluation stage, it can be known whether the performance
Then the process is continued by performing data of the Machine Learning model that has been made is good
processing which consists of checking the imbalance of the or not, the accuracy value of each algorithm used is also
data using oversampling which aims to reduce/eliminate known and what features have the most influence on the
outliers in the data. Imbalance data occure if there is prediction of employee turnover. already made.
unbalance ratio from one data to another data [15]. Impact of
accuracy in machine learning probably occurs on the Performance measurement/evaluation uses the
misclassification due to of imbalanced data. The decrease in Confusion Matrix. The performance evaluation process is
accuracy in imbalanced is caused by the fact that there are carried out for each algorithm model used. A Confusion
many noises or outliers found in the test dataset that come Matrix is a method that can be used to measure the
from the minority class. Imbalanced data can be solved performance of a classification method. The Confusion
using oversampling methos, adding the data synthetic data to Matrix contains information that compares the results of the
class minority [16]. After that, data scaling is done using classification performed by the system with the results of the
normalization. Using the right data scaling method can classification that should be. Based on the number of class
optimize the performance of the Machine Learning outputs, the classification system can be divided into 4 (four)
algorithm [17]. It aims to change the original data types, namely binary, multi-class, multi-label and
measurement scale into an accepted form without changing hierarchical classification [22].
the value of the data. The data presented is already in
numerical form. III. RESULT
G. Modeling This section contains coding designs and research

If the data is ready, then the next step is to create a results from predicting employee turnover using Machine
modeling algorithm. In this study, the author will compare Learning.
two algorithms, namely XG Boost and Support Vector A. Dataset Collection
Machine. This study uses secondary data obtained from Kaggle.
 XG Boost The data is used as a dataset. This dataset will then be
Extreme Gradient Boosting (XGBoost) is a decision processed to produce predictions regarding employee
tree-based algorithm [19]. The model is an ensemble tree turnover using Machine Learning. The amount of data
algorithm consisting of several classification and regression obtained is 1470 rows and 35 columns of data. Some
trees. The XG Boost algorithm performs optimization faster examples of sample data can be seen in the appendix. From
than other implementations of the Gradient Boosting these data, no missing values were found. There are 34
Method both in classification and regression problems [20]. features and 1 target (attrition). The form of feature data is
In a tree-based algorithm, the inner nodes represent values numerical and categorical. There are 26 numerical features
for the test attribute and the leaf nodes with scores represent and 9 categorical features. When entering a dataset into
decisions. Google Colaboratory, use the command as shown in the
following image.
 Support Vector Machine
Support Vector Machine (SVM) is a Supervised
Learning machine learning model [21]. Support Vector
Machine uses a classification algorithm to solve the two- Fig. 1: Command input dataset
category classification problem [19]. SVM has the basic
principle of a linear classifier, namely classification cases Then to check information related to column names,
that can be separated linearly. Also SVM can work to the data types, and null data, use the command as shown in
linear problem. SVM takes input the input data and predict Fig.2 and the results are shown in Fig.3 to Fig.4.
the two different input and classified based on the
hyperplane.
So when making modeling, the author makes the two Fig. 2: Command for checking column, data type, and null
algorithms and then does the training process with data that data
has previously been processed. If the modeling is complete,

ISSN No:-2456-2165
Fig. 3: Result checking column, data type, and null data
Fig. 4: Result checking numerical and categorical features

ISSN No:-2456-2165
B. Exploratory Data The following is a command to view feature distribution plot
Data Exploratory is intended to view data distribution graphs and feature distribution plot graphs against attrition
plots in a graphical form that is easier to see and understand. in the form of stock graphs and line graphs.
Fig. 5: Command features distribution
Fig. 6: Graph of feature distribution plots
Fig. 7: Command features distribution to attrition

ISSN No:-2456-2165
Fig. 8: Graphic plot of the distribution of features against attrition
Fig. 9: Command features distribution (graph)

ISSN No:-2456-2165
Fig. 10: Line graph of feature distribution plots
Fig. 11: Command features distribution to attrition (graph)

ISSN No:-2456-2165
Fig. 12: Line graphic plot of the distribution of features against attrition
The graphic results show that attrition occurs in  New employees

employees with the following features:  Employees with new roles (less than 4 years)
 Young age around 20-23 years  New to current manager (less than 5 years)
 Low daily rates
 Farther home When viewed from a large amount of each data, then if
 Lower job satisfaction illustrated with a bar chart the graphical appearance is as
 Lower monthly income follows.
 Employees with little experience (4-10 years)
Fig. 13: Command count features distribution

ISSN No:-2456-2165
Fig. 14: Graph of the feature distribution count plot
Fig. 15: Command count features distribution
Fig. 16: Graph of count plot of feature distribution to attrition

ISSN No:-2456-2165
When this dataset is viewed in heatmap form, it will look like the image below.
Fig. 17: The command displays the distribution features heatmap
Fig. 18: Result from the distribution features heatmap
C. Cleansing Data so that the data is no longer empty. Then proceed with
The next stage is data cleansing. The dataset that is owned removing features that don't play an important role if deemed
needs to be checked first whether the dataset that is owned has necessary.
duplicate data or not. If the dataset you have does not contain
duplicate/multiple data, then the next step is to check whether The command for checking duplicate data and the results
there are missing values or not. If a row/column with a are shown in the following Fig.19.
missing value condition is found, it is necessary to find a way

ISSN No:-2456-2165
Fig. 19: Command checks for data duplication and results
Furthermore, to carry out the missing value-checking The results of checking shows that the dataset has no
process, the command used is as shown in the image below. missing values.
Fig. 20: Command for checking missing values
Fig. 21: Missing value checking results
Then the data is checked again whether there is feature removed because some of these data features have the same
data that needs to be dropped or not. In this dataset, it turns value in all rows. The data that needs to be dropped include
out that there is feature data that needs to be dropped or EmployeeCount, StandardHours and Over18 data.
Fig. 22: Command drop feature data
After that, the process continues by converting the EmployeeNumber data into a string with the following command.

ISSN No:-2456-2165
Fig. 23: Command converts EmployeeNumber to a string
D. Transformation Data are very far from the general value, or in other words have
After the data cleansing process is complete, then the extreme values. The following is the command that is run to
outliers are removed. Outliers are data that have values that remove the outliers.
Fig. 24: The command and results remove outliers
The process is then continued by encoding features. into numeric. The encoding process uses One-Hot Encoding.
Categorical data will be transformed into a form that can be The following is the command that is executed during the
understood by the system. Character data will be converted encoding process.
Fig. 25: The command and results labeling encoding

ISSN No:-2456-2165
Fig. 26: Command labeling encoding for re-map outliers
After the labeling encoding process is complete, the Encoding. The command for performing the One-hot
process continues with data encoding using One-hot Encoding process is shown in the following Fig. 27.
Fig. 27: Command One-hot Encoding process and the result

ISSN No:-2456-2165
After the encoding process is complete, it is followed The dataset obtained from Kaggle will then be divided
by a data split process to break the data into 2 parts, namely into 2 to obtain training data and test data with a simulated
test data and training data training data ratio of 0.7 and 0.3 for test data.
E. Split Data
Fig. 28: Command split data process
F. Pre-processing Data imbalance of data using oversampling which aims to

Entering the data processing, the stages consist of reduce/eliminate outliers in the data.
checking imbalanced data and scaling data. To check the
Fig. 29: Command handling imbalance data process

ISSN No:-2456-2165
After that, data scaling is done using normalization. It data presented is already in numerical form. The following is
aims to change the original data measurement scale into an a command to perform data scaling commands using
accepted form without changing the value of the data. The normalization.
Fig. 30: Command for data scaling
G. Modeling
After the data processing is complete, now the data is If the modeling is complete, the next step is to carry
ready to be used in the modeling phase. Because it will out the trial process. The trial process expected that the
compare two algorithms, namely XG Boost and Support system can provide optimal predictive results. The following
Vector Machine, it is necessary to model the two algorithms is the command used to carry out the testing process and its
and then carry out the experimental process with previously results.
processed data.
Fig. 31: Command for the testing process

ISSN No:-2456-2165
H. Evaluation Performance measurement/evaluation uses the
After the trial process was completed, it was found that Confusion Matrix. The performance evaluation process is
among the two algorithm models, namely XG Boost and carried out for each algorithm model used. The command
Support Vector Machine, the results showed that best used along with the results of the Confusion Matrix is shown
accuracy was found in the XG Boost model with an in the image below.
accuracy of 86%, followed by a Support Vector Machine of
84 %.
Fig. 37: Command of XG Boost performance evaluation process
Fig. 38: Command of SVM performance evaluation process
IV. DISCUSSION REFERENCES
The initial hypothesis of this study is that it is [1.] S. Yadav, A. Jain, and D. Singh, “Early Prediction of
suspected that the two algorithms used (XG Boost and Employee Attrition using Data Mining Techniques,”
Support Vector Machine) will obtain accurate results that Proc. 8th Int. Adv. Comput. Conf. IACC 2018, no.
reach >80%. April, pp. 349–354, 2018, doi:
10.1109/IADCC.2018.8692137.
V. CONCLUSSION [2.] D. Prasetyawan and R. Gatra, “Model Convolutional
Neural Network untuk Mengukur Kepuasan
Based on the results of prediction research using Pelanggan Berdasarkan Ekspresi Wajah,” J. Tek.
Machine Learning by comparing the two XG Boost and Inform. dan Sist. Inf., vol. 8, no. 3, pp. 661–673,
Support Vector Machine it can be concluded that the 2022, doi: 10.28932/jutisi.v8i3.5493.
accuracy obtained from each accuracy is that XG Boost [3.] Ö. Çelik, “A Research on Machine Learning Methods
obtained 86% and Support Vector Machine 84%. In terms of and Its Applications,” J. Educ. Technol. Online
accuracy, much of the predicted data is correct. Precision Learn., vol. 1, no. 3, pp. 25–40, 2018, doi:
shows low false positives. Recall shows low false negatives. 10.31681/jetol.457046.

ISSN No:-2456-2165
[4.] A. Roihan, P. A. Sunarya, and A. S. Rafika, [16.] M. C. Untoro and J. L. Buliali, “Penanganan
“Pemanfaatan Machine Learning dalam Berbagai Imbalance Class Data Laboratorium Kesehatan
Bidang: Review paper,” IJCIT (Indonesian J. dengan Majority Weighted Minority Oversampling
Comput. Inf. Technol., vol. 5, no. 1, pp. 75–82, 2020, Technique,” Regist. J. Ilm. Teknol. Sist. Inf., vol. 4,
doi: 10.31294/ijcit.v5i1.7951. no. 1, pp. 23–29, 2018, doi:
[5.] P. Santoso, H. Abijono, and N. L. Anggreini, 10.26594/register.v4i1.1184.
“Algoritma Supervised Learning Dan Unsupervised [17.] A. Ambarwari, Q. J. Adrian, and Y. Herdiyeni,
Learning Dalam Pengolahan Data,” J. Teknol. Terap. “Analisis Pengaruh Data Scaling Terhadap Performa
G-Tech, vol. 4, no. 2, pp. 315–318, 2021, doi: Algoritme Machine Learning untuk Identifikasi
10.33379/gtech.v4i2.635. Tanaman,” Jurmal Resti (Rekayasa Sist. dan Teknol.
[6.] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Informasi), vol. 4, no. 1, pp. 117–122, 2020.
Vassanelli, “Applications of Deep Learning and [18.] R. Maharjan, “Employee Churn Prediction Using
Reinforcement Learning to Biological Data,” IEEE Logistic Regression and Support Vector Machine,”
Trans. Neural Networks Learn. Syst., vol. 29, no. 6, San Jose State University, 2021.
pp. 1–33, 2018, doi: 10.1109/TNNLS.2018.2790388. [19.] Suwarno and R. Kusnadi, “Analisis Perbandingan
[7.] W. Punlumjeak, N. Rachburee, and J. Arunrerk, “Big SVM, XGBoost dan Neutral Network pada
Data Analytics: Student Performance Prediction Klasifikasi Ujaran Kebencian,” J. RESTI (Rekayasa
Using Feature Selection and Machine Learning on Sist. dan Teknol. Informasi), vol. 5, no. 5, pp. 896–
Microsoft Azure Platform,” J. Telecommun. Electron. 903, 2021, doi: 10.29207/resti.v5i5.3506.
Comput. Eng., vol. 9, no. 1–4, pp. 113–117, 2017. [20.] M. Guo, Z. Yuan, B. Janson, Y. Peng, Y. Yang, and
[8.] Setiawan, S. Suprihanto, A. C. Nugraha, and J. W. Wang, “Older Pedestrian Traffic Crashes Severity
Hutahaean, “HR Analytics: Employee Attrition Analysis Based on An Emerging Machine Learning
Analysis Using Logistic Regression,” IOP Conf. Ser. XGBoost,” Sustainability, vol. 13, no. 926, pp. 1–26,
Mater. Sci. Eng., vol. 830, no. 3, pp. 1–7, 2020, doi: 2021, doi: 10.3390/su13020926.
10.1088/1757-899X/830/3/032001. [21.] A. Raza, K. Munir, M. Almutairi, F. Younas, and M.
[9.] S. L. V. Papineni, A. Mallikarjuna Reddy, S. M. S. Fareed, “Predicting Employee Attrition Using
Yarlagadda, S. Yarlagadda, and H. Akkineni, “An Machine Learning Approaches,” Appl. Sci., vol. 12,
Extensive Analytical Approach on Human Resources no. 13, pp. 1–17, 2022, doi: 10.3390/app12136424.
Using Random Forest Algorithm,” Int. J. Eng. Trends [22.] M. Sokolova and G. Lapalme, “A Systematic
Technol., vol. 69, no. 5, pp. 119–127, 2021, doi: Analysis of Performance Measures for Classification
10.14445/22315381/IJETT-V69I5P217. Tasks,” Inf. Process. Manag., vol. 45, no. 4, pp. 427–
[10.] K. Bhuva and K. Srivastava, “Comparative Study of 437, 2009, doi: 10.1016/j.ipm.2009.03.002.
The Machine Learning Techniques for Predicting The
Employee Attrition,” IJRAR, vol. 5, no. 3, pp. 568–
577, 2018, [Online]. Available: www.ijrar.org.
[11.] H. Ariesta and M. A. Kartawidjaja, “Feature Selection
pada Azure Machine Learning untuk Prediksi Calon
Mahasiswa Berprestasi,” TESLA J. Tek. Elektro, vol.
20, no. 2, pp. 166–174, 2018, doi:
10.24912/tesla.v20i2.2993.
[12.] A. Sethy and D. A. K. Raut, “Employee Attrition
Rate Prediction Using Machine Learning Approach,”
Turkish J. Physiother. Rehabil., vol. 32, no. 3, pp.
14024–14031, 2020, [Online]. Available:
www.turkjphysiotherrehabil.org.
[13.] D. F. Sengkey, F. D. Kambey, S. P. Lengkong, S. R.
Joshua, and H. V. F. Kainde, “Pemanfaatan Platform
Pemrograman Daring dalam Pembelajaran
Probabilitas dan Statistika di Masa Pandemi CoVID-
19,” J. Inform., vol. 15, no. 4, pp. 257–264, 2020,
[Online]. Available:
https://ejournal.unsrat.ac.id/index.php/informatika/arti
cle/view/31685.
[14.] R. R. Zaveri and P. P. M. Chawan, “Google Colab : A
White Paper,” IJSRD - Int. J. Sci. Res. Dev., vol. 9,
no. 6, pp. 124–125, 2021.
[15.] R. D. Fitriani, H. Yasin, and Tarno, “Penanganan
Klasifikasi Kelas Data Tidak Seimbang dengan
Random Oversampling Pada Naive Bayes,” J.
Gaussian, vol. 10, no. 1, pp. 11–20, 2021.

Atrition Analysis Using XG Boost and Support Vector Machine Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Atrition Analysis Using XG Boost and Support Vector Machine Algorithms

Uploaded by

Copyright:

Available Formats

Volume 8, Issue 6, June 2023 International Journal of Innovative Science and Research Technology

Atrition Analysis using XG Boost and Support

Catherine Olivia Sereati, Maria A. Kartawidjaja, Marsul Siregar

IJISRT23JUN1315 www.ijisrt.com 2096

A. Dataset Collection D. Transformation Data

IJISRT23JUN1315 www.ijisrt.com 2097

G. Modeling This section contains coding designs and research

IJISRT23JUN1315 www.ijisrt.com 2098

Fig. 3: Result checking column, data type, and null data

Fig. 4: Result checking numerical and categorical features

IJISRT23JUN1315 www.ijisrt.com 2099

Fig. 5: Command features distribution

Fig. 6: Graph of feature distribution plots

Fig. 7: Command features distribution to attrition

IJISRT23JUN1315 www.ijisrt.com 2100

Fig. 8: Graphic plot of the distribution of features against attrition

Fig. 9: Command features distribution (graph)

IJISRT23JUN1315 www.ijisrt.com 2101

Fig. 10: Line graph of feature distribution plots

Fig. 11: Command features distribution to attrition (graph)

IJISRT23JUN1315 www.ijisrt.com 2102

The graphic results show that attrition occurs in  New employees

Fig. 13: Command count features distribution

IJISRT23JUN1315 www.ijisrt.com 2103

Fig. 14: Graph of the feature distribution count plot

Fig. 15: Command count features distribution

Fig. 16: Graph of count plot of feature distribution to attrition

IJISRT23JUN1315 www.ijisrt.com 2104

Fig. 17: The command displays the distribution features heatmap

Fig. 18: Result from the distribution features heatmap

IJISRT23JUN1315 www.ijisrt.com 2105

Fig. 19: Command checks for data duplication and results

Fig. 20: Command for checking missing values

Fig. 21: Missing value checking results

Fig. 22: Command drop feature data

IJISRT23JUN1315 www.ijisrt.com 2106

Fig. 23: Command converts EmployeeNumber to a string

Fig. 24: The command and results remove outliers

Fig. 25: The command and results labeling encoding

IJISRT23JUN1315 www.ijisrt.com 2107

Fig. 26: Command labeling encoding for re-map outliers

Fig. 27: Command One-hot Encoding process and the result

IJISRT23JUN1315 www.ijisrt.com 2108

Fig. 28: Command split data process

F. Pre-processing Data imbalance of data using oversampling which aims to

Fig. 29: Command handling imbalance data process

IJISRT23JUN1315 www.ijisrt.com 2109

Fig. 30: Command for data scaling

Fig. 31: Command for the testing process

IJISRT23JUN1315 www.ijisrt.com 2110

Fig. 37: Command of XG Boost performance evaluation process

Fig. 38: Command of SVM performance evaluation process

IV. DISCUSSION REFERENCES

IJISRT23JUN1315 www.ijisrt.com 2111

IJISRT23JUN1315 www.ijisrt.com 2112

You might also like