Instituto Politécnico Nacional
Centro de Investigación en Computación
Development of Machine Learning and Deep
Learning Algorithms to Detect Depression in
Students through Digital Phenotyping
T E S I S
que para obtener el título de:
MAESTRÍA EN CIENCIAS DE LA
COMPUTACIÓN
presenta:
ABRAHAM LARRAZOLO BARRERA
Tutores:
Dr. Adolfo Guzmán Arenas
Dr. Gilberto Lorenzo Martínez Luna
Estados Unidos Mexicanos
Ciudad de México
2022
Contents
Resumen
iv
Abstract
v
Acknowledgements
1 Introduction
1.1 Context . . . . . . . . . .
1.2 Objectives . . . . . . . . .
1.2.1 General objective .
1.2.2 Specific objectives .
1.3 Contributions . . . . . . .
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 State of the art
2.1 Methods of Machine Learning and Diagnosing Depression . . . . . . . . . .
2.1.1 Personal Sensing: Understanding Mental Health Using Ubiquitous
Sensors and Machine Learning . . . . . . . . . . . . . . . . . . . . .
2.1.2 Mobile Phone Sensor Correlates of Depressive Symptom Severity in
Daily-Life Behavior: An Exploratory Study . . . . . . . . . . . . .
2.1.3 Studentlife: Assessing mental health, academic performance, and
behavioral trends of college students using smartphones . . . . . . .
2.1.4 The relationship between mobile phone location sensor data and
depressive symptom severity . . . . . . . . . . . . . . . . . . . . . .
2.1.5 Towards Deep Learning Models for Psychological State Prediction
using Smartphone Data: Challenges and Opportunities . . . . . . .
3 Theoretical framework
3.1 Machine learning models . . . . . . . . .
3.1.1 Linear regression . . . . . . . . .
3.1.2 Linear regression optimized . . .
3.1.3 Artificial Neural Network (ANN)
3.2 Pearson correlation . . . . . . . . . . . .
3.3 Cross validation . . . . . . . . . . . . . .
3.4 Exam PHQ-9 - Depression Score . . . .
3.5 Dataset Studentlife . . . . . . . . . . . .
3.5.1 PHQ-9 Scores . . . . . . . . . . .
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
2
3
4
4
4
5
7
8
9
11
11
11
12
13
13
14
15
16
17
4 Proposed Solution
4.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . .
4.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . .
4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . .
4.4 Correlation of features with PHQ9 score . . . . . . . .
4.5 Models . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Score Estimation Model - Linear regression . . .
4.5.2 Score Estimation Model - Multilayer Perceptron
.
.
.
.
.
.
.
18
18
19
19
23
25
25
26
5 Results
5.1 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
27
6 Conclusions
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
31
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
MLPs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Figures
2.1
3.1
3.2
4.1
4.2
4.3
Features depicted in the layer above the inputs. Raw sensor data is transformed into features from the sensor data. Source from Mohr, Zhang, and
Schueller (2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
l
Neural Network Architecture. The wjk
is the weight from the k th neuron in
the (l − 1)th layer to the j th neuron in the lth layer, the (a)lj for the activation of the j th neuron in the lth layer. The nonlinear function is represented
by σ, applying this function to the output of a linear transformation yields
a nonlinear transformation. . . . . . . . . . . . . . . . . . . . . . . . . . .
PHQ-9 score distribution baseline and the end of study follow-up. Adapted
from Studentlife by Wang et al.Wang et al. (2018) . . . . . . . . . . . . . .
Experiment 1 predicts depression based on PHQ-9 score using the features
extracted from students’ mobile phones only during the last 2 weeks of the
10 weeks of Studenlife. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Experiment 2 predicts depression based on PHQ-9 score using the features
of two blocks, initial 2 weeks and last 2 weeks of the Studentlife dataset. .
The architecture of a Regression Artificial Neural Network was designed
with 4 hidden layers, 10 neurons in each layer, and 1 neuron in the last
layer. The activation function is used in all layers with Rectified Linear
Unit (ReLU). The first layer has one input for each one of the behavioral
features i = 1, 2, .., n and Y is the output neuron that predicts a depression
PHQ-9 score values.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
5
14
17
18
19
26
5.1
5.2
Results of Experiment 1. Subplots of the results for each model for the 38
students (The horizontal axis is the Student ID assigned from the Studentlife
dataset). Models in descending order: OLS, Lasso, Ridge, Elastic, and ANN. 29
Results of Experiment 2. Subplots of the results for each model for the 38
students (The horizontal axis is the Student ID assigned from the Studentlife
dataset). Models in descending order: OLS, Lasso, Ridge, Elastic, and ANN. 30
List of Tables
2.1
2.2
3.1
3.2
4.1
4.2
5.1
Classification of participants with and without depressive symptoms and
estimating their PHQ-9 scores using location features individually and aggregated.
Source from Saeb et al. (2015) . . . . . . . . . . . . . . . . . . . . . . . .
Correlations between automatic sensor data and PHQ-9 depression scale.
Source from Wang et al. (2018) . . . . . . . . . . . . . . . . . . . . . . . .
Level of Depression according to PHQ-9 scores . . . . . . . . . . . . . . . .
PHQ-9 depression scale interpretation for both exams (pre and post).
Adapted from Studentlife by Wang et al. (2018) . . . . . . . . . . . . . . .
Coefficients of correlation between location features and PHQ-9 scores in
the Baseline. Pearson’s coefficients in Larrazolo’s study are similar to Saeb
et al. (2016). This similarity is important because the coding of features is
based on Saeb et al. (2016) but will be used to explore different models and
scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Coefficients of correlation between location features and PHQ-9 scores Follow Up. Pearson’s coefficients in Larrazolo’s study are similar to Saeb et
al. (2016). This similarity is important because the coding of features is
based on Saeb et al. (2016) but will be used to explore different models and
scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of results RMSE of linear and nonlinear models using LOOCV
as an evaluation method of Experiment 1 based on the research study of
Saeb et al. (2015) and Experiment 2 proposed by Larrazolo. . . . . . . . .
iii
7
8
15
17
23
23
27
Resumen
La depresión es una enfermedad frecuente en todo el mundo. Aproximadamente 280 millones de personas alrededor del mundo tienen depresión de acuerdo a la Organización
Mundial de la Salud, World Health Organization (2021).
Las crecientes capacidades y mayor número de sensores en los dispositivos personales como
el teléfono, pulseras y relojes inteligentes,han generado gran interés en el área de la salud
debido a la información de los datos que dichos dispositivos generan pasivamente. Investigaciones en las que se usa el fenotipado digital para predecir indicadores de enfermedades
mentales como la depresión y el estrés han ido en aumento.
En el presente trabajo se muestra el análisis y la exploración de diferentes modelos de
aprendizaje máquina y aprendizaje profundo usados para predecir los niveles de depresión
de estudiantes a partir de los datos obtenidos mediante dispositivos móviles. Empleamos
modelos lineales y redes neuronales artificiales para la predicción de los resultados; los
niveles de depresión fueron analizados y comparados con base al examen PHQ-9. El PHQ9 es examen clínico usado para la detección de la depresión, consta de 9 preguntas, el
puntaje tiene un rango de 0 a 27. Las características que usamos para los modelos fueron
obtenidas mediante un preprocesamiento de los datos de los sensores del teléfono móvil.
etc. Los resultados obtenidos para los modelos de regresión lineal fueron de 7.7 Root Mean
Squared Error RMSE para el clásico regresor lineal (Ordinary Least Square), los resultados
de este modelo con regularización Ridge fue de 2.8 RMSE y con Lasso 2.8 RMSE. El mejor
modelo fue la Arquitectura de red neuronal con 2.7 RMSE.
iv
Abstract
Depression is a common illness throughout the world. Approximately 280 million people
around the world have depression, according to the World Health Organization (2021).
The increasing capacities and more sensors in personal devices such as the telephone,
smart bracelets, and watches have generated significant interest in the area of health due
to the information from the data these devices passively generate. Research using digital
phenotyping to predict indicators of mental illness, such as depression and stress, has
increased.
In the present work, the analysis and exploration of different machine learning and deep
learning models used to predict the levels of depression of students from the data obtained
through mobile devices are shown. We used linear models and artificial neural networks
to predict the results; Depression levels were analyzed and compared based on the PHQ-9
exam. The PHQ-9 is a clinical exam used to detect depression and consists of 9 questions.
The score has a range from 0 to 27. The characteristics we used for the models were
obtained by preprocessing the data from the mobile sensors. The results obtained for the
linear regression models were 7.7 Root Mean Squared Error RMSE for the classic linear
regressor (Ordinary Least Square); the results of this model with Ridge regularization were
2.8 RMSE and with Lasso 2.8 RMSE. The best model was the Neural Network Architecture
with 2.7 RMSE.
v
Acknowledgements
First of all, I thank my family for always supporting me at all times. Especially to my
mother who has always shown me her love unconditionally. My sister and brother both
represent an inspiration to explore new limits in knowledge. Last but not least, to my partner Nancy who supported me in many ways on this path to achieving the master’s degree.
My sincere thanks go to Dr.Adolfo Guzmán and Dr.Gilberto Martínez, who provided me
with their invaluable experience and knowledge to complete this program. I express my
deepest gratitude for their constant guidance and support throughout this program.
A special mention to thanks to Dr.Mercedes Balcells for the opportunity to work on the
project of Digital Phenotyping and the Institute for Medical Engineering Science (IMESMIT) for the stay research.
Thanks to Centro de Investigación en Computación (CIC) at Instituto Politécnico Nacional (IPN) and its administration to allow me to study this important program. Also
thank Consejo Nacional de Ciencia y Tecnología (CONACYT) for their financial support,
which provided invaluable opportunities for me to pursue my academic goals.
vi
Chapter 1
Introduction
1.1
Context
Depression is a mental disturbance affecting an individual’s thinking and mental development. According to World Health Organization, approximately 280 million people have
mental disorders World Health Organization (2021).
Mental illnesses are diagnosed based on symptoms, patients’ interviews, and self-reported.
The use of mobile devices in sensing to track and infer behavioral health is growing because
mobile devices like smartphones, tablets, and wearables have sensors that could improve
mental health monitoring.
Digital phenotyping was defined as ’moment-by-moment quantification of the individuallevel human phenotype in situ using data from digital devices’ by Torous (2018). Spinazze
(2019) defined digital phenotyping as ’the process of inferring individual behavior from
digital data generated through human interaction with electronic devices, including both
physical hardware and software.’
Ubiquitous sensing technologies like smartphones allow the continuous collection of data
about individuals, passively through inbuilt sensors or connected devices and actively via
user engagement. Passive data from sensors and applications are transformed into states
related to mental health. The increment in the number of smartphones and their sensors
has potentialized adoption of these devices for monitoring mental health. In recent years,
the number of connected wearable devices worldwide has increased and is expected to
exceed 1 billion by 2022, according to Statista (2019).
Machine learning and deep learning methods have been applied to continuous sensor data
to predict mental health, allowing different mental conditions such as depression, anxiety,
stress, etc. These techniques aim to create models to train themselves to perceive complex
patterns.
1
CHAPTER 1. INTRODUCTION
1.2
1.2.1
2
Objectives
General objective
Given the growing use of machine learning, deep learning, and the information provided
by smartphones and wearables to detect mental illnesses such as depression, the objective
of this work is to explore and apply machine learning and deep learning models that
allow predicting the level of depression in students according to the score of the PHQ-9
clinical test through the use of data obtained from the sensors of the mobile phone to find
characteristics of behavior that could be indicated for the prediction of depression.
1.2.2
Specific objectives
• Identify behavioral features from smartphone sensors with the potential to be used
in the depression predictor models.
• Explore and apply the linear regression model with its regularizations and artificial
neural networks models using the Studentlife dataset by Wang et al. (2018) to predict
the level of depression according to the scoring of the clinical test PHQ-9.
• Describe differences between the explored models, identifying approaches and specific
characteristics.
CHAPTER 1. INTRODUCTION
1.3
3
Contributions
This thesis made the following contributions:
• Developed features: Coding the features that might be indicative of detecting the
severity of depression through mobile phone sensors.
• Depression detection models: Designing four models for measuring the severity of
depression detection in students using features extracted from mobile phone sensors.
• Exploring different scenarios of time to identify the change of behavior in students
using the features extracted from mobile phone sensors to detect the severity of
depression.
Chapter 2
State of the art
This section describes relevant literature and its main contributions related to Digital
Phenotyping using information from smartphones and wearables. Most of these works
applied machine learning methods and deep learning methods to detect levels of depression.
2.1
2.1.1
Methods of Machine Learning and Diagnosing Depression
Personal Sensing: Understanding Mental Health Using Ubiquitous Sensors and Machine Learning
Mohr et al. (2017) describes sensors in devices such as phones, wearables, and computers
used to create a digital trace of personal sensing. Collecting and analyzing data from
sensors embedded in the context of daily life to identify human behaviors, thoughts, feelings, and traits. Figure 2.1 provides a layered hierarchical model for translating raw sensor
data into markers of behaviors and states related to mental health. Also, research methods
and challenges are discussed, including privacy and dimensionality problems. Although
personal sensing is still in its infancy, it holds great promise as a method for conducting
mental health research and as a clinical tool for monitoring at-risk populations and providing the foundation for the next generation of mobile health interventions.
Researchers examined the possibility of using smartphone sensor data to detect the presence and severity of mental health disorders, including depression, bipolar disorder, and
schizophrenia. Specifically for depression exposed different works and concluded that GPS
features appear to predict depression, although the relationship between depression and
subsequent GPS features degrades quickly over time, suggesting that a lack of mobility may be an early warning signal of depression. They review three commonly used
machine-learning analytical methods: supervised learning, unsupervised learning, and
semi-supervised learning. The work of Mohr et al. (2017) is a framework reference of
layers to describe the transformation from the sensors to clinical states based on the collection of different studies. In our work, we used some sensors like Location(GPS) and
4
CHAPTER 2. STATE OF THE ART
5
Figure 2.1: Features depicted in the layer above the inputs. Raw sensor data is transformed
into features from the sensor data. Source from Mohr et al. (2017)
Ambient light referenced in the Figure 2.1 to create new features of location type, and we
use expanded features like Circadian Movement, Number of clusters, Total distance, etc;
These characteristics are described in detail in the section 4.3 Feature extraction. Another
difference with our work is the development and testing of models to predict PHQ-9 score
depression.
2.1.2
Mobile Phone Sensor Correlates of Depressive Symptom
Severity in Daily-Life Behavior: An Exploratory Study
Saeb et al. (2015) explored the detection of daily-life behavioral markers and identifying
depressive symptom severity using mobile phone global positioning systems (GPS). The
study considered a total of 40 participants with a sensor data acquisition app for 2 weeks.
At the beginning of the 2-week period, participants completed a self-reported depression
survey (PHQ-9). Behavioral features were developed and extracted from GPS location
and phone usage data. Linear regression and logistic regression were used to estimate each
participant’s PHQ-9 score using features from the phone sensor data.
Data preprocessing was used to extract features from both GPS location and phone usage
data. One procedure was to classify whether each GPS location data sample came from
a stationary state or a transition state. This procedure estimated the movement speed
at each location by calculating its time derivative and then used a threshold speed that
defined the boundary between these states. The threshold proposed for this study was 1
km/h.
Another procedure was clustering the samples in the stationary state to detect the places
6
CHAPTER 2. STATE OF THE ART
where participants spent most of their time. For this procedure was used the algorithm
K-means. The GPS location data sample was partitioned into K clusters such that the
distance of the data points to the centers was minimized. This process was iterative because the number of clusters was unknown, so it started with one cluster and increased
the number of clusters until the farthest point in each cluster to its centroid fell below a
threshold. This threshold is the maximum radius of a cluster that was assigned to 500 meters. The extractions from GPS are the features that characterize the student’s behavior.
This research investigated the correlation of each feature from the previous section with
the PHQ-9 score that was obtained from an exam at the beginning of the study. Students
with PHQ-9≥ 5 score were classified with depressive symptoms, and students with PHQ9<5 without depressive symptoms.
The models proposed in the study to estimate depression states from features were a Linear
regression (score estimation model) with regularization elastic-net and Logistic regression
(classification model).
Results obtained from several features from GPS data were related to the PHQ-9 score.
Only 28 of 40 participants were considered for data analysis due to insufficient information.
The correlation between the features and the PHQ-9 score revealed that 6 of the 10 were
significantly correlated. Circadian movement, normalized entropy, and location variance
had strong correlations using Pearson’s coefficient.
Table 2.1 shows the result of training models using individual features and then all combined. The metrics for the classification model are mean sensitivity and mean specificity.
The sensitivity represented in Eq.2.1 is also known as recall, which is the ratio of the
correct depression predictions by the model to all who are depressed (PHQ-9≥ 5).
Sensivity = Recall =
True Positive
True Positive + False Negative
(2.1)
The specificity represented in Eq.2.2 is the ratio of the correct non-depression predictions
by the model to all who are non-depressed (PHQ-9<5).
Specificity =
True Negative
True Negative + False Positive
(2.2)
The classification carried out by Saeb was only made into two classes: PHQ-9 less than 5
(PHQ-9<5) and PHQ-9 greater than or equal to 5 (PHQ-9≥ 5). Even though the PHQ-9
groups the score into 5 categories (see section Exam PHQ-9 - Depression Score 3.4 for
detailed information), Saeb used the threshold of 5 since this represents the Minimal category which is the lowest level of depression. There are considerations for using a regressor
because there are not only two classes, and the numerical value given by a regressor model
is more precise. The metric for the predictor model was the normalized root-mean-square
deviation (NRMSD), also known as the normalization of root-mean-square error (RMSE),
which measures the difference between the PHQ-9 scores estimated by the model on the
test participants and their true scores, and then apply the normalization. Those results
indicate how close the predicted value of depression was to the actual value. The results
showed that models trained with features with a stronger correlation with PHQ-9 were
CHAPTER 2. STATE OF THE ART
Training features
Entropy
Normalized entropy
Location variance
Home stay
Transition time
Total distance
Circadian movement
Number of clusters
All
Classification (PHQ9<5 vs PHQ9≥ 5)
%mean accuracy (SD) %mean sensitivity %mean specificity
69.7 (3.5)
66.8
72.7
86.5 (3.4)
88.4
84.9
75.7 (4.6)
80.2
71.5
75.9 (4.9)
80.5
71.7
41.1 (9.2)
43.4
38.7
56.4 (6.6)
69.6
43.4
78.6 (4.1)
80.1
77.5
41.5 (8.9)
47.4
35.5
78.8 (6.2)
83.6
74.5
7
PHQ-9 score estimation
Mean NRMSD (SD)
0.262 (0.017)
0.235 (0.016)
0.229 (0.014)
0.253 (0.015)
0.303 (0.020)
0.343 (0.041)
0.222 (0.014)
0.305 (0.022)
0.251 (0.023)
Table 2.1: Classification of participants with and without depressive symptoms and estimating their PHQ-9 scores using location features individually and aggregated.
Source from Saeb et al. (2015)
better at detecting participants with depressive symptoms than those with none.
The results of Saeb concluded that features extracted from mobile phone sensor data like
GPS and phone usage are information that was related to depressive symptom severity.
The work of Saeb et al. (2015) extracted features and developed linear regression models
to predict PHQ-9. In our work, we used some of the features exposed by Saeb and others.
We tested not only with linear models and their optimizations (Lasso, Ridge, and Elastic
Net), but we also used deep learning, specifically Artificial Neural Networks. Another difference is the dataset used in our work allows us to make predictions using two blocks of 2
weeks (PHQ-9 initial and PHQ-9 final), which means that we could explore the difference
in behavior and PHQ-9 score using these blocks.
2.1.3
Studentlife: Assessing mental health, academic performance,
and behavioral trends of college students using smartphones
Wang et al. (2018) collected and measured mental health, stress, and academic performance. This study, named Studentlife, is a study of 48 students across 10 weeks using
an app that continuously collects data sensors like accelerometer, microphone, light sensor, proximity, GPS/Bluetooth, and application usage. The goal was to assess the impact
of workload on stress, sleep, activity, mood, sociability, mental well-being, and academic
performance.
The participants who were voluntarily recruited were 18 graduate and 30 undergraduate
students from Dartmouth College. Smartphones Android Nexus 4s were offered to complete assignments, and many of the students used their own iPhones or Android phones.
There were 60% users with iPhone and 40% with Android.
For this research, the team developed the Studentlife smartphone app to infer human behavior in an energy-efficient manner automatically. Studentlife app provides a framework
for automatic sensing to infer activity (stationary, walking, running, driving, cycling), sleep
duration, and sociability. The data collection process during the 10 weeks was using the
app on phones carried by the students throughout the day. There was automatic sensing
without interaction by the users. During the study, were applied 2 PHQ9 exams, one at
8
CHAPTER 2. STATE OF THE ART
the beginning of the research and the other at the final (10 weeks).
The research studied the correlation between automatic sensing data from smartphones
and mental health. The degree of correlation was calculated using Pearson’s correlations.
See Table 2.2. They found significant correlations between sleep duration, conversation
Automatic sensing data
sleep duration (pre)
sleep duration (post)
conversation frequency during day (pre)
conversation frequency during day (post)
conversation frequency during evening (post)
conversation duration during day (post)
number of co-locations (post)
r (Pearson coefficient correlation)
−0.360
−0.382
−0.403
−0.387
−0.345
−0.328
−0.362
p − value
0.025
0.020
0.010
0.016
0.034
0.044
0.025
Table 2.2: Correlations between automatic sensor data and PHQ-9 depression scale.
Source from Wang et al. (2018)
frequency and duration, and PHQ-9 depression. Some inferences from the results of Table
2.2 is the negative correlation between sleep duration and pre(r = −0.360, p = 0.025)
and post(r = −0.382, p = 0.020) depression tests, indicating the inability to sleep, which
is one of the key signs of clinical depression. Another findings inline with this study are
significant negative correlation between conversation frequency during the day epoch and
pre(r = −0.403, p = 0.010) and post(r = 0.387, p = 0.016) depression tests, strong relationship between conversation frequency(r = −0.345, p = 0.034) and depression score. For
conversation duration (r = 0.328, p = 0.044). This information indicates that students
with fewer conversational interactions are more likely to be depressed.
The work of Wang et al. (2018) collected the dataset of Studentlife, explained extraction
features from mobile sensors, and the correlation of relevant variables exposed in Table
2.2 regarding the PHQ-9 score. In our work, we extracted features and calculated the
correlation of each feature regarding PHQ-9. One difference in our work is that we use the
features to test and train the developed models to predict the PHQ-9 score.
2.1.4
The relationship between mobile phone location sensor data
and depressive symptom severity
Saeb et al. (2016) replicated and extended previous work Saeb et al. (2015) detecting depression through collected data from phone sensors using geographic location sensors to
identify depressive symptom severity. Saeb used the dataset of Studentlife by Wang et al.
(2014) to demonstrate that GPS features may be an important and reliable predictor of
depression severity. Another extension was found in the relationships between workdays
and non-workdays. Movement on workdays is likely determined to some degree by social
roles and expectations, while movement on non-workdays is likely less determined by external demands and more by the individual’s motivational state. There were extracted
features of GPS sensor values using the same eight (Saeb et al. (2015)), plus three new
features: speed mean, speed variance, and raw entropy. They evaluated the relationship
CHAPTER 2. STATE OF THE ART
9
between each set of features(10 weeks and 2 weeks, weekends and weekdays) and depressive
symptoms severity as measured by the PHQ9-score, considering these coefficients significant for their associated p-values fell below 0.05. The correlation analyses between the
10-week location feature resulted in particularly location variance, circadian movement,
entropy, and a number of clusters all had absolute correlation coefficients |r| ≥ 0.4 (r person correlation) with the follow-up PHQ-9 scores. The correlation between GPS features
and depression was significantly related to PHQ-9 scores.
The correlation examined how 2 weeks GPS features were divided into weekdays and weekends. Correlations of features on weekends were stronger than on weekdays. In contrast,
correlations between weekend location features and the follow-up PHQ-9 scores generally
remained significant regardless of the time point at which features were extracted. This
was also true for three out of the six features extracted on weekdays, including homestay,
entropy, and normalized entropy.
Saeb’s study supports the potential of smartphone sensor technology in providing biomarkers of depression in daily life. The work of Saeb et al. (2016) was an extension of his
previous research Saeb et al. (2015), but this time did not work in the linear model to
predict depression score PHQ-9. The differences between Saeb’s work and our work are the
extraction features mentioned by Saeb, and the development and training of linear models
and non-linear models (ANN). We explored linear regression and its different versions with
regularization and other models like Artificial Neural Networks.
2.1.5
Towards Deep Learning Models for Psychological State Prediction using Smartphone Data: Challenges and Opportunities
Mikelsons, Smith, Mehrotra, and Musolesi (2017) investigated the effectiveness of neural
network models for predicting users’ stress levels by using the location information collected
by smartphones. The mobility patterns of participants were characterized using GPS
metrics and used these metrics as input to the network. They used the users’ stress
states collected from questionnaires from the Studentlife dataset. They used a technique
of daily averaging and rescaling the data into three stress levels: below-medium, median,
and above-median. Eight Spatio-temporal metrics were used as predictors: total distance
covered in a day, maximum 2-point displacement in a day, distance standard deviation,
number of different areas visited by tiles approximation, total spacial coverage by the
convex hull, the difference in the sequence of tiles covered compared to the previous day
and distance entropy. The Artificial Neural Network used comprised four fully connected
layers of 57,35,35,3 neurons each. The first three layers feature tanh activation and batch
normalization for each as well as drop-out regularization of rates 0.35, 0.25, and 0.15.
The final layer is a softmax-activation layer with batch normalization applied and three
output nodes corresponding to the three stress classes Using all the 12 input features, they
reported an F1 score of 0.42, a precision of 0.43, and a recall of 0.45.
The research of Mikelsons et al. (2017) used features extracted from GPS and Artificial
Neural Network to classify stress levels; Our experiment seeks to predict depression as
a numerical value, more accurate than a classification where the probability of failure
CHAPTER 2. STATE OF THE ART
10
decreases. We explored different models (linear and nonlinear), specifically the use of the
multilayer neural network regressor was to predict the real score of students’ PHQ-9 score.
Chapter 3
Theoretical framework
3.1
Machine learning models
This section details the different models used in this work to predict the PHQ-9 score.
In order to get appropriate context, there are some concepts important to define before
explaining details about the models:
• feature: feature is an individual measurable property or characteristic of a phenomenon. Bishop (2006)
• parameter: parameter is a configuration variable that is internal to the model and
whose value can be estimated from data.
• coefficient: A multiplicative factor such as one of the constants ai in the polynomial.
Weisstein, Eric W. (2022)
• sampling (statistical sample): sampling is the process of selecting a number of cases
from all the cases in a particular group or universe.
• target variable: the target variable is the variable whose values are modeled and
predicted by other variables.
3.1.1
Linear regression
In statistics, linear regression is a linear approach for modeling the relationship between
a scalar response and one or more explanatory variables (also known as dependent and
independent variables). In linear regression, the relationships are modeled using linear
predictor functions whose unknown model parameters are estimated from the data. Such
models are called linear models.
Linear regression can be used to fit a predictive model to an observed data set of values.
This model explains variation in the response variable that can be attributed to variation
in the explanatory variables. It means linear regression analyses can be applied to quantify
the strength of the relationship between the response and the explanatory variables.
11
12
CHAPTER 3. THEORETICAL FRAMEWORK
3.1.2
Linear regression optimized
The least squares linear regression approach adjusts the parameters of score estimation.
This method performs well as long as the number of features relative to the number
of samples is low. Otherwise, the model overfits the data.To minimize the overfitting
problem, we used the regularization method. The linear regression regularization prevents
the coefficient from becoming too large by adding penalizing.
Ridge Regression
Ridge Regression Zou and Hastie (2005) is a regularized method to reduce overfitting. In
Ridge Regression, the Ordinary Least Squares(OLS) loss function is augmented in such a
way that we not only minimize the sum of squared residuals but also penalize the size of
parameter estimates in order to shrink them towards zero. In Eqs.3.1,3.2,3.3 yi represents
the target to predict, in this work PHQ-9 score, x′i the features, β̂ the coefficients, n the
number of elements in the sample and m the number of coefficients (features).
Ridge Regression (also known as L2 regularization) introduces a bias λ,the penalty function is:
LRidge (β̂) =
n
X
yi − x′i β̂
i=1
2
+λ
m
X
β̂j2
(3.1)
j=1
So, setting λ to 0 is the same as using the OLS, while the larger its value, the stronger is
the coefficients’ size penalized.
Lasso Regression
Lasso Regression (also known as L1 regularization)Zou and Hastie (2005) is quite similar
conceptually to Ridge Regression. It also adds a penalty for non-zero coefficients but this
model penalizes the sum of their absolute values. For high values of λ, many coefficients
are exactly zero. It is a regularization method similar to Ridge Regression, except that
the penalty function is
LLasso (β̂) =
n
X
i=1
yi −
x′i β̂
2
+λ
m
X
β̂j
(3.2)
j=1
Elastic Net Regression
Elastic NetZou and Hastie (2005) is a combination penalty of Ridge and Lasso Regression
to get the best. Elastic Net aims at minimizing the following loss function:
2
Pn
!
′
m
m
β̂
y
−
x
X
i
i
i−1
1−αX 2
β̂j
+λ
β̂ + α
(3.3)
Lenet (β̂) =
2n
2 j=1 j
j=1
CHAPTER 3. THEORETICAL FRAMEWORK
13
In Eq.3.3 the hyperparameter α is the degree of influence of each penalty L1 and L2. Its
value is embedded in the interval [0,1]. When α = 0 Ridge is applied, and when α = 1
Lasso is applied.
3.1.3
Artificial Neural Network (ANN)
Multilayer perceptions (MLPs) (Goodfellow, Bengio, and Courville (2016)) have the goal
of approximating some function f ∗ . A feedforward network defines a mapping y = f (x; θ)
and learns the value of the parameters θ that result in the best function approximation .
These models are called feed forward because information flows through the function being
evaluated from x, through the intermediate computations used to define f , and finally to
the output y. Feedforward neural networks typically are represented by composing together
many different functions. The model is associated with a directed acyclic graph describing
how the functions are composed together. For example, we might have
three functions
f (1) , f (2) , and f (3) connected in a chain, to form f (x) = f (3) f (2) f (1) (x) . These chain
structures are the most commonly used structures of neural networks. In this case, f (1) is
called the first layer of the network, f (2) is called the second layer, and so on. These are
called neural because they were inspired by neuroscience.
The hidden layer of the network is vector-valued. The dimensionality of these hidden
layers determines the width of the model. Figure 3.1 explains the feedforward process into
an Artificial Neural Network with hidden layers.
Regularization
When the network tries to learn from a small dataset it will tend to have greater control
over the dataset will make sure to satisfy all the data points exactly. So, the network is
trying to memorize every single data point and failing to capture the general trend from
the training dataset. Dropout regularization is one technique used to tackle overfitting
problems in deep learning. During training, some layer outputs are ignored or dropped at
random. This makes the layer appear and is regarded as having a different number of nodes
and connectedness to the preceding layer. In practice, each layer update during training is
carried out with a different perspective of the specified layer. Dropout makes the training
process noisy, requiring nodes within a layer to take on more or less responsible for the
inputs on a probabilistic basis.
3.2
Pearson correlation
The Pearson correlation coefficient r is a descriptive statistic that describes the strength
and direction of the linear relationship between two quantitative variables. It is the ratio
between the covariance of two variables and the product of their standard deviations. The
math definition of Pearson correlation is Eq.3.4, where cov is the covariance, σX is the
standard deviation of X, and σY is the standard deviation of Y . Pearson correlation
14
CHAPTER 3. THEORETICAL FRAMEWORK
(0)
a1
(0)
a2
(0)
a3
0
w1,1
0
w1,2
0
w1,3
0
w1,4
0
w1,n
(0)
a4
..
.
(1)
a1
(1)
a1
=σ
(1)
a1 = σ
(1)
(1)
..
.
(0)
0
w1,1
a0
n
X
+
(0)
0
w1,2
a1
(0)
(0)
0
w1,i
ai + b1
i=1
a2
a3
(1)
a1
+ ... +
!
0
w1,n
a(0)
n
+
(0)
b1
(0) (0)
0
b1
a1
. . . w1,n
(0)
0 (0)
. . . w2,n a2 b2
..
...
. + .
.
.. ..
0
(0)
(0)
. . . wm,n
an
bm
(0)
0
w1,1
w0
1,0
(1)
0
0
w2,0 w2,1
a
2 = σ
..
...
..
.
.
0
0
(1)
wm,0 wm,1
am
a(1) = σ W(0) a(0) + b
(1)
am
(0)
an
l
Figure 3.1: Neural Network Architecture. The wjk
is the weight from the k th neuron in
th
th
th
the (l − 1) layer to the j neuron in the l layer, the (a)lj for the activation of the j th
neuron in the lth layer. The nonlinear function is represented by σ, applying this function
to the output of a linear transformation yields a nonlinear transformation.
normalizes the measurement of the covariance, the values always are between -1 and 1.
ρX,Y =
3.3
cov(X, Y )
σX σY
(3.4)
Cross validation
Cross-Validation is a statistical method of evaluation (Refaeilzadeh, Tang, and Liu (2009)).
It consists of dividing the data into train and test datasets. The training dataset is used
to train a model and the test dataset is to evaluate the model with data that has not
been seen. The simple form of cross-validation is k-fold cross-validation. In k-fold crossvalidation, the data is divided into k equally folds. Subsequently, k iterations of training
and validation are performed such that within each iteration a different fold of the data is
held-out for validation while the remaining k-1 folds are used for learning.
Leave-one-out cross-validation is K-fold cross-validation where K is equal to N, the number
of registers in the dataset. That means that N separate times, the model is trained on
all the data except for one point and a prediction is made for that point. As before the
average error is computed and used to evaluate the model. The evaluation given by leaveone-out cross-validation error is good, but at first pass, it seems very expensive to compute.
CHAPTER 3. THEORETICAL FRAMEWORK
3.4
15
Exam PHQ-9 - Depression Score
The Patient Health Questionnaire (PHQ) is a questionnaire that can be entirely selfadministered by the patient. The PHQ assesses 8 diagnoses, divided into threshold disorders (disorders that correspond to specific diagnoses: major depressive disorder, panic
disorder, other anxiety disorder and bulimia nervosa), and subthreshold disorders (disorders whose criteria encompass fewer symptoms than are required for any specific diagnoses:
other depressive disorder, probable alcohol abuse/dependence, and binge eating disorder).
The PHQ-9 is the 9-item depression module from the full PHQ. As a severity measure, the
PHQ-9 score can range from 0 to 27 since each of the 9 items can be scored from 0 (not
at all) to 3 (nearly every day). The PHQ-9 score is divided into 4 categories according
to the score, 0 ± 4 (Minimal), 5 ± 9 (Mild), 10 ± 14(Moderate), 15 ± 19(Moderately
severe), and 20 or greater (Severe); See Table 3.1. These categories were chosen for several
reasons. The first was pragmatic in that the cut points of 5, 10, 15, and 20 are simple for
clinicians to remember and apply. The second reason was empiric, in that using different
cut points did not noticeably change the associations between increasing PHQ-9 severity
and measures of construct validity according to Kroenke, Spitzer, and Williams (2001).
Depression Severity
Minimal
Mild
Moderate
Moderately severe
Severe
PHQ-9 Score
0-4
5-9
10-14
15-19
20-27
Table 3.1: Level of Depression according to PHQ-9 scores
CHAPTER 3. THEORETICAL FRAMEWORK
3.5
16
Dataset Studentlife
Studentlife by Wang et al. (2014) is a study that uses passive and automatic sensing data
from the phones of a class of 48 Dartmouth students over a 10-week term to assess their
mental health (e.g., depression, loneliness, stress), academic performance (grades across
all their classes, term GPA and cumulative GPA) and behavioral trends (e.g., how stress,
sleep, visits to the gym, etc. change in response to college workload – i.e., assignments,
midterms, finals – as the term progresses).
The StudentLife dataset is a large, longitudinal dataset that is rich in formation and deep.
Importantly, the dataset is anonymized, protecting the participants’ privacy in the study.
The dataset is from 48 undergrads and grad students at Dartmouth over the 10-week
spring term. It includes over 53 GB of continuous data, self-reports, and pre-post surveys;
specifically, it comprises:
• Objective sensing data: sleep (bedtime, duration, wake up); conservation duration,
conversation frequency; physical activity (stationary, walk, run);
• Location-based data: location, co-location, indoor and outdoor mobility;
• Other phone data: light, Bluetooth, audio, Wi-Fi, screen lock/unlock, phone charge,
app usage.
• Self-reports: affect (PAM), stress, behavior, Boston bombing reaction, canceled
classes, class opinion, comment, Dartmouth now, Dimension incident, Dimension
protest, dining halls, events, exercise, Green Key, lab, mood, loneliness, social and
study spaces.
• Pre-post surveys: PHQ9 depression scale, UCLA loneliness scale, positive and negative affect schedule (PANAS), perceived stress scale (PSS), big five personalities,
flourishing scale, Pittsburgh sleep quality index, veterans RAND 12 item health
(VR12)
• Academic performance data: class information, deadlines, grades (grades, term GPA,
cumulative GPA), piazza data
• Dining data: meals data, location, and time
• Seating data: seating position of students in Android programming
• Entry and exit surveys: to be added once anonymized
17
CHAPTER 3. THEORETICAL FRAMEWORK
3.5.1
PHQ-9 Scores
In the Studentlife study were applied two PHQ-9 exams. The first (pre) was at the initial
of the project with the participation of 46 students, and the other one was at the end(post)
with the participation of 36 students. The table 3.2 shows the interpretation of the scale
and the number of students that fall into each category for pre-post assessment. The majority of students experience minimal or minor depression for pre-post measures.
Depression severity
Number of students
(pre-survey)
Number of students
(post-survey)
None
(0)
Minimal
(1 − 4)
Minor
(5 − 9)
Moderate
(10 − 14)
Moderately severe
(15 − 19)
Severe
(20 − 27)
Total number of students
2
19
17
6
1
1
46
4
15
12
3
2
2
38
Table 3.2: PHQ-9 depression scale interpretation for both exams (pre and post).
Adapted from Studentlife by Wang et al. (2018)
To see the distribution of participant’s scores, Figure 3.2 shows PHQ-9 participant frequencies for each period.
Figure 3.2: PHQ-9 score distribution baseline and the end of study follow-up. Adapted
from Studentlife by Wang et al.Wang et al. (2018)
Chapter 4
Proposed Solution
Studentlife had a period of 10 weeks of collected data. In this research, we used 2 blocks,
the initial 2 weeks and the last 2 weeks. These blocks are relevant because only 2 clinical
tests, PHQ-9, were applied, the first at the beginning and the second on the last day of
the project. Therefore, this work will carry out an empirical study with two experiments.
4.1
Experiment 1
This first experiment is similar to the research study procedure of Saeb et al. (2015). The
features extracted were coded based on the same research study but applied using the
Studenlife dataset of Wang et al. (2014). Another difference between this Experiment 1
and the Saeb et al. (2015) is that we not only used Elastic net as Saeb et al. (2015), we
also extended the exploration by adding Ridge, Lasso, and the ANN.
Experiment 1 predicted the PHQ-9 score of the last two weeks using the features extracted
from that period. The features extracted from the last 2 weeks are the input to the predictor models, see Figure 4.1.
P HQ9
w0
w2
w4
w6
w8
w10
post_features
Figure 4.1: Experiment 1 predicts depression based on PHQ-9 score using the features
extracted from students’ mobile phones only during the last 2 weeks of the 10 weeks of
Studenlife.
18
19
CHAPTER 4. PROPOSED SOLUTION
4.2
Experiment 2
One contribution of this work is the proposal of the second experiment, not proposed before in state of the art, which predicts depression using both blocks, initial 2 weeks and last
2 weeks. This experiment calculated the difference in behavior between blocks to know if
the behavior of each participant increased or decreased in some features. See Figure 4.2.
This experiment aims to identify the behavioral change in the students, established on the
hypothesis that behavioral change is reflected in their PHQ9.
P HQ9
P HQ9
w0
w2
pre_features
w4
w6
w8
w10
post_features
feature = post_features - pre_features
Figure 4.2: Experiment 2 predicts depression based on PHQ-9 score using the features of
two blocks, initial 2 weeks and last 2 weeks of the Studentlife dataset.
4.3
Feature Extraction
We extracted features according to Saeb et al. (2015) from mobile phone sensors to define
the behavioral pattern of each participant. For each experiment, a dataset was generated
to be used in training and fitting the linear and nonlinear models.
• Location Variance Measures the variability in a participant’s GPS location calculated
using stationary states.
(4.1)
Location Variance = log σlat 2 + σlong 2
Logarithm to compensate for the skewness in the distribution of location variance σ
(by each measure; lat and long) across participants.
• Number of Clusters The number of clusters represents the number of frequent places
found by the K-means algorithm in the preprocessing stage.
• Homestay
Homestay measured the percentage of time a participant spent at home relative to
other location clusters. To obtain this measure, first needed to know which cluster
represented the participant’s home. The home cluster is identidied based on two
heuristics:
20
CHAPTER 4. PROPOSED SOLUTION
1. The home cluster is among the first to the third most visited clusters.
2. The home cluster is the cluster most visited during the time period between 12
a.m. and 6 a.m.
• Entropy
Entropy measures the variability of time the participant spent at the location clusters. This feature was developed based on the concept of entropy from information
theory. It was calculated as:
X
Entropy = −
pi log pi
(4.2)
i
where each i = 1, 2, . . . , N represented a location cluster, N denoted the total number of location clusters, and pi was the percentage of time the participant spent at
the location cluster i.
High entropy indicated that the participant spent time more uniformly across different location clusters, while lower entropy indicated greater inequality in the time
spent across the clusters. For example, if a participant spent 80% of time at home
and 20% at work, the resulting entropy would be −(0.8 log 0.8 + 0.2 log 0.2) ≈ 0.500,
while if they spent 50% at home and 50% at work, the resulting entropy would be
−(0.5 log 0.5 + 0.5 log 0.5) ≈ 0.693.
• Normalized Entropy
Normalized entropy is the division between entropy by its maximum value, which is
the logarithm of the total number of clusters:
Normalized Entropy = Entropy/ log N
(4.3)
The value of normalized entropy ranges from [0-1], where 0 indicates that all location data points belong to the same cluster, and 1 implies that they are uniformly
distributed across all the clusters. Unlike entropy, normalized entropy is invariant
to the number of clusters and thus depends solely on the distribution of the visited
location clusters. The value of normalized entropy ranges from 0-1, where 0 indicates
that all location data points belong to the same cluster, and 1 implies that they are
uniformly distributed across all the clusters.
• Circadian Movement
Circadian movement captures the temporal information of the location data. This
feature measured to what extent the participants’ sequence of locations followed a
24-hour, or circadian, rhythm. For example, if a participant left home for work and
returned home from work around the same time each day, the circadian movement
was high. On the contrary, a participant with a more irregular pattern of moving
between locations had a lower circadian movement.
CHAPTER 4. PROPOSED SOLUTION
21
To calculate circadian movement, we first used the least-squares spectral analysis,
also known as the Lomb-Scargle method, see VanderPlas (2018), to obtain the spectrum of the GPS location data. Then, we calculated the amount of energy that fell
into the frequency bins within a 24 ± 0.5 hour period in the following way:
E=
X
psd (fi ) / (i1 − i2 ) (4)
i
where i = i1 , i1 + 1, i1 + 2, . . . , i2 , and i1 and i2 represent the frequency bins corresponding to 24.5 and 23.5 hour periods. psd (fi ) denotes the power spectral density
at each frequency bin fj . E was calculated separately for longitude and latitude and
obtained the total circadian movement as:
CM = log (Elat + Elong ) (5)
The logarithm was applied to account for the skewness in the distribution.
• Total Distance
Total distance measures the total distance in kilometers taken by a participant. It
was calculated by accumulating the distances between the location samples.
• Speed mean
Mean of the instantaneous speed obtained at each GPS data point. The instantaneous speed (degrees/sec) was calculated as the change in latitude and longitude
values over time in the following way:
s
2
longi − longi−1 2
lati − lati−1
Vi =
+
(4.4)
ti − ti−1
ti − ti−1
where lat i , long i , and ti are latitude, longitude, and time at sample i.
• Transition Time
Transition time represented the percentage of time during which a participant was in
a nonstationary state (see Data Preprocessing). This was calculated by dividing the
number of GPS location samples in transition states by the total number of samples.
• Phone Usage Frequency
Phone usage frequency indicated, on average, how many times during a day a participant interacted with their phone.
CHAPTER 4. PROPOSED SOLUTION
22
• Phone Usage Duration
Phone usage duration measured, on average, the total time in seconds that a participant spent each day interacting with their phone.
CHAPTER 4. PROPOSED SOLUTION
4.4
23
Correlation of features with PHQ9 score
The Pearson correlation was computed for features and PHQ9 in both the initial 2 weeks
and the last 2 weeks. In addition, to support analyses examining the directionality of the
correlations, person correlation was calculated from 2-week-long blocks of GPS data. The
2-week period was selected as a block of time because a diagnosis of depression requires
the presence of symptoms for at least two weeks. Therefore, we obtained nine features
corresponding to 2 weeks. Table 4.4 and Table 4.4 show correlations with Pearson’s correlation for each feature and PHQ-9 score.
Features
Location variance
Circadian movement
Speed mean
Speed variance
Total distance
Number of clusters
Entropy
Normalized entropy
Home stay
Baseline(n=46)
Saeb et al. (2016) study
Larrazolo study
-0.29 ±0.008
-0.29
-0.34 ±0.006
-0.30
-0.03 ±0.007
0.04
-0.07 ±0.007
-0.11
-0.23 ±0.004
-0.22
-0.38 ±0.005
-0.31
-0.31 ±0.007
-0.31
-0.26 ±0.007
-0.28
0.22 ±0.008
0.23
Table 4.1: Coefficients of correlation between location features and PHQ-9 scores in the
Baseline. Pearson’s coefficients in Larrazolo’s study are similar to Saeb et al. (2016). This
similarity is important because the coding of features is based on Saeb et al. (2016) but
will be used to explore different models and scenarios.
Features
Location variance
Circadian movement
Speed mean
Speed variance
Total distance
Number of clusters
Entropy
Normalized entropy
Home stay
Follow-up (n=38)
Saeb et al. (2016) study
Larrazolo study
-0.43 ±0.007
-0.43
-0.48 ±0.006
-0.43
-0.06 ±0.005
-0.06
-0.06 ±0.005
-0.09
-0.18 ±0.006
-0.17
-0.44 ±0.004
-0.35
-0.46 ±0.005
-0.45
-0.44 ±0.005
-0.46
0.43 ±0.005
0.40
Table 4.2: Coefficients of correlation between location features and PHQ-9 scores Follow
Up. Pearson’s coefficients in Larrazolo’s study are similar to Saeb et al. (2016). This
similarity is important because the coding of features is based on Saeb et al. (2016) but
will be used to explore different models and scenarios.
CHAPTER 4. PROPOSED SOLUTION
24
The correlation analysis between the features and the PHQ-9 scores revealed that features
like location variance, circadian movement, number of clusters, entropy, normalized entropy, and homestay showed strong correlations with Pearson’s correlation coefficients (r,
Pearson’s coefficient, r ≥ 0.35 ). The correlation explains that these variables are related
to the movement of the participants in different ways. The correlation between circadian
movement and location variance is interesting and indicates that participants with more
mobility also had more regular movement patterns.
These results indicate that students with more clusters, entropy, and normalized entropy
(more movement through different ways) are less likely to be depressed. The results of
homestay suggest that students who spend more time in the house are typically more likely
to experience depressive symptoms.
CHAPTER 4. PROPOSED SOLUTION
4.5
25
Models
This work hypothesizes that behavioral changes are related to the participant’s state of
depression and that an initial state is required to know their usual behavior. The use of
the participant’s baseline state, which consists of data from the application PHQ-9 clinical
questionnaire at the beginning of the study, together with the passive data collected from
the first two weeks, will help to have prior knowledge of the behavior of the participant.
By comparing the characteristics of the first two weeks and the last two weeks, we can
observe the difference in behavior accompanied by the increase or decrease in depression.
We will carry out an empirical study of multivariate analysis with two experiments; experiment 1 consists of training and predicting the PHQ-9 values of the last period using
only the passive data of the previous two weeks; experiment 2 includes information from
the first two weeks and the last two weeks, the difference between both will help us to
know if their behavior increased in some features or on the contrary decreased, in addition to generating a PHQ-9 score that will consist of the increase or decrease in its value
between the initial PHQ-9 and the final PHQ-9. For each experiment, a dataset of data
is generated, which will be used to train and fit some linear computational models and a
nonlinear model such as an artificial neural network.
4.5.1
Score Estimation Model - Linear regression
Using the features extracted from their phone sensor data, we used a linear regression model
to estimate each participant’s PHQ-9 score. The model was defined as the Eq.4.5, where
n is the number of features. The coefficients a0 , a1 , . . . .an were obtained by minimizing
the squared error between the estimated and the true PHQ-9 scores (see Linear regression
optimized, section 3.1.2).
Depression Score = a0 + ai Fi + a2 F2 + . . . + an Fn
(4.5)
26
CHAPTER 4. PROPOSED SOLUTION
4.5.2
Score Estimation Model - Multilayer Perceptron MLPs
We proposed to use the architecture of a regression artificial neural network to predict the
PHQ-9 score of each student using the behavioral features mentioned in section 4.3(features extraction). The architecture configuration is specified in Figure 4.3.
input
layer
hidden layers
x1
h11
h21
h31
h41
x2
h12
h22
h32
h42
x3
h13
h23
h33
h43
x4
h14
h24
h34
h44
output
layer
x5
h15
h25
h35
h45
Y
x6
h16
h26
h36
h46
x7
h17
h27
h37
h47
x8
h18
h28
h38
h48
xm
h19
h29
h39
h49
Figure 4.3: The architecture of a Regression Artificial Neural Network was designed with
4 hidden layers, 10 neurons in each layer, and 1 neuron in the last layer. The activation
function is used in all layers with Rectified Linear Unit (ReLU). The first layer has one
input for each one of the behavioral features i = 1, 2, .., n and Y is the output neuron that
predicts a depression PHQ-9 score values.
Chapter 5
Results
The results were divided according to Experiment 1 (See section 4.1) and Experiment 2
(See section 4.2).
5.1
Performance Models
The results of the experiments using the linear and non-linear models are presented in
table 5.1. Experiment 1, based on study research of Saeb et al. (2015), used only the last
2 weeks. Experiment 2, the proposal of this work, used the behavioral change of 2 blocks,
the initial 2 weeks and the last 2 weeks, to predict the PHQ-9 score.
From the results of Experiment 1, the best score was the Ridge model with an RMSE of
Experiment 1
Model
Experiment 2
Based on research of Saeb et al. (2015)
Proposed by Larrazolo
Root Mean Square Error
Root Mean Square Error
3.7
3.9
4.1
5.7
6.0
2.8
2.8
2.8
7.7
2.7
Ridge
Elasticnet
Lasso
Linear Regression (OLS)
Artificial Neural Network (ANN)
Table 5.1: Comparison of results RMSE of linear and nonlinear models using LOOCV as
an evaluation method of Experiment 1 based on the research study of Saeb et al. (2015)
and Experiment 2 proposed by Larrazolo.
3.7, and the worst was ANN with 6.0. There is no possibility of comparing the results of
Experiment 1 with Saheb’s results because the dataset Saheb’s was not publicly available.
The results of experiment 1 were based on Saheb’s study but applied to the Studentlife
dataset. The results of Experiment 2, proposed by Larrazolo, are more accurate than
Experiment 1’s results. The best performance model of Experiment 2 was the Artificial
Neural Network with 2.7 RMSE, and the worst was the OLS with 7.7 RMSE. The results
for each participant using linear and non-linear models are shown in Figure 5.1 for Experiment 1 , Figure 5.1 shows the results of Experiment 2.
27
CHAPTER 5. RESULTS
28
29
CHAPTER 5. RESULTS
Model - Ordinary Least Square OLS
50
PHQ-9 score
40
Real
Predicted
30
20
10
0
−10
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
25
Real
Predicted
20
PHQ-9 score
58 59
Model - Lasso
15
10
5
0
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
25
Real
Predicted
20
PHQ-9 score
58 59
Model - Ridge
15
10
5
0
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
25
Real
Predicted
20
PHQ-9 score
58 59
Model - Elastic
15
10
5
0
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
Real
Predicted
20
PHQ-9 score
58 59
Model - Artificial Neural Network
25
15
10
5
0
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
58 59
Figure 5.1: Results of Experiment 1. Subplots of the results for each model for the 38
students (The horizontal axis is the Student ID assigned from the Studentlife dataset).
Models in descending order: OLS, Lasso, Ridge, Elastic, and ANN.
30
CHAPTER 5. RESULTS
Model - Ordinary Least Square OLS
25
PHQ-9 score
0
−25
−50
−75
−100
Real
Predicted
−125
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
Real
Predicted
20
PHQ-9 score
58 59
Model - Lasso
25
15
10
5
0
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
25
Real
Predicted
20
PHQ-9 score
58 59
Model - Ridge
15
10
5
0
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
25
Real
Predicted
20
PHQ-9 score
58 59
Model - Elastic
15
10
5
0
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
25
Real
Predicted
20
PHQ-9 score
58 59
Model - Artificial Neural Network
15
10
5
0
0 1 2 3 4 5
7
9 10
14 15 16 17 18 19 20
23 24
27
30 31 32 33 34 35 36
Student ID
42 43 44 45
47
49
51 52 53
56
58 59
Figure 5.2: Results of Experiment 2. Subplots of the results for each model for the 38
students (The horizontal axis is the Student ID assigned from the Studentlife dataset).
Models in descending order: OLS, Lasso, Ridge, Elastic, and ANN.
Chapter 6
Conclusions
The constant use of personal devices like smartphones and wearable devices has helped
to track people’s activities, such as the time they spend at home, places they commonly
visit, and other features, which are getting closer to the prediction and diagnosis of mental
health.
Different linear and nonlinear models were presented to detect the PHQ-9 score based on
behavioral features. These models were applied to different scenarios, Experiment 1 and
Experiment 2. Experiment 1 was based on the study of Saeb et al. (2015) but applied to
the Studentlife dataset. The results are shown in table 5.1; the best model was the Ridge
model with 3.7 RMSE, which means an average error of 3.7 from real values of PHQ-9
score
Later, Experiment 2 was done as a new proposal to improve the accuracy according to
the observations and exploration of the data. The results are shown in table 5.1; the best
model was ANN with 2.3 RMSE. These results are better than those obtained by Experiment 1. The results in Experiment 2 allow concluding that predicting depression levels in
a participant requires the context of an initial state and their behavioral patterns change.
This could explain why Experiment 2 had better results than Experiment 1.
The characterization of behavioral patterns of participants was based on feature extraction
of mobile phone sensors. We proposed different models to predict the PHQ-9 scores. The
best performing model was the ANN with Experiment 2 with a 2.7 RMSE. The difference
between the linear models with regularization (2.8 RMSE) and the ANN (2.7 RMSE) is
lower. The results indicate an average difference of 2.7 PHQ-9 scores compared to the
participants’ real values of the PHQ-9 (from 0 to 27).
6.1
Future work
The work done in this thesis allows identifying important opportunities to explore other
options related to the detection of depression using digital phenotyping. This work used
the Studenlife dataset containing 10 weeks; this study applied only 2 PHQ-9 exams (first
and last week). Exploring sequential models and using more frequent periods of applying
clinical tests is essential. Regular clinical tests open possibilities to experiment with other
31
CHAPTER 6. CONCLUSIONS
32
models and have more robust training. One approach to explore could be time series
or recurrent neural networks (RNN) to predict depression using the sequence track of
behavioral patterns more frequently. These models could be personalized or general, but
collecting the information (features and clinical tests) periodically(daily, 2, or 3 times a
week) could be a way to discover specific habits in the daily life of participants that could
be related to depression. Another approach to explore is adding other sensors and features
that could improve the models for predicting depression. Sleep time has been studied as
an important signal of depression, and these feature has been approached using sensors
like the microphone, the light, and the charge time of the smartphone (Chen et al. (2013)).
An important problem in this study was finding a specific dataset that collected different
sensors over a long time with sufficient participants. Future work could create an integrated
platform to conduct investigations of digital phenotyping and collect the data of mobile
phones and wearables. This platform will enable to have not only more data but also data
according to specific requirements.
References
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Chen, Z., Lin, M., Chen, F., Lane, N. D., Cardone, G., Wang, R., . . . Campbell, A. T.
(2013). Unobtrusive sleep monitoring using smartphones. In 2013 7th international
conference on pervasive computing technologies for healthcare and workshops (p. 145152).
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. (http://
www.deeplearningbook.org)
Kroenke, K., Spitzer, R., & Williams, J. (2001, September). The phq-9: validity of
a brief depression severity measure. Journal of general internal medicine, 16 (9),
606—613. Retrieved from https://europepmc.org/articles/PMC1495268 doi:
10.1046/j.1525-1497.2001.016009606.x
Mikelsons, G., Smith, M., Mehrotra, A., & Musolesi, M. (2017). Towards deep learning
models for psychological state prediction using smartphone data: Challenges and
opportunities. arXiv. Retrieved from https://arxiv.org/abs/1711.06350 doi:
10.48550/ARXIV.1711.06350
Mohr, D. C., Zhang, M., & Schueller, S. M. (2017, March). Personal sensing: Understanding mental health using ubiquitous sensors and machine learning. Annu Rev Clin
Psychol , 13 , 23–47.
Refaeilzadeh, P., Tang, L., & Liu, H.
(2009).
Cross-validation.
https://doi.org/10.1007/978-0-387-39940-95 65 : SpringerU S.
Saeb, S., Lattie, E. G., S., S. M., K., P., K., & Mohr, D. C. (2016). The relationship between
mobile phone location sensor data and depressive symptom severity. PeerJ , 4 , e2537.
Retrieved from https://doi.org/10.7717/peerj.2537/ doi: 10.7717/peerj.2537
Saeb, S., Zhang, M., Karr, C. J., Schueller, S. M., Corden, M. E., Kording, K. P., & Mohr,
D. C. (2015, Jul 15). Mobile phone sensor correlates of depressive symptom severity
in daily-life behavior: An exploratory study. J Med Internet Res, 17 (7), e175.
Retrieved from http://www.jmir.org/2015/7/e175/ doi: 10.2196/jmir.4273
Spinazze, B. A. e. a., Rykov Y.
(2019).
Digital phenotyping for assessment and prediction of mental health outcomes: a scoping review protocol.
https://bmjopen.bmj.com/content/9/12/e032255.info: BMJ Open 2019. doi: 10
.1136/bmjopen-2019-032255
Statista.
(2019).
Number of connected wearable devices worldwide from 2016
to 2022.
https://www.statista.com/statistics/487291/global-connected
-wearable-devices/. ([Accessed 20-April-2022])
Torous, S. P. B. I. e. a., J. (2018). Characterizing the clinical relevance of dig33
REFERENCES
34
ital phenotyping data quality with applications to a cohort with schizophrenia.
https://doi.org/10.1038/s41746-018-0022-8: npj Digital Med.
VanderPlas, J. T. (2018, may). Understanding the lomb–scargle periodogram. The Astrophysical Journal Supplement Series, 236 (1), 16. Retrieved from https://doi.org/
10.3847%2F1538-4365%2Faab766 doi: 10.3847/1538-4365/aab766
Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., . . . Campbell, A. T. (2014).
Studentlife: Assessing mental health, academic performance and behavioral trends
of college students using smartphones. In Proceedings of the 2014 acm international
joint conference on pervasive and ubiquitous computing (p. 3–14). New York, NY,
USA: Association for Computing Machinery. Retrieved from https://doi.org/
10.1145/2632048.2632054 doi: 10.1145/2632048.2632054
Wang, R., Wang, W., daSilva, A., Huckins, J. F., Kelley, W. M., Heatherton, T. F., &
Campbell, A. T. (2018, mar). Tracking depression dynamics in college students using
mobile phone and wearable sensing. Proc. ACM Interact. Mob. Wearable Ubiquitous
Technol., 2 (1). Retrieved from https://doi.org/10.1145/3191775 doi: 10.1145/
3191775
Weisstein, Eric W. (2022). Coefficient. Retrieved from https://mathworld.wolfram
.com/Coefficient.html ([Online; accessed 11-October-2022])
World Health Organization. (2021). Depression. https://www.who.int/news-room/
fact-sheets/detail/depression. ([Accessed 22-April-2022])
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67 (2),
301–320. Retrieved 2022-07-12, from http://www.jstor.org/stable/3647580