Academia.eduAcademia.edu
Instituto Politécnico Nacional Centro de Investigación en Computación Development of Machine Learning and Deep Learning Algorithms to Detect Depression in Students through Digital Phenotyping T E S I S que para obtener el título de: MAESTRÍA EN CIENCIAS DE LA COMPUTACIÓN presenta: ABRAHAM LARRAZOLO BARRERA Tutores: Dr. Adolfo Guzmán Arenas Dr. Gilberto Lorenzo Martínez Luna Estados Unidos Mexicanos Ciudad de México 2022 Contents Resumen iv Abstract v Acknowledgements 1 Introduction 1.1 Context . . . . . . . . . . 1.2 Objectives . . . . . . . . . 1.2.1 General objective . 1.2.2 Specific objectives . 1.3 Contributions . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 State of the art 2.1 Methods of Machine Learning and Diagnosing Depression . . . . . . . . . . 2.1.1 Personal Sensing: Understanding Mental Health Using Ubiquitous Sensors and Machine Learning . . . . . . . . . . . . . . . . . . . . . 2.1.2 Mobile Phone Sensor Correlates of Depressive Symptom Severity in Daily-Life Behavior: An Exploratory Study . . . . . . . . . . . . . 2.1.3 Studentlife: Assessing mental health, academic performance, and behavioral trends of college students using smartphones . . . . . . . 2.1.4 The relationship between mobile phone location sensor data and depressive symptom severity . . . . . . . . . . . . . . . . . . . . . . 2.1.5 Towards Deep Learning Models for Psychological State Prediction using Smartphone Data: Challenges and Opportunities . . . . . . . 3 Theoretical framework 3.1 Machine learning models . . . . . . . . . 3.1.1 Linear regression . . . . . . . . . 3.1.2 Linear regression optimized . . . 3.1.3 Artificial Neural Network (ANN) 3.2 Pearson correlation . . . . . . . . . . . . 3.3 Cross validation . . . . . . . . . . . . . . 3.4 Exam PHQ-9 - Depression Score . . . . 3.5 Dataset Studentlife . . . . . . . . . . . . 3.5.1 PHQ-9 Scores . . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 2 3 4 4 4 5 7 8 9 11 11 11 12 13 13 14 15 16 17 4 Proposed Solution 4.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . 4.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . 4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . 4.4 Correlation of features with PHQ9 score . . . . . . . . 4.5 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Score Estimation Model - Linear regression . . . 4.5.2 Score Estimation Model - Multilayer Perceptron . . . . . . . 18 18 19 19 23 25 25 26 5 Results 5.1 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 6 Conclusions 6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 . . . . . . . . . . . . . . . . . . . . . . . . MLPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures 2.1 3.1 3.2 4.1 4.2 4.3 Features depicted in the layer above the inputs. Raw sensor data is transformed into features from the sensor data. Source from Mohr, Zhang, and Schueller (2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . l Neural Network Architecture. The wjk is the weight from the k th neuron in the (l − 1)th layer to the j th neuron in the lth layer, the (a)lj for the activation of the j th neuron in the lth layer. The nonlinear function is represented by σ, applying this function to the output of a linear transformation yields a nonlinear transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . PHQ-9 score distribution baseline and the end of study follow-up. Adapted from Studentlife by Wang et al.Wang et al. (2018) . . . . . . . . . . . . . . Experiment 1 predicts depression based on PHQ-9 score using the features extracted from students’ mobile phones only during the last 2 weeks of the 10 weeks of Studenlife. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiment 2 predicts depression based on PHQ-9 score using the features of two blocks, initial 2 weeks and last 2 weeks of the Studentlife dataset. . The architecture of a Regression Artificial Neural Network was designed with 4 hidden layers, 10 neurons in each layer, and 1 neuron in the last layer. The activation function is used in all layers with Rectified Linear Unit (ReLU). The first layer has one input for each one of the behavioral features i = 1, 2, .., n and Y is the output neuron that predicts a depression PHQ-9 score values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 5 14 17 18 19 26 5.1 5.2 Results of Experiment 1. Subplots of the results for each model for the 38 students (The horizontal axis is the Student ID assigned from the Studentlife dataset). Models in descending order: OLS, Lasso, Ridge, Elastic, and ANN. 29 Results of Experiment 2. Subplots of the results for each model for the 38 students (The horizontal axis is the Student ID assigned from the Studentlife dataset). Models in descending order: OLS, Lasso, Ridge, Elastic, and ANN. 30 List of Tables 2.1 2.2 3.1 3.2 4.1 4.2 5.1 Classification of participants with and without depressive symptoms and estimating their PHQ-9 scores using location features individually and aggregated. Source from Saeb et al. (2015) . . . . . . . . . . . . . . . . . . . . . . . . Correlations between automatic sensor data and PHQ-9 depression scale. Source from Wang et al. (2018) . . . . . . . . . . . . . . . . . . . . . . . . Level of Depression according to PHQ-9 scores . . . . . . . . . . . . . . . . PHQ-9 depression scale interpretation for both exams (pre and post). Adapted from Studentlife by Wang et al. (2018) . . . . . . . . . . . . . . . Coefficients of correlation between location features and PHQ-9 scores in the Baseline. Pearson’s coefficients in Larrazolo’s study are similar to Saeb et al. (2016). This similarity is important because the coding of features is based on Saeb et al. (2016) but will be used to explore different models and scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coefficients of correlation between location features and PHQ-9 scores Follow Up. Pearson’s coefficients in Larrazolo’s study are similar to Saeb et al. (2016). This similarity is important because the coding of features is based on Saeb et al. (2016) but will be used to explore different models and scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of results RMSE of linear and nonlinear models using LOOCV as an evaluation method of Experiment 1 based on the research study of Saeb et al. (2015) and Experiment 2 proposed by Larrazolo. . . . . . . . . iii 7 8 15 17 23 23 27 Resumen La depresión es una enfermedad frecuente en todo el mundo. Aproximadamente 280 millones de personas alrededor del mundo tienen depresión de acuerdo a la Organización Mundial de la Salud, World Health Organization (2021). Las crecientes capacidades y mayor número de sensores en los dispositivos personales como el teléfono, pulseras y relojes inteligentes,han generado gran interés en el área de la salud debido a la información de los datos que dichos dispositivos generan pasivamente. Investigaciones en las que se usa el fenotipado digital para predecir indicadores de enfermedades mentales como la depresión y el estrés han ido en aumento. En el presente trabajo se muestra el análisis y la exploración de diferentes modelos de aprendizaje máquina y aprendizaje profundo usados para predecir los niveles de depresión de estudiantes a partir de los datos obtenidos mediante dispositivos móviles. Empleamos modelos lineales y redes neuronales artificiales para la predicción de los resultados; los niveles de depresión fueron analizados y comparados con base al examen PHQ-9. El PHQ9 es examen clínico usado para la detección de la depresión, consta de 9 preguntas, el puntaje tiene un rango de 0 a 27. Las características que usamos para los modelos fueron obtenidas mediante un preprocesamiento de los datos de los sensores del teléfono móvil. etc. Los resultados obtenidos para los modelos de regresión lineal fueron de 7.7 Root Mean Squared Error RMSE para el clásico regresor lineal (Ordinary Least Square), los resultados de este modelo con regularización Ridge fue de 2.8 RMSE y con Lasso 2.8 RMSE. El mejor modelo fue la Arquitectura de red neuronal con 2.7 RMSE. iv Abstract Depression is a common illness throughout the world. Approximately 280 million people around the world have depression, according to the World Health Organization (2021). The increasing capacities and more sensors in personal devices such as the telephone, smart bracelets, and watches have generated significant interest in the area of health due to the information from the data these devices passively generate. Research using digital phenotyping to predict indicators of mental illness, such as depression and stress, has increased. In the present work, the analysis and exploration of different machine learning and deep learning models used to predict the levels of depression of students from the data obtained through mobile devices are shown. We used linear models and artificial neural networks to predict the results; Depression levels were analyzed and compared based on the PHQ-9 exam. The PHQ-9 is a clinical exam used to detect depression and consists of 9 questions. The score has a range from 0 to 27. The characteristics we used for the models were obtained by preprocessing the data from the mobile sensors. The results obtained for the linear regression models were 7.7 Root Mean Squared Error RMSE for the classic linear regressor (Ordinary Least Square); the results of this model with Ridge regularization were 2.8 RMSE and with Lasso 2.8 RMSE. The best model was the Neural Network Architecture with 2.7 RMSE. v Acknowledgements First of all, I thank my family for always supporting me at all times. Especially to my mother who has always shown me her love unconditionally. My sister and brother both represent an inspiration to explore new limits in knowledge. Last but not least, to my partner Nancy who supported me in many ways on this path to achieving the master’s degree. My sincere thanks go to Dr.Adolfo Guzmán and Dr.Gilberto Martínez, who provided me with their invaluable experience and knowledge to complete this program. I express my deepest gratitude for their constant guidance and support throughout this program. A special mention to thanks to Dr.Mercedes Balcells for the opportunity to work on the project of Digital Phenotyping and the Institute for Medical Engineering Science (IMESMIT) for the stay research. Thanks to Centro de Investigación en Computación (CIC) at Instituto Politécnico Nacional (IPN) and its administration to allow me to study this important program. Also thank Consejo Nacional de Ciencia y Tecnología (CONACYT) for their financial support, which provided invaluable opportunities for me to pursue my academic goals. vi Chapter 1 Introduction 1.1 Context Depression is a mental disturbance affecting an individual’s thinking and mental development. According to World Health Organization, approximately 280 million people have mental disorders World Health Organization (2021). Mental illnesses are diagnosed based on symptoms, patients’ interviews, and self-reported. The use of mobile devices in sensing to track and infer behavioral health is growing because mobile devices like smartphones, tablets, and wearables have sensors that could improve mental health monitoring. Digital phenotyping was defined as ’moment-by-moment quantification of the individuallevel human phenotype in situ using data from digital devices’ by Torous (2018). Spinazze (2019) defined digital phenotyping as ’the process of inferring individual behavior from digital data generated through human interaction with electronic devices, including both physical hardware and software.’ Ubiquitous sensing technologies like smartphones allow the continuous collection of data about individuals, passively through inbuilt sensors or connected devices and actively via user engagement. Passive data from sensors and applications are transformed into states related to mental health. The increment in the number of smartphones and their sensors has potentialized adoption of these devices for monitoring mental health. In recent years, the number of connected wearable devices worldwide has increased and is expected to exceed 1 billion by 2022, according to Statista (2019). Machine learning and deep learning methods have been applied to continuous sensor data to predict mental health, allowing different mental conditions such as depression, anxiety, stress, etc. These techniques aim to create models to train themselves to perceive complex patterns. 1 CHAPTER 1. INTRODUCTION 1.2 1.2.1 2 Objectives General objective Given the growing use of machine learning, deep learning, and the information provided by smartphones and wearables to detect mental illnesses such as depression, the objective of this work is to explore and apply machine learning and deep learning models that allow predicting the level of depression in students according to the score of the PHQ-9 clinical test through the use of data obtained from the sensors of the mobile phone to find characteristics of behavior that could be indicated for the prediction of depression. 1.2.2 Specific objectives • Identify behavioral features from smartphone sensors with the potential to be used in the depression predictor models. • Explore and apply the linear regression model with its regularizations and artificial neural networks models using the Studentlife dataset by Wang et al. (2018) to predict the level of depression according to the scoring of the clinical test PHQ-9. • Describe differences between the explored models, identifying approaches and specific characteristics. CHAPTER 1. INTRODUCTION 1.3 3 Contributions This thesis made the following contributions: • Developed features: Coding the features that might be indicative of detecting the severity of depression through mobile phone sensors. • Depression detection models: Designing four models for measuring the severity of depression detection in students using features extracted from mobile phone sensors. • Exploring different scenarios of time to identify the change of behavior in students using the features extracted from mobile phone sensors to detect the severity of depression. Chapter 2 State of the art This section describes relevant literature and its main contributions related to Digital Phenotyping using information from smartphones and wearables. Most of these works applied machine learning methods and deep learning methods to detect levels of depression. 2.1 2.1.1 Methods of Machine Learning and Diagnosing Depression Personal Sensing: Understanding Mental Health Using Ubiquitous Sensors and Machine Learning Mohr et al. (2017) describes sensors in devices such as phones, wearables, and computers used to create a digital trace of personal sensing. Collecting and analyzing data from sensors embedded in the context of daily life to identify human behaviors, thoughts, feelings, and traits. Figure 2.1 provides a layered hierarchical model for translating raw sensor data into markers of behaviors and states related to mental health. Also, research methods and challenges are discussed, including privacy and dimensionality problems. Although personal sensing is still in its infancy, it holds great promise as a method for conducting mental health research and as a clinical tool for monitoring at-risk populations and providing the foundation for the next generation of mobile health interventions. Researchers examined the possibility of using smartphone sensor data to detect the presence and severity of mental health disorders, including depression, bipolar disorder, and schizophrenia. Specifically for depression exposed different works and concluded that GPS features appear to predict depression, although the relationship between depression and subsequent GPS features degrades quickly over time, suggesting that a lack of mobility may be an early warning signal of depression. They review three commonly used machine-learning analytical methods: supervised learning, unsupervised learning, and semi-supervised learning. The work of Mohr et al. (2017) is a framework reference of layers to describe the transformation from the sensors to clinical states based on the collection of different studies. In our work, we used some sensors like Location(GPS) and 4 CHAPTER 2. STATE OF THE ART 5 Figure 2.1: Features depicted in the layer above the inputs. Raw sensor data is transformed into features from the sensor data. Source from Mohr et al. (2017) Ambient light referenced in the Figure 2.1 to create new features of location type, and we use expanded features like Circadian Movement, Number of clusters, Total distance, etc; These characteristics are described in detail in the section 4.3 Feature extraction. Another difference with our work is the development and testing of models to predict PHQ-9 score depression. 2.1.2 Mobile Phone Sensor Correlates of Depressive Symptom Severity in Daily-Life Behavior: An Exploratory Study Saeb et al. (2015) explored the detection of daily-life behavioral markers and identifying depressive symptom severity using mobile phone global positioning systems (GPS). The study considered a total of 40 participants with a sensor data acquisition app for 2 weeks. At the beginning of the 2-week period, participants completed a self-reported depression survey (PHQ-9). Behavioral features were developed and extracted from GPS location and phone usage data. Linear regression and logistic regression were used to estimate each participant’s PHQ-9 score using features from the phone sensor data. Data preprocessing was used to extract features from both GPS location and phone usage data. One procedure was to classify whether each GPS location data sample came from a stationary state or a transition state. This procedure estimated the movement speed at each location by calculating its time derivative and then used a threshold speed that defined the boundary between these states. The threshold proposed for this study was 1 km/h. Another procedure was clustering the samples in the stationary state to detect the places 6 CHAPTER 2. STATE OF THE ART where participants spent most of their time. For this procedure was used the algorithm K-means. The GPS location data sample was partitioned into K clusters such that the distance of the data points to the centers was minimized. This process was iterative because the number of clusters was unknown, so it started with one cluster and increased the number of clusters until the farthest point in each cluster to its centroid fell below a threshold. This threshold is the maximum radius of a cluster that was assigned to 500 meters. The extractions from GPS are the features that characterize the student’s behavior. This research investigated the correlation of each feature from the previous section with the PHQ-9 score that was obtained from an exam at the beginning of the study. Students with PHQ-9≥ 5 score were classified with depressive symptoms, and students with PHQ9<5 without depressive symptoms. The models proposed in the study to estimate depression states from features were a Linear regression (score estimation model) with regularization elastic-net and Logistic regression (classification model). Results obtained from several features from GPS data were related to the PHQ-9 score. Only 28 of 40 participants were considered for data analysis due to insufficient information. The correlation between the features and the PHQ-9 score revealed that 6 of the 10 were significantly correlated. Circadian movement, normalized entropy, and location variance had strong correlations using Pearson’s coefficient. Table 2.1 shows the result of training models using individual features and then all combined. The metrics for the classification model are mean sensitivity and mean specificity. The sensitivity represented in Eq.2.1 is also known as recall, which is the ratio of the correct depression predictions by the model to all who are depressed (PHQ-9≥ 5). Sensivity = Recall = True Positive True Positive + False Negative (2.1) The specificity represented in Eq.2.2 is the ratio of the correct non-depression predictions by the model to all who are non-depressed (PHQ-9<5). Specificity = True Negative True Negative + False Positive (2.2) The classification carried out by Saeb was only made into two classes: PHQ-9 less than 5 (PHQ-9<5) and PHQ-9 greater than or equal to 5 (PHQ-9≥ 5). Even though the PHQ-9 groups the score into 5 categories (see section Exam PHQ-9 - Depression Score 3.4 for detailed information), Saeb used the threshold of 5 since this represents the Minimal category which is the lowest level of depression. There are considerations for using a regressor because there are not only two classes, and the numerical value given by a regressor model is more precise. The metric for the predictor model was the normalized root-mean-square deviation (NRMSD), also known as the normalization of root-mean-square error (RMSE), which measures the difference between the PHQ-9 scores estimated by the model on the test participants and their true scores, and then apply the normalization. Those results indicate how close the predicted value of depression was to the actual value. The results showed that models trained with features with a stronger correlation with PHQ-9 were CHAPTER 2. STATE OF THE ART Training features Entropy Normalized entropy Location variance Home stay Transition time Total distance Circadian movement Number of clusters All Classification (PHQ9<5 vs PHQ9≥ 5) %mean accuracy (SD) %mean sensitivity %mean specificity 69.7 (3.5) 66.8 72.7 86.5 (3.4) 88.4 84.9 75.7 (4.6) 80.2 71.5 75.9 (4.9) 80.5 71.7 41.1 (9.2) 43.4 38.7 56.4 (6.6) 69.6 43.4 78.6 (4.1) 80.1 77.5 41.5 (8.9) 47.4 35.5 78.8 (6.2) 83.6 74.5 7 PHQ-9 score estimation Mean NRMSD (SD) 0.262 (0.017) 0.235 (0.016) 0.229 (0.014) 0.253 (0.015) 0.303 (0.020) 0.343 (0.041) 0.222 (0.014) 0.305 (0.022) 0.251 (0.023) Table 2.1: Classification of participants with and without depressive symptoms and estimating their PHQ-9 scores using location features individually and aggregated. Source from Saeb et al. (2015) better at detecting participants with depressive symptoms than those with none. The results of Saeb concluded that features extracted from mobile phone sensor data like GPS and phone usage are information that was related to depressive symptom severity. The work of Saeb et al. (2015) extracted features and developed linear regression models to predict PHQ-9. In our work, we used some of the features exposed by Saeb and others. We tested not only with linear models and their optimizations (Lasso, Ridge, and Elastic Net), but we also used deep learning, specifically Artificial Neural Networks. Another difference is the dataset used in our work allows us to make predictions using two blocks of 2 weeks (PHQ-9 initial and PHQ-9 final), which means that we could explore the difference in behavior and PHQ-9 score using these blocks. 2.1.3 Studentlife: Assessing mental health, academic performance, and behavioral trends of college students using smartphones Wang et al. (2018) collected and measured mental health, stress, and academic performance. This study, named Studentlife, is a study of 48 students across 10 weeks using an app that continuously collects data sensors like accelerometer, microphone, light sensor, proximity, GPS/Bluetooth, and application usage. The goal was to assess the impact of workload on stress, sleep, activity, mood, sociability, mental well-being, and academic performance. The participants who were voluntarily recruited were 18 graduate and 30 undergraduate students from Dartmouth College. Smartphones Android Nexus 4s were offered to complete assignments, and many of the students used their own iPhones or Android phones. There were 60% users with iPhone and 40% with Android. For this research, the team developed the Studentlife smartphone app to infer human behavior in an energy-efficient manner automatically. Studentlife app provides a framework for automatic sensing to infer activity (stationary, walking, running, driving, cycling), sleep duration, and sociability. The data collection process during the 10 weeks was using the app on phones carried by the students throughout the day. There was automatic sensing without interaction by the users. During the study, were applied 2 PHQ9 exams, one at 8 CHAPTER 2. STATE OF THE ART the beginning of the research and the other at the final (10 weeks). The research studied the correlation between automatic sensing data from smartphones and mental health. The degree of correlation was calculated using Pearson’s correlations. See Table 2.2. They found significant correlations between sleep duration, conversation Automatic sensing data sleep duration (pre) sleep duration (post) conversation frequency during day (pre) conversation frequency during day (post) conversation frequency during evening (post) conversation duration during day (post) number of co-locations (post) r (Pearson coefficient correlation) −0.360 −0.382 −0.403 −0.387 −0.345 −0.328 −0.362 p − value 0.025 0.020 0.010 0.016 0.034 0.044 0.025 Table 2.2: Correlations between automatic sensor data and PHQ-9 depression scale. Source from Wang et al. (2018) frequency and duration, and PHQ-9 depression. Some inferences from the results of Table 2.2 is the negative correlation between sleep duration and pre(r = −0.360, p = 0.025) and post(r = −0.382, p = 0.020) depression tests, indicating the inability to sleep, which is one of the key signs of clinical depression. Another findings inline with this study are significant negative correlation between conversation frequency during the day epoch and pre(r = −0.403, p = 0.010) and post(r = 0.387, p = 0.016) depression tests, strong relationship between conversation frequency(r = −0.345, p = 0.034) and depression score. For conversation duration (r = 0.328, p = 0.044). This information indicates that students with fewer conversational interactions are more likely to be depressed. The work of Wang et al. (2018) collected the dataset of Studentlife, explained extraction features from mobile sensors, and the correlation of relevant variables exposed in Table 2.2 regarding the PHQ-9 score. In our work, we extracted features and calculated the correlation of each feature regarding PHQ-9. One difference in our work is that we use the features to test and train the developed models to predict the PHQ-9 score. 2.1.4 The relationship between mobile phone location sensor data and depressive symptom severity Saeb et al. (2016) replicated and extended previous work Saeb et al. (2015) detecting depression through collected data from phone sensors using geographic location sensors to identify depressive symptom severity. Saeb used the dataset of Studentlife by Wang et al. (2014) to demonstrate that GPS features may be an important and reliable predictor of depression severity. Another extension was found in the relationships between workdays and non-workdays. Movement on workdays is likely determined to some degree by social roles and expectations, while movement on non-workdays is likely less determined by external demands and more by the individual’s motivational state. There were extracted features of GPS sensor values using the same eight (Saeb et al. (2015)), plus three new features: speed mean, speed variance, and raw entropy. They evaluated the relationship CHAPTER 2. STATE OF THE ART 9 between each set of features(10 weeks and 2 weeks, weekends and weekdays) and depressive symptoms severity as measured by the PHQ9-score, considering these coefficients significant for their associated p-values fell below 0.05. The correlation analyses between the 10-week location feature resulted in particularly location variance, circadian movement, entropy, and a number of clusters all had absolute correlation coefficients |r| ≥ 0.4 (r person correlation) with the follow-up PHQ-9 scores. The correlation between GPS features and depression was significantly related to PHQ-9 scores. The correlation examined how 2 weeks GPS features were divided into weekdays and weekends. Correlations of features on weekends were stronger than on weekdays. In contrast, correlations between weekend location features and the follow-up PHQ-9 scores generally remained significant regardless of the time point at which features were extracted. This was also true for three out of the six features extracted on weekdays, including homestay, entropy, and normalized entropy. Saeb’s study supports the potential of smartphone sensor technology in providing biomarkers of depression in daily life. The work of Saeb et al. (2016) was an extension of his previous research Saeb et al. (2015), but this time did not work in the linear model to predict depression score PHQ-9. The differences between Saeb’s work and our work are the extraction features mentioned by Saeb, and the development and training of linear models and non-linear models (ANN). We explored linear regression and its different versions with regularization and other models like Artificial Neural Networks. 2.1.5 Towards Deep Learning Models for Psychological State Prediction using Smartphone Data: Challenges and Opportunities Mikelsons, Smith, Mehrotra, and Musolesi (2017) investigated the effectiveness of neural network models for predicting users’ stress levels by using the location information collected by smartphones. The mobility patterns of participants were characterized using GPS metrics and used these metrics as input to the network. They used the users’ stress states collected from questionnaires from the Studentlife dataset. They used a technique of daily averaging and rescaling the data into three stress levels: below-medium, median, and above-median. Eight Spatio-temporal metrics were used as predictors: total distance covered in a day, maximum 2-point displacement in a day, distance standard deviation, number of different areas visited by tiles approximation, total spacial coverage by the convex hull, the difference in the sequence of tiles covered compared to the previous day and distance entropy. The Artificial Neural Network used comprised four fully connected layers of 57,35,35,3 neurons each. The first three layers feature tanh activation and batch normalization for each as well as drop-out regularization of rates 0.35, 0.25, and 0.15. The final layer is a softmax-activation layer with batch normalization applied and three output nodes corresponding to the three stress classes Using all the 12 input features, they reported an F1 score of 0.42, a precision of 0.43, and a recall of 0.45. The research of Mikelsons et al. (2017) used features extracted from GPS and Artificial Neural Network to classify stress levels; Our experiment seeks to predict depression as a numerical value, more accurate than a classification where the probability of failure CHAPTER 2. STATE OF THE ART 10 decreases. We explored different models (linear and nonlinear), specifically the use of the multilayer neural network regressor was to predict the real score of students’ PHQ-9 score. Chapter 3 Theoretical framework 3.1 Machine learning models This section details the different models used in this work to predict the PHQ-9 score. In order to get appropriate context, there are some concepts important to define before explaining details about the models: • feature: feature is an individual measurable property or characteristic of a phenomenon. Bishop (2006) • parameter: parameter is a configuration variable that is internal to the model and whose value can be estimated from data. • coefficient: A multiplicative factor such as one of the constants ai in the polynomial. Weisstein, Eric W. (2022) • sampling (statistical sample): sampling is the process of selecting a number of cases from all the cases in a particular group or universe. • target variable: the target variable is the variable whose values are modeled and predicted by other variables. 3.1.1 Linear regression In statistics, linear regression is a linear approach for modeling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Linear regression can be used to fit a predictive model to an observed data set of values. This model explains variation in the response variable that can be attributed to variation in the explanatory variables. It means linear regression analyses can be applied to quantify the strength of the relationship between the response and the explanatory variables. 11 12 CHAPTER 3. THEORETICAL FRAMEWORK 3.1.2 Linear regression optimized The least squares linear regression approach adjusts the parameters of score estimation. This method performs well as long as the number of features relative to the number of samples is low. Otherwise, the model overfits the data.To minimize the overfitting problem, we used the regularization method. The linear regression regularization prevents the coefficient from becoming too large by adding penalizing. Ridge Regression Ridge Regression Zou and Hastie (2005) is a regularized method to reduce overfitting. In Ridge Regression, the Ordinary Least Squares(OLS) loss function is augmented in such a way that we not only minimize the sum of squared residuals but also penalize the size of parameter estimates in order to shrink them towards zero. In Eqs.3.1,3.2,3.3 yi represents the target to predict, in this work PHQ-9 score, x′i the features, β̂ the coefficients, n the number of elements in the sample and m the number of coefficients (features). Ridge Regression (also known as L2 regularization) introduces a bias λ,the penalty function is: LRidge (β̂) = n  X yi − x′i β̂ i=1 2 +λ m X β̂j2 (3.1) j=1 So, setting λ to 0 is the same as using the OLS, while the larger its value, the stronger is the coefficients’ size penalized. Lasso Regression Lasso Regression (also known as L1 regularization)Zou and Hastie (2005) is quite similar conceptually to Ridge Regression. It also adds a penalty for non-zero coefficients but this model penalizes the sum of their absolute values. For high values of λ, many coefficients are exactly zero. It is a regularization method similar to Ridge Regression, except that the penalty function is LLasso (β̂) = n  X i=1 yi − x′i β̂ 2 +λ m X β̂j (3.2) j=1 Elastic Net Regression Elastic NetZou and Hastie (2005) is a combination penalty of Ridge and Lasso Regression to get the best. Elastic Net aims at minimizing the following loss function: 2 Pn  ! ′ m m β̂ y − x X i i i−1 1−αX 2 β̂j +λ β̂ + α (3.3) Lenet (β̂) = 2n 2 j=1 j j=1 CHAPTER 3. THEORETICAL FRAMEWORK 13 In Eq.3.3 the hyperparameter α is the degree of influence of each penalty L1 and L2. Its value is embedded in the interval [0,1]. When α = 0 Ridge is applied, and when α = 1 Lasso is applied. 3.1.3 Artificial Neural Network (ANN) Multilayer perceptions (MLPs) (Goodfellow, Bengio, and Courville (2016)) have the goal of approximating some function f ∗ . A feedforward network defines a mapping y = f (x; θ) and learns the value of the parameters θ that result in the best function approximation . These models are called feed forward because information flows through the function being evaluated from x, through the intermediate computations used to define f , and finally to the output y. Feedforward neural networks typically are represented by composing together many different functions. The model is associated with a directed acyclic graph describing how the functions are composed together. For example, we might have  three functions f (1) , f (2) , and f (3) connected in a chain, to form f (x) = f (3) f (2) f (1) (x) . These chain structures are the most commonly used structures of neural networks. In this case, f (1) is called the first layer of the network, f (2) is called the second layer, and so on. These are called neural because they were inspired by neuroscience. The hidden layer of the network is vector-valued. The dimensionality of these hidden layers determines the width of the model. Figure 3.1 explains the feedforward process into an Artificial Neural Network with hidden layers. Regularization When the network tries to learn from a small dataset it will tend to have greater control over the dataset will make sure to satisfy all the data points exactly. So, the network is trying to memorize every single data point and failing to capture the general trend from the training dataset. Dropout regularization is one technique used to tackle overfitting problems in deep learning. During training, some layer outputs are ignored or dropped at random. This makes the layer appear and is regarded as having a different number of nodes and connectedness to the preceding layer. In practice, each layer update during training is carried out with a different perspective of the specified layer. Dropout makes the training process noisy, requiring nodes within a layer to take on more or less responsible for the inputs on a probabilistic basis. 3.2 Pearson correlation The Pearson correlation coefficient r is a descriptive statistic that describes the strength and direction of the linear relationship between two quantitative variables. It is the ratio between the covariance of two variables and the product of their standard deviations. The math definition of Pearson correlation is Eq.3.4, where cov is the covariance, σX is the standard deviation of X, and σY is the standard deviation of Y . Pearson correlation 14 CHAPTER 3. THEORETICAL FRAMEWORK (0) a1 (0) a2 (0) a3 0 w1,1 0 w1,2 0 w1,3 0 w1,4 0 w1,n (0) a4 .. . (1) a1 (1) a1 =σ (1) a1 = σ (1) (1) .. . (0) 0 w1,1 a0 n X + (0) 0 w1,2 a1 (0) (0) 0 w1,i ai + b1 i=1 a2 a3   (1) a1   + ... + ! 0 w1,n a(0) n + (0) b1    (0)   (0)  0 b1 a1 . . . w1,n  (0)  0   (0)  . . . w2,n  a2  b2     ..   ... .  +  .  .   ..   ..  0 (0) (0) . . . wm,n an bm  (0) 0 w1,1 w0  1,0  (1)  0 0  w2,0 w2,1 a   2  = σ   ..  ...  ..  .   .  0 0 (1) wm,0 wm,1 am a(1) = σ W(0) a(0) + b (1) am (0) an l Figure 3.1: Neural Network Architecture. The wjk is the weight from the k th neuron in th th th the (l − 1) layer to the j neuron in the l layer, the (a)lj for the activation of the j th neuron in the lth layer. The nonlinear function is represented by σ, applying this function to the output of a linear transformation yields a nonlinear transformation. normalizes the measurement of the covariance, the values always are between -1 and 1. ρX,Y = 3.3 cov(X, Y ) σX σY (3.4) Cross validation Cross-Validation is a statistical method of evaluation (Refaeilzadeh, Tang, and Liu (2009)). It consists of dividing the data into train and test datasets. The training dataset is used to train a model and the test dataset is to evaluate the model with data that has not been seen. The simple form of cross-validation is k-fold cross-validation. In k-fold crossvalidation, the data is divided into k equally folds. Subsequently, k iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for validation while the remaining k-1 folds are used for learning. Leave-one-out cross-validation is K-fold cross-validation where K is equal to N, the number of registers in the dataset. That means that N separate times, the model is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model. The evaluation given by leaveone-out cross-validation error is good, but at first pass, it seems very expensive to compute. CHAPTER 3. THEORETICAL FRAMEWORK 3.4 15 Exam PHQ-9 - Depression Score The Patient Health Questionnaire (PHQ) is a questionnaire that can be entirely selfadministered by the patient. The PHQ assesses 8 diagnoses, divided into threshold disorders (disorders that correspond to specific diagnoses: major depressive disorder, panic disorder, other anxiety disorder and bulimia nervosa), and subthreshold disorders (disorders whose criteria encompass fewer symptoms than are required for any specific diagnoses: other depressive disorder, probable alcohol abuse/dependence, and binge eating disorder). The PHQ-9 is the 9-item depression module from the full PHQ. As a severity measure, the PHQ-9 score can range from 0 to 27 since each of the 9 items can be scored from 0 (not at all) to 3 (nearly every day). The PHQ-9 score is divided into 4 categories according to the score, 0 ± 4 (Minimal), 5 ± 9 (Mild), 10 ± 14(Moderate), 15 ± 19(Moderately severe), and 20 or greater (Severe); See Table 3.1. These categories were chosen for several reasons. The first was pragmatic in that the cut points of 5, 10, 15, and 20 are simple for clinicians to remember and apply. The second reason was empiric, in that using different cut points did not noticeably change the associations between increasing PHQ-9 severity and measures of construct validity according to Kroenke, Spitzer, and Williams (2001). Depression Severity Minimal Mild Moderate Moderately severe Severe PHQ-9 Score 0-4 5-9 10-14 15-19 20-27 Table 3.1: Level of Depression according to PHQ-9 scores CHAPTER 3. THEORETICAL FRAMEWORK 3.5 16 Dataset Studentlife Studentlife by Wang et al. (2014) is a study that uses passive and automatic sensing data from the phones of a class of 48 Dartmouth students over a 10-week term to assess their mental health (e.g., depression, loneliness, stress), academic performance (grades across all their classes, term GPA and cumulative GPA) and behavioral trends (e.g., how stress, sleep, visits to the gym, etc. change in response to college workload – i.e., assignments, midterms, finals – as the term progresses). The StudentLife dataset is a large, longitudinal dataset that is rich in formation and deep. Importantly, the dataset is anonymized, protecting the participants’ privacy in the study. The dataset is from 48 undergrads and grad students at Dartmouth over the 10-week spring term. It includes over 53 GB of continuous data, self-reports, and pre-post surveys; specifically, it comprises: • Objective sensing data: sleep (bedtime, duration, wake up); conservation duration, conversation frequency; physical activity (stationary, walk, run); • Location-based data: location, co-location, indoor and outdoor mobility; • Other phone data: light, Bluetooth, audio, Wi-Fi, screen lock/unlock, phone charge, app usage. • Self-reports: affect (PAM), stress, behavior, Boston bombing reaction, canceled classes, class opinion, comment, Dartmouth now, Dimension incident, Dimension protest, dining halls, events, exercise, Green Key, lab, mood, loneliness, social and study spaces. • Pre-post surveys: PHQ9 depression scale, UCLA loneliness scale, positive and negative affect schedule (PANAS), perceived stress scale (PSS), big five personalities, flourishing scale, Pittsburgh sleep quality index, veterans RAND 12 item health (VR12) • Academic performance data: class information, deadlines, grades (grades, term GPA, cumulative GPA), piazza data • Dining data: meals data, location, and time • Seating data: seating position of students in Android programming • Entry and exit surveys: to be added once anonymized 17 CHAPTER 3. THEORETICAL FRAMEWORK 3.5.1 PHQ-9 Scores In the Studentlife study were applied two PHQ-9 exams. The first (pre) was at the initial of the project with the participation of 46 students, and the other one was at the end(post) with the participation of 36 students. The table 3.2 shows the interpretation of the scale and the number of students that fall into each category for pre-post assessment. The majority of students experience minimal or minor depression for pre-post measures. Depression severity Number of students (pre-survey) Number of students (post-survey) None (0) Minimal (1 − 4) Minor (5 − 9) Moderate (10 − 14) Moderately severe (15 − 19) Severe (20 − 27) Total number of students 2 19 17 6 1 1 46 4 15 12 3 2 2 38 Table 3.2: PHQ-9 depression scale interpretation for both exams (pre and post). Adapted from Studentlife by Wang et al. (2018) To see the distribution of participant’s scores, Figure 3.2 shows PHQ-9 participant frequencies for each period. Figure 3.2: PHQ-9 score distribution baseline and the end of study follow-up. Adapted from Studentlife by Wang et al.Wang et al. (2018) Chapter 4 Proposed Solution Studentlife had a period of 10 weeks of collected data. In this research, we used 2 blocks, the initial 2 weeks and the last 2 weeks. These blocks are relevant because only 2 clinical tests, PHQ-9, were applied, the first at the beginning and the second on the last day of the project. Therefore, this work will carry out an empirical study with two experiments. 4.1 Experiment 1 This first experiment is similar to the research study procedure of Saeb et al. (2015). The features extracted were coded based on the same research study but applied using the Studenlife dataset of Wang et al. (2014). Another difference between this Experiment 1 and the Saeb et al. (2015) is that we not only used Elastic net as Saeb et al. (2015), we also extended the exploration by adding Ridge, Lasso, and the ANN. Experiment 1 predicted the PHQ-9 score of the last two weeks using the features extracted from that period. The features extracted from the last 2 weeks are the input to the predictor models, see Figure 4.1. P HQ9 w0 w2 w4 w6 w8 w10 post_features Figure 4.1: Experiment 1 predicts depression based on PHQ-9 score using the features extracted from students’ mobile phones only during the last 2 weeks of the 10 weeks of Studenlife. 18 19 CHAPTER 4. PROPOSED SOLUTION 4.2 Experiment 2 One contribution of this work is the proposal of the second experiment, not proposed before in state of the art, which predicts depression using both blocks, initial 2 weeks and last 2 weeks. This experiment calculated the difference in behavior between blocks to know if the behavior of each participant increased or decreased in some features. See Figure 4.2. This experiment aims to identify the behavioral change in the students, established on the hypothesis that behavioral change is reflected in their PHQ9. P HQ9 P HQ9 w0 w2 pre_features w4 w6 w8 w10 post_features feature = post_features - pre_features Figure 4.2: Experiment 2 predicts depression based on PHQ-9 score using the features of two blocks, initial 2 weeks and last 2 weeks of the Studentlife dataset. 4.3 Feature Extraction We extracted features according to Saeb et al. (2015) from mobile phone sensors to define the behavioral pattern of each participant. For each experiment, a dataset was generated to be used in training and fitting the linear and nonlinear models. • Location Variance Measures the variability in a participant’s GPS location calculated using stationary states.  (4.1) Location Variance = log σlat 2 + σlong 2 Logarithm to compensate for the skewness in the distribution of location variance σ (by each measure; lat and long) across participants. • Number of Clusters The number of clusters represents the number of frequent places found by the K-means algorithm in the preprocessing stage. • Homestay Homestay measured the percentage of time a participant spent at home relative to other location clusters. To obtain this measure, first needed to know which cluster represented the participant’s home. The home cluster is identidied based on two heuristics: 20 CHAPTER 4. PROPOSED SOLUTION 1. The home cluster is among the first to the third most visited clusters. 2. The home cluster is the cluster most visited during the time period between 12 a.m. and 6 a.m. • Entropy Entropy measures the variability of time the participant spent at the location clusters. This feature was developed based on the concept of entropy from information theory. It was calculated as: X Entropy = − pi log pi (4.2) i where each i = 1, 2, . . . , N represented a location cluster, N denoted the total number of location clusters, and pi was the percentage of time the participant spent at the location cluster i. High entropy indicated that the participant spent time more uniformly across different location clusters, while lower entropy indicated greater inequality in the time spent across the clusters. For example, if a participant spent 80% of time at home and 20% at work, the resulting entropy would be −(0.8 log 0.8 + 0.2 log 0.2) ≈ 0.500, while if they spent 50% at home and 50% at work, the resulting entropy would be −(0.5 log 0.5 + 0.5 log 0.5) ≈ 0.693. • Normalized Entropy Normalized entropy is the division between entropy by its maximum value, which is the logarithm of the total number of clusters: Normalized Entropy = Entropy/ log N (4.3) The value of normalized entropy ranges from [0-1], where 0 indicates that all location data points belong to the same cluster, and 1 implies that they are uniformly distributed across all the clusters. Unlike entropy, normalized entropy is invariant to the number of clusters and thus depends solely on the distribution of the visited location clusters. The value of normalized entropy ranges from 0-1, where 0 indicates that all location data points belong to the same cluster, and 1 implies that they are uniformly distributed across all the clusters. • Circadian Movement Circadian movement captures the temporal information of the location data. This feature measured to what extent the participants’ sequence of locations followed a 24-hour, or circadian, rhythm. For example, if a participant left home for work and returned home from work around the same time each day, the circadian movement was high. On the contrary, a participant with a more irregular pattern of moving between locations had a lower circadian movement. CHAPTER 4. PROPOSED SOLUTION 21 To calculate circadian movement, we first used the least-squares spectral analysis, also known as the Lomb-Scargle method, see VanderPlas (2018), to obtain the spectrum of the GPS location data. Then, we calculated the amount of energy that fell into the frequency bins within a 24 ± 0.5 hour period in the following way: E= X psd (fi ) / (i1 − i2 ) (4) i where i = i1 , i1 + 1, i1 + 2, . . . , i2 , and i1 and i2 represent the frequency bins corresponding to 24.5 and 23.5 hour periods. psd (fi ) denotes the power spectral density at each frequency bin fj . E was calculated separately for longitude and latitude and obtained the total circadian movement as: CM = log (Elat + Elong ) (5) The logarithm was applied to account for the skewness in the distribution. • Total Distance Total distance measures the total distance in kilometers taken by a participant. It was calculated by accumulating the distances between the location samples. • Speed mean Mean of the instantaneous speed obtained at each GPS data point. The instantaneous speed (degrees/sec) was calculated as the change in latitude and longitude values over time in the following way: s 2   longi − longi−1 2 lati − lati−1 Vi = + (4.4) ti − ti−1 ti − ti−1 where lat i , long i , and ti are latitude, longitude, and time at sample i. • Transition Time Transition time represented the percentage of time during which a participant was in a nonstationary state (see Data Preprocessing). This was calculated by dividing the number of GPS location samples in transition states by the total number of samples. • Phone Usage Frequency Phone usage frequency indicated, on average, how many times during a day a participant interacted with their phone. CHAPTER 4. PROPOSED SOLUTION 22 • Phone Usage Duration Phone usage duration measured, on average, the total time in seconds that a participant spent each day interacting with their phone. CHAPTER 4. PROPOSED SOLUTION 4.4 23 Correlation of features with PHQ9 score The Pearson correlation was computed for features and PHQ9 in both the initial 2 weeks and the last 2 weeks. In addition, to support analyses examining the directionality of the correlations, person correlation was calculated from 2-week-long blocks of GPS data. The 2-week period was selected as a block of time because a diagnosis of depression requires the presence of symptoms for at least two weeks. Therefore, we obtained nine features corresponding to 2 weeks. Table 4.4 and Table 4.4 show correlations with Pearson’s correlation for each feature and PHQ-9 score. Features Location variance Circadian movement Speed mean Speed variance Total distance Number of clusters Entropy Normalized entropy Home stay Baseline(n=46) Saeb et al. (2016) study Larrazolo study -0.29 ±0.008 -0.29 -0.34 ±0.006 -0.30 -0.03 ±0.007 0.04 -0.07 ±0.007 -0.11 -0.23 ±0.004 -0.22 -0.38 ±0.005 -0.31 -0.31 ±0.007 -0.31 -0.26 ±0.007 -0.28 0.22 ±0.008 0.23 Table 4.1: Coefficients of correlation between location features and PHQ-9 scores in the Baseline. Pearson’s coefficients in Larrazolo’s study are similar to Saeb et al. (2016). This similarity is important because the coding of features is based on Saeb et al. (2016) but will be used to explore different models and scenarios. Features Location variance Circadian movement Speed mean Speed variance Total distance Number of clusters Entropy Normalized entropy Home stay Follow-up (n=38) Saeb et al. (2016) study Larrazolo study -0.43 ±0.007 -0.43 -0.48 ±0.006 -0.43 -0.06 ±0.005 -0.06 -0.06 ±0.005 -0.09 -0.18 ±0.006 -0.17 -0.44 ±0.004 -0.35 -0.46 ±0.005 -0.45 -0.44 ±0.005 -0.46 0.43 ±0.005 0.40 Table 4.2: Coefficients of correlation between location features and PHQ-9 scores Follow Up. Pearson’s coefficients in Larrazolo’s study are similar to Saeb et al. (2016). This similarity is important because the coding of features is based on Saeb et al. (2016) but will be used to explore different models and scenarios. CHAPTER 4. PROPOSED SOLUTION 24 The correlation analysis between the features and the PHQ-9 scores revealed that features like location variance, circadian movement, number of clusters, entropy, normalized entropy, and homestay showed strong correlations with Pearson’s correlation coefficients (r, Pearson’s coefficient, r ≥ 0.35 ). The correlation explains that these variables are related to the movement of the participants in different ways. The correlation between circadian movement and location variance is interesting and indicates that participants with more mobility also had more regular movement patterns. These results indicate that students with more clusters, entropy, and normalized entropy (more movement through different ways) are less likely to be depressed. The results of homestay suggest that students who spend more time in the house are typically more likely to experience depressive symptoms. CHAPTER 4. PROPOSED SOLUTION 4.5 25 Models This work hypothesizes that behavioral changes are related to the participant’s state of depression and that an initial state is required to know their usual behavior. The use of the participant’s baseline state, which consists of data from the application PHQ-9 clinical questionnaire at the beginning of the study, together with the passive data collected from the first two weeks, will help to have prior knowledge of the behavior of the participant. By comparing the characteristics of the first two weeks and the last two weeks, we can observe the difference in behavior accompanied by the increase or decrease in depression. We will carry out an empirical study of multivariate analysis with two experiments; experiment 1 consists of training and predicting the PHQ-9 values of the last period using only the passive data of the previous two weeks; experiment 2 includes information from the first two weeks and the last two weeks, the difference between both will help us to know if their behavior increased in some features or on the contrary decreased, in addition to generating a PHQ-9 score that will consist of the increase or decrease in its value between the initial PHQ-9 and the final PHQ-9. For each experiment, a dataset of data is generated, which will be used to train and fit some linear computational models and a nonlinear model such as an artificial neural network. 4.5.1 Score Estimation Model - Linear regression Using the features extracted from their phone sensor data, we used a linear regression model to estimate each participant’s PHQ-9 score. The model was defined as the Eq.4.5, where n is the number of features. The coefficients a0 , a1 , . . . .an were obtained by minimizing the squared error between the estimated and the true PHQ-9 scores (see Linear regression optimized, section 3.1.2). Depression Score = a0 + ai Fi + a2 F2 + . . . + an Fn (4.5) 26 CHAPTER 4. PROPOSED SOLUTION 4.5.2 Score Estimation Model - Multilayer Perceptron MLPs We proposed to use the architecture of a regression artificial neural network to predict the PHQ-9 score of each student using the behavioral features mentioned in section 4.3(features extraction). The architecture configuration is specified in Figure 4.3. input layer hidden layers x1 h11 h21 h31 h41 x2 h12 h22 h32 h42 x3 h13 h23 h33 h43 x4 h14 h24 h34 h44 output layer x5 h15 h25 h35 h45 Y x6 h16 h26 h36 h46 x7 h17 h27 h37 h47 x8 h18 h28 h38 h48 xm h19 h29 h39 h49 Figure 4.3: The architecture of a Regression Artificial Neural Network was designed with 4 hidden layers, 10 neurons in each layer, and 1 neuron in the last layer. The activation function is used in all layers with Rectified Linear Unit (ReLU). The first layer has one input for each one of the behavioral features i = 1, 2, .., n and Y is the output neuron that predicts a depression PHQ-9 score values. Chapter 5 Results The results were divided according to Experiment 1 (See section 4.1) and Experiment 2 (See section 4.2). 5.1 Performance Models The results of the experiments using the linear and non-linear models are presented in table 5.1. Experiment 1, based on study research of Saeb et al. (2015), used only the last 2 weeks. Experiment 2, the proposal of this work, used the behavioral change of 2 blocks, the initial 2 weeks and the last 2 weeks, to predict the PHQ-9 score. From the results of Experiment 1, the best score was the Ridge model with an RMSE of Experiment 1 Model Experiment 2 Based on research of Saeb et al. (2015) Proposed by Larrazolo Root Mean Square Error Root Mean Square Error 3.7 3.9 4.1 5.7 6.0 2.8 2.8 2.8 7.7 2.7 Ridge Elasticnet Lasso Linear Regression (OLS) Artificial Neural Network (ANN) Table 5.1: Comparison of results RMSE of linear and nonlinear models using LOOCV as an evaluation method of Experiment 1 based on the research study of Saeb et al. (2015) and Experiment 2 proposed by Larrazolo. 3.7, and the worst was ANN with 6.0. There is no possibility of comparing the results of Experiment 1 with Saheb’s results because the dataset Saheb’s was not publicly available. The results of experiment 1 were based on Saheb’s study but applied to the Studentlife dataset. The results of Experiment 2, proposed by Larrazolo, are more accurate than Experiment 1’s results. The best performance model of Experiment 2 was the Artificial Neural Network with 2.7 RMSE, and the worst was the OLS with 7.7 RMSE. The results for each participant using linear and non-linear models are shown in Figure 5.1 for Experiment 1 , Figure 5.1 shows the results of Experiment 2. 27 CHAPTER 5. RESULTS 28 29 CHAPTER 5. RESULTS Model - Ordinary Least Square OLS 50 PHQ-9 score 40 Real Predicted 30 20 10 0 −10 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 25 Real Predicted 20 PHQ-9 score 58 59 Model - Lasso 15 10 5 0 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 25 Real Predicted 20 PHQ-9 score 58 59 Model - Ridge 15 10 5 0 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 25 Real Predicted 20 PHQ-9 score 58 59 Model - Elastic 15 10 5 0 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 Real Predicted 20 PHQ-9 score 58 59 Model - Artificial Neural Network 25 15 10 5 0 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 58 59 Figure 5.1: Results of Experiment 1. Subplots of the results for each model for the 38 students (The horizontal axis is the Student ID assigned from the Studentlife dataset). Models in descending order: OLS, Lasso, Ridge, Elastic, and ANN. 30 CHAPTER 5. RESULTS Model - Ordinary Least Square OLS 25 PHQ-9 score 0 −25 −50 −75 −100 Real Predicted −125 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 Real Predicted 20 PHQ-9 score 58 59 Model - Lasso 25 15 10 5 0 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 25 Real Predicted 20 PHQ-9 score 58 59 Model - Ridge 15 10 5 0 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 25 Real Predicted 20 PHQ-9 score 58 59 Model - Elastic 15 10 5 0 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 25 Real Predicted 20 PHQ-9 score 58 59 Model - Artificial Neural Network 15 10 5 0 0 1 2 3 4 5 7 9 10 14 15 16 17 18 19 20 23 24 27 30 31 32 33 34 35 36 Student ID 42 43 44 45 47 49 51 52 53 56 58 59 Figure 5.2: Results of Experiment 2. Subplots of the results for each model for the 38 students (The horizontal axis is the Student ID assigned from the Studentlife dataset). Models in descending order: OLS, Lasso, Ridge, Elastic, and ANN. Chapter 6 Conclusions The constant use of personal devices like smartphones and wearable devices has helped to track people’s activities, such as the time they spend at home, places they commonly visit, and other features, which are getting closer to the prediction and diagnosis of mental health. Different linear and nonlinear models were presented to detect the PHQ-9 score based on behavioral features. These models were applied to different scenarios, Experiment 1 and Experiment 2. Experiment 1 was based on the study of Saeb et al. (2015) but applied to the Studentlife dataset. The results are shown in table 5.1; the best model was the Ridge model with 3.7 RMSE, which means an average error of 3.7 from real values of PHQ-9 score Later, Experiment 2 was done as a new proposal to improve the accuracy according to the observations and exploration of the data. The results are shown in table 5.1; the best model was ANN with 2.3 RMSE. These results are better than those obtained by Experiment 1. The results in Experiment 2 allow concluding that predicting depression levels in a participant requires the context of an initial state and their behavioral patterns change. This could explain why Experiment 2 had better results than Experiment 1. The characterization of behavioral patterns of participants was based on feature extraction of mobile phone sensors. We proposed different models to predict the PHQ-9 scores. The best performing model was the ANN with Experiment 2 with a 2.7 RMSE. The difference between the linear models with regularization (2.8 RMSE) and the ANN (2.7 RMSE) is lower. The results indicate an average difference of 2.7 PHQ-9 scores compared to the participants’ real values of the PHQ-9 (from 0 to 27). 6.1 Future work The work done in this thesis allows identifying important opportunities to explore other options related to the detection of depression using digital phenotyping. This work used the Studenlife dataset containing 10 weeks; this study applied only 2 PHQ-9 exams (first and last week). Exploring sequential models and using more frequent periods of applying clinical tests is essential. Regular clinical tests open possibilities to experiment with other 31 CHAPTER 6. CONCLUSIONS 32 models and have more robust training. One approach to explore could be time series or recurrent neural networks (RNN) to predict depression using the sequence track of behavioral patterns more frequently. These models could be personalized or general, but collecting the information (features and clinical tests) periodically(daily, 2, or 3 times a week) could be a way to discover specific habits in the daily life of participants that could be related to depression. Another approach to explore is adding other sensors and features that could improve the models for predicting depression. Sleep time has been studied as an important signal of depression, and these feature has been approached using sensors like the microphone, the light, and the charge time of the smartphone (Chen et al. (2013)). An important problem in this study was finding a specific dataset that collected different sensors over a long time with sufficient participants. Future work could create an integrated platform to conduct investigations of digital phenotyping and collect the data of mobile phones and wearables. This platform will enable to have not only more data but also data according to specific requirements. References Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Chen, Z., Lin, M., Chen, F., Lane, N. D., Cardone, G., Wang, R., . . . Campbell, A. T. (2013). Unobtrusive sleep monitoring using smartphones. In 2013 7th international conference on pervasive computing technologies for healthcare and workshops (p. 145152). Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. (http:// www.deeplearningbook.org) Kroenke, K., Spitzer, R., & Williams, J. (2001, September). The phq-9: validity of a brief depression severity measure. Journal of general internal medicine, 16 (9), 606—613. Retrieved from https://europepmc.org/articles/PMC1495268 doi: 10.1046/j.1525-1497.2001.016009606.x Mikelsons, G., Smith, M., Mehrotra, A., & Musolesi, M. (2017). Towards deep learning models for psychological state prediction using smartphone data: Challenges and opportunities. arXiv. Retrieved from https://arxiv.org/abs/1711.06350 doi: 10.48550/ARXIV.1711.06350 Mohr, D. C., Zhang, M., & Schueller, S. M. (2017, March). Personal sensing: Understanding mental health using ubiquitous sensors and machine learning. Annu Rev Clin Psychol , 13 , 23–47. Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. https://doi.org/10.1007/978-0-387-39940-95 65 : SpringerU S. Saeb, S., Lattie, E. G., S., S. M., K., P., K., & Mohr, D. C. (2016). The relationship between mobile phone location sensor data and depressive symptom severity. PeerJ , 4 , e2537. Retrieved from https://doi.org/10.7717/peerj.2537/ doi: 10.7717/peerj.2537 Saeb, S., Zhang, M., Karr, C. J., Schueller, S. M., Corden, M. E., Kording, K. P., & Mohr, D. C. (2015, Jul 15). Mobile phone sensor correlates of depressive symptom severity in daily-life behavior: An exploratory study. J Med Internet Res, 17 (7), e175. Retrieved from http://www.jmir.org/2015/7/e175/ doi: 10.2196/jmir.4273 Spinazze, B. A. e. a., Rykov Y. (2019). Digital phenotyping for assessment and prediction of mental health outcomes: a scoping review protocol. https://bmjopen.bmj.com/content/9/12/e032255.info: BMJ Open 2019. doi: 10 .1136/bmjopen-2019-032255 Statista. (2019). Number of connected wearable devices worldwide from 2016 to 2022. https://www.statista.com/statistics/487291/global-connected -wearable-devices/. ([Accessed 20-April-2022]) Torous, S. P. B. I. e. a., J. (2018). Characterizing the clinical relevance of dig33 REFERENCES 34 ital phenotyping data quality with applications to a cohort with schizophrenia. https://doi.org/10.1038/s41746-018-0022-8: npj Digital Med. VanderPlas, J. T. (2018, may). Understanding the lomb–scargle periodogram. The Astrophysical Journal Supplement Series, 236 (1), 16. Retrieved from https://doi.org/ 10.3847%2F1538-4365%2Faab766 doi: 10.3847/1538-4365/aab766 Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., . . . Campbell, A. T. (2014). Studentlife: Assessing mental health, academic performance and behavioral trends of college students using smartphones. In Proceedings of the 2014 acm international joint conference on pervasive and ubiquitous computing (p. 3–14). New York, NY, USA: Association for Computing Machinery. Retrieved from https://doi.org/ 10.1145/2632048.2632054 doi: 10.1145/2632048.2632054 Wang, R., Wang, W., daSilva, A., Huckins, J. F., Kelley, W. M., Heatherton, T. F., & Campbell, A. T. (2018, mar). Tracking depression dynamics in college students using mobile phone and wearable sensing. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 2 (1). Retrieved from https://doi.org/10.1145/3191775 doi: 10.1145/ 3191775 Weisstein, Eric W. (2022). Coefficient. Retrieved from https://mathworld.wolfram .com/Coefficient.html ([Online; accessed 11-October-2022]) World Health Organization. (2021). Depression. https://www.who.int/news-room/ fact-sheets/detail/depression. ([Accessed 22-April-2022]) Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67 (2), 301–320. Retrieved 2022-07-12, from http://www.jstor.org/stable/3647580