Introduction

As part of the Final Project for A1C1 Univ.ai's Course, the team - Vishnu, Sakthisree, Niegil and Rishabh - started their journey to leverage Machine Learning for trying to answer the age-old question of - what exactly makes us happy?

The World Happiness Report was recommended to be a good starting point for guaging world wise bliss. Throughout our analysis, the data points surely helped us although towards the end we were able to understand that perhaps all of the variables in this report alone will not be sufficient for us to accurately measure the happiness of a country since "happiness" is very relative in nature.

title

There are six measurements taken per country for guaging the World Happiness Index. They consist of:

  1. GDP per Capita - Gross Domestic Product per capita for the countries

  2. Family - Satisfaction Rank of Family

  3. Life Expectancy - Avg. expected years to live

  4. Freedom - Perception of freedom quantified

  5. Generosity - Numerical value estimated based on the perception of Generosity experienced by poll takers in their country.

  6. Trust/Government Corruption - A quantification of the people's perceived trust in their governments.

  7. Dystopia Score - Score based on comparison to hypothetically the saddest country in the world.

  8. Dystopia Residual - Rank of any country in a particular year.

The Happiness Score calculated in the report is actually an average of the responses to the main life evaluation question asked in the Gallup World Poll (GWP), which uses the Cantril Ladder.

Cantril Ladder involved something called as Cantril step where they ask reponsents to think of a step with the most excellent life they can think of and with that as benchmark, score their current life.

Credits Remarks to:

  1. Univ.Ai Professor Pavlos Protopapas
  2. Kaggle Datasets
  3. Aashita Kesarwani - https://www.kaggle.com/aashita/guide-to-animated-bubble-charts-using-plotly - for demonstrating beautiful ways to plot bubble charts
  4. Jesper Sören Dramsch - https://www.kaggle.com/jesperdramsch/the-reason-we-re-happy - for demonstrating wonderful means of doing data analysis
  5. Jamaç Eren Ay - https://www.kaggle.com/yamaerenay/world-happiness-report-preprocessed - for preparing pre processed datasets and allowing it for free use for all

Problem Statement

Given the data available per country to guage the Hapiness Index, our aim is to:

  1. Part A - Analyze and understand which factors affect the Happiness Index Score of countries
  2. Part B - Analyze and understand the relationship between Terror Attacks and Happiness Index
  3. Part C - Create a Model to predict the Happiness Index of a Country
  4. Part D - To see how much Health contributes to the Happiness Index? With the current pandemic at hand, predicting COVID-19 Cases in the coming days for countries.
  5. Part E - Creating a Dashbord for viewing COVID-19 Predictions

Part A

To Analyze and understand which factors affect the Happiness Index Score of countries

Explaratory Data Analysis

Our objective here is to look through the datasets and perform some basic analysis to understand and guage insights.

A look into Correlation

The Spearman's Rank Correlation Coefficient is used to discover the strength of a link between two sets of data.

  • The Spearman rank correlation coefficient, ρ considers the ranks of the values for the two variables.ρ will always be a value between -1 and 1.

  • The further away ρ is from zero, the stronger the relationship between the two variables. The sign of ρ corresponds to the direction of the relationship. If it is positive, then as one variable increases, the other tends to increase. If it is negative, then as one variable increases, the other tends to decrease.

  • You use Spearman’s correlation if your data have a non-linear relationship (like an exponential relationship) or you have one or more outliers. However, Spearman’s correlation is only appropriate if the relationship between your variables is monotonic.

happiness_score gdp_per_capita family health freedom generosity government_trust dystopia_residual year social_support
happiness_score 1.00 0.80 0.14 0.77 0.54 0.13 0.32 0.23 0.03 0.24
gdp_per_capita 0.80 1.00 0.21 0.78 0.36 -0.01 0.26 0.06 -0.04 0.14
family 0.14 0.21 1.00 -0.07 0.01 0.23 0.10 0.56 -0.59 -0.86
health 0.77 0.78 -0.07 1.00 0.40 -0.02 0.18 -0.05 0.07 0.38
freedom 0.54 0.36 0.01 0.40 1.00 0.33 0.43 -0.00 0.06 0.23
generosity 0.13 -0.01 0.23 -0.02 0.33 1.00 0.24 0.16 -0.10 -0.18
government_trust 0.32 0.26 0.10 0.18 0.43 0.24 1.00 0.13 0.02 -0.02
dystopia_residual 0.23 0.06 0.56 -0.05 -0.00 0.16 0.13 1.00 0.09 -0.59
year 0.03 -0.04 -0.59 0.07 0.06 -0.10 0.02 0.09 1.00 0.43
social_support 0.24 0.14 -0.86 0.38 0.23 -0.18 -0.02 -0.59 0.43 1.00

Inference: From the above matrixes, it seems like Health, GDP Per Capita and freedom are the top 3 factors that correlate with happiness index.

Univariate Analysis

This type of analysis consists of use of single variable. The analysis of univariate data does not deal with causes or relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it.

Bivariate Analysis

This type of analysis involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables

Inference:

From the above plot, we can infer that there seems to be a:
Linear Relationship: happiness_score v/s gdp_per_capita, happiness_score v/s health, happiness_score v/s freedom
Non-Linear Relationship: happiness_score v/s gerosity, happiness_score v/s government_trust

Performing ANOVA test between predictors and response variable to guage how significantly it affects the scoring

Analysis of Variance is a statistical method, used to check the means of two or more groups that are significantly different from each other. It assumes Hypothesis as:
H0: Means of all groups are equal.
H1: At least one mean of the groups are different.

  • If the distributions overlap or close, then the grand mean will be similar to individual means whereas if distributions are far, the grand mean and individual means differ by larger distance.
  • In ANOVA, we will be checking & comparing both Between-group variability to Within-group variability through f-test.
  • If there is no significant difference between the groups that all variances are equal, the result of ANOVA’s F-ratio will be close to 1.
The best predictors of Happiness Index are: 
['gdp_per_capita', 'government_trust', 'health', 'family']

Two of the aspects coming out of ANOVA test belong to our correlation inference i.e GDP per capita and health. Apart from that, it seems like government trust and family also play quite a significant role in realizing the happiness score.

Looking at all countries and their ranks in Happiness Index Score

Inference: Clearly Norwary seems to the top country scoring in Happiness Index. It is not surprising since European Countries have better living conditions.

Happiness with regards to Generosity and Economy

Inference: The farther right side bubbless are mostly contries in the European Continents. Clearly they have better GDP Per capita. Surprisingly Europeans countries score average on Generosity(Asian countries have highest generosity) but have the most Happiness Score rankings.

Happiness with regards to Health and Economy

Inference: The farther right side bubbles are mostly contries in the European Continents. Clearly they have better Health score as well since they are present on top. The lowest health scores mostly consists of African and Asian countries.

Happiness with regards to Family and Economy

Inference:

The farther right side bubbless are mostly contries in the European Continents. Clearly they have better Family ratings. The most unsatisfied family rankings is actually mixture of mostly African, South American,Asian and a few European countries & North American countries.

Happiness with regards to Govt Trust and Economy

Inference:

Most countries rank low on government trust giving us insights into how most of the world population doesn't necessarily trust it's governments despite the ovearching push of democracy to be adoptee. High government trust countries are Rwanda and obvious countries of Sinagpore, New Zealand, Finland.

World-wide View of Countries with regards to Generosity

Trend of Happiness Over Time

From the chart we can notice that the continent of Europe has a good score of GDP per capita, compared to others. Australian countries contribute the least to global GDP.

Part B

Analyze and understand the relationship between Terror Attacks and Happiness Index

Thoughts/ Motive

One of the things that intrigued us terrorism across the world. With wars and conflicts happening on a day to day basis, we really wanted to understand to what extent terrorism plays a role in happiness index. For this we combined two datasets - the Happiness Datasets and the World Terrorism dataset from Global Terrorism Database(GTD).

In our datasets, we have only took the count of terror attacks and not other information such as text based data surrounding the context of what happpened, names of the weapons used and so on since that would delve into NLP. Our future work in scope is using NLP to also analyse the datasets in order to better guage the relationship between happiness and terrorism.

Processing the Datasets

Now that we have seen EDA on Happiness Index, we were wondering what about terror attacks? Clearly the factors mentioned above are not sufficient enough to explain true happiness. So we decided to see how terror attacks combine with happiness index and to answer the question if there is a correlation present.

Below Cells take time to execute due to large dataset

Terrorism Database

iyear country_txt latitude longitude summary attacktype1_txt attacktype1 targtype1_txt targsubtype1 targsubtype1_txt corp1 target1 natlty1 natlty1_txt gname motive guncertain1 guncertain2 guncertain3 individual weaptype1_txt nkill
0 2015.00 Iraq 33.30 44.37 01/03/2015: An explosive device planted on a m... Bombing/Explosion 1 Private Citizens & Property 73.00 Vehicles/Transportation Not Applicable Minibus 95.00 Iraq Unknown None 0.00 nan nan 0.00 Explosives 2.00
1 2015.00 Bosnia-Herzegovina 45.18 15.83 01/01/2015: Assailants stabbed Selvedin Begano... Armed Assault 6 Religious Figures/Institutions 85.00 Religious Figure Unknown Imam: Selvedin Beganovic 28.00 Bosnia-Herzegovina Muslim extremists The specific motive is unknown; however, sourc... 0.00 nan nan 0.00 Melee 0.00
2 2015.00 Iraq 33.30 44.37 01/01/2015: An explosive device planted in a v... Bombing/Explosion 1 Educational Institution 48.00 Teacher/Professor/Instructor University of Baghdad Lecturer 95.00 Iraq Unknown None 0.00 nan nan 0.00 Explosives 1.00
3 2015.00 Sweden 59.86 17.64 01/01/2015: An assailant threw an explosive de... Facility/Infrastructure Attack 3 Religious Figures/Institutions 86.00 Place of Worship Unknown Mosque 198.00 Sweden Unknown The specific motive is unknown; however, sourc... 0.00 nan nan 0.00 Incendiary 0.00
4 2015.00 Libya 32.07 20.15 01/01/2015: Assailants attacked a Haftar milit... Bombing/Explosion 7 Terrorists/Non-State Militia 94.00 Non-State Militia Haftar Militia 204 Camp 113.00 Libya Shura Council of Benghazi Revolutionaries None 0.00 nan nan 0.00 Explosives nan

Happiness Index Database

country happiness_score gdp_per_capita family health freedom generosity government_trust dystopia_residual continent Year social_support
0 Afghanistan 3.79 0.40 0.58 0.18 0.11 0.31 0.06 2.15 Asia 2015 nan
1 Albania 4.64 1.00 0.80 0.73 0.38 0.20 0.04 1.49 Europe 2015 nan
2 Algeria 5.87 1.09 1.15 0.62 0.23 0.07 0.15 2.57 Africa 2015 nan
3 Argentina 6.60 1.19 1.44 0.70 0.49 0.11 0.06 2.61 South America 2015 nan
4 Armenia 4.38 0.90 1.01 0.64 0.20 0.08 0.03 1.52 Asia 2015 nan

Exploratory Data Analysis on Combines Dataset with Terrorism

We can see that there are some countries which go through alot of terrorist attacks

There seems to be a:
Linear Relationship: happiness_score v/s gdp_per_capita, happiness_score v/s health, happiness_score v/s freedom
Non-Linear Relationship: happiness_score v/s gerosity, happiness_score v/s government_trust

Inference:

With the data that we have, there doesn't seem to be much correlation between terror attacks and happienss index. We would need more data to come to a singificant conclusion as to how terrorism really affects the happiness index. Perhaps another factors that would allow us to further understand the happiness index would be war conditions. Countries like Syria and Palestine, are in critical war zones which would make their living condtions poor and hence affecting the happiness index.

Part C

To create a Model to Predict Happiness Index

Predicting happiness Index

The MSE value of our model is:  0.25
The R2 score of our model is :  0.819

We used Lasso Regression with the degree of 6 to perform Polynomial Lasso Regression in order to predict the Happiness Score.

Our MSE value for Lasso Regression is 0.25 and our R2 Score is 0.82 which is pretty satisfactory.

Why did we use Lasso Regression?

  • We understood that Lasso tends to do well if there are a small number of significant parameters and the others are close to zero (ergo: when only a few predictors actually influence the response). This was our case where our parameters no. was relatively small hence this seemed like the good approach to take. Ridge works well if there are many large parameters of about the same value (ergo: when most predictors impact the response).
  • Lasso, or Least Absolute Shrinkage and Selection Operator, is quite similar conceptually to ridge regression. It adds a penalty for non-zero coefficients. However, unlike ridge regression which penalizes sum of squared coefficients (the so-called L2 penalty), lasso penalizes the sum of their absolute values (L1 penalty). As a result, for high values of λ, many coefficients are exactly zeroed under lasso.
The MSE value of our model is:  0.25
The R2 score of our model is :  0.82

What did we do in MLP Regressor?

  • Our choice of multiple number of layers here is to depict non-linearity in the model. Multiple number of layers lead to non-linearity, but excess number of layers may lead to overfitting of the model.
  • Experimenting and trying out multiple combinations of layers and neurons, three layers with depicted neurons turned out to be suitable for our model.
  • Also, we used the default Activation Function, ReLu because of our model being a Linear Regression Model and ReLu fits the best for this problem.

Our MSE value for MLP Regressor is 0.26 and our R2 Score is 0.82 which is pretty much the same as Lasso Regression.

Predicting Terrorist attacks

We also tried experimenting witht the variables we have from the happiness dataset to see if we can satisfactorily predict no. of terrorist attacks likely to happen.

Of course the model does not have the best performance because we understand that there are more factors that affect the outcome.

Our future work here is to get more external factors relating to what sparks terrorim attacks and create model to allow for better risk handling.

The R2 score of our model is:  0.07916807295022832

Clearly our model is not performing well here.

Part D

To see how much Health contributes to the Happiness Index? With the current pandemic at hand, predicting COVID-19 Cases in the coming days for countries.

Thoughts

From Part A, we have realized that Health does play a major role in a country's happiness score. With the current pandemic at hand, we were motivated to look at COVID cases and forecast the upcoming cases. We wanted to compare the COVID data with the happiness index data, however, we felt that it would not give the right results since the happiness index data of 2020 is from the months of January-February when there was not much COVID health crisis happening.

However, in pursuit of excitement and interest, we decided to go forth to do a basic forecasting model on COVID-19 dataset using fbprophet.

What and Why Prophet?

Prophet is Facebooks'open source time series prediction. Prophet decomposes time series into trend, seasonality and holiday. It has intuitive hyper parameters which are easy to tune.

Prophet time series = Trend + Seasonality + Holiday + error

Trend models non periodic changes in the value of the time series. Seasonality is the periodic changes like daily, weekly, or yearly seasonality. Holiday effect which occur on irregular schedules over a day or a period of days. Error terms is what is not explained by the model.

We believe that the advantages of using Prophet are:

  • Accommodates seasonality with multiple periods
  • Prophet is resilient to missing values
  • Best way to handle outliers in Prophet is to remove them
  • Fitting of the model is fast
  • Intuitive hyper parameters which are easy to tune

Credits to https://towardsdatascience.com/time-series-prediction-using-prophet-in-python-35d65f626236 for information on Prophet.

The performance of the model


Part E

Creating a Dashbord for viewing COVID-19 Predictions

Our very own COVID-19 Forecasting Dashboard

Using the model that we built, we created a COVID-19 Forecasting Dashboard. You can view it in this link:

https://covid-prediction.herokuapp.com/

Our main motivation here was to be able to learn how to best provide the model outcomes to audience.

You can see the code in our file under the name: Covid-pred

Conclusions

  • The data factors being used for calculating the Happiness Index of the countries is not holistic and inclusive. There are other factors to also be considered. GDP per capita seems to be a skewed figure itself and the limitations that GDP poses is highly likely to bias the happiness score.

  • We did not find much correlation between no. of terror attacks and happiness index of a country. However, we believe we need to consider more factors & influences pertaining to terrorism for us to properly see the relationship.

  • For COVID-19 forecasts, we performed univariate analysis on our historical data, which made us realize that historical data alone might not be sufficient for the prediction. But certainly, this is one of the main predictors and it can be used with other set of predictors to create a more powerful model.

Improvements That Can Be Done

Improvement: Figure out another way to calculate Happiness Index of a country which includes more holistic and inclusive factors

Based on our observations, we believe that factors apart from 6 selected need to be considered in order to make accurate happiness index scoring. A possible improvement would be to research on an alternative way to calculate the index without using GDP per capita as a score

Improvement: To move into using NLP & Decision Trees for analyzing Terrorism Data

Most of the factors in the Terrorism Dataset were text based. Hence, using NLP here will be best for us to understand the influences of the predictor on the response. To improve model prediction, we believe models pertaining to Decision Trees will help.

Improvement: To move into Multivariate Analysis

We forecasted COVID-19 cases using only past data – however, we are aware that historical data alone is not enough to make accurate forecasts. There are many other external factors – our intention was to more or less look at the trend and observe how this trend will move in the future.