About 800,000 people died by suicide every year, which accounts for 1.4% of all deaths worldwide.
The annual U.S. suicide rate increased by 10% between 2014 and 2017 according to the CDC’s National Center for Health Statistics(NCHS).
In 2015, suicide has become the second leading cause of death among 15-34 age group and the third leading cause of death among 10-14 age group in the U.S.
It is reported that in 2016 suicide was the 10th leading cause of death in the U.S., imposing a cost of $69 billion annually.
On the one hand, almost all the risk factors for suicide are preventable. Usually, people are not aware of the risk factors of suicide and situate themselves in the dangerous environment. We believe that a better understanding of potential risk factors is essential for the prevention of suicide.
usa_suicide =
read_xlsx("./data/suicide_age_gender.xlsx") %>%
janitor::clean_names() %>%
filter(is.na(ten_year_age_groups)) %>%
drop_na(gender) %>%
select(year, state, gender, deaths, population) %>%
group_by(year) %>%
summarize(
deaths = sum(deaths),
population = sum(population)
) %>%
mutate(
suicide_rate = deaths * 100000/population
) %>%
select(year,suicide_rate)
global_suicide =
read_csv("./data/master.csv") %>%
janitor::clean_names() %>%
filter(year >= 1999) %>%
group_by(year) %>%
summarize(
deaths = sum(suicides_no),
population = sum(population)
) %>%
mutate(
suicide_global = deaths*100000/population
) %>%
select(year, suicide_global)
suicide_rate =
left_join(usa_suicide, global_suicide, by = "year") %>%
rename( "USA" = "suicide_rate", "Global" = "suicide_global") %>%
pivot_longer(
USA:Global,
names_to = "region",
values_to = "suicide_rate"
)
suicide_rate %>%
ggplot(aes(x = year, y = suicide_rate, color = region)) +
geom_point() + geom_line() +
labs(
title = "1999-2017 Suicide Rate in the U.S. and global",
x = "Year",
y = "Suicides per 100k"
)
On the other hand, the suicide rate in the U.S. increases overtime while it decreases globally. Therefore, we want to focus on the U.S. to explore the risk factors and promote effective preventive measures.
Visualizing the general situation of suicide rate worldwide
Exploring risk factors for suicide in U.S.
Visualizing the distribution of risk factors and suicide in the U.S.
Analyzing correlations of these risk factors with suicide in U.S.
We aim to investigate the risk factors for suicide and suicide rate in the U.S.. We want to visualize the association and data innovatively and interactively. As stated above, our questions are as follows:
What are potential risk factors for suicide? How could we visualize the association between these factors and suicide rate?
How could we use shiny to visualize the distribution of risk factors and suicide rate in the USA?
Could the model be generalized to other regions? How can we predict suicide rate using the constructed model?
Worldwide Suicide Data
USA Suicide Data and the potential risk factors data
The suicide rates of each state from 1999 to 2017 is from Domectis suicide distribution dataset from CDC. This dataset also includes gender and age composition in each state as well as suicide rates among groups.
The per capital GDP of each state in 2007 is from Wiki, presenting a list of U.S states sorted by their gross domestic product (GDP) per capita from 2011-2018.
The average yearly precipitation of each state is from NETSTATE.COM, presenting a list of U.S. states sorted by their average yearly precipitation.
The average temperature of each state is from USA.COM, presenting a list of U.S. areas sorted by their average temperature.
The marijuana data of each state is from samhsa.gov, showing the percentage of U.S. adults that have used cannabis within the past year from 2016 to 2017, by state.
The alcohol consumption data of each state is from NIH; NIAAA, showing apparent per capita alcohol consumption: national, state, and regional trends in 1977-2017. We only select 2007 data.
The education attainment data of each state is from SSTi, containing information on share of population with advanced degree, bachelor’s degree, associate’s or some college, and high school or less.
The education attainment data of each state is from SSTi, containing information on share of population with advanced degree, bachelors’ degree, associate’s or some college, and high school or less.
The gun data of each state is from ThoughtCo containing information on number of guns per capita and total guns registered of each state in 2017.
All the data can be found here.
Worldwide Suicide Data
## global map
world <- read_csv("./data/master.csv")
world_clean <-
world %>%
rename( "rate_standardized" = "suicides/100k pop") %>%
mutate(suicides = population * rate_standardized) %>%
group_by(country,year) %>%
summarise(rate = sum(suicides)/sum(population))
world_clean <-
world_clean %>%
mutate(
'country_iso3c' = countrycode(country,'country.name', 'iso3c')
)
# get Min and Max values of suicide rate
vmax <- world_clean %>% pull(rate) %>% max(na.rm=T)
vmin <- world_clean %>% pull(rate) %>% min(na.rm=T)
## Global total trend (by year)
data_year=global %>%
group_by(year) %>%
summarize(population = sum(population),
suicides = sum(suicides_no),
suicides_per100k = (suicides / population) * 100000)
data_year2=arrange(data_year, desc(suicides_per100k))[1,]
data_year2$suicides_per100k<-round(data_year2$suicides_per100k,2)
## Global gender trend
data_gender=global %>%
group_by(sex,year) %>%
summarize(population = sum(population),
suicides = sum(suicides_no),
suicides_per100k = (suicides / population) * 100000)
## Different age group trend(by year) animation separeted line
data_age=global %>%
mutate(
age=fct_relevel(age,"5-14 years","15-24 years","25-34 years","35-54 years","55-74 years","75+ years")
)%>%
group_by(year, age) %>%
summarize(suicide_per100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000)
## average suicide rate ranking during 1987-2016 in different continet
global_rank=global %>%
filter(year>2010) %>%
group_by(country) %>%
summarize(suicide_per100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000) %>%
arrange(desc(suicide_per100k)) %>%
ungroup() %>%
mutate(
country=as.factor(country),
country=fct_reorder(country,desc(suicide_per100k))
) %>%
head(40) %>%
mutate(highlight_flag = ifelse(country == 'United States', T, F))
We directly downloaded the csv file of world suicide data here. Then we calculated suicide rate per 100k using suicide number and total population. Next, for different purposes, we group_by
different variables such as year, country and gender to get the rates in different groups.
US_2017 Data
## suicide_rate_df
suicide_rate_df =
read_excel("./data/suicide_age_gender.xlsx") %>%
janitor::clean_names() %>%
filter(year == 2017) %>%
filter(is.na(ten_year_age_groups)) %>%
drop_na(gender) %>%
select(state,gender, deaths, population) %>%
group_by(state) %>%
summarize(
suicide_rate_per100000 = sum(deaths)*100000/sum(population)
)
## gender_rate_df
gender_rate_df =
read_excel("./data/suicide_age_gender.xlsx") %>%
janitor::clean_names() %>%
filter(year == 2017) %>%
filter(is.na(ten_year_age_groups)) %>%
drop_na(gender) %>%
select(state,gender,population) %>%
pivot_wider(
names_from = "gender",
values_from = "population"
) %>%
mutate(
female_ratio = Female/(Female+Male),
male_ratio = Male/(Female+Male)
) %>%
select(state, female_ratio, male_ratio)
## gender_suiciderate_df
gender_suiciderate_df =
read_excel("./data/suicide_age_gender.xlsx") %>%
janitor::clean_names() %>%
filter(year == 2017) %>%
filter(is.na(ten_year_age_groups)) %>%
drop_na(gender) %>%
select(state,gender,deaths,population) %>%
mutate(
suicide_rate_per100000 = deaths*100000/population
) %>%
select(state, gender, suicide_rate_per100000) %>%
pivot_wider(
names_from = gender,
values_from = suicide_rate_per100000
) %>%
mutate(F_suicide_rate_per100000 = Female, M_suicide_rate_per100000 = Male) %>%
select(-Female, -Male)
## age
age_df =
read_excel("./data/suicide_age_gender.xlsx") %>%
janitor::clean_names() %>%
filter(year == 2017) %>%
drop_na(ten_year_age_groups) %>%
select(state,gender,ten_year_age_groups, population) %>%
pivot_wider(
names_from = "gender",
values_from = "population"
) %>%
mutate(
Female = replace(Female, is.na(Female), 0),
Male = replace(Male, is.na(Male), 0),
total = Female + Male
) %>%
select(-Female, -Male) %>%
pivot_wider(
names_from = "ten_year_age_groups",
values_from = "total"
) %>%
rename("Y15_24" = "15-24 years", "Y25_34" = "25-34 years","Y35_44" = "35-44 years","Y45_54" = "45-54 years","Y55_64" = "55-64 years","Y65_74" = "65-74 years","Y5_14" = "5-14 years", "Y75_84" = "75-84 years","Y85_larger" = "85+ years") %>%
mutate(
Y5_14 = replace(Y5_14, is.na(Y5_14), 0),
Y15_24 = replace(Y15_24, is.na(Y15_24), 0),
Y25_34 = replace(Y25_34, is.na(Y25_34), 0),
Y35_44 = replace(Y35_44, is.na(Y35_44), 0),
Y45_54 = replace(Y45_54, is.na(Y45_54), 0),
Y55_64 = replace(Y55_64, is.na(Y55_64), 0),
Y65_74 = replace(Y65_74, is.na(Y65_74), 0),
Y75_84 = replace(Y75_84, is.na(Y75_84), 0),
Y85_larger = replace(Y85_larger, is.na(Y85_larger), 0),
total = Y5_14 + Y15_24 + Y25_34+Y35_44+Y45_54+Y55_64+Y65_74 +Y75_84+Y85_larger,
R5_14 = Y5_14/total,
R15_24= Y15_24/total,
R25_34=Y25_34/total,
R35_44=Y35_44/total,
R45_54=Y45_54/total,
R55_64=Y55_64/total,
R65_74=Y65_74/total,
R75_84=Y75_84/total,
R85_larger = Y85_larger/total
) %>%
select(state, contains("R"))
## read and clean alcohol
alcohol = read_excel("./data/alcohol2017.xlsx", sheet = 2, skip = 3) %>%
janitor::clean_names() %>%
remove_missing() %>%
mutate(state = alcohol_consumption_per_capita_from_all_beverages_in_the_u_s_in_2017_by_state_in_gallons_of_ethanol,
alcohol_consumption_per_capita_gallons = x2) %>%
select(state, alcohol_consumption_per_capita_gallons)
## read and clean marijuana
marijuana = read_excel("./data/marijuana2017.xlsx", sheet = 2, skip = 3) %>%
janitor::clean_names() %>%
remove_missing() %>%
mutate(state = percentage_of_u_s_adults_that_have_used_cannabis_within_the_past_year_from_2016_to_2017_by_state,
marijuana_percentage = x2*0.01) %>%
select(state, marijuana_percentage)
left_join(alcohol, marijuana)
## education
edu =
cbind(readxl::read_excel("./xj/Educational Attainment.xlsx",range = "A2:A54"),
readxl::read_excel("./xj/Educational Attainment.xlsx",range = "N2:S54")) %>%
janitor::clean_names() %>%
mutate(year = "2017") %>%
select(state, year, everything(),-less_than_bachelors,-at_least_bachelors)
## gun
url = "https://www.thoughtco.com/gun-owners-percentage-of-state-populations-3325153"
gun_xml = read_html(url)
gun_xml %>%
html_nodes(css = "table")
gun=
(gun_xml %>% html_nodes(css = "table")) %>%
.[[1]] %>%
html_table(header = T) %>%
slice(-1) %>%
as_tibble() %>%
janitor::clean_names() %>%
mutate(year = "2017") %>%
select(state,year,everything(),-rank) %>%
select(-year)
## precipitation
precipitation = read_html("https://www.netstate.com/states/tables/state_precipitation.htm") %>%
html_nodes(css = "table") %>%
.[[1]] %>%
html_table(fill = TRUE) %>%
select(X2, X3) %>%
slice(-1:-4) %>%
slice(-(n() - 3):-n()) %>%
rename(state = X2, precipitation = X3) %>%
separate(precipitation, into = c("precipitation_inches", "precipitation_cm"), sep = " inches [(]") %>%
select(state, precipitation_inches) %>%
mutate(precipitation_inches = as.numeric(precipitation_inches))
## temperature
temp_data = read_html("http://www.usa.com/rank/us--average-temperature--metro-area-rank.htm?yr=9000&dis=&wist=&plow=&phigh=") %>%
html_nodes(css = "table") %>%
.[[2]] %>%
html_table() %>%
slice(-1) %>%
separate(X2, into = c("temperature", "unit"), sep = "°") %>%
separate(X3, into = c("area_state", "population"), sep = " / ") %>%
separate(area_state, into = c("area", "state_ab"), sep = ", ") %>%
select(temperature, state_ab) %>%
mutate(temperature = as.numeric(temperature)) %>%
group_by(state_ab) %>%
summarize(mean_temperature_F = mean(temperature)) %>%
filter(nchar(state_ab) == 2)
state_ab_data = read_excel(path = "./data/state_ab.xlsx")
temperature =
left_join(state_ab_data, temp_data, by = "state_ab") %>%
select(state, mean_temperature_F) %>%
mutate(state = str_replace(state, "\\s", ""))
## GDP
gdp = read.csv("./data/gdp_2017.csv") %>%
mutate(gdp = X2017) %>%
select(-X2017)
## joint
us_2017 =
left_join(suicide_rate_df, gender_rate_df) %>%
left_join(gender_suiciderate_df) %>%
left_join(age_df) %>%
left_join(alcohol) %>%
left_join(marijuana) %>%
left_join(edu) %>%
left_join(gdp) %>%
left_join(gun) %>%
left_join(precipitation) %>%
left_join(temperature) %>%
select(-year) %>%
select(-Y85_larger)
names(us_2017) = c("State",
"Suicide rate", "Female proportion", "Male proportion",
"Female Suicide rate", "Male Suicide rate",
"Age 5-14","Age 15-24", "Age 25-34", "Age 35-44", "Age 45-54", "Age 55-64", "Age 65-74", "Age 75-84", "Age >85",
"Alcohol consumption", "Marijuana usage",
"Advanced degree", "Bachelors degree", "College", "Below college degree",
"GDP", "Guns number per capita(average)", "Guns number registered(average)", "Precipitation", "Temperature"
)
# write_csv(us_2017,"./data/us_2017.csv")
After we got USA Suicide Data and the potential risk factors data, we did some data cleaning process. We collected data by downloading directly or extracting tables using read_html
function from the website. Then, we selected states and risk factor information in 2017 and joined every data frame together to get the final merged dataset.
After collecting data from different sources, we constructed a merged dataset collecting multiple potential risk factors for suicide rate in different states in the U.S., here are the descriptions:
Suicide rate (2017): The number of suicide deaths cases per 100k population.
Female proportion (2017): The proportion of female.
Male proportion (2017): The proportion of male.
Female Suicide rate (2017): The number of female suicide deaths cases per 100k females.
Male Suicide rate (2017): The number of male suicide deaths cases per 100k males.
Age 5-14 (2017): The proportion of people aged between 5-14.
Age 15-24 (2017): The proportion of people aged between 15-24.
Age 25-34 (2017): The proportion of people aged between 25-34.
Age 35-44 (2017): The proportion of people aged between 35-44.
Age 45-54 (2017): The proportion of people aged between 45-54.
Age 55-64 (2017): The proportion of people aged between 55-64.
Age 65-74 (2017): The proportion of people aged between 65-74.
Age 75-84 (2017): The proportion of people aged between 75-84.
Age >85 (2017): The proportion of people aged over 85.
Alcohol consumption (2017): Alcohol consumption per capita in gallons
Marijuana usage (2017): The proportion of people using marijuana.
Advanced degree (2017): The proportion of people with advanced degrees.
Bachelors degree(2017): The proportion of people with Bachelor’s degrees.
College degree (2017): The proportion of people with College degrees.
Below college degree (2017): The proportion of people with below college degrees.
GDP (2017): Gross Domestic Product per capita.
Guns number per capita (2017): Averaged guns number per capita.
Guns number registered (2017): Averaged guns number registered
Precipitation: Averaged Precipitation in inches.
Temperature: Averaged Temperature in F.
Another dataset we used is Domectis suicide distribution dataset from 1999 to 2017. In this dataset, we have information about number of suicide cases in different years, states, gender and age group and the population in each groups. Using this dataset, we can analyze how suicide cases change in each state in time and how suicide cases distributed in different gender and age groups.
Visualizing the world data helps us to understand the general trend of suicide rate among different age, gender and capture the difference between domestic and global suicide rate situation.
Firstly, we drew an animated world map Gif to show the suicide rate change of each country from 1986-2016.
Secondly, the line plot was made to show the annual global suicide rate change. We can tell that before 1995, a dramatic increase happened and after 1995, a steady drop followed.
Thirdly, the global suicide rate change by gender was presented. Generally, the rate in male was much higher than the rate in female.
What’s more, different trends in different age groups showed that 75+ age group occupied the highest position all the time. But all age groups were experiencing a drop from 2000 to 2016.
Finally, the average suicide ranking plot(2010-2016) told us how did America’s suicide rate compare with other countries. In top 40, USA ranked 23rd, indicating a worrying situation.
The detailed analysis can be found here.
To show how the suicide cases distributed in different states, we made a Crude Suicide Rate map allowing users to toggle between different years (1999-2017) and choose gender, demonstrating the changes in the suicide rate over time and visualize regional differences across the U.S. The map demonstrates that there are significant changes over time and gender.
We made a line plot to show the change of total suicide rate by year in US. We can see a stable increase by year and the fact that there are many more cases of suicide in male than in female.
abb = read_excel("./data/state_ab.xlsx") %>% mutate(state = str_replace(state, "\\s", ""))
suicide = read_excel("./data/suicide_age_gender.xlsx") %>% janitor::clean_names() %>%
filter(state != "District of Columbia") %>%
filter(is.na(ten_year_age_groups)) %>%
drop_na(gender) %>%
left_join(abb) %>% select(year, state_ab, state, gender, deaths, population) %>%
mutate(rate = (deaths/population)*100000) %>%
select(year, state_ab, gender, rate)
suicide_all = suicide %>% group_by(year,gender) %>% summarise(rate = sum(rate))
line =
suicide_all %>%
ggplot(aes(x = year, y = rate)) + geom_point(aes(color = gender)) + geom_line(aes(color = gender)) + theme_bw() +
labs(
title = "Suicide rate change over time in US"
)
ggplotly(line)
We also made boxplots to show the distributions of suicides cases and population composition in states.
df = read_csv("./data/us_2017.csv") %>% janitor::clean_names() %>% filter(state != "District of Columbia") %>%
pivot_longer(
2:26,
names_to = "variables",
values_to = "data"
)
# suicide rate
df1 = read_csv("./data/us_2017.csv") %>% janitor::clean_names() %>% filter(state != "District of Columbia" ) %>% select(state,suicide_rate, male_suicide_rate, female_suicide_rate) %>%
pivot_longer(
-state,
names_to = "variables",
values_to = "data"
)
df1 %>%
plot_ly(y = ~data, color = ~variables, type = "box",
colors = "Set2")
# population proportion
df2 = read_csv("./data/us_2017.csv") %>% janitor::clean_names() %>% filter(state != "District of Columbia" ) %>%
select(state, male_proportion, female_proportion,
age_5_14,age_15_24,age_25_34,age_35_44,age_45_54,age_55_64,age_65_74,age_75_84,age_85,advanced_degree,bachelors_degree,college,below_college_degree) %>%
pivot_longer(
-state,
names_to = "variables",
values_to = "data"
) %>%
mutate(variables = factor(variables, levels = c("male_proportion", "female_proportion",
"age_5_14","age_15_24","age_25_34","age_35_44","age_45_54","age_55_64","age_65_74","age_75_84","age_85","advanced_degree","bachelors_degree","college","below_college_degree"))
) %>%
mutate(
types = case_when(
variables %in% c("male_proportion" , "female_proportion") ~ "Gender",
variables %in% c("age_5_14","age_15_24","age_25_34","age_35_44","age_45_54","age_55_64","age_65_74","age_75_84","age_85") ~ "Age",
variables %in% c("advanced_degree","bachelors_degree","college","below_college_degree") ~ "Education"))
boxplot =
df2 %>%
ggplot(aes(x = variables, y = data)) + geom_boxplot(aes(color = types)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(
x = "population compostion",
y = "proportion",
title = "population compostion"
)
ggplotly(boxplot)
To show the distribution of the risk factors, we constructed the U.S. map in shiny app allowing users to choose the types of variables, visualizing regional differences across the U.S. and a bar plot showing the intuitive number in different states. Users can choose TOP N states they want to explore in both plots.
We are interested in whether there is a linear association between suicide rate and some risk factors. We chose suicide
as the main outcome, gun
, alcohol
, temperature
, precipitation
, marijuana
, education
, GDP
and gender
as candidate covariates, each with 50 observations.
One value of gun
in Wyoming and one value of temperature
in Rhode Island were missing. We replaced them with the mean value of the other 49 observations.
After tidying the merged dataset, we obtained the regression dataset. The description of the dataset is shown below:
Variables | Description |
---|---|
state | State |
suicide | The number of suicide deaths per 100k population |
gun | Averaged guns number per capita |
alcohol | Alcohol consumption per capita (gallons of ethanol) |
temperature | Average temperature (F) |
precipitation | Average precipitation (inches) |
marijuana | The proportion of people using marijuana (%) |
education | Educational attainment - bachelor’s degree or higher (%) |
gdp | GDP per capita (dollars) |
gender | Male (%) |
regression_data = read_csv("./data/us_2017.csv") %>%
janitor::clean_names() %>%
filter(state != "District of Columbia") %>%
mutate(
suicide = suicide_rate,
gun = guns_number_per_capita_average,
alcohol = alcohol_consumption,
marijuana = marijuana_usage * 100,
education = (advanced_degree + bachelors_degree) * 100,
gender = male_proportion * 100
) %>%
select(state, suicide, gun, alcohol, temperature, precipitation, marijuana, education, gdp, gender) %>%
# replace missing value with mean value
mutate(
gun = replace_na(gun, mean(gun, na.rm = TRUE)),
temperature = replace_na(temperature, mean(temperature, na.rm = TRUE)))
First, we used scatter plot to see the relationship between suicide rate and each covariate. The scatter plots with linear regression line are shown below:
fit_gun = lm(suicide ~ gun, data = regression_data)
fit_alcohol = lm(suicide ~ alcohol, data = regression_data)
fit_education = lm(suicide ~ education, data = regression_data)
fit_gdp = lm(suicide ~ gdp, data = regression_data)
fit_gender = lm(suicide ~ gender, data = regression_data)
fit_marijuana = lm(suicide ~ marijuana, data = regression_data)
fit_precipitation = lm(suicide ~ precipitation, data = regression_data)
fit_temperature = lm(suicide ~ temperature, data = regression_data)
gun_ly =
regression_data %>%
plot_ly(
x = ~gun, y = ~suicide, type = "scatter", mode = "markers", color = I("palegreen4"), text = ~state, name = "gun") %>%
add_lines(x = ~gun, y = fitted(fit_gun), line = list(color = I("palegreen4"), width = 2), name = "lm-gun")
alcohol_ly =
regression_data %>%
plot_ly(
x = ~alcohol, y = ~suicide, type = "scatter", mode = "markers", color = I("indianred1"), text = ~state, name = "alcohol") %>%
add_lines(x = ~alcohol, y = fitted(fit_alcohol),line = list(color = I("indianred1"), width = 2), name = "lm-alcohol")
temp_ly =
regression_data %>%
plot_ly(
x = ~temperature, y = ~suicide, type = "scatter", mode = "markers", color = I("brown"), text = ~state, name = "temperature") %>%
add_lines(x = ~temperature, y = fitted(fit_temperature), line = list(color = I("brown"), width = 2), name = "lm-temperature")
precip_ly =
regression_data %>%
plot_ly(
x = ~precipitation, y = ~suicide, type = "scatter", mode = "markers", color = I("orange"), text = ~state, name = "precipitation") %>%
add_lines(x = ~precipitation, y = fitted(fit_precipitation), line = list(color = I("orange"), width = 2), name = "lm-precipitation")
mari_ly =
regression_data %>%
plot_ly(
x = ~marijuana, y = ~suicide, type = "scatter", mode = "markers",color = I("snow4"), text = ~state, name = "marijuana") %>%
add_lines(x = ~marijuana, y = fitted(fit_marijuana), line = list(color = I("snow4"), width = 2), name = "lm-marijuana")
edu_ly =
regression_data %>%
plot_ly(
x = ~education, y = ~suicide, type = "scatter", mode = "markers", color = I("plum"), text = ~state, name = "education") %>%
add_lines(x = ~education, y = fitted(fit_education), line = list(color = I("plum"), width = 2), name = "lm-education")
gdp_ly =
regression_data %>%
plot_ly(
x = ~gdp, y = ~suicide, type = "scatter", mode = "markers", color = I("royalblue1"), text = ~state, name = "gdp") %>%
add_lines(x = ~gdp, y = fitted(fit_gdp), line = list(color = I("royalblue1"), width = 2), name = "lm-gdp")
gender_ly =
regression_data %>%
plot_ly(
x = ~gender, y = ~suicide, type = "scatter", mode = "markers", color = I("purple1"), text = ~state, name = "gender") %>%
add_lines(x = ~gender, y = fitted(fit_gender), line = list(color = I("purple1"), width = 2), name = "lm-gender")
subplot(gun_ly, alcohol_ly, temp_ly, precip_ly, mari_ly, edu_ly, gdp_ly, gender_ly, nrows = 4, shareX = FALSE, shareY = TRUE)
Next, we used automatic procedures (stepwise elimination) to cut down covariates.
fit1 = lm(suicide ~.-state,data = regression_data)
step(fit1)
## Start: AIC=80.99
## suicide ~ (state + gun + alcohol + temperature + precipitation +
## marijuana + education + gdp + gender) - state
##
## Df Sum of Sq RSS AIC
## - alcohol 1 0.040 176.27 78.999
## - precipitation 1 0.206 176.43 79.046
## <none> 176.23 80.987
## - temperature 1 13.963 190.19 82.800
## - education 1 30.539 206.77 86.978
## - marijuana 1 33.321 209.55 87.647
## - gun 1 50.987 227.21 91.693
## - gdp 1 60.008 236.23 93.640
## - gender 1 230.078 406.31 120.754
##
## Step: AIC=79
## suicide ~ gun + temperature + precipitation + marijuana + education +
## gdp + gender
##
## Df Sum of Sq RSS AIC
## - precipitation 1 0.239 176.51 77.067
## <none> 176.27 78.999
## - temperature 1 14.916 191.18 81.060
## - education 1 30.554 206.82 84.992
## - marijuana 1 36.734 213.00 86.464
## - gun 1 54.555 230.82 90.481
## - gdp 1 60.536 236.80 91.760
## - gender 1 230.865 407.13 118.856
##
## Step: AIC=77.07
## suicide ~ gun + temperature + marijuana + education + gdp + gender
##
## Df Sum of Sq RSS AIC
## <none> 176.51 77.067
## - temperature 1 16.667 193.17 79.578
## - education 1 31.270 207.78 83.222
## - marijuana 1 38.487 214.99 84.929
## - gun 1 54.864 231.37 88.600
## - gdp 1 60.935 237.44 89.895
## - gender 1 236.423 412.93 117.563
##
## Call:
## lm(formula = suicide ~ gun + temperature + marijuana + education +
## gdp + gender, data = regression_data)
##
## Coefficients:
## (Intercept) gun temperature marijuana education
## -1.440e+02 1.220e-01 -9.552e-02 2.485e-01 -2.343e-01
## gdp gender
## -1.815e-04 3.559e+00
fit2 = lm(suicide ~ gun + temperature + marijuana + education + gender +gdp, data = regression_data)
fit2 %>%
broom::tidy() %>%
knitr::kable(digits = 4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -143.9870 | 24.3248 | -5.9194 | 0.0000 |
gun | 0.1220 | 0.0334 | 3.6559 | 0.0007 |
temperature | -0.0955 | 0.0474 | -2.0150 | 0.0502 |
marijuana | 0.2485 | 0.0811 | 3.0621 | 0.0038 |
education | -0.2343 | 0.0849 | -2.7601 | 0.0085 |
gender | 3.5589 | 0.4689 | 7.5893 | 0.0000 |
gdp | -0.0002 | 0.0000 | -3.8529 | 0.0004 |
fit2 %>%
broom::glance() %>%
knitr::kable(digits = 4)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.8354 | 0.8124 | 2.026 | 36.3648 | 0 | 7 | -102.4802 | 220.9604 | 236.2566 | 176.5058 | 43 |
par(mfrow = c(2,2))
plot(fit2)
Basically, the residual assumptions are satisfied. However, 26 and 50 may be influential. Then, we tried to fit model without 26 and 50.
fit3 = lm(suicide ~ gun + temperature + marijuana + education + gender +gdp, data = regression_data[-c(26,50),])
par(mfrow = c(2,2))
plot(fit3)
fit3 %>%
broom::tidy() %>%
knitr::kable(digits = 4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -124.8903 | 19.0036 | -6.5719 | 0.0000 |
gun | 0.1277 | 0.0257 | 4.9753 | 0.0000 |
temperature | -0.0659 | 0.0368 | -1.7906 | 0.0807 |
marijuana | 0.2668 | 0.0628 | 4.2472 | 0.0001 |
education | -0.2224 | 0.0667 | -3.3327 | 0.0018 |
gender | 3.1119 | 0.3692 | 8.4287 | 0.0000 |
gdp | -0.0002 | 0.0000 | -4.5129 | 0.0001 |
The coefficients of model changed slightly, except for temperature(over 10%). We decided to retain 26 and 50.
GGally::ggcoef(fit2,aes(x = estimate, y = term, color = term),size = 0.8,
exclude_intercept = T,sort = "ascending") +
labs(title = "95% CI for Model Coefficients")
Results:
The “alcohol” and “precipitation” were removed from the model.
The fitted equation is “suicide = -143.99 + 0.12gun - 0.10temperature + 0.25marijuana - 0.23education + 3.56gender - 0.0002gdp”.
Adjusted R-square is 0.8124, which means these variables can explain a large proportion of variance in the suicide rate.
According to the results, the suicide rate is higher in states where there is a higher gun ownership rate, higher marijuana usage, higher ratio of males to females, lower temperature and lower educational attainment.
Holding other variables constant, for one unit increase in gun ownership(per 1,000), marijuana use in adults(%), or for one unit decrease in proportion of male (%), GDP per capita (dollars), the suicide rate(per 100k population) will increase by 0.13, 0.27, 3.11 respectively, or decrease by 0.22, 0.0002 on average.
Multiple risk factors are responsible for suicide. From our linear regression model, the suicide rate in 50 states in the U.S. is significantly associated with gender, gun ownership rates, average temperature, marijuana use, the degree of education and GDP. However, future suicide rate could not be limited by these variables. While suicide is a global phenomenon, further study needs to be conducted to determine the generalization of our model and hypothesis.
In this project, we do gain several insights about risk factors for suicide rate in the U.S.. Firstly, gender matters. Men are at higher risk of suicide compared to women, which may be explained by the different socioeconomic status owned and the degree of pressure are faced. Other important factors are the marijuana use and the gun ownership rates. This highlights that legislation needs to be improved for restrictions in drugs and guns to prevent suicide. What’s more, the average temperature, degree of education and GDP are all negatively correlated with the suicide rate.
With the visualization of geographic variation of these risk factors in the U.S., we get insights into target states with higher prevalence of risk factors. We hope that our project could help to solve the puzzle for suicide cause and increase the awareness of the public.