—title: “Statistical inference with the GSS data”output: html_document: fig_height: 4 highlight: pygments theme: spacelab—## Setup### Load packages“`{r load-packages, message = FALSE}library(ggplot2)library(dplyr)library(statsr)“`### Load data“`{r load-data}load(“gss.Rdata”)“`* * *## Part 1: Datafrom 1972,GSS collects the data to cover a diverse range of issues like marijuana use,crime and punishment,race relations, quality of life ,sexual behaviour,national spending priorities,religious view ,working and marital status. Altogether GSS is the best source for sociological trends covering in United States.Untill 1994, it was condcuted almost anually but since 1994,GSS has been conducted in even numbered years.The horizontal spread of topics and the vertical extent of the survey give researchers an unprecedented oppoturnity to study changes in American public opinion over time.* * *## Part 2: Research questionMOtivationIn recent years,there has been a less interest in scientific reserach(in the feild of medicine ,pharmaceuticals and psychological science). The main reason of this lack of interest leads to reduced research funding but would hopefully also lead to a desire to improve the national education system via public spending.Hypothesisis there any correlation between respondents level of confidence in the American educational system(coenduc) and their confidence in the scientific community (consci)? Are respondents with less confidence in education or science more likely to support improving the national education system (nateduc) ?H0: Level of confidence in the American educational system and level of confidence in the scientific community are independent.HALevel of confidence in the American educational system and level of confidence in the scientific community are dependent.H0: Level of confidence in the American educational system and desire to fund the American educational system are independent.HA: Level of confidence in the American educational system and desire to fund the American educational system are dependent.H0: Level of confidence in the scientific community and desire to fund the American educational system are independent.HA: Level of confidence in the scientific community and desire to fund the American educational system are dependent* * *## Part 3: Exploratory data analysislet us filtering out the values listed as NA for the four variable below“`{r}gss_mod <- gss %>% filter( !is.na(coneduc) & !is.na(consci) & !is.na(year) & !is.na(nateduc))“`Reduced the data set from 57061 to 23174.while here we observed that 1972 has been removed. “`{r}unique(gss_mod$year)“`We also find that there are comparatively more observations per year prior to 1984“`{r}year_counts <- gss_mod %>% group_by(year) %>% summarise(count = n())“`## Part 4: Inference1.Independence: we can assume records are independent as the GSS dataset is generated from a randomly sampled survey.2.Sample Size: The 23174 records we use for this analysis is indeed less than 10% of the total US population.3.Degrees of Freedom: We have 3 confidence levels and 3 news reading frequency levels. As we have two categorical variables each with over 2 levels, we utilize the chi-squared test of independence to test the hypothesis.4.Expected Counts: To perform a chi-square test (goodness of fit or independence), the expected counts for each field should be greater than or equal to 5. As shown below, each of the fields is greater than or equal to 5.“`{r}gss_educ_sci <- table(gss_mod$coneduc, gss_mod$consci)gss_educ_sci``````{r}g <- ggplot(data = gss_mod, aes(x=coneduc))g <- g + geom_bar(aes(fill=consci), position = "dodge")g + theme(axis.text.x = element_text(angle=60,hjust = 1)) + labs(x = "Level of Confidence in Education", y = "Counts")``````{r}plot(table(gss_mod$coneduc, gss_mod$consci), main = "Mosaic Plot", xlab = "Confidence in Education", ylab = "Confidence in Scientific Community")```Chi-Squared test of independence```{r}chisq.test(gss_educ_sci)```findingwe are convinced to reject null hypothesis in favour of alternate hypothesis as the significance level is much lower than 0.05.```{r}ss_educ_fund <- table(gss_mod$coneduc, gss_mod$nateduc)gss_educ_fund / nrow(gss_mod)``````{r}g <- ggplot(data = gss_mod, aes(x=coneduc))g <- g + geom_bar(aes(fill=nateduc), position = "dodge")g + theme(axis.text.x = element_text(angle=60,hjust = 1)) + labs(x = "Level of Confidence in Education", y = "Counts")``````{r}plot(table(gss_mod$coneduc, gss_mod$nateduc), main = "Mosaic Plot", xlab = "Confidence in Education", ylab = "Opinion on Educational Funding")```further about the desire to fund```{r}sci_fund_levels <- gss_mod %>% group_by(year, nateduc) %>% filter(nateduc == “Too Little”) %>% summarise(count = n())sci_fund_levels$proportions <- sci_fund_levels$count / year_counts$countggplot(data = sci_fund_levels, aes(x=year, y=proportions)) + geom_point() + geom_smooth() + labs(title = "Respondents who feel education funding is 'Too Little'")```