ATT00001htmATT00002htmTASK_2_TeqwaRmd---title_TASK

ATT00001.htm

ATT00002.htm

TASK_2_Teqwa.Rmd

--- title: 'TASK 2' output: html_document: df_print: pdf_document: --- You can assume that the data set to be used is in the same folder as the file. But you have to load the file yourself. Where it says which functions to use, these must be used. Other features that do almost the same do not necessarily produce the correct result. If it is essential to use special functions, then these are explicitly mentioned with `` name_on_function``. ## Classification We begin with classification. For these tasks, use the dataset ** vote92.rda **. It is from a 1992 US presidential poll. There were 2 alternatives, Ross Perot, Bill Clinton, and George H. W. Bush. The dataset comes from * Alvarez, R. Michael and Jonathan Nagler. 1995. Economics, issues and the Perot candidacy: Voter choice in the 1992 Presidential election. American Journal of Political Science. 39: 714-44. * Documentation of the data set is included in the end of this task. We start by loading the dataset. Here you should use `` load``. As always, we use `` library (tidyverse) ``. ```{r vote92} ``` Then make a figure with `` ggplot`` and `` geom_bar`` which shows how many votes each of the candidates got among the respondents in this dataset. This figure should be called ** distribution **. ```{r distribution} # distribution ``` Now you are going to create a data set that does not have observations from respondents who answered that they voted for the candidates who received the least support from these respondents. Call this dataset ** vote2 **. You should use the `` library (dplyr) `` functions to do this. As is known, `` library (dplyr) `` is part of the `` tidyverse`` collection that we always use. ```{r vote2} # vote2 ``` Now you should create a formula for a model that classifies which candidate the respondents vote for as a function of the other variables in the dataset. By formula, it means dependent variable modeled with a set of independent variables. Use `` as.formula`` to create the formula. Call this for ** mod_1 **. You should not use the party-identification variables * them * and * rep *. All other 'relevant' variables must be included in the model. ```{r mod_1} # mod_1 ``` Now you should estimate ** mod_1 ** as a logistic model. Use `` glm``. The result of this model should be called ** res_1 **. Use the dataset that has only 2 candidates. For your own part, it may be useful to make a summary of the results. ```{r res_1} # res_1 ``` Create an alternative model where you include gender interactions and reported ideological distance between the respondent and each of the two candidates. This model should be called ** mod_2 **. Use `` update`` to update from ** mod_1 ** to ** mod_2 **. ```{r mod_2} # mod_2 ``` Estimates ** mod_2 **. Call the results for ** res_2 ** ```{r res_2} # res_2 ``` Based on these results, is there a basis for saying that ideology plays less role for women than for men? Give ** ideology_difference ** value 1 if you think it is and value 0 if not. ```{r ideology_difference} # ideology_difference ``` ## loocv: Leave One Out Cross Validation To further investigate whether our basis for claiming that ideology plays different roles for men and women in the population, we should use cross-validation. We first estimate 'loocv (leave one out cross validation)'. Estimate 'loocv' for ** res_1 ** and ** res_2 **. Use a standard cost function that weighs all types of errors equally. Call this function ** error_class **. This feature needs two arguments, the right class and the model predictions. ```{r error_class} # error_class ``` Estimate 'loocv' for ** res_1 ** and ** res_2 **. Give these models the names ** class_1 ** and ** class_2 **. Based on unadjusted 'delta' from ** class_1 ** and ** class_2 **, calculate the difference between models. This difference is called ** delta_difference **. You should use `` library (boot) ``. ```{r delta_loocv} # class_1 # class_2 # delta_differance ``` ## k-shared cross validation As is known, 'loocv' can adapt to training data. Therefore, use k-divided cross validation with 11 groups as an alternative measure of expected classification errors. Call the object that shows the difference in the results from ** res_1 ** and ** res_2 ** for ** cross_difference **. Use `` set.seed`` to make the results reproducible. Use seed grain 8947. ```{r cross_differance} #cross_differance ``` ## Alternatives to `` glm`` Now you should estimate linear discrimination versions of ** mod_1 ** and ** mod_2 **. You should use `` lda`` from `` library (MASS) ``. These results should be compiled in a list that you create with `` map`` Call the list of results ** lda_mods **. ```{r lda_mods} # lda_mods ``` Make a ROC figure where you compare how the two `` glm` 'models and the two `` lda`` models do. You should use `` library (pROC) ``. Use `` roc`` to create the baskets and `` ggroc`` to make the figure. Call the figure for ** roc_fig **. ```{r roc_fig} # roc_fig ``` ## More than 2 options Because `` lda`` and `` qda`` can be used for problems with more than 2 options, I want you to update ** mod_1 ** and ** mod_2 ** to with the distance to Perot variable, in ** mod_2 ** also the interaction between gender and distance to Perot. Call these models ** mod_1p ** and ** mod_2p **. Estimate `` lda`` models. Use ** vote92 **. Call the list of results ** mod_p **. ```{r mod_p, warning = FALSE} # mod_1p # mod_2p # mod_p ``` Estimate a square version of these models with `` qda``. With `` map``, set the results in a list called ** qda_mods **. ```{r mod_pq} # mod_pq ``` Show the degree to which the `` qda` 'version of the model ** mod_1p ** is able to find the right candidate in training data. Preach outcomes and create a 3x3 table that compiles actual candidate and predicted candidate. Call the table ** qda_tab **. Here, predictions should be rows and actual columns. ```{r qda_tab} # qda_tab ``` ## Bootstrap Create a function that allows you to estimate a bootstrap version of ** res_2 ** with 1000 moves, use ** vote2 **. Call the result ** res_2_boot **. ```{r res_2_boot} # res_2_boot ``` Finally, I want you to make a figure showing the coefficients from ** res_2_boot **. I want you to show incisions, 2.5, 97.5 percentiles, and 66 and 33 percentiles. Add a dashed horizontal line at 0, so that one can easily see if the effect is positive, negative, or overlap 0. The thickness of the 97.5 percentiles should be line width 1 with gray tone 75. upper and lower 33 percentiles should be in line width 2 with gray tone 25 Set the cut-off effect for each variable with a black dot of size 2. Remove the x and y axis names. The figure must be reversed so that the coefficients are over with effects along the x-axis. Call the figure ** koeff_fig **. ```{r koeff_fig, warning = FALSE} # koeff_fig ``` ## Docemetation of the dataset Reports of voting in the 1992 U.S. Presidential election. Description Survey data containing self-reports of vote choice in the 1992 U.S. Presidential election, with numerous covariates, from the 1992 American National Election Studies. Usage data(vote92) Format A data frame with 909 observations on the following 10 variables. ``vote`` a factor with levels Perot Clinton Bush ``dem`` a numeric vector, 1 if the respondent reports identifying with the Democratic party, 0 otherwise. ``rep`` a numeric vector, 1 if the respondent reports identifying with the Republican party, 0 otherwise ``female`` a numeric vector, 1 if the respondent is female, 0 otherwise ``persfinance`` a numeric vector, -1 if the respondent reports that their personal financial situation has gotten worse over the last 12 months, 0 for no change, 1 if better ``natlecon`` a numeric vector, -1 if the respondent reports that national economic conditions have gotten worse over the last 12 months, 0 for no change, 1 if better ``clintondis`` a numeric vector, squared difference between respondent's self-placement on a scale measure of political ideology and the respondent's placement of the Democratic candidate, Bill Clinton ``bushdis`` a numeric vector, squared ideological distance of the respondent from the Republican candidate, President George H.W. Bush ``perotdis`` a numeric vector, squared ideological distance of the respondent from the Reform Party candidate, Ross Perot Details These data are unweighted. Refer to the original data source for weights that purport to correct for non-representativeness and non-response. Source Alvarez, R. Michael and Jonathan Nagler. 1995. Economics, issues and the Perot candidacy: Voter choice in the 1992 Presidential election. American Journal of Political Science. 39:714-44. Miller, Warren E., Donald R. Kinder, Steven J. Rosenstone and the National Election Studies. 1999. National Election Studies, 1992: Pre-/Post-Election Study. Center for Political Studies, University of Michigan: Ann Arbor, Michigan. Inter-University Consortium for Political and Social Research. Study Number 1112. http://dx.doi.org/10.3886/ICPSR01112. References Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Hoboken, New Jersey. Examples 8.7 and 8.8.

blog22

ATT00001.htm

ATT00002.htm

TASK_2_Teqwa.Rmd

vote92.rda

vote92.rda

Get help from top-rated tutors in any subject.