STA304 Fall 2020

STA304: Surveys, Sampling and Observational Data Delivery: Fully online Online class times: L0101 9:10 a.m. ET (Friday) L0301 2:10 p.m. ET (Friday)

Course webpage: All materials will be posted on Quercus (https://q.utoronto.ca)

Instructor Information Office Hours

Professor: Samantha-Jo Caetano

How do you pronounce that? Like the English words “kyte”, “a”, “no”. Please call me: Sam (pronouns: she/her)

Wednesdays 11:10 a.m.–1:00 p.m.

ET (Toronto time)

Course Information Statistics is about how we can learn from data. In order to analyze this data, it needs to be collected first.

This course will focus on how to appropriately collect data through surveys and samples, as well as some of the pitfalls of collecting data inappropriately. Throughout the course we will look at how to design a survey (including both experimental designs and observational studies) and how to label (and potentially

avoid) sources of bias. We will learn about different sampling techniques, including random sampling, stratification, clustering, systematic, and unequal probability selection. Lastly, we will cover how to make

inference based on the data collected from a survey/sample. This inference will include estimating

population parameters of interest (mean, proportions, variation, correlation, etc.).

Course Format and Organization This term, we will be using an asynchronous (i.e. a flipped format) for STA304. Each week, lecture videos will be posted, you’ll watch videos, read course slides, and complete a quiz ahead of our Friday class

meeting. I will then focus on some hands-on, applied, and discussion based problems in our scheduled class time on Fridays. Hopefully, you will join these sessions live so that you can participate, but a video recording of the sessions will be posted on Quercus for revision and to support members of the class in

difficult time zones. In addition to these weekly activities, there will be a discussion board, one test, three

problem sets and a final project.

Watch content videos, review course slides

Participate in Discussion Board and

prep for upcoming assessments (Quiz, Problem Sets, etc.)

Complete quiz (due Fridays at 11:59 p.m. ET)

Friday class meeting

MONDAY – THURSDAY FRIDAY

STA304 Fall 2020

Glossary of Terminology and Course Tools Below, you’ll find a list of some of the terms you will encounter in this course and more broadly at the University. You can find a more complete glossary at https://future.utoronto.ca/newly-admitted-

students/checklist/glossary-of-terms/ Bb Collaborate (aka Blackboard

Collaborate)

BbCollaborate is a video conferencing tool that we will use for synchronous class meetings and office hours. You can access it through Quercus (look for Bb

Collaborate in the sidebar). You can join the Course Room at any time to update your profile picture. Instructions and other getting started tips here: https://help.blackboard.com/Collaborate/Ultra/Participant/Get_Started

Quercus The online teaching and learning system where you will find your course homepages (including course materials, syllabi, announcements, and grades) and other resources.

Piazza Piazza is a free Q&A platform which is used in many courses. In this course, you’ll use Piazza to post questions about course content or logistics (see “Getting your questions answered” for more guidance on how to join and use Piazza).

R / RStudio Free software environment for statistical computing. In this course, you’ll learn to use R to produce visualizations, manipulate data, and conduct analyses. No prior programming experience is assumed.

UTORID Your username that gives you access to ACORN, Quercus, etc. (e.g. caetano2)

Minimal technical requirements

All students should consult the minimum technical requirements for participation in online learning. If

you are facing financial barriers to obtaining the required technology, please contact your College

Registrar’s Office to obtain information regarding your potential eligibility for a need-based bursary. If

you anticipate having difficulty connecting to University websites (e.g., Quercus), please submit your

question here: https://www.utoronto.ca/covid19-contact.

Throughout this course we will be using R/RStudio to work through applied problems. There is no

requirement to know how to program in R/RStudio at the start of this course, but familiarity will help

the learning curve. Discussions will take place on Piazza (or Quercus).

STA304 Fall 2020

Learning Objectives and Assessment By the end of the course, you should be able to:

How will your success be measured?

WEIGHT WHAT DATES

9% Weekly quizzes

Due Fridays at 11:59 p.m. ET Starting: Week of Sept 14 (Friday Sept 18) Ending: Week of Nov 30 (Friday Dec 4)

Exceptions: Week of the Test (Fri Nov 20) & Reading Week (Fri Nov 13)

17% Problem Set 1 Due Thursday October 1

17% Problem Set 2 Due Thursday October 15

17% Problem Set 3 Due Monday November 2

10% Test Thursday November 19; details TBA on Quercus

30% Final project Multiple due dates (Final Project due in Final Assessment Period); details TBA

Quizzes (due Fridays at 11:59 p.m. ET) There will be a quiz each week apart from the week of the tests and reading week for a total of 10 quizzes. Quizzes are intended to be completed while looking over the lecture material/videos for that week. So it

is recommended to go through the quiz immediately after viewing the weekly posted lecture video(s). Quizzes will consist of a combination of multiple choice and fill-in the blank questions. The quiz will cover

material presented in that week’s module (videos, slides, etc.). Each quiz will be available on Quercus from the time a module is posted (Monday afternoon at 6:00 p.m. ET) until the following Friday at 11:59 p.m.

Identify and implement different sampling techniques and

different study designs and the trade-offs involved in each.

Design a survey or sample that is appropriately gathering

information of interest.

Carry out a variety of statistical analyses in R to make inference

on the data collected from a survey/sample.

Identify sources of bias within a study and comment on a study’s design, including it’s weaknesses,

strengths, and appropriate analyses.

Clearly communicate results of statistical analyses to technical and non-technical audiences.

STA304 Fall 2020

ET; no late submissions will be accepted. The only exception to this is if you miss the quiz for a valid

medical reason; see the “Missed Work” section for the course policy.

Discussion Board (ongoing) We will be using the Piazza on Quercus to facilitate discussion. I will post questions and topics on the Quercus discussion board in which your participation and responses will be *not* be graded, but I strongly encourage you to use the discussion boards. The TAs and professor will monitor the discussion boards

throughout the week, so if you have a general question regarding the course structure, topics or something interesting that you would like to share with the class please post it here. Please post all general

questions regarding the course on the discussion board.

Please note that emails and the message board are not checked or responded to by either the TA or me

after hours or on the weekend.

Missed Work You do not need to reveal your personal or medical information to me. I understand that illness or

personal emergencies can happen from time to time. The following accommodations to assessment

requirements apply in these situations.

Weekly Quizzes & Problem Sets If a quiz or a problem set is missed for a valid reason, you may ask to be excused from the assessment. Extensions will *not* be given. If approved, the weight of the missed assessment will be shifted to all

remaining assessments of the same type (i.e. the weight of a missed quiz will be shifted to the remaining quizzes and the weight of a missed assignment will be shifted to the remaining assignments). To request

to be excused from an assessment, you must report your absence through the ACORN Absence Declaration Tool (https://www.acorn.utoronto.ca/) AND send an email to your instructor at

[email protected]. For consideration, your email:

- must be received within 1 week of the missed assessment, - must include your full name and student number,

- must specify the assessment missed including the date, and - must include the following two sentences:

1. “I affirm that I am experiencing an illness or personal emergency and I understand that to falsely claim so is an offence under the Code of Behaviour on Academic Matters.”

2. “I understand that the weight of this assessment will be shifted to the remaining

assessments of the same type.”

NB: No more than two of the quizzes can be accommodated in this way and no more than one problem sets can be accommodated in this way. Missed quizzes and problem sets beyond this limit

will be recorded as 0%.

STA304 Fall 2020

Tests If the test is missed for a valid reason, you may ask for an accommodation. If approved, you will be offered an alternate assessment (which may be an oral exam or a test cumulative on all material in the

course or a combination of some sort).

To request to be excused from an assessment, you must report your absence through the ACORN

Absence Declaration Tool (https://www.acorn.utoronto.ca/) AND send an email to your instructor at

[email protected]. For consideration, your email:

- must be received within 1 week of the missed assessment, - must include your full name and student number,

- must specify the assessment missed including the date, and - must include the following two sentences:

1. “I affirm that I am experiencing an illness or personal emergency and I understand

that to falsely claim so is an offence under the Code of Behaviour on Academic Matters.”

2. “I understand that I will need to complete an alternative assessment to gain this grade and that I will be given further instructions in response to this email (if

approved).”

Final project The final paper is a critical piece of assessment. Extensions for valid reasons may be granted for a

maximum of three days (i.e. through to the close of the university). The exact extension granted will be at the discretion of the instructor.

To be considered, an extension request MUST be sent [email protected] by 11:59 a.m. E.T. (midday) on the business day prior to the due date. Note: For Monday deadlines, this will be 11:59 a.m.

E.T. (midday) the previous Friday.

Where possible, alert me to potential issues as early as you can. This will allow us to work together with

you to find a suitable solution.

Important note If too much work is missed, even for valid reasons, a make up test (including all work covered

during the entire term) and/or an oral exam may be required to calculate a fair mark, at the discretion of the instructor. Please ensure you and/or your registrar get in touch with me as

early as possible if this may be the case for you.

STA304 Fall 2020

(Tentative) Weekly Assessment Schedule

Week Quizzes Other

Assessment

Sept 7-11

Sept 14-18 Quiz 1

Sept 21-25 Quiz 2

Sept 28-Oct 2 Quiz 3 Problem Set 1 Oct 5-9 Quiz 4

Oct 12-16 Quiz 5 Problem Set 2 Oct 19-23 Quiz 6

Oct 26-30 Quiz 7

Nov 2-6 Quiz 8 Problem Set 3 Nov 9-13 READING WEEK READING WEEK

Nov 16-20

Test

Nov 23-27 Quiz 9

Nov 30-Dec 4 Quiz 10

Dec 7-9

Dec 11-22

Final Project

STA304 Fall 2020

Marking Concerns Any request to have an assessment remarked must be emailed to [email protected] within one

week of the grades being posted; your request will be reviewed by the course instructors and head

teaching assistant. Your request must include:

- your name and student number, - a detailed written justification referring to your answer and the relevant course material to be

considered; it is not enough to simply say that you believe your answer deserves higher credit,

rather you must support your request with specific reference to relevant course materials.

Please note that we reserve the right to review the grading of all questions or parts when you re-submit

an assessment for reconsideration (i.e., your grade could go down).

Writing Communication and especially writing is a critical aspect of the statistical workflow. Papers, assgnments,

short answer questions, etc. should be well-written, well-organized, and easy to follow. They should flow easily from one point to the next. They should have proper sentence structure, spelling, vocabulary,

and grammar. Each point should be articulated clearly and completely without being overly verbose. You will be heavily penalised for papers that do not meet these basic requirements.

Papers should demonstrate your understanding of the material you have learnt and your confidence in drawing on the terms, techniques, and issues you have considered. Your work must be thoroughly referenced.

If you have concerns about your ability to do any of this, then please make use of the writing support

provided to students - https://writing.utoronto.ca/. The services are designed to target the needs of both native and non-native speakers and the programs are free. I have used similar services in the past at other universities and always found them very helpful.

Core Texts There is not strictly one textbook that we will be working through. Instead, we will be calling upon multiple texts to gain insight into surveys and sampling. Here are a list of the main texts that we draw on:

1. Wu, Changbao and Mary E. Thompson, 2020, Sampling Theory and Practice, Springer. This text is currently available for download from the U of T library website:

https://onesearch.library.utoronto.ca/

2. Gelman, Andrew, Jennifer Hill and Aki Vehtari, 2020, Regression and Other Stories, Cambridge

University Press.

3. Kohavi, Ron, Diane Tang, and Ya Xu, 2020, Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, Cambridge University Press.

4. McElreath, Richard, 2020, Statistical Rethinking, 2nd Edition, CRC Press.

STA304 Fall 2020

Getting your questions answered

Piazza forum (available on Quercus) Posts can be anonymous for your classmates, but

instructors and TAs will be able to see your name.

Before posting a question, search to see if someone else has already asked a similar

question (you can edit the question to add yours or post a follow-up at the bottom)

Try to answer your classmates’ questions –

this is a great way to reinforce your own understanding while also helping your

classmates! Don’t worry if you aren’t 100% sure of the answer – all answers will be reviewed / endorsed / completed by TAs

and instructors!

Question(s) about course content e.g.

- I don’t understand the difference between stratification and clustering?

- Can we make a causal statement here?

- My code won’t run for the assignment (please include screenshots of your code and the error message!)

Question(s) related to your personal circumstances (i.e. something which is not

appropriate to share with the whole class) e.g. - I would like to request an accommodation

for yesterday’s quiz because I was ill (make

sure to include all information listed in the “Missed Work” section)

- I would like question 2 on the assignment to

be regraded (be sure to include clear justification, as outlined in the “Marking

Concerns” section)

Question(s) about course logistics? e.g.

- What is the deadline for the weekly quiz? - Where do I see the discussion board topic?

Information / resource to share with classmates

e.g. - I have a link/resource/opportunity to share

with my classmates

My contact email: [email protected] (only send emails from your utoronto.ca email

address to ensure it doesn’t automatically go to a Junk folder and be sure to include your full name

and student number)

This account will be monitored by the

course instructor; if you want to reach a specific instructor or TA please email them directly.

Allow 48 hours for a response during the week (Monday to Friday, ET) and do not expect responses on the weekend.

If you cannot meet a deadline because you are ill, please refer to the “Missed Work” section in this syllabus and submit all

required information to this email account.

Questions about course content won’t be answered here, but rather redirected to

Piazza or office hours.

STA304 Fall 2020

How to succeed in this course The course is designed to actively engage you in the course material. We hope you’ll find the statistical

reasoning and data science interesting, challenging, and fun! In order for you to get the most from the

classroom sessions:

• Always watch/read the weekly content before Friday’s class meeting (there will be a quiz to help you check your knowledge),

• Complete the assignments,

• Keep up-to-date in the course–do not leave working on discussions or your project to the last minute, and

• Ask questions! Post in/watch the course discussion forum on Piazza and attend instructor and/or TA office hours (TA Office hours will be posted on Quercus).

Recognized Study Groups Recognized Study Groups (RSGs) are small study groups of 3 to 6 students from the same course who

meet weekly to learn course content in a collaborative environment.

Each group is made up of students from the same course. One student volunteers to be the RSG Leader

and helps organize and plan weekly activities. The RSG Leader is a student who is trained in group

facilitation and effective learning techniques. RSG Leaders are not tutors – they are learning along with

group members.

A student staff member is also assigned to each group to help connect you to academic resources and

support your group’s goals.

While not compulsory for this course, I would highly recommend you get involved with an RSG.

Meet to complete Meet to Complete is an online “study with me” space where you can study alongside other students.

Each Meet to Complete is hosted by a student to welcome you and provide support & encouragement,

if needed.

To join Meet to Complete, enroll in the Meet to Complete course on Quercus! Online learning doesn’t

need to be lonely.

Academic Integrity You are responsible for knowing the content of the University of Toronto’s Code of Behaviour on

Academic Matters.

As a general rule, we encourage you to discuss course material with each other and ask others for advice. However, it is not permitted to share answers or to directly share R code or written answers for anything that is to be handed in. For example, “For question 2.1 what R function did you use?” is a

fair question when discussing course material with others in the class; “Please show me your R code for question 2.1” is not an appropriate question. If writing or code is discovered to match another student’s

submission or outside source, this will be reported as an academic offence. When asked to hand in code and a problem set or project document, the code you submit must have been used to generate the document. If it does not (i.e., the submitted code does not match the submitted output), this is also

considered an academic offense.

STA304 Fall 2020

I will not tolerate any academic offenses. This includes (but is not limited to) plagiarism, cheating, copying R code, communication/extra resources during closed book assessments, purchasing labour for

assessments (of any kind). Academic offenses will be taken very seriously and dealt with accordingly. If you have any questions about what is or is not permitted in this course, please do not hesitate to

contact your instructor.

Please consult the University’s site on Academic Integrity http://academicintegrity.utoronto.ca/. Please

also see the definition of plagiarism in section B.I.1.(d) of the University’s Code of Behaviour on Academic Matters http://www.governingcouncil.utoronto.ca/Assets/Governing+Council+Digital+Assets/Policies/PDF/ppjun

011995.pdf. Please read the Code. Please review Cite it Right and if you require further clarification, consult the site How Not to Plagiarize http://advice.writing.utoronto.ca/wp-

content/uploads/sites/2/how-not-to-plagiarize.pdf.

Note that when an assignment is required to be completed as a team (e.g., project), you may discuss and share answers and code with other members of your team, but not with another team in the class

or anyone outside the course.

Intellectual Property Statement Course material that has been created by your instructor (i.e. lecture slides, term test

questions/solutions and any other course material and resources made available to you on Quercus) is the intellectual property of your instructor and is made available to you for your personal use in this course. Sharing, posting, selling or using this material outside of your personal use in this course is not

permitted under any circumstances and is considered an infringement of intellectual property rights.

While recordings of class meetings will be made available to you on the course website, these are intended only for students registered in the course. You are not authorized to copy these materials or distribute them to individuals who are not registered in the course. If you would like to record any

course activities in this course, you MUST ask permission from your instructor in advance. According to

intellectual property laws, not asking permission constitutes stealing.

Accessibility Needs The University of Toronto is committed to accessibility. If you require accommodations for a disability, or have any accessibility concerns about the

course, the classroom, or course materials, please contact Accessibility Services as soon as possible: email [email protected] or visit

the website at http://accessibility.utoronto.ca .

If you have an accommodation letter from your accessibility advisor that is relevant to this course,

please do the following:

• Email your letter to [email protected] with “Accommodation letter” as part of the

email subject, CC your advisor and let us know anything else you wish us to know/any

STA304 Fall 2020

questions you have. Please do this as soon as possible after you enrol in the course/receive

this syllabus.

• Confirm any accommodations for each specific assessment 1 week before the assessment.

(I.e. if you receive extra time for timed assessments, confirm this one week prior to the

midterm assessment and final assessment, even if we have already discussed this at the

beginning of the semester.)

Covid-19 We are in the middle of a pandemic. This term will be a challenging one for you, but also for everyone

involved, including your TAs and me, faculty, staff, and of course the other students in your courses.

Nonetheless, I am hoping to make the best of the situation and want to provide you with an opportunity

to get the most out of this course (and overall situation), by providing you with a set of resources and

infrastructure to help your statistical learning and communication. Some degree of flexibility and good

faith is needed from all of us. If you need accommodations, then please be as proactive as possible in

asking for them.

Final Project

Samantha-Jo Caetano

Rough Draft Due Date (2%): Wednesday December 9, 2020 at 11:59pm ET

Peer Review Due Date (3%): Monday December 14, at 11:59pm ET

Final Report Due Date (25%): Monday December 21, 2020 at 11:59pm ET

This Final Project is to be handed in as a report.

This final project should be completed in an R markdown file and should be knit to a pdf document. Your submission will have 3 parts: (i) Output/Final Copy of Report; (ii) R markdown code, .Rmd file; (iii) link to a GitHub repository of your code (this will include your .R scripts for cleaning the code).

Please have all three files available for submission at the due date.

Your Objective

To perform a meaningful statistical analysis on some survey, sample, or observational data.

Note: There is a peer review component to this project.

General Requirements • As an individual you will select one of the options (a-e), you will perform the appropriate analysis

and you will write a report.

• On December 9th you will submit a pdf of a rough draft to be edited by your peers.

• From December 10-14 you will provide feedback on some of your peers’ rough drafts.

• On December 21 you will submit a pdf and Rmd of the final report, as well as a GitHub repo link.

• The final report will be a well written and revised document consisting of the following sections (more details in “Report Details”):

– Title & Authors

– Keywords

– Introduction

– Methodology (Data and Model)

– Results

– Discussion

– References

– Appendix (Optional)

Options Working individually, please conduct original research that applies statistics to a question involving surveys, sampling or observational data and then write a paper about it. You have various options for topics (pick one):

a. Develop a research question that is of interest to you and obtain or create a relevant dataset. This option involves developing your own research question based on your own interests, background, and expertise. I encourage you to take this option, but please discuss your plans

with me. How does one come up with ideas? One way is to be question-driven, where you keep an informal log of small ideas, questions, and puzzles, that you have as you’re reading and working. Often, after dwelling on it for a while you can manage to find some questions of interest. Another way is to be data-driven - try to find some interesting dataset and then work backward. Finally, yet another way, is to be methods-driven - let’s say that you happen to understand Gaussian processes, then just apply that expertise to an area. (If you select this option it is recommended to incorporate some causal inference of observational data into your report.)

b. (Thanks to Jack Bailey for this idea) Build a MRP model based on the CES and a post- stratification dataset that you obtain, to identify how the 2019 Canadian Federal Election would have been different if ‘everyone’ had voted. What do we learn about the importance of turnout based on your model and results?

c. Reproduce a paper. Options include: - Angelucci, Charles, and Julia Cagé, 2019, ‘Newspapers in times of low advertising revenues’,

American Economic Journal: Microeconomics, please see: https://www.openicpsr.org/openicpsr/project/116438/version/V1/view.

- Bailey, Michael A., Daniel J. Hopkins & Todd Rogers, 2016, ‘Unresponsive and Unpersuaded: The Unintended Consequences of a Voter Persuasion Effort’, Political Behavior.

- Clark, Sam, 2019, ‘A General Age-Specific Mortality Model With an Example Indexed by Child Mortality or Both Child and Adult Mortality’, Demography, please see: https://github.com/sinafala/svd-comp.

- Skinner, Ben, 2019, ‘Making the connection: Broadband access and online course enrollment at public open admissions institutions’, Research in Higher Education, please see: https://github.com/btskinner/oa_online_broadband_rep.

- Pons, Vincent, 2018, ‘Will a Five-Minute Discussion Change Your Mind? A Countrywide Experiment on Voter Choice in France’ American Economic Review.

- Valencia Caicedo, Felipe, 2019, ‘The Mission: Human Capital Transmission, Economic Persistence, and Culture in South America’, The Quarterly Journal of Economics, please see: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ML1155.

- If you have a favourite paper then please let me know by the end of Week 12 so that I can check that it’s appropriate.

d. Pretend that you work for Upworthy. Request the Upworthy dataset and then use it to evaluate the result of an A/B test. This request could take a week. Please plan ahead if you choose this option.

e. Critique the following paper: AlShebli, Bedoor, Kinga Makovi & Talal Rahwan, 2020, ‘The association between early career informal mentorship in academic collaborations and junior author performance’, Nature Communications. You should be able to download the data here: https://github.com/bedoor/Mentorship and the paper here: https://www.nature.com/articles/s41467-020-19723-8. For background please see: https://statmodeling.stat.columbia.edu/2020/11/19/are-female-scientists-worse-mentors- this-study-pretends-to-know/ and https://danieleweeks.github.io/Mentorship/#summary.

Process for December 9 11:59pm ET • As an individual, via Quercus, submit a PDF of your rough draft on Quercus by 11:59pm ET on

Wednesday, December 9, 2020.

• At a minimum this must include your title and a fully written Introduction section.

• You will be awarded 2% for completion of the total 30% for the Final Project.

• It is recommended that you also include the (partially completed) reference section here, but this is optional.

• You do not need to include your name in the pdf if you prefer to stay anonymous to your peers.

• The point of this is to get feedback on your work (and to make sure you have at least started thinking about this by December 9th) so you are more than welcome to include other sections that you wish to get feedback on.

Disclaimer: There will be no extensions granted for this submission since the following submission is dependent on this date.

Process for December 14 11:59pm ET • As an individual, on December 10, you will randomly be assigned a handful of rough drafts to

provide feedback. You have until December 14, 2020 11:59pm ET to provide feedback to your peers.

• If you provide feedback to one peer you will receive 1%, if you provide feedback to two peers you will receive 2% if you provide feedback to three (or more) peers you will receive the full 3%.

• Providing feedback is obviously subjective, so we have established a set of minimum requirements:

– Your feedback must include at least 5 comments (meaningful/useful bullet points).

– One comment on the appropriateness of the title.

– One comment on the readability of the writing (I.e. address any edits, grammar, typos, etc.)

– One comment on how interesting/compelling the writing and potential analysis is.

– One comment that states whether it is clear which option (a-e) was selected.

– One comment/suggestion a foreseeable model, weakness, next step, data, reference, etc. (Just give them something useful to work off of.)

Disclaimer: There will be no extensions granted for this submission since the following submission is dependent on this date.

Disclaimer: Please remember that you are providing feedback here. All comments should be professional and kind. It is challenging to receive criticism, and arguably more challenging to provide criticism, and even more challenging to give criticism strictly through text. Please remember that your goal here is to help your peers advance their writing/analysis. Any feedback that is inappropriate will receive a 0 on this section.

Process for December 21 11:59pm ET • As an individual, via Quercus, submit a PDF of your paper. Again, in your paper you must have a

link to the associated GitHub repo.

• Via Quercus you will need to submit the following three files:

– pdf of your final report.

– your group .Rmd file.

– a link to a Github repository with your materials.

• This submission will be graded based off the rubric posted on Quercus and will be worth 25%.

Final Project – Additional Instructions

Samantha-Jo Caetano

Rough Draft Due Date (2%): Wednesday December 9, 2020 at 11:59pm ET

Peer Review Due Date (3%): Monday December 14, at 11:59pm ET

Final Report Due Date (25%): Monday December 21, 2020 at 11:59pm ET

This Final Project is to be handed in as a report.

This final project should be completed in an R markdown file and should be knit to a pdf document. Your submission will have 3 parts: (i) Output/Final Copy of Report; (ii) R markdown code, .Rmd file; (iii) link to a Github repository of your code (this will include your .R scripts for cleaning the code).

Please have all three files available for submission at the due date.

General Logistics Here are the general expectations:

1. Everything is entirely reproducible. 2. Your paper must be written in R Markdown. 3. Your paper must have the following sections:

1. Title, date, author, keywords, abstract, introduction, methodology (data and model), results, discussion, appendix (optional, for supporting, but not critical, material), and a reference list.

d. Your paper must be well-written, draw on relevant literature, and show your statistical skills by explaining all statistical concepts that you draw on.

e. The discussion needs to be substantial. For instance, if the paper were 10 pages long then a discussion should be at least 2.5 pages. In the discussion, the paper must include subsections on weaknesses and next steps - but these must be in proportion.

f. The report must provide a link to a GitHub repo that contains everything (apart from raw data that you git ignored if it is not yours to share). The code must be entirely reproducible, documented, and readable. The repo must be well-organised and appropriately use folders and README files.

Report Details Below are notes about what should be in each main section, I have included sub-sections that are optional for you to include: Title & Authors:

- Include an aptly named title for your report. - Include authors names and the date.

Abstract:

(~ 1 paragraph)

- Here you are provided a brief summary of the entire report. - This is generally written as one long paragraph.

Keywords:

(~ 1-2 lines) - Here you are providing the reader with key words. - Usually includes anywhere from 3 to 10 key words (or “topics”) - Here is an example (based on the example text in the sections below)

Key words: Propensity Score, Causal Inference, Observational Study, Lung Cancer, Smoking

Introduction:

(~ 3-4 paragraphs) - Here you will introduce your problem. - First Paragraph: In this section you should start off by giving some background/context

explaining the global relevance of the problem/data/analysis. For example: o Statistical analysis is ubiquitous to clinical research. The causal links inferred from

clinical studies affect treatment options, screening techniques and help identify patient risk factors. Observational data is often more feasible, and arguably more reliable, than experimental design data. Thus, having the ability to make causal inference in this setting is key from both an economical and practical perspective. …..

- Second Paragraph: In the next paragraph you will get a little more specific about your specific problem/analysis and it should be a bit more niche. For example:

o One popular way to make causal inference is through propensity score matching (citation). Propensity score matching was first introduced in 1983 (citation) and has become widely popular in recent years (citation). In this report, I will use propensity score matching to discern if there is a causal link between whether or not a patient smokes and whether or not a patient developed lung cancer.……

- Additional paragraphs will be of a similar nature to paragraph 2, but just introducing other ideas. For instance, maybe I would need to include a paragraph going over smoking and lung cancer (or maybe 2 paragraphs).

- Final Paragraph: In the last paragraph you will let the reader know how you will layout the rest of the report. For example:

o Two data sets will be used to investigate how propensity score matching could be used to make inference on the causal link between smoking and lung cancer. In the Methodology section (Section 2), I describe the simulation study, the data, and the model that was used to perform the propensity score analysis. Results of the propensity score analysis are provided in the Results section (Section 3), and inferences of this data along with conclusions are presented in Conclusion section (Section 4).

- General Tip: Try to avoid first person in this section, usually using “I” or “my” will come across as though it is your own personal opinion/motivation. You want this section to come across as though it is scientifically/factually driven, usually this is a bit more compelling to a scientific reader.

Methodology:

(~ 3-4 paragraphs – length will vary depending on selected option) - This section will vary depending on what option you selected from a, b, c, d, e in the Final

Project Instructions document.

Data: - Here you will describe the data set. - If it is simulated you should describe how you simulated the data and why you used certain

distributions and parameters. For example: o I simulated $n$ survival times from the exponential distribution with mean

$\beta=10$, representing the time from diagnosis until death for each of the $n$ lung cancer patients. The exponential distribution is representative of continuous variables that are positive, making it appropriate to represent survival times. Moreover, the parameter of $\beta=10$ was selected based off previous studies which summarize characteristics regarding lung cancer patients (Include citation).

- If you did not simulate data, and instead are working with real data you should include a “Table 1” which is a Table providing baseline characteristics of the data (usually separated by treatment groups). Your text should go over key components of this table.

- You really want to use this section to set yourself up, so that when you are describing the model and results, it is evident why you selected that model, and why results would include certain variables.

Model: - Here you will describe the chosen model (e.g., if you decide to perform linear regression

you must write out the model and describe the parameters and variables included) and give some justification for why this model was selected.

- This will include some mathematical notation when explicitly stating the model. You should describe the notation used.

- This section really will vary depending on the option selected.

Results:

(~ 3-4 paragraphs – length will vary depending on selected option)

- Here you should relay any tables and graphs that are a result of some intended statistical analysis. There should be text describing the results of these tables and figures.

- Any additional analysis results should also be described here. - Be concise in this section. Simply relay the facts (in a digestible way). - Note about describing the results:

o Do not just write: We calculated $$\hat{y}^{PS}$$ to be 0.529. o Do write: We estimate that the proportion of voters in favour of voting for

<Party Name> to be 0.529. This is based off our post-stratification analysis of the proportion of voters in favour of <Party Name> modelled by a <type of model> model, which accounted for <list of variables in the model>.

o As a minimum, you pretty much only need to include the statement above (in your own words and filled in accordingly) in this section, but you likely will have some other results to relay.

Discussion:

Summary: (~ 1 paragraph) - Summarize what was done earlier - The idea is to tie everything together. Conclusions: (~ 2 paragraphs - length will vary depending on the results/findings) - Here is where you explain what the results really mean, and identify any relevant findings. - Make sure to touch on global impacts. For example,

o The propensity score analysis showed that people who smoked were 2.465 (p-value = 0.002) times more likely to develop lung cancer than people who did not smoke. Based off this result it appears as though smoking at least 1 pack of cigarettes per day for a prolonged period of time will increase one’s likelihood of developing lung cancer by approximately 250%.

Weakness & Next Steps: (~ 2 paragraphs - length will vary depending on the results/findings) - This sub-section can be split into two, if needed - Be careful here, especially if you are simulating data. You can never simulate every single

scenario. So you will have some generalizability issues. - Addressing weaknesses of the analysis. - Addressing future steps of the analysis.

o Hint: a good future step might be to compare with the actual election results and do a post-hoc analysis (or at least a survey) of how to better improve estimation in future elections.

References:

- Include a bibliography that includes all ideas that were employed that were not your own. - Do your best to be as thorough as possible here. It is only right to give credit where credit

is due. - Referencing should be consistent, organized and well formatted. - This can go beyond the recommended page limit.

Get help from top-rated tutors in any subject.

Efficiently complete your homework and academic assignments by getting help from the experts at homeworkarchive.com