Announcements:
Sunday: February 25, 2024 at 4:30 pm: Update on the (apparently probable) strike.
At this stage it seems so unlikely that there wonât be a strike that I am planning on the assumption that it will take place.
Therefore, any forthcoming quizzes, if scheduled while the strike is still unresolved, are cancelled and will be rescheduled.
Our class will meet over Zoom at the usual time at https://yorku.zoom.us/j/93916385398
Where there is no uncertainty, there cannot be truth â Richard Feynman
How harmful is smoking?: Cigarette consumption and life expectancy in 189 countries in 2004: Correlation is not causation. This is an example of the ecological fallacy, which is itself an example of a deeper phenomenon: Simpsonâs Paradox.
Classes meet on Mondays, Wednesday and Friday from
9:30 to 10:20 in ACW 305 on Mondays and Wednesdays and in ACW 204 on
Fridays.
Tutorial: On Zoom (click here), every
Tuesday from 1:30 to 2:30 pm. Instructor: Georges Monette
Teaching Assistant: Chenyi Yu
Email: Messages about course content should be posted
publicly to Piazza. You may post messages and questions as personal
messages to the instructors. If they are of general interest and donât
contain personal information, they will usually be made public for the
benefit of the entire class unless you specifically request that the
message remain private.
Due Thursday, January 11, 9 pm
Due Sunday, January 14, 9 pm
Tools | Global Options ... | Terminal
. Then click on the
box in Shell paragraph to the right of New
terminals open with:
Complete Doodle Poll for a tutorial hour by this evening:
Quiz on Wednesday:
Announcement: Tutorial on Zoom (click here), every Tuesday from 1:30 to 2:30 pm.
Quiz Today
Topic 2 (continuing): Regression Review: Regression in R
Topic 2 (continuing): Regression Review
Topic 2 (continuing): Regression Review
Topic 2 (continuing): Regression Review
Learning R:
Why R? What about SAS, SPSS, Python, among others?
SAS is a very high quality, intensely engineered, environment for statistical analysis. It is widely used by large corporations. New procedures in SAS are developed and thoroughly tested by a team of 1,000 or more SAS engineers before being released. It currently has more than 300 procedures.
R is an environment for statistical programming and development that has accumulated many somewhat inconsistent layers developed over time by people of varying abilities, many of whom work largely independently of each other. There is no centralized quality testing except to check whether code and documentation run before a new package is added to Râs main repository, CRAN. When this page was last updated, CRAN had 20,640 packages.
In addition, a large number of packages under development are available through other repositories, such as github.
The development of SAS began in 1966 and that of R (in the form of its predecessor, S, at Bell Labs) in 1976.
The âdesignâ of R (using âRâ to refer to both R and to S) owes a lot to the design of Unix. The idea is to create a toolbox of simple tools that you link together yourself to perform an analysis. Unix, now mainly as Linux, 1 commands were simple tools linked together by pipes so the output of one command is the input of the next. To do anything you need to put the commands together yourself.
The same is true of R. Itâs extremely flexible but at the cost of requiring you to know what you want to do and to be able to use its many tools in combination with each other to achieve your goal. Many decisions in Râs design were intended to make it easy to use interactively. Often the result is a language that is very quirky for programming.
SAS, in contrast, requires you to select options to run large procedures that purport to do the entire job.
This is an old joke: If someone publishes a journal article about a new statistical method, it might be added to SAS in 5 to 10 years. It wonât be added to SPSS until 5 years after thereâs a textbook written about it, maybe another 10 to 15 years after its appearance in SAS.
It was added to R two years ago because the new method was developed as a package in R long before being published.
So why become a statistician? So you can have the breadth and depth of understanding that someone needs to apply the latest statistical ideas with the intelligence and discernment to use them effectively.
So expect to have a symbiotic relationship with R. You need R to have access to the tools that implement the latest ideas in statistics. R needs you because it takes people like you to use R effectively.
The role of R in this course is to help us
Itâs very challenging to find a good way to âlearn Râ. It depend on where you are and where you want to go. Now, thereâs a plethora of on-line courses. See the blog post: The 5 Most Effective Ways to Learn R
In my opinion, ultimately, the best way is to
Using R is like playing the piano. You can read and learn all the theory you want, ultimately you learn by playing.
Copy the following scripts as files in RStudio:
Play with them line by line.
Post questions arising from these scripts to the âquestionâ folder on Piazza. We will take up some questions in class and others in tutorials scheduled to deal with questions on R.
Let \(N\) be your team member number for the last assignment, comment on the statements whose number \(Q\) is equal to \(N (mod __4__)\) in this file of statements related to statistics. Be warned that most of these statements are fallacious to some degree. To keep things interesting, there might be one or more correct statements.
Elliptical thinking may help you get insights into the correctness of some statements.
Due: Monday, February 12
Do questions 6 and 8 in the R script to play with Multilevel Models: Lab 1.R
Upload your work on each question in a separate Piazza post (post it as a Piazza âNoteâ, not as a Piazza âQuestionâ)
with the title: âA4 6â for the first question and âA4 8â for the second.
Do your work in Rmarkdown scripts (either .R or .Rmd files, itâs
up to you) and post your script, not the pdf output to Piazza. The
script should work when someone else runs it in the current version of
R.
You can get and give help from anyone in the class but please do so in Piazza and you will get credit for it.
Even if you get help, your code should be your own. Donât copy code from each other.
âVoluntary assignment 6â:
Class links:
Due: Tuesday, March 6
The purpose of this assignment is to give everyone a chance to work individually with the data for the project in preparation for collaboration with your team.
We will study methods to work with this kind of data over the next few weeks.
The data are contained in an Excel file at http://blackwell.math.yorku.ca/private/MATH4939_data/ . Use the userid and password provided in class. Please do not post the userid and password in any public posting, e.g. on Piazza. You may post them in âPrivateâ posts to your team. The data should be treated as confidential and may be used only for the purposes of this course.
The data consist of longitudinal volume measurements for a number of patients being treated after traumatic brain injuries (TBI), e.g. from car accidents or falls, and similar data from a number of control subjects without brain injuries.
The goal of the study is to identify whether some brain structures are subject to a rate of shrinkage after TBIs (Traumatic Brain Injury) that is greater than the rate normally associated with aging. It is thought that some structures, particularly portions of the hippocampus may be particularly affected after TBI.
You will be able to work with this data set as we explore ways of of working with multilevel and longitudinal data. Itâs a real data set with all the flaws that are typical of real data sets.
A first task will be to turn the data from a âwide fileâ with one row per subject to a long file with one row per âoccasionâ. Note that variables with names that end in â_1â, â_2â, etc. are longitudinal variables, i.e. variables measured on different occasions at different points in time.
A good way to transform variable names if they arenât in the right form is to use substitutions with regular expressions.
A good way to transform the data set from âwideâ form to âlongâ form is to use the âtolongâ function in the âspida2â package but you are welcome to use other methods that can be implemented through a script in R, i.e. not manipulating the data itself. You might find section 9, particularly section 9.4, of the following notes helpful:
Most of the variables are measures of the volume of components of the brain:
I hope that you will be curious to know something about these various parts of the brain and that you will exploit the internet to get some information.
The ids are numerical for patients with brain injuries and have the form âc01â, âc02â etc for control subjects. The variable âdate_1â contains the date of injury and âdate_2â, âdate_3â, etc., the dates on which the corresponding brain scans were performed.
You might like to have a look at fixing dates in Wrangling Messy Spreadsheets into Useable Data for some ideas on using dates to extract, for example, the elapsed time between two dates as a numerical variable.
Plot some key variables: VBR, CC_TOT, HPC_L_TOT, HPC_R_TOT against elapsed time since injury using appropriate plots to convey some idea of the general patterns in the data. Remember that any changes to the data must be done in R. Do not edit the Excel file. Comment on what you see. Make a table (using a command in R, of course) showing how many observations are available from each subject. Create a posting entitled âAssignment 6â to Piazza in which you Upload your Rmarkdown .R file and the html file it produces. Make it Private to the instructor until the deadline.
Project:
Continuation of day 19
Class links:
Asking meaningful questions + dealing with heteroskedasticity: pdf / R scripts
Class links:
Sample questions on causality:
Y ~ X
Y ~ X + Z6
Y ~ X + Z1
Y ~ X + Z1 + Z4
Y ~ X + Z1 + Z3
Y ~ X + Z3 + Z6
Y ~ X + Z1 + Z5
Y ~ X
X
.X
, order them, to the extent possible from the
information in the DAG, according to the expected standard deviation of
\(\hat{\beta}_X\). Briefly state the
basis for your ordering.Class links:
Tutorial: tentative
R is to S as Linux is to Unix as GNU C is to C as GNU C++ is to C âŚ. S, Unix, C and C++ were created at Bell Labs in the late 60s and the 70s. R, Linux, GNU C and GNU C++ are public license re-engineerings of the proprietary S, Unix, C and C++ respectively.âŠď¸