# 数据分析essay/report/assignment/paper写作-Identify the response variable and explanatory variables

### a) Exploratory Data Analysis

1. Identify the response variable and explanatory variables;
• For factors, maybe do boxplots against response, to test for skewness and significance (Week 8 slides);
• Factor with 2 levels (Sex), also do a 2-sample t-test for significance.
• Describe the plots in terms of skewness (mainly) and also your finding (e.g. if most of the data points are in level x).
2. Scatterplots
• Linear relationship? How strong is the linear relationship? Other relationship?
• Any unusual points (outlier or leverage point)? Potential influential points?
• Multicollinearity?
3. Correlation matrix
• Relationship between response and each explanatory?
• Relationship among the predictors, i.e. multicollinearity?

### b) MLR Model

1. Fit the model
• Treat categorical variables as factors, with reference to code on Week 8 Slides.
• Maybe consider present some features of the model including coefficients andR^2.
2. Main Residual Plot
• Interpretation the same as for SLR.
• Check the assumptions (independence and constant variance).

### c) MLR Model Transformation

Basiclly repeat part b) for 3 times, each time with a different response transformation, and discuss each of them (in terms of the main residual plots).

### d) MLR Model Box-Cox Transformation

Use the boxcox function in R and do exactly the same thing as in part b and c but this time remember to give a little conclusion about your model chosen. (maybe add 0.00001 to all y to avoid 0s?)

1. Edit and use the code and also the interpretation with reference to Q2 a) in the sample assign- ment.
2. Keep in mind that the if an added variable plot shows a linear structure, this is evidence that the predictor variable under investigation should indeed be included in the model. On the other hand, if the added variable plot appears to be a simple random scatter of points, then we will likely conclude that the predictor is not adding any further explanation of the response and can be dropped from the model (as long as the other variables are retained, since dropping a different predictor from the model may make the predictor under study suddenly significant, we will discuss such situations in the later subsection on multicollinearity).

### f) Pairwise comparison

1. Edit and use the code provided in Lecture 9a (search for keyword pairwise if you cannot find them) (note: use 0.05/8).
2. From my point of view, it is only statistically significant if 0 is NOT within the confidence interval.

### g) Pairwise comparison

Exactly the same as part f).

### h) ANOVA

1. First Nested F-test (see sample Q1d as example)
• Set up the hypothesis test clearly (that is, identify the null and the alternative).
• Write-up the F-statistics computation (MSR/MSE) and the value.
• Read p-value, reject null if p-value is less than 0.05.
2. Repeat for the following two tests.
3. Discussion about which variables coefficient is actually 0.

### i) Interaction

1. Adding interaction term between age and each of the other predictors in the model, fit the ML model.
2. Read the ANOVA table to see if the interaction term shall be included in the model or not.

### j) Model Diagnostic

1. Studentised Residual Plot
• For the code and interpretation, see Sample Q1c.
• The rules for interpretation shall be similar to main residual vs fitted plot.
2. Normal Q-Q Plot
• Code and interpretation see Sample Q1c again.
• Normal Q-Q plot checks the assumption about normality (which is not particularly impor- tant really).
• Be generous, so say no serious problem if the plot follows the trendline somehow.
3. Cooks distance Plot
• Code and interpretation see Sample Q1c again, also reference interpretation and codes for Assignment 1if necessary.
• Cooks distance identifies potential influential points.
• Shows problem if only a few (14 roughly) points stants out in the plot from the rest. No serious problem if a lot of the points stand out.