STAT 501 – Mid-Term Exam 2 – Spring 2015 – Due April 12
Instructions: Use Word to type your answers within this document. Then, submit your answers in the appropriate dropbox in ANGEL by the due date and within 3 hours of downloading the exam. The point distribution is located next to each question.
- (4x2 = 8 points) State which of the following statements is TRUE and which is FALSE. For the statements that are false, explain why they are false.
- Removing an outlier in a regression analysis will result in narrower confidence intervals.
- In a simple linear regression (SLR) model, if a log transformation is performed on X to remedy some non-linearity, the mean value of Y is bound to change.
- In model selection, the highest adjusted R2-value and the smallest S-value criteria always yield the same "best" models.
- Regression models with different responses, but the same predictor X matrix, will have the same leverage values.
- (3+3+4+4+3+3 = 20 points) Open the “Salary Data.” The dataset consists of current salaries (Salary in thousands of dollars) for 63 individuals with information about their years of work experience (YrsExp) and highest degree attained (Degree). Your goal is to fit a regression model to express the dependence of Y (Salary) on X (YrsExp) and Degree.
- Clearly define a set of indicator variables that could be used in a regression model to represent the qualitative variable Degree. [Hint: Think carefully about the number of indicator variables needed given the number of levels of Degree and use “Bachelor” as the reference level.]
- Write a population multiple linear regression equation for predicting the current salary in terms of YrsExp and Degree. Since education level could impact the dependence of Y on X, the model should contain an interaction effect between YrsExp and Degree, together with their main effects. [Hint: Your equation should include Y, X, the indicator variables you defined in part (a), interaction terms, and population regression coefficients (β’s).]
- Conduct a hypothesis test for whether the average annual salary increase per year of experience differs by level of education (i.e., test if the slopes for two or more Degree categories differ). Write out the null and alternative hypotheses, the test statistic, the p-value, and the conclusion. [Minitab v17: Select Salary as the Response, YrsExp as the Continuous predictor, Degree as the categorical predictor, click “Model,” select both YrsExp and Degree together in the Predictors box and click the Add button next to “Interactions through order 2.” Minitab v16: Create interaction terms using Calc > Calculator before fitting the regression model.]
- Write a new population regression equation based on your conclusion to part (c). Fit this model and conduct two separate hypothesis tests for whether the mean salary for a fixed number of years’ experience differs by education level. For each test, write out the null and alternative hypotheses, the test statistic, the p-value, and the conclusion.
- Based on your conclusion to part (d), write three fitted sample regression equations that can be used to predict the current salary for each education level. [Hint: Your equations should include number values, not β’s.]
- Based on one of the equations from part (e), predict the current salary of a PhD degree holder with 10 years of work experience. [Hint: A point estimate is sufficient so there is no need for an interval.]
- (4x2 = 8 points) Consider the following four graphs where the vertical axis represents Y and the horizontal axis represents X.
Choose the most appropriate plot for each of the following models (where D1 and D2 represent a set of indicator variables):
- (5+2+5+3+3 = 18 points) The file “Savings Data” contains savings of 33 individuals along with their age. It is apparent that Y = Savings (in $) has a positive association with X = Age (in years). An appropriate regression model relating Savings to Age could be useful for predicting savings based on age. The most straightforward approach would be to fit a simple linear regression (SLR) model for Y vs X, provided that the LINE assumptions are satisfied. [Consult “Worked Examples Using Minitab” in the Online Notes for help with any Minitab procedures.]
- Fit an SLR model for Y vs X and perform a residual plot analysis to determine if the LINE assumptions are satisfied. Include a numerical test when checking for normality (use the Ryan Joiner test in Minitab). Discuss your findings and include any relevant graphs.
- Based on your conclusion in part (a), determine if any transformations are suggested for X and/or Y. [Hint: You should find that both X and Y need to be transformed.]
- Fit an SLR model for the transformed variable(s) and comment on this model’s validity with supporting statements, numerical tests and/or plots.
- Use Minitab to compute a 95% confidence interval for the mean amount of savings (in $) expected for 40 year-olds based on the fitted model in part (c). [Hint: Remember to take into account the transformations to X and Y.]
- Use Minitab to compute a 95% prediction interval for the amount of savings (in $) predicted for a randomly selected 40 year-old based on the fitted model in part (c). [Hint: Remember to take into account the transformations to X and Y.]
- (2+1+3+2 = 8 points) The following Minitab output resulted from a multiple linear regression model fit to response variable, Y, and predictor terms, X1, X2, and X1X2:
Term Coef SE Coef T-Value P-Value
Constant 4.49 1.89 2.37 0.022
X1 0.759 0.374 2.03 0.048
X2 0.965 0.426 2.26 0.028
X1*X2 0.1742 0.0821 2.12 0.039
- Conduct a hypothesis test for whether the interaction term, X1X2, can be dropped from the model. Write out the population model, null and alternative hypotheses, the test statistic, the p-value, and the conclusion.
- Based on your conclusion to part (b), write the fitted sample regression equation.
- State whether the following statements are supported by the Minitab output. (simply write “yes” or “no” for each statement).
- X1 and X2 are positively associated.
- Y and X1 are positively associated for fixed values of X2.
- The linear association between Y and X1 increases as X2 increases.
- Use the fitted equation in part (b) to predict Y for an observation with X1 = 6 and X2 = 5. [Hint: A point estimate is sufficient.]
- (6x3 =18 points) The table below was obtained from the Best Subsets regression procedure for the “Infection Risk Data.”
Response is InfctRsk
l C N
t X e u
S u r n r
t r a s s
R-Sq R-Sq Mallows a e y u e
Vars R-Sq (adj) (pred) Cp S y s s s s
1 35.5 34.8 30.2 51.2 1.1351 X
1 34.7 34.0 30.7 53.2 1.1428 X
2 53.0 52.0 48.3 14.0 0.97380 X X
2 46.3 45.1 40.9 29.2 1.0415 X X
3 57.0 55.6 51.4 7.1 0.93657 X X X
3 56.0 54.6 49.5 9.4 0.94740 X X X
4 59.3 57.5 53.1 4.0 0.91622 X X X X
4 58.7 56.9 52.0 5.4 0.92323 X X X X
5 59.3 57.1 51.0 6.0 0.92120 X X X X X
- Based on the criteria listed in the table above select what you believe to be the “Best” model and write down its population regression equation. Support your answer.
- Would you consider this model to yield an unbiased predicted response? Support your answer.
- Name a model in the table that may yield a biased predicted response. Support your answer.
- Calculate SSTO using information in the table.
- Use Minitab’s Backward Elimination procedure on this dataset and write down the fitted sample regression equation for the resulting “best” model. Use αr = 0.15 and the Minitab v17 command sequence: Stat > Regression > Regression > Fit regression model > Stepwise (select Backward Elimination for Method). For Minitab v16 use Stat > Regression > Stepwise.
- State any extra useful information provided by the Backward Elimination output that is not available in the Best Subsets table above.
- (4+2+2+2+4+3+3 = 20 points) Open the “Profits Data.” The data indicate a positive linear association between interest rates and broker profits. The data are to be used primarily to obtain a regression model and compute confidence/prediction intervals.
- Fit an SLR model for Y = profits and X = interest rate and create a scatterplot of Y vs X with the fitted regression line added.
- For the model in part (a) discuss whether there are any “extreme X values.” [Hint: Use leverages.]
- For the model in part (a) discuss whether there are any “outliers” (unusual Y values). [Hint: Use internally studentized residuals, which Minitab calls standardized residuals.]
- For the model in part (a) discuss whether there are any “influential data points.” [Hint: Use Cook’s distances.]
- You should have identified one outlier in part (c). Repeat your regression analysis after deleting this outlier. Again create a scatterplot of Y vs X with the fitted regression line added.
- Compare the results of your regression analyses and plots obtained from parts (a) and (e).
- In the context of this problem, comment on any detrimental effects if the outlier was not removed.
- This Solution has been Purchased 5 time
- Submitted On 11 Apr, 2015 01:29:51