DearCustomer
Great questions! Here's how I'd answer them:
(a) How does correlation analysis differ from regression analysis?
Correlation analysis attempts to identify the relationship (if any) between two variables of interest, while regression analysis uses one (or more) independent variable (the x's) to predict the value of another dependent (y) variable by means of a predictive model.
(b) What does a correlation coefficient reveal?
The correlation coefficient can take values between -1 and 1. The sign signifies whether the relationship is positive or negative and the distance from 0 signifies the strength of the relationship. In particular, the relationship described in this way is a linear relationship.
(c) State the quick rule for a significant correlation and explain its limitations.
This "quick rule" is specific to your class. If you'll tell me the rule, I'll be happy to give you its limitations.
(d) What sums are needed to calculate a correlation coefficient?
You need the sum of the (xy). You need the sum of (x), the sum of (y). These are for calculating x-bar, y-bar. You also need the sum of square x's and y's.
(e) What are the two ways of testing a correlation coefficient for significance?
There is a t-test. The test statistic is r/(sqrt((1-r^2)/(N-2))
There is also a z-test. The formula is Z = ln[|(r+1)/r-1)|]/2
There are more than just these two. For example, the quick rule that you mentioned above. There is also an F-test, I believe. You should consult your notes to see which ones are being referred to here.
----
12.48 In the following regression, X = weekly pay, Y = income tax withheld, and n = 35 McDonald's employees.
R2 0.202
Std. Error 6.816
n 35
ANOVA table
Source SS df MS F p-value
Regression(NNN) NNN-NNNN1(NNN) NNN-NNNN8.35 .0068
Residual 1,533.0614 33 46.4564
Total 1,920.7573 34
Regression output confidence interval
variables coefficients std. error t (df =33) p-value 95% lower 95% upper
Intercept 30.7963 6.4078 4.806 .0000 17.7595 43.8331
Slope 0.0343 0.0119 2.889 .0068 0.0101 0.0584
(a) Write the fitted regression equation.
y-hat=30.7963+0.0343x
b) State the degrees of freedom for a two- tailed test for zero slope, and use Appendix D to find the critical value at a = .05.
The degrees of freedom is 33. (shown in the output)
The critical value is 2.035
(c) What is your conclusion about the slope?
The t stat exceeds the critical value, so the slope is significantly different from 0.
(d) Interpret the 95 percent confidence limits for the slope.
We are 95% confident that the true slope falls between 0.0101 and 0.0584
(e) Verify that F = t2 for the slope.
t = 2.889 and t^2 = 8.346321 which is what is shown in the output. Verified.
(f) In your own words, describe the fit of this regression. ]
While using x as a predictor variable only explains 20.2% of the variability in y, the slope is significantly different from 0, so the model is informative.
-----
12.50 In the following regression, X = total assets ($ billions), Y = total revenue ($ billions), and n = 64 large banks. (a) Write the fitted regression equation. (b) State the degrees of freedom for a two- tailed test for zero slope, and find the critical value at a = .05. (c) What is your conclusion about the slope? d) Interpret the 95 percent confidence limits for the slope. (e) Verify that F = t2 for the slope. (f) In your own words, describe the fit of this regression.
R2 0.519
Std. Error 6.977
n 64
ANOVA table
Source SS df MS F p-value
Regression 3,260.0981 1 3,260.0981 66.97 1.90E-11
Residual 3,018.3339 62 48.6828
Total 6,278.4320 63
Regression output confidence interval
variables coefficients std. error t (df =62) p-value 95% lower 95% upper
Intercept 6.5763 1.9254 3.416 .0011 2.7275 10.4252
X1 0.0452 0.0055 8.183 1.90E-11 0.0342 0.0563
(a) Write the fitted regression equation.
y-hat = 6.5763 + 0.0452x
(b) State the degrees of freedom for a two- tailed test for zero slope, and find the critical value at a = .05.
The output shows that there are 62 degrees of freedom.
The critical value is found to be1.999.
(c) What is your conclusion about the slope?
The t-statistic is 8.183. That greatly exceeds the critical value. The slope is non-zero.
d) Interpret the 95 percent confidence limits for the slope.
We are 95% confident than the true slope falls between 0.0342 and 0.0563.
(e) Verify that F = t2 for the slope.
t=8.183. t^2 = 66.961489 which is what is shown in the output. Verified.
(f) In your own words, describe the fit of this regression.
Using X1 as a predictor variable accounts for 51.9% of the variability in y. In addition, the slope of the line is highly significantly different from 0. This model is informative. X1 is a good useful predictor of y.
-----
13.30 A researcher used stepwise regression to create regression models to predict BirthRate (births per 1,000) using five predictors: LifeExp (life expectancy in years), InfMort (infant mortality rate), Density (population density per square kilometer), GDPCap (Gross Domestic Product per capita), and Literate (literacy percent). Interpret these results. BirthRates2
Regression Analysis—Stepwise Selection (best model of each size)
153 observations
BirthRate is the dependent variable
p-values for the coefficients
Nvar &nbs p; LifeExp InfMort Density GDPCap Literate s Adj R2 R2
1 .0000 6.318 .722 .724
2 .0000 .0000 5.334 .802 .805
3 .0000 .0242 .0000 5.261 .807 .811
4 .5764 .0000 .0311 .0000 5.273 .806 .812
5 .5937 & nbsp; .0000 .6289 .0440 .0000 5.287 .805 .812
We see that as each variable is added the R62 value continues to increase. This is always the case. Adding additional variables cannot cause the R^2 value to decrease. However, after the addition of infant mortality, the R^2 value increases by only tiny amounts. Hence, it may be the case, that even though these variables are significant, they do not add much explanatory power to the model. This is perhaps because they are correlated with the other predictor variables (i.e., they contain much of the same information). LifeExp and InfMort are both highly significant until the variables GDPcap and Literate are added.
If I were going to continue analyzing this data, I would remove LifeExp and Density from the model to see how that affects the R^2 value and the p-values.
If I had to choose a model to go with based solely on this output I would choose LifeExp and InfMort to be the predictor variables.
-----
13.32 An expert witness in a case of alleged racial discrimination in a state university school of nursing introduced a regression of the determinants of Salary of each professor for each year during an 8-year period (n = 423) with the following results, with dependent variable Year (year in which the salary was observed) and predictors YearHire (year when the individual was hired), Race (1 if individual is black, 0 otherwise), and Rank (1 if individual is an assistant professor, 0 otherwise). Interpret these results.
Variable Coefficient t p
Intercept - 3,816,521 - 29.4 .000
Year 1,948 29.8 .000
YearHire - 826 - 5.5 .000
Race - 2,093 - 4.3 .000
Rank - 6,438 - 22.3 .000
R2= 0.811 R2 adj =0.809 s = 3,318
These results show a highly predictive model. Using these variables as predictors explains 81.1% of the variability in salary. Everyone of the variables in highly significant in the presence of the other variables in the model, though race and yearhire are the two least significant of the 4. It would be interesting to see how the model is changed if the race variable, for instance, were removed.
14.16 (a) Plot the data on U.S. general aviation shipments.

(b) Describe the pattern and discuss possible causes.
During the 60s and 70s were the Korean and Vietnam wars. Perhaps the planes created during this time were war planes.
(c) Would a fitted trend be helpful? Explain.
No. There does not seem to be a trend here. Definitely, there is not a linear trend. A line would gloss over the important features of the data. Furthermore, applying a regression model to this data would violate the assumptions of the regression model (i.e., normality, independence, equal variance).
d) Make a similar graph for 1992–2003 only. Would a fitted trend be helpful in making a prediction for 2004?

No. It does not seem that a trendline would be good here either. There is still a "wave" to the data. This may represent the business cycle, the recession of the 90's or some other seasonal type trend. Forcing a straight line through this data would not be appropriate because the relationship is more complex.
(e) Fit a trend model of your choice to the 1992–2003 data.
Nonetheless, here is the plot with a regression line applied:

(f) Make a forecast for 2004, using either the fitted trend model or a judgment forecast.
We plug in 2004 into the equation and get 182.21*2004-362294=2,854.84. We can see, though, that the data does not follow this straight line near the end of the x-range. For that reason, an visual prediction is more appropriate. For 2004, I predict 2100 planes based on the apparent pattern in the data.
Why is it best to ignore earlier years in this data set?
There were so many drastic changes in the airline industry over this time period that to force a single regression model to "fit" over such a long time period would be asking too much. It is better, as we have done in this case, to separate the time period into logical pieces that can be analyzed together in a more appropriate fashion.
-----
Let me know if you have questions about this. I'm happy to clarify any details.