A good regression equation will have a

Regression analysis, like most multivariate statistics, allows you to infer that there is a relationship between two or more variables. These relationships are seldom exact because there is variation caused by many variables, not just the variables being studied.

If you say that students who study more make better grades, you are really hypothesizing that there is a positive relationship between one variable, studying, and another variable, grades. You could then complete your inference and test your hypothesis by gathering a sample of [amount studied, grades] data from some students and use regression to see if the relationship in the sample is strong enough to safely infer that there is a relationship in the population. Notice that even if students who study more make better grades, the relationship in the population would not be perfect; the same amount of studying will not result in the same grades for every student [or for one student every time]. Some students are taking harder courses, like chemistry or statistics; some are smarter; some study effectively; and some get lucky and find that the professor has asked them exactly what they understood best. For each level of amount studied, there will be a distribution of grades. If there is a relationship between studying and grades, the location of that distribution of grades will change in an orderly manner as you move from lower to higher levels of studying.

Regression analysis is one of the most used and most powerful multivariate statistical techniques for it infers the existence and form of a functional relationship in a population. Once you learn how to use regression, you will be able to estimate the parameters — the slope and intercept — of the function that links two or more variables. With that estimated function, you will be able to infer or forecast things like unit costs, interest rates, or sales over a wide range of conditions. Though the simplest regression techniques seem limited in their applications, statisticians have developed a number of variations on regression that greatly expand the usefulness of the technique. In this chapter, the basics will be discussed. Once again, the t-distribution and F-distribution will be used to test hypotheses.

Before starting to learn about regression, go back to algebra and review what a function is. The definition of a function can be formal, like the one in my freshman calculus text: “A function is a set of ordered pairs of numbers [x,y] such that to each value of the first variable [x] there corresponds a unique value of the second variable [y]” [Thomas, 1960].. More intuitively, if there is a regular relationship between two variables, there is usually a function that describes the relationship. Functions are written in a number of forms. The most general is y = f[x], which simply says that the value of y depends on the value of x in some regular fashion, though the form of the relationship is not specified. The simplest functional form is the linear function where:

α and β are parameters, remaining constant as x and y change. α is the intercept and β is the slope. If the values of α and β are known, you can find the y that goes with any x by putting the x into the equation and solving. There can be functions where one variable depends on the values values of two or more other variables where x1 and x2 together determine the value of y. There can also be non-linear functions, where the value of the dependent variable [y in all of the examples we have used so far] depends on the values of one or more other variables, but the values of the other variables are squared, or taken to some other power or root or multiplied together, before the value of the dependent variable is determined. Regression allows you to estimate directly the parameters in linear functions only, though there are tricks that allow many non-linear functional forms to be estimated indirectly. Regression also allows you to test to see if there is a functional relationship between the variables, by testing the hypothesis that each of the slopes has a value of zero.

First, let us consider the simple case of a two-variable function. You believe that y, the dependent variable, is a linear function of x, the independent variable — y depends on x. Collect a sample of [x, y] pairs, and plot them on a set of x, y axes. The basic idea behind regression is to find the equation of the straight line that comes as close as possible to as many of the points as possible. The parameters of the line drawn through the sample are unbiased estimators of the parameters of the line that would come as close as possible to as many of the points as possible in the population, if the population had been gathered and plotted. In keeping with the convention of using Greek letters for population values and Roman letters for sample values, the line drawn through a population is:

while the line drawn through a sample is:

y = a + bx

In most cases, even if the whole population had been gathered, the regression line would not go through every point. Most of the phenomena that business researchers deal with are not perfectly deterministic, so no function will perfectly predict or explain every observation.

Imagine that you wanted to study the estimated price for a one-bedroom apartment in Nelson, BC. You decide to estimate the price as a function of its location in relation to downtown. If you collected 12 sample pairs, you would find different apartments located within the same distance from downtown. In other words, you might draw a distribution of prices for apartments located at the same distance from downtown or away from downtown. When you use regression to estimate the parameters of price = f[distance], you are estimating the parameters of the line that connects the mean price at each location. Because the best that can be expected is to predict the mean price for a certain location, researchers often write their regression models with an extra term, the error term, which notes that many of the members of the population of [location, price of apartment] pairs will not have exactly the predicted price because many of the points do not lie directly on the regression line. The error term is usually denoted as ε, or epsilon, and you often see regression equations written:

Strictly, the distribution of ε at each location must be normal, and the distributions of ε for all the locations must have the same variance [this is known as homoscedasticity to statisticians].

In estimating the unknown parameters of the population for the regression line, we need to apply a method by which the vertical distances between the yet-to-be estimated regression line and the observed values in our sample are minimized. This minimized distance is called sample error, though it is more commonly referred to as residual and denoted by e. In more mathematical form, the difference between the y and its predicted value is the residual in each pair of observations for x and y. Obviously, some of these residuals will be positive [above the estimated line] and others will be negative [below the line]. If we add all these residuals over the sample size and raise them to the power 2 in order to prevent the chance those positive and negative signs are cancelling each other out, we can write the following criterion for our minimization problem:

S is the sum of squares of the residuals. By minimizing S over any given set of observations for x and y, we will get the following useful formula:

[latex]b=\frac{\sum{[x-\bar{x}][y-\bar{y}]}}{\sum{[x-\bar{x}]^2}}[/latex]

After computing the value of b from the above formula out of our sample data, and the means of the two series of data on x and y, one can simply recover the intercept of the estimated line using the following equation:

For the sample data, and given the estimated intercept and slope, for each observation we can define a residual as:

Depending on the estimated values for intercept and slope, we can draw the estimated line along with all sample data in a y–x panel. Such graphs are known as scatter diagrams. Consider our analysis of the price of one-bedroom apartments in Nelson, BC. We would collect data for y=price of one bedroom apartment, x1=its associated distance from downtown, and x2=the size of the apartment, as shown in Table 8.1.

Table 8.1 Data for Price, Size, and Distance of Apartments in Nelson, BCy = price of apartments in $1000
x1 = distance of each apartment from downtown in kilometres
x2 = size of the apartment in square feetyx1x2551.5350513450601.7530075145055.53.1385491.6210652.338061.52600554450455325750.65424652285

The graph [shown in Figure 8.1] is a scatter plot of the prices of the apartments and their distances from downtown, along with a proposed regression line.

Figure 8.1 Scatter Plot of Price, Distance from Downtown, along with a Proposed Regression Line

In order to plot such a scatter diagram, you can use many available statistical software packages including Excel, SAS, and Minitab. In this scatter diagram, a negative simple regression line has been shown. The estimated equation for this scatter diagram from Excel is:

Where a=71.84 and b=-5.38. In other words, for every additional kilometre from downtown an apartment is located, the price of the apartment is estimated to be $5380 cheaper, i.e. 5.38*$1000=$5380. One might also be curious about the fitted values out of this estimated model. You can simply plug the actual value for x into the estimated line, and find the fitted values for the prices of the apartments. The residuals for all 12 observations are shown in Figure 8.2.

Figure 8.2

You should also notice that by minimizing errors, you have not eliminated them; rather, this method of least squares only guarantees the best fitted estimated regression line out of the sample data.

In the presence of the remaining errors, one should be aware of the fact that there are still other factors that might not have been included in our regression model and are responsible for the fluctuations in the remaining errors. By adding these excluded but relevant factors to the model, we probably expect the remaining error will show less meaningful fluctuations. In determining the price of these apartments, the missing factors may include age of the apartment, size, etc. Because this type of regression model does not include many relevant factors and assumes only a linear relationship, it is known as a simple linear regression model.

Testing your regression: does y really depend on x?

Understanding that there is a distribution of y [apartment price] values at each x [distance] is the key for understanding how regression results from a sample can be used to test the hypothesis that there is [or is not] a relationship between x and y. When you hypothesize that y = f[x], you hypothesize that the slope of the line [β in y = α + βx + ε] is not equal to zero. If β was equal to zero, changes in x would not cause any change in y. Choosing a sample of apartments, and finding each apartment’s distance to downtown, gives you a sample of [x, y]. Finding the equation of the line that best fits the sample will give you a sample intercept, α, and a sample slope, β. These sample statistics are unbiased estimators of the population intercept, α, and slope, β. If another sample of the same size is taken, another sample equation could be generated. If many samples are taken, a sampling distribution of sample β’s, the slopes of the sample lines, will be generated. Statisticians know that this sampling distribution of b’s will be normal with a mean equal to β, the population slope. Because the standard deviation of this sampling distribution is seldom known, statisticians developed a method to estimate it from a single sample. With this estimated sb, a t-statistic for each sample can be computed:

where n = sample size

m = number of explanatory [x] variables

b = sample slope

β= population slope

sb = estimated standard deviation of b’s, often called the standard error

These t’s follow the t-distribution in the tables with n–m-1 df.

Computing sb is tedious, and is almost always left to a computer, especially when there is more than one explanatory variable. The estimate is based on how much the sample points vary from the regression line. If the points in the sample are not very close to the sample regression line, it seems reasonable that the population points are also widely scattered around the population regression line and different samples could easily produce lines with quite varied slopes. Though there are other factors involved, in general when the points in the sample are farther from the regression line, sb is greater. Rather than learn how to compute sb, it is more useful for you to learn how to find it on the regression results that you get from statistical software. It is often called the standard error and there is one for each independent variable. The printout in Figure 8.3 is typical.

Figure 8.3 Typical Statistical Package Output for Linear Simple Regression Model

You will need these standard errors in order to test to see if y depends on x or not. You want to test to see if the slope of the line in the population, β, is equal to zero or not. If the slope equals zero, then changes in x do not result in any change in y. Formally, for each independent variable, you will have a test of the hypotheses:

[latex]H_o: \beta = 0[/latex]

[latex]H_a: \beta \neq 0[/latex]

If the t-score is large [either negative or positive], then the sample b is far from zero [the hypothesized β], and Ha should be accepted. Substitute zero for b into the t-score equation, and if the t-score is small, b is close enough to zero to accept Ha. To find out what t-value separates “close to zero” from “far from zero”, choose an alpha, find the degrees of freedom, and use a t-table from any textbook, or simply use the interactive Excel template from Chapter 3, which is shown again in Figure 8.4.


Figure 8.4 Interactive Excel Template for Determining t-Value from the t-Table – see Appendix 8.

Remember to halve alpha when conducting a two-tail test like this. The degrees of freedom equal n – m -1, where n is the size of the sample and m is the number of independent x variables. There is a separate hypothesis test for each independent variable. This means you test to see if y is a function of each x separately. You can also test to see if β > 0 [or β < 0] rather than β ≠ 0 by using a one-tail test, or test to see if β equals a particular value by substituting that value for β when computing the sample t-score.

Testing your regression: does this equation really help predict?

To test to see if the regression equation really helps, see how much of the error that would be made using the mean of all of the y’s to predict is eliminated by using the regression equation to predict. By testing to see if the regression helps predict, you are testing to see if there is a functional relationship in the population.

Imagine that you have found the mean price of the apartments in our sample, and for each apartment, you have made the simple prediction that price of apartment will be equal to the sample mean, y. This is not a very sophisticated prediction technique, but remember that the sample mean is an unbiased estimator of population mean, so on average you will be right. For each apartment, you could compute your error by finding the difference between your prediction [the sample mean, y] and the actual price of an apartment.

As an alternative way to predict the price, you can have a computer find the intercept, α, and slope, β, of the sample regression line. Now, you can make another prediction of how much each apartment in the sample may be worth by computing:

[latex]\hat{y} = \alpha + \beta[distance][/latex]

Once again, you can find the error made for each apartment by finding the difference between the price of apartments predicted using the regression equation ŷ, and the observed price, y. Finally, find how much using the regression improves your prediction by finding the difference between the price predicted using the mean, y, and the price predicted using regression, ŷ. Notice that the measures of these differences could be positive or negative numbers, but that error or improvement implies a positive distance.

Coefficient of Determination

If you use the sample mean to predict the amount of the price of each apartment, your error is [y–y] for each apartment. Squaring each error so that worries about signs are overcome, and then adding the squared errors together, gives you a measure of the total mistake you make if you want to predict y. Your total mistake is Σ[y–y]2. The total mistake you make using the regression model would be Σ[y-ŷ]2. The difference between the mistakes, a raw measure of how much your prediction has improved, is Σ[ŷ–y]2. To make this raw measure of the improvement meaningful, you need to compare it to one of the two measures of the total mistake. This means that there are two measures of “how good” your regression equation is. One compares the improvement to the mistakes still made with regression. The other compares the improvement to the mistakes that would be made if the mean was used to predict. The first is called an F-score because the sampling distribution of these measures follows the F-distribution seen in Chapter 6, “F-test and One-Way ANOVA”. The second is called R2, or the coefficient of determination.

All of these mistakes and improvements have names, and talking about them will be easier once you know those names. The total mistake made using the sample mean to predict, Σ[y–y]2, is called the sum of squares, total. The total mistake made using the regression, Σ[y-ŷ]2, is called the sum of squares, error [residual]. The general improvement made by using regression, Σ[ŷ–y]2 is called the sum of squares, regression or sum of squares, model. You should be able to see that:

sum of squares, total = sum of squares, regression + sum of squares, error [residual]

[latex]\sum{[y-\bar{y}]^2} = \sum{[ŷ-\bar{y}]^2} + \sum{[y-ŷ]^2}[/latex]

In other words, the total variations in y can be partitioned into two sources: the explained variations and the unexplained variations. Further, we can rewrite the above equation as:

where SST stands for sum of squares due to total variations, SSR measures the sum of squares due to the estimated regression model that is explained by variable x, and SSE measures all the variations due to other factors excluded from the estimated model.

Going back to the idea of goodness of fit, one should be able to easily calculate the percentage of each variation with respect to the total variations. In particular, the strength of the estimated regression model can now be measured. Since we are interested in the explained part of the variations by the estimated model, we simply divide both sides of the above equation by SST, and we get:

We then isolate this equation for the explained proportion, also known as R-square:

Only in cases where an intercept is included in a simple regression model will the value of R2 be bounded between zero and one. The closer R2 is to one, the stronger the model is. Alternatively, R2 is also found by:

This is the ratio of the improvement made using the regression to the mistakes made using the mean. The numerator is the improvement regression makes over using the mean to predict; the denominator is the mistakes [errors] made using the mean. Thus R2 simply shows what proportion of the mistakes made using the mean are eliminated by using regression.

In the case of the market for one-bedroom apartments in Nelson, BC, the percentage of the variations in price for the apartments is estimated to be around 50%. This indicates that only half of the fluctuations in apartment prices with respect to the average price can be explained by the apartments’ distance from downtown. The other 50% are not controlled [that is, they are unexplained] and are subject to further research. One typical approach is to add more relevant factors to the simple regression model. In this case, the estimated model is referred to as a multiple regression model.

While R2 is not used to test hypotheses, it has a more intuitive meaning than the F-score. The F-score is the measure usually used in a hypothesis test to see if the regression made a significant improvement over using the mean. It is used because the sampling distribution of F-scores that it follows is printed in the tables at the back of most statistics books, so that it can be used for hypothesis testing. It works no matter how many explanatory variables are used. More formally, consider a population of multivariate observations, [y, x1, x2, …, xm], where there is no linear relationship between y and the x’s, so that y ≠ f[y, x1, x2, …, xm]. If samples of n observations are taken, a regression equation estimated for each sample, and a statistic, F, found for each sample regression, then those F’s will be distributed like those shown in Figure 8.5, the F-table with [m, n–m-1] df.


Figure 8.5 Interactive Excel Template of an F-Table – see Appendix 8.

The value of F can be calculated as:

where n is the size of the sample, and m is the number of explanatory variables [how many x’s there are in the regression equation].

If Σ[ŷ–y]2 the sum of squares regression [the improvement], is large relative to Σ[ŷ–y]3, the sum of squares residual [the mistakes still made], then the F-score will be large. In a population where there is no functional relationship between y and the x’s, the regression line will have a slope of zero [it will be flat], and the ŷ will be close to y. As a result very few samples from such populations will have a large sum of squares regression and large F-scores. Because this F-score is distributed like the one in the F-tables, the tables can tell you whether the F-score a sample regression equation produces is large enough to be judged unlikely to occur if y ≠ f[y, x1, x2, …, xm]. The sum of squares regression is divided by the number of explanatory variables to account for the fact that it always decreases when more variables are added. You can also look at this as finding the improvement per explanatory variable. The sum of squares residual is divided by a number very close to the number of observations because it always increases if more observations are added. You can also look at this as the approximate mistake per observation.

[latex]H_0: y \neq f[y,x_1,x_2,\cdots,x_m][/latex]

To test to see if a regression equation was worth estimating, test to see if there seems to be a functional relationship:

[latex]H_a: y = f[y,x_1,x_2,\cdots,x_m][/latex]

This might look like a two-tailed test since Ho has an equal sign. But, by looking at the equation for the F-score you should be able to see that the data support Ha only if the F-score is large. This is because the data support the existence of a functional relationship if the sum of squares regression is large relative to the sum of squares residual. Since F-tables are usually one-tail tables, choose an α, go to the F-tables for that α and [m, n–m-1] df, and find the table F. If the computed F is greater than the table F, then the computed F is unlikely to have occurred if Ho is true, and you can safely decide that the data support Ha. There is a functional relationship in the population.

Now that you have learned all the necessary steps in estimating a simple regression model, you may take some time to re-estimate the Nelson apartment model or any other simple regression model, using the interactive Excel template shown in Figure 8.6. Like all other interactive templates in this textbook, you can change the values in the yellow cells only. The result will be shown automatically within this template. For this template, you can only estimate simple regression models with 30 observations. You use special paste/values when you paste your data from other spreadsheets. The first step is to enter your data under independent and dependent variables. Next, select your alpha level. Check your results in terms of both individual and overall significance. Once the model has passed all these requirements, you can select an appropriate value for the independent variable, which in this example is the distance to downtown, to estimate both the confidence intervals for the average price of such an apartment, and the prediction intervals for the selected distance. Both these intervals are discussed later in this chapter. Remember that by changing any of the values in the yellow areas in this template, all calculations will be updated, including the tests of significance and the values for both confidence and prediction intervals.


Figure 8.6 Interactive Excel Template for Simple Regression – see Appendix 8.

Multiple Regression Analysis

When we add more explanatory variables to our simple regression model to strengthen its ability to explain real-world data, we in fact convert a simple regression model into a multiple regression model. The least squares approach we used in the case of simple regression can still be used for multiple regression analysis.

As per our discussion in the simple regression model section, our low estimated R2 indicated that only 50% of the variations in the price of apartments in Nelson, BC, was explained by their distance from downtown. Obviously, there should be more relevant factors that can be added into this model to make it stronger. Let’s add the second explanatory factor to this model. We collected data for the area of each apartment in square feet [i.e.,  x2]. If we go back to Excel and estimate our model including the new added variable, we will see the printout shown in Figure 8.7.

Figure 8.7 Excel Printout

The estimates equation of the regression model is:

predicted price of apartments= 60.041 – 5.393*distance + .03*area

This is the equation for a plane, the three-dimensional equivalent of a straight line. It is still a linear function because neither of the x’s nor y is raised to a power nor taken to some root nor are the x’s multiplied together. You can have even more independent variables, and as long as the function is linear, you can estimate the slope, β, for each independent variable.

Before using this estimated model for prediction and decision-making purposes, we should test three hypotheses. First, we can use the F-score to test to see if the regression model improves our ability to predict price of apartments. In other words, we test the overall significance of the estimated model. Second and third, we can use the t-scores to test to see if the slopes of distance and area are different from zero. These two t-tests are also known as individual tests of significance.

To conduct the first test, we choose an α = .05. The F-score is the regression or model mean square over the residual or error mean square, so the df for the F-statistic are first the df for the regression model and, second, the df for the error. There are 2 and 9 df for the F-test. According to this F-table, with 2 and 9 df, the critical F-score for α = .05 is 4.26.

The hypotheses are:

H0: price ≠ f[distance, area]

Ha: price = f[distance, area]

Because the F-score from the regression, 6.812, is greater than the critical F-score, 4.26, we decide that the data support Ho and conclude that the model helps us predict price of apartments. Alternatively, we say there is such a functional relationship in the population.

Now, we move to the individual test of significance. We can test to see if price depends on distance and area. There are [n-m-1]=[12-2-1]=9 df. There are two sets of hypotheses, one set for β1, the slope for distance, and one set for β2, the slope for area. For a small town, one may expect that β1, the slope for distance, will be negative, and expect that β2 will be positive. Therefore, we will use a one-tail test on β1, as well as for β2:

[latex]H_a: \beta _1

Chủ Đề