Perform a regression analysis

Important notice for users of Office 2003    To continue receiving security updates for Office, make sure you're running Office 2003 Service Pack 3 (SP3). The support for Office 2003 ends April 8, 2014. If you’re running Office 2003 after support ends, to receive all important security updates for Office, you need to upgrade to a later version such as Office 365 or Office 2013. For more information, see Support is ending for Office 2003.

Applies to
Microsoft Office Excel 2003

Book cover


This article is adapted from Microsoft Office Excel 2003 Inside Out by Craig Stinson and Mark Dodge. Visit Microsoft Learning to buy this book.

In this article


Sample files    You can download sample files that relate to excerpts from Microsoft Office Excel 2003 Inside Out from Downloads on Microsoft Office Online. This article uses the files Analysis.xls and Home Price Estimator.xls .

Excel includes several array functions for performing linear regression—LINEST, TREND, FORECAST, SLOPE, and STEYX—and exponential regression—LOGEST and GROWTH. These functions are entered as array formulas and they produce array results. You can use each of these functions with one or several independent variables. The following list provides a definition of the different types of regression:

  • Linear regression produces the slope of a line that best fits a single set of data. Based on a year's worth of sales figures, for example, linear regression can tell you the projected sales for March of the following year by giving you the slope and y-intercept (that is, the point where the line crosses the y-axis) of the line that best fits the sales data. By following the line forward in time, you can estimate future sales, if you can safely assume that growth will remain linear.
  • Exponential regression produces an exponential curve that best fits a set of data that you suspect does not change linearly with time. For example, a series of measurements of population growth will nearly always be better represented by an exponential curve than by a line.
  • Multiple regression is the analysis of more than one set of data, which often produces a more realistic projection. You can perform both linear and exponential multiple regression analyses. For example, suppose you want to project the appropriate price for a house in your area based on square footage, number of bathrooms, lot size, and age. Using a multiple regression formula, you can estimate a price, based on a database of information gathered from existing houses.

Regressing into the future?

The concept of regression might sound strange because the term is normally associated with movement backward, whereas in the world of statistics, regression is often used to predict the future. Simply put, regression is a statistical technique that finds a mathematical expression that best describes a set of data.

Often businesses try to predict the future using sales and percent-of-sales projections based on history. A simple percent-of-sales technique identifies assets and liabilities that vary along with sales, determines the proportion of each, and assigns them percentages. Although using percent-of-sales forecasting is often sufficient for slow or steady short-term growth, the technique loses accuracy as growth accelerates.

Regression analysis uses more sophisticated equations to analyze larger sets of data and translates them into coordinates on a line or curve. In the not-so-distant past, regression analysis was not widely used because of the large volume of calculations involved. Since spreadsheet applications, such as Excel, began offering built-in regression functions, the use of regression analysis has become more widespread.

Calculating linear regression

The equation y = mx + b algebraically describes a straight line for a set of data with one independent variable where x is the independent variable, y is the dependent variable, m represents the slope of the line, and b represents the y-intercept. If a line represents a number of independent variables in a multiple regression analysis to an expected result, the equation of the regression line takes the form y=m1x1+m2x2+...+mnxn+b in which y is the dependent variable, x1 through xn are n independent variables, m1 through mn are the coefficients of each independent variable, and b is a constant.

The LINEST function

The LINEST function uses this more general equation to return the values of m1 through mn and the value of b, given a known set of values for y and a known set of values for each independent variable. This function takes the form LINEST(known_y's, known_x's, const, stats).

The known_y's argument is the set of y-values you already know. This argument can be a single column, a single row, or a rectangular range of cells. If known_y's is a single column, each column in the known_x's argument is considered an independent variable. Similarly, if known_y's is a single row, each row in the known_x's argument is considered an independent variable. If known_y's is a rectangular range, you can use only one independent variable; known_x's in this case should be a range of the same size and shape as known_y's. If you omit the known_x's argument, Excel uses the sequence 1, 2, 3, 4, and so on.

The const and stats arguments are optional. If either is included, it must be a logical constant—either TRUE or FALSE. (You can substitute 1 for TRUE and 0 for FALSE.) The default settings for const and stats are TRUE and FALSE, respectively. If you set const to FALSE, Excel forces b (the last term in the straight-line equation) to be 0. If you set stats to TRUE, the array returned by LINEST includes the following validation statistics.


se1 through sen


Standard error values for each coefficient

Seb Standard error value for the constant b
R2 Coefficient of determination
Sey Standard error value for y
F F statistic
Df Degrees of freedom
Ssreg Regression sum of squares
Ssresid Residual sum of squares

Before creating a formula using LINEST, you must select a range large enough to hold the result array returned by the function. If you omit the stats argument (or set it explicitly to FALSE), the result array encompasses one cell for each of your independent variables and one cell for b. If you include the validation statistics, the result array looks like the following example. After selecting a range to contain the results array, type the function and then press Ctrl+Shift+Enter to enter the function in each cell of the result array.


mn   


mn1   


…   


m2   


m1   


b  

sen sen–1 se2 se1 seb
r2 Sey
F Df
ssreg Ssresid

Note that, with or without validation statistics, the coefficients and standard error values for your independent variables are returned in the opposite order from your input data. For example, if you have four independent variables organized in four columns, LINEST evaluates the leftmost column as x1, but it returns m1 in the fourth column of the result array.

Figure 1 shows a simple example of the use of LINEST with one independent variable.

 Note   The workbook used in this example, Analysis.xls, is included in the sample files download.

The entries in column B of this worksheet represent monthly product demand for a small business. The numbers in column A represent the months in the period. Suppose you want to compute the slope and y-intercept of the regression line that best describes the relationship between the demand and the months. In other words, you want to describe the trend of the data. To do this, select the range F6:G6, type the formula =LINEST(B2:B19, A2:A19), and press Ctrl+Shift+Enter. The resulting number in cell F6 is 20.613, the slope of the regression line; the number in cell G6 is 4002.065, the y-intercept of the line.

The LINEST function computes the slope and y-intercept of a regression line.

Figure 1: The LINEST function computes the slope and y-intercept of a regression line.

 Note   The LINEST and LOGEST functions return only the y-axis coordinates used for calculating lines and curves. The difference between them is that LINEST projects a straight line and LOGEST projects an exponential curve. You must be careful to match the appropriate function to the analysis at hand. The LINEST function might be more appropriate for sales projections, and the LOGEST function might be more appropriate for statistical analyses or population trends.


ShowInside Out     A real (estate) regression application

One often-used regression model is sometimes known as the Competitive Market Analysis (CMA). Realtors use CMAs to arrive at an estimated selling price for a home, based on historical sales data for comparable homes in the area. The following figure shows a sample Excel-based version of this tool, called the Home Price Estimator.

Home Price Estimator Tool

This application uses the LINEST function to analyze the statistics in the Input Data area and generate an array of results based on similar statistics in the What You Want area. The LINEST array is actually located in rows below the visible area of the worksheet, as shown below. The first row of values in the LINEST data array is used by the Estimated Price formula to extrapolate an estimated value.

Home Price Estimator Workbook

Normally in this workbook, row and column headings are hidden, rows 25 through 37 are hidden, worksheet protection is turned on, and cells are locked with entries allowed only in the designated input areas. Real estate tip: As the note on the worksheet mentions, you can use listed home prices to arrive at an estimated price, but actual sale prices are more realistic, if you can get them.

 Note   The workbook used in this example, Home Price Estimator.xls, is included in the sample files download.

The TREND function

LINEST returns a mathematical description of the straight line that best fits known data. TREND finds points that lie along that line and that fall into the unknown category. You can use the numbers returned by TREND to plot a trend line—a straight line that helps make sense of actual data. You can also use TREND to extrapolate, or make intelligent guesses about, future data based on the tendencies exhibited by known data. (Be careful. Although you can use TREND to plot the straight line that best fits the known data, TREND can't tell you if that line is a good predictor of the future. Validation statistics returned by LINEST can help you make that assessment.) The TREND function takes the form =TREND(known_y's, known_x's, new_x's, const).

The first two arguments represent the known values of your dependent and independent variables. As in LINEST, the known_y's argument is a single column, a single row, or a rectangular range. The known_x's argument also follows the pattern described for LINEST. The third and fourth arguments are optional. If you omit new_x's, the TREND function considers new_x's to be identical to known_x's. If you include const, the value of that argument must be TRUE or FALSE (or 1 or 0). If const is TRUE, TREND forces b to be 0.

To calculate the trend-line data points that best fit your known data, simply omit the third and fourth arguments from this function. The results array will be the same size as the known_x's range. In Figure 2, we used TREND to find the value of each point on the regression line that describes the data set from the example in Figure 1. To create these values, we selected the range C2:C19 and entered =TREND(B2:B19, A2:A19) as an array formula using Ctrl+Shift+Enter.

The TREND function creates a data series that can be plotted as a line on a chart.

Figure 2: The TREND function creates a data series that can be plotted as a line on a chart.

To extrapolate from existing data, you must supply a range for new_x's. You can supply as many or as few cells for new_x's as you want. The result array will be the same size as the new_x's range. In Figure 3 we used TREND to calculate demand for the 19th, 20th, and 21st months. To arrive at these values, we entered the numbers 19 through 21 in A21:A23, selected C21:C23, and entered =TREND(B2:B19, A2:A19, A21:A23) as an array formula by pressing Ctrl+Shift+Enter.

TREND can predict the sales figures for months 19, 20, and 21.

Figure 3: TREND can predict the sales figures for months 19, 20, and 21.

The FORECAST function

The FORECAST function is similar to TREND, except that it returns a single point along a line rather than returning an array that defines the line. This function takes the form =FORECAST(x, known_y's, known_x's).

The x argument is the data point for which you want to extrapolate a value. For example, instead of using TREND, we can use the FORECAST function to extrapolate the value in cell C23 in Figure 3 by entering the formula =FORECAST(21, B2:B19, A2:A19) where the x argument refers to the 21st data point on the regression line. You can use this function if you want to calculate any point in the future.

The SLOPE function

The SLOPE function returns the slope of the linear regression line. The slope is defined as the vertical distance divided by the horizontal distance between any two points on the regression line. Its value is the same as the first number in the array returned by the LINEST function. In other words, SLOPE calculates the trajectory of the line used by the FORECAST and TREND functions to calculate the values of data points. The SLOPE function takes the form =SLOPE(known_y's, known_x's).

To find the slope of the regression line that describes the data set from the example shown in Figure 3, we can enter =SLOPE(B2:B19, A2:A19) as an array. This returns a value of 20.613.

The STEYX function

The STEYX function calculates the standard error of a regression, a measure of the amount of error accrued in predicting a y for each given x. This function takes the form =STEYX(known_y's, known_x's). If we apply this function to the worksheet shown in Figure 3, the formula =STEYX(B2:B19, A2:A19) returns a standard error value of 12.96562.

Calculating exponential regression

Unlike linear regression, which plots values along a straight line, exponential regression describes a curve by calculating the array of values needed to plot it. The equation that describes an exponential regression curve is

y = b * m1x1 * m2x2 * … * mnxn

If you have only one independent variable, the equation is

y = b * mx

The LOGEST function

The LOGEST function works like LINEST, except that you use it to analyze data that is nonlinear, and it returns the coordinates of an exponential curve instead of a straight line. LOGEST returns coefficient values for each independent variable plus a value for the constant b. This function takes the form =LOGEST(known_y's, known_x's, const, stats).

LOGEST accepts the same arguments as the LINEST function and returns a result array in the same fashion. If you set the optional stats argument to TRUE, the function also returns validation statistics.

 Note    The LINEST and LOGEST functions return only the y-axis coordinates used for calculating lines and curves. The difference between them is that LINEST projects a straight line and LOGEST projects an exponential curve. You must be careful to match the appropriate function to the analysis at hand. The LINEST function might be more appropriate for sales projections, and the LOGEST function might be more suited to applications, such as statistical analyses or population trends.

The GROWTH function

Where the LOGEST function returns a mathematical description of the exponential regression curve that best fits a set of known data, the GROWTH function finds points that lie along that curve. The GROWTH function works like its linear counterpart, TREND, and takes the form =GROWTH(known_y's, known_x's, new_x's, const).

 
 
Applies to:
Excel 2003