Methods of regression analysis. Regression analysis is a statistical method for studying the dependence of a random variable on variables Methods of regression analysis in statistics

Methods of regression analysis.  Regression analysis is a statistical method for studying the dependence of a random variable on variables Methods of regression analysis in statistics

Regression analysis

Regression (linear) analysis- a statistical method for studying the influence of one or more independent variables on a dependent variable. Independent variables are otherwise called regressors or predictors, and dependent variables are called criterion variables. Terminology dependent And independent variables reflects only the mathematical dependence of the variables ( see False correlation), rather than cause-and-effect relationships.

Goals of Regression Analysis

  1. Determination of the degree of determination of the variation of a criterion (dependent) variable by predictors (independent variables)
  2. Predicting the value of a dependent variable using the independent variable(s)
  3. Determining the contribution of individual independent variables to the variation of the dependent variable

Regression analysis cannot be used to determine whether there is a relationship between variables, since the presence of such a relationship is a prerequisite for applying the analysis.

Mathematical Definition of Regression

A strictly regression relationship can be defined as follows. Let , be random variables with a given joint probability distribution. If for each set of values ​​a conditional mathematical expectation is defined

(regression equation in general form),

then the function is called regression values ​​of Y by values, and its graph is regression line by , or regression equation.

The dependence on is manifested in the change in the average values ​​of Y with a change in . Although, for each fixed set of values, the value remains a random variable with a certain scattering.

To clarify the question of how accurately regression analysis estimates the change in Y when changing , the average value of the dispersion of Y for different sets of values ​​is used (in fact, we are talking about the measure of dispersion of the dependent variable around the regression line).

Least squares method (calculation of coefficients)

In practice, the regression line is most often found in the form of a linear function (linear regression), the best way approximating the desired curve. This is done using the least squares method, when the sum of the squared deviations of the actually observed ones from their estimates is minimized (meaning estimates using a straight line that purports to represent the desired regression relationship):

(M - sample size). This approach is based on known fact, that the amount appearing in the above expression takes on a minimum value precisely for the case when .

To solve the problem of regression analysis using the least squares method, the concept is introduced residual functions:

Minimum condition for the residual function:

The resulting system is a system of linear equations with unknowns

If we represent the free terms on the left side of the equations as a matrix

and the coefficients for the unknowns on the right side are the matrix

then we get the matrix equation: , which is easily solved by the Gauss method. The resulting matrix will be a matrix containing the coefficients of the regression line equation:

To obtain the best estimates, it is necessary to fulfill the preconditions of the OLS (Gauss–Markov conditions). In the English literature, such estimates are called BLUE (Best Linear Unbiased Estimators).

Interpretation of Regression Parameters

The parameters are partial correlation coefficients; is interpreted as the proportion of the variance of Y explained by fixing the influence of the remaining predictors, that is, it measures the individual contribution to the explanation of Y. In the case of correlated predictors, the problem of uncertainty in the estimates arises, which become dependent on the order in which the predictors are included in the model. In such cases, it is necessary to use correlation and stepwise regression analysis methods.

When talking about nonlinear models of regression analysis, it is important to pay attention to whether we are talking about nonlinearity in independent variables (from a formal point of view, easily reduced to linear regression), or about nonlinearity in the estimated parameters (causing serious computational difficulties). In case of nonlinearity of the first type, from a substantive point of view, it is important to highlight the appearance in the model of terms of the form , , indicating the presence of interactions between features , etc. (see Multicollinearity).

see also

Links

  • www.kgafk.ru - Lecture on the topic “Regression analysis”
  • www.basegroup.ru - methods for selecting variables in regression models

Literature

  • Norman Draper, Harry Smith Applied regression analysis. Multiple Regression = Applied Regression Analysis. - 3rd ed. - M.: “Dialectics”, 2007. - P. 912. - ISBN 0-471-17082-8
  • Robust methods for estimating statistical models: Monograph. - K.: PP "Sansparel", 2005. - P. 504. - ISBN 966-96574-0-7, UDC: 519.237.5:515.126.2, BBK 22.172+22.152
  • Radchenko Stanislav Grigorievich, Methodology of regression analysis: Monograph. - K.: "Korniychuk", 2011. - P. 376. - ISBN 978-966-7599-72-0

Wikimedia Foundation. 2010.

After correlation analysis has revealed the presence of statistical relationships between variables and assessed the degree of their closeness, we usually move on to a mathematical description of a specific type of dependency using regression analysis. For this purpose, a class of functions is selected that connects the resultant indicator y and the arguments x 1, x 2, ..., x k, the most informative arguments are selected, estimates of the unknown values ​​of the parameters of the communication equation are calculated, and the properties of the resulting equation are analyzed.

The function f(x 1, x 2,..., x k) describing the dependence of the average value of the resultant characteristic y on the given values ​​of the arguments is called the regression function (equation). The term “regression” (Latin -regression - retreat, return to something) was introduced by the English psychologist and anthropologist F. Galton and is associated exclusively with the specifics of one of the first specific examples, in which this concept was used. Thus, processing statistical data in connection with the analysis of the heredity of height, F. Galton found that if fathers deviate from the average height of all fathers by x inches, then their sons deviate from the average height of all sons by less than x inches. The identified trend was called “regression to the mean.” Since then, the term “regression” has been widely used in the statistical literature, although in many cases it does not accurately characterize the concept of statistical dependence.

To accurately describe the regression equation, it is necessary to know the distribution law of the effective indicator y. In statistical practice, one usually has to confine oneself to the search for suitable approximations for the unknown true regression function, since the researcher does not have precise knowledge of the conditional probability distribution law of the analyzed resultant indicator y for given values ​​of the argument x.

Let's consider the relationship between true f(x) = M(y1x), model regression? and regression estimate y. Let the effective indicator y be related to the argument x by the relation:

where is a random variable that has a normal distribution law, and Me = 0 and D e = y 2. The true regression function in this case has the form: f (x) = M(y/x) = 2x 1.5.

Let us assume that we do not know the exact form of the true regression equation, but we have nine observations of a two-dimensional random variable related by the relation yi = 2x1.5 + e, and presented in Fig. 1

Figure 1 - The relative position of the truth f (x) and the theoretical? regression models

Location of points in Fig. 1 allows you to limit yourself to a class linear dependencies kind? = in 0 + in 1 x. Using the least squares method, we find the estimate of the regression equation y = b 0 + b 1 x. For comparison, in Fig. 1 shows graphs of the true regression function y = 2x 1.5, the theoretical approximating regression function? = in 0 + in 1 x .

Since we made a mistake in choosing the class of the regression function, and this is quite common in the practice of statistical research, our statistical conclusions and estimates will turn out to be erroneous. And no matter how much we increase the volume of observations, our sample estimate y will not be close to the true regression function f(x). If we had chosen the class of regression functions correctly, then the inaccuracy in describing f(x) using? could only be explained by sample limitations.

In order to best restore, from the original statistical data, the conditional value of the effective indicator y(x) and the unknown regression function f(x) = M(y/x), the following adequacy criteria (loss functions) are most often used.

Least square method. According to it, the square of the deviation of the observed values ​​of the effective indicator y, (i = 1,2,..., n) from the model values,? = f(x i), where x i is the value of the argument vector in i-th observation: ?(y i - f(x i) 2 > min. The resulting regression is called root mean square.

Method of smallest modules. According to it, the sum of absolute deviations of the observed values ​​of the effective indicator from the modular values ​​is minimized. And we get,? = f(x i), mean absolute median regression? |y i - f(x i)| >min.

Regression analysis is a method of statistical analysis of the dependence of a random variable y on variables x j = (j = 1,2,..., k), considered in regression analysis as non-random variables, regardless of the true distribution law of x j.

It is usually assumed that a random variable y has a normal distribution law with a conditional expectation y, which is a function of the arguments x/ (/ = 1, 2,..., k) and a constant variance y 2 independent of the arguments.

All in all linear model regression analysis looks like:

Y = Y k j=0 V j ts j(x 1 , x 2 . . .. ,x k)+E

where q j is some function of its variables - x 1, x 2. . .. ,x k, E is a random variable with zero mathematical expectation and variance y 2.

In regression analysis, the type of regression equation is chosen based on the physical nature of the phenomenon being studied and the results of observation.

Estimates of the unknown parameters of the regression equation are usually found using the least squares method. Below we will dwell on this problem in more detail.

Two-dimensional linear equation regression. Let us assume, based on the analysis of the phenomenon under study, that on “average” y has linear function from x, i.e. there is a regression equation

y=M(y/x)=in 0 + in 1 x)

where M(y1x) is the conditional mathematical expectation of the random variable y for a given x; at 0 and at 1 - unknown parameters of the general population, which must be estimated based on the results of sample observations.

Suppose that to estimate parameters at 0 and at 1, a sample of size n is taken from a two-dimensional population (x, y), where (x, y,) is the result of the i-th observation (i = 1, 2,..., n) . In this case, the regression analysis model has the form:

y j = in 0 + in 1 x+e j .

where e j are independent normally distributed random variables with zero mathematical expectation and variance y 2, i.e. M e j. = 0;

D e j .= y 2 for all i = 1, 2,..., n.

According to the least squares method, as estimates of the unknown parameters at 0 and at 1, one should take such values ​​of the sample characteristics b 0 and b 1 that minimize the sum of squared deviations of the values ​​of the resultant characteristic for i from the conditional mathematical expectation? i

We will consider the methodology for determining the influence of marketing characteristics on the profit of an enterprise using the example of seventeen typical enterprises with average sizes and indicators of economic activity.

When solving the problem, the following characteristics were taken into account, identified as the most significant (important) as a result of the questionnaire survey:

* innovation activity enterprises;

* planning the range of products produced;

* formation of pricing policy;

* public relations;

* sales system;

* employee incentive system.

Based on a system of comparisons by factors, square matrices of adjacency were constructed, in which the values ​​of relative priorities were calculated for each factor: innovative activity of the enterprise, planning of the range of products, formation of pricing policy, advertising, public relations, sales system, employee incentive system.

Estimates of priorities for the factor “relation with the public” were obtained as a result of a survey of enterprise specialists. The following notations are accepted: > (better), > (better or the same), = (same),< (хуже или одинаково), <

Next, the problem of a comprehensive assessment of the enterprise’s marketing level was solved. When calculating the indicator, the significance (weight) of the considered partial characteristics was determined and the problem of linear convolution of partial indicators was solved. Data processing was carried out using specially developed programs.

Next, a comprehensive assessment of the enterprise's marketing level is calculated - the marketing coefficient, which is entered in Table 1. In addition, the table includes indicators characterizing the enterprise as a whole. The data in the table will be used to perform regression analysis. The resultant attribute is profit. Along with the marketing coefficient, the following indicators were used as factor characteristics: volume of gross output, cost of fixed assets, number of employees, specialization coefficient.

Table 1 - Initial data for regression analysis


According to the table data and on the basis of factors with the most significant values ​​of correlation coefficients, regression functions of the dependence of profit on factors were constructed.

The regression equation in our case will take the form:

The quantitative influence of the factors discussed above on the amount of profit is indicated by the coefficients of the regression equation. They show how many thousand rubles its value changes when the factor characteristic changes by one unit. As follows from the equation, an increase in the marketing mix coefficient by one unit gives an increase in profit by 1547.7 thousand rubles. This suggests that improving marketing activities has enormous potential for improving the economic performance of enterprises.

When studying marketing effectiveness, the most interesting and most important factor is factor X5 - the marketing coefficient. In accordance with the theory of statistics, the advantage of the existing multiple regression equation is the ability to evaluate the isolated influence of each factor, including the marketing factor.

The results of the regression analysis have a wider application than for calculating the parameters of the equation. The criterion for classifying (Kef) enterprises as relatively better or relatively worse is based on the relative indicator of the result:

where Y facti is the actual value of the i-th enterprise, thousand rubles;

Y calculated - the amount of profit of the i-th enterprise, obtained by calculation using the regression equation

In terms of the problem being solved, the value is called the “efficiency coefficient”. The activity of an enterprise can be considered effective in cases where the value of the coefficient is greater than one. This means that the actual profit is greater than the average profit over the sample.

Actual and estimated profit values ​​are presented in table. 2.

Table 2 - Analysis of the resulting characteristic in the regression model

Analysis of the table shows that in our case, the activities of enterprises 3, 5, 7, 9, 12, 14, 15, 17 for the period under review can be considered successful.

Concept of regression. Dependence between variables x And y can be described in different ways. In particular, any form of connection can be expressed by a general equation, where y treated as a dependent variable, or functions from another - independent variable x, called argument. The correspondence between an argument and a function can be specified by a table, formula, graph, etc. Changing a function depending on a change in one or more arguments is called regression. All means used to describe correlations constitute the content regression analysis.

To express regression, correlation equations, or regression equations, empirical and theoretically calculated regression series, their graphs, called regression lines, as well as linear and nonlinear regression coefficients are used.

Regression indicators express the correlation relationship bilaterally, taking into account changes in the average values ​​of the characteristic Y when changing values x i sign X, and, conversely, show a change in the average values ​​of the characteristic X according to changed values y i sign Y. The exception is time series, or time series, showing changes in characteristics over time. The regression of such series is one-sided.

There are many different forms and types of correlations. The task comes down to identifying the form of the connection in each specific case and expressing it with the appropriate correlation equation, which allows us to anticipate possible changes in one characteristic Y based on known changes in another X, related to the first correlationally.

12.1 Linear regression

Regression equation. Results of observations carried out on a particular biological object based on correlated characteristics x And y, can be represented by points on a plane by constructing a system of rectangular coordinates. The result is a kind of scatter diagram that allows one to judge the form and closeness of the relationship between varying characteristics. Quite often this relationship looks like a straight line or can be approximated by a straight line.

Linear relationship between variables x And y is described by a general equation, where a, b, c, d,... – parameters of the equation that determine the relationships between the arguments x 1 , x 2 , x 3 , …, x m and functions.

In practice, not all possible arguments are taken into account, but only some arguments; in the simplest case, only one:

In the linear regression equation (1) a is the free term, and the parameter b determines the slope of the regression line relative to the rectangular coordinate axes. In analytical geometry this parameter is called slope, and in biometrics – regression coefficient. A visual representation of this parameter and the position of the regression lines Y By X And X By Y in the rectangular coordinate system gives Fig. 1.

Rice. 1 Regression lines of Y by X and X by Y in the system

rectangular coordinates

Regression lines, as shown in Fig. 1, intersect at point O (,), corresponding to the arithmetic average values ​​of characteristics correlated with each other Y And X. When constructing regression graphs, the values ​​of the independent variable X are plotted along the abscissa axis, and the values ​​of the dependent variable, or function Y, are plotted along the ordinate axis. Line AB passing through point O (,) corresponds to the complete (functional) relationship between the variables Y And X, when the correlation coefficient . The stronger the connection between Y And X, the closer the regression lines are to AB, and, conversely, the weaker the connection between these quantities, the more distant the regression lines are from AB. If there is no connection between the characteristics, the regression lines are at right angles to each other and .

Since regression indicators express the correlation relationship bilaterally, regression equation (1) should be written as follows:

The first formula determines the average values ​​when the characteristic changes X per unit of measure, for the second - average values ​​when changing by one unit of measure of the attribute Y.

Regression coefficient. The regression coefficient shows how much on average the value of one characteristic y changes when the measure of another, correlated with, changes by one Y sign X. This indicator is determined by the formula

Here are the values s multiplied by the size of class intervals λ , if they were found from variation series or correlation tables.

The regression coefficient can be calculated without calculating standard deviations s y And s x according to the formula

If the correlation coefficient is unknown, the regression coefficient is determined as follows:

Relationship between regression and correlation coefficients. Comparing formulas (11.1) (topic 11) and (12.5), we see: their numerator has the same value, which indicates a connection between these indicators. This relationship is expressed by the equality

Thus, the correlation coefficient is equal to the geometric mean of the coefficients b yx And b xy. Formula (6) allows, firstly, based on the known values ​​of the regression coefficients b yx And b xy determine the regression coefficient R xy, and secondly, check the correctness of the calculation of this correlation indicator R xy between varying characteristics X And Y.

Like the correlation coefficient, the regression coefficient characterizes only a linear relationship and is accompanied by a plus sign for a positive relationship and a minus sign for a negative relationship.

Determination of linear regression parameters. It is known that the sum of squared deviations is a variant x i from the average is the smallest value, i.e. This theorem forms the basis of the least squares method. Regarding linear regression [see formula (1)] the requirement of this theorem is satisfied by a certain system of equations called normal:

Joint solution of these equations with respect to parameters a And b leads to the following results:

;

;

, from where and.

Considering the two-way nature of the relationship between the variables Y And X, formula for determining the parameter A should be expressed like this:

And . (7)

Parameter b, or regression coefficient, is determined by the following formulas:

Construction of empirical regression series. If there are a large number of observations, regression analysis begins with the construction of empirical regression series. Empirical regression series is formed by calculating the values ​​of one varying characteristic X average values ​​of another, correlated with X sign Y. In other words, the construction of empirical regression series comes down to finding group averages from the corresponding values ​​of characteristics Y and X.

An empirical regression series is a double series of numbers that can be represented by points on a plane, and then, by connecting these points with straight line segments, an empirical regression line can be obtained. Empirical regression series, especially their graphs, called regression lines, give a clear idea of ​​the form and closeness of the correlation between varying characteristics.

Alignment of empirical regression series. Graphs of empirical regression series turn out, as a rule, not to be smooth, but broken lines. This is explained by the fact that, along with the main reasons that determine the general pattern in the variability of correlated characteristics, their magnitude is affected by the influence of numerous secondary reasons that cause random fluctuations in the nodal points of regression. To identify the main tendency (trend) of the conjugate variation of correlated characteristics, it is necessary to replace broken lines with smooth, smoothly running regression lines. The process of replacing broken lines with smooth ones is called alignment of empirical series And regression lines.

Graphic alignment method. This is the simplest method that does not require computational work. Its essence boils down to the following. The empirical regression series is depicted as a graph in a rectangular coordinate system. Then the midpoints of regression are visually outlined, along which a solid line is drawn using a ruler or pattern. The disadvantage of this method is obvious: it does not exclude the influence of the individual properties of the researcher on the results of alignment of empirical regression lines. Therefore, in cases where higher accuracy is needed when replacing broken regression lines with smooth ones, other methods of aligning empirical series are used.

Moving average method. The essence of this method comes down to the sequential calculation of arithmetic averages from two or three adjacent terms of the empirical series. This method is especially convenient in cases where the empirical series is represented by a large number of terms, so that the loss of two of them - the extreme ones, which is inevitable with this method of alignment, will not noticeably affect its structure.

Least square method. This method was proposed at the beginning of the 19th century by A.M. Legendre and, independently of him, K. Gauss. It allows you to most accurately align empirical series. This method, as shown above, is based on the assumption that the sum of squared deviations is an option x i from their average there is a minimum value, i.e. Hence the name of the method, which is used not only in ecology, but also in technology. The least squares method is objective and universal; it is used in a wide variety of cases when finding empirical equations for regression series and determining their parameters.

The requirement of the least squares method is that the theoretical points of the regression line must be obtained in such a way that the sum of the squared deviations from these points for the empirical observations y i was minimal, i.e.

By calculating the minimum of this expression in accordance with the principles of mathematical analysis and transforming it in a certain way, one can obtain a system of so-called normal equations, in which the unknown values ​​are the required parameters of the regression equation, and the known coefficients are determined by the empirical values ​​of the characteristics, usually the sums of their values ​​and their cross products.

Multiple linear regression. The relationship between several variables is usually expressed by a multiple regression equation, which can be linear And nonlinear. In its simplest form, multiple regression is expressed as an equation with two independent variables ( x, z):

Where a– free term of the equation; b And c– parameters of the equation. To find the parameters of equation (10) (using the least squares method), the following system of normal equations is used:

Dynamic series. Alignment of rows. Changes in characteristics over time form the so-called time series or dynamics series. A characteristic feature of such series is that the independent variable X here is always the time factor, and the dependent variable Y is a changing feature. Depending on the regression series, the relationship between the variables X and Y is one-sided, since the time factor does not depend on the variability of the characteristics. Despite these features, dynamics series can be likened to regression series and processed using the same methods.

Like regression series, empirical dynamics series are influenced not only by the main ones, but also by numerous secondary (random) factors that obscure the main trend in the variability of characteristics, which in the language of statistics is called trend.

Analysis of time series begins with identifying the shape of the trend. To do this, the time series is depicted as a line graph in a rectangular coordinate system. In this case, time points (years, months and other units of time) are plotted along the abscissa axis, and the values ​​of the dependent variable Y are plotted along the ordinate axis. If there is a linear relationship between the variables X and Y (linear trend), the least squares method is the most appropriate for aligning the time series is a regression equation in the form of deviations of the terms of the series of the dependent variable Y from the arithmetic mean of the series of the independent variable X:

Here is the linear regression parameter.

Numerical characteristics of dynamics series. The main generalizing numerical characteristics of dynamics series include geometric mean and an arithmetic mean close to it. They characterize the average rate at which the value of the dependent variable changes over certain periods of time:

An assessment of the variability of members of the dynamics series is standard deviation. When choosing regression equations to describe time series, the shape of the trend is taken into account, which can be linear (or reduced to linear) and nonlinear. The correctness of the choice of regression equation is usually judged by the similarity of the empirically observed and calculated values ​​of the dependent variable. A more accurate solution to this problem is the regression analysis of variance method (topic 12, paragraph 4).

Correlation of time series. It is often necessary to compare the dynamics of parallel time series related to each other by certain general conditions, for example, to find out the relationship between agricultural production and the growth of livestock numbers over a certain period of time. In such cases, the characteristic of the relationship between variables X and Y is correlation coefficient R xy (in the presence of a linear trend).

It is known that the trend of time series is, as a rule, obscured by fluctuations in the series of the dependent variable Y. This gives rise to a twofold problem: measuring the dependence between compared series, without excluding the trend, and measuring the dependence between neighboring members of the same series, excluding the trend. In the first case, the indicator of the closeness of the connection between the compared time series is correlation coefficient(if the relationship is linear), in the second – autocorrelation coefficient. These indicators have different meanings, although they are calculated using the same formulas (see topic 11).

It is easy to see that the value of the autocorrelation coefficient is affected by the variability of the series members of the dependent variable: the less the series members deviate from the trend, the higher the autocorrelation coefficient, and vice versa.



top