Regression analysis is a statistical method for studying the dependence of a random variable on variables. Correlation and regression analysis in Excel: execution instructions The regression method allows you to set

Regression analysis is a statistical method for studying the dependence of a random variable on variables.  Correlation and regression analysis in Excel: execution instructions The regression method allows you to set

Regression analysis is a method of establishing an analytical expression of a stochastic relationship between the studied features. The regression equation shows how, on average, changes at when changing any of x i , and looks like:

where y - dependent variable (it is always one);

X i - independent variables (factors) (there may be several of them).

If there is only one independent variable, this is a simple regression analysis. If there are several P 2), then such an analysis is called multivariate.

In the course of regression analysis, two main tasks are solved:

    construction of the regression equation, i.e. finding the type of relationship between the result indicator and independent factors x 1 , x 2 , …, x n .

    assessment of the significance of the resulting equation, i.e. determination of how much the selected factor features explain the variation of the feature y.

Regression analysis is used mainly for planning, as well as for the development of a regulatory framework.

Unlike correlation analysis, which only answers the question of whether there is a relationship between the analyzed features, regression analysis also gives its formalized expression. In addition, if the correlation analysis studies any relationship of factors, then the regression analysis studies one-sided dependence, i.e. a connection showing how a change in factor signs affects the resultant sign.

Regression analysis is one of the most developed methods of mathematical statistics. Strictly speaking, the implementation of regression analysis requires the fulfillment of a number of special requirements (in particular, x l ,x 2 ,...,x n ;y must be independent, normally distributed random variables with constant variances). In real life, strict compliance with the requirements of regression and correlation analysis is very rare, but both of these methods are very common in economic research. Dependencies in the economy can be not only direct, but also inverse and non-linear. A regression model can be built in the presence of any dependence, however, in multivariate analysis, only linear models of the form are used:

The construction of the regression equation is carried out, as a rule, by the least squares method, the essence of which is to minimize the sum of squared deviations of the actual values ​​of the resulting attribute from its calculated values, i.e.:

where T - number of observations;

j =a+b 1 x 1 j +b 2 x 2 j + ... + b n X n j - calculated value of the result factor.

Regression coefficients are recommended to be determined using analytical packages for a personal computer or a special financial calculator. In the simplest case, the regression coefficients of a one-factor linear regression equation of the form y = a + bx can be found using the formulas:

cluster analysis

Cluster analysis is one of the methods of multivariate analysis, designed for grouping (clustering) a population, the elements of which are characterized by many features. The values ​​of each of the features serve as the coordinates of each unit of the studied population in the multidimensional space of features. Each observation, characterized by the values ​​of several indicators, can be represented as a point in the space of these indicators, the values ​​of which are considered as coordinates in a multidimensional space. Distance between points R And q from k coordinates is defined as:

The main criterion for clustering is that the differences between clusters should be more significant than between observations assigned to the same cluster, i.e. in a multidimensional space, the inequality must be observed:

where r 1, 2 - distance between clusters 1 and 2.

As well as the regression analysis procedures, the clustering procedure is quite laborious, it is advisable to perform it on a computer.

The main goal of regression analysis consists in determining the analytical form of the relationship, in which the change in the resultant attribute is due to the influence of one or more factor signs, and the set of all other factors that also affect the resultant attribute is taken as constant and average values.
Tasks of regression analysis:
a) Establishing the form of dependence. Regarding the nature and form of the relationship between phenomena, there are positive linear and non-linear and negative linear and non-linear regression.
b) Definition of the regression function in the form of a mathematical equation of one type or another and establishing the influence of explanatory variables on the dependent variable.
c) Estimation of unknown values ​​of the dependent variable. Using the regression function, you can reproduce the values ​​of the dependent variable within the interval of given values ​​of the explanatory variables (i.e., solve the interpolation problem) or evaluate the course of the process outside the specified interval (i.e., solve the extrapolation problem). The result is an estimate of the value of the dependent variable.

Pair regression - the equation of the relationship of two variables y and x: y=f(x), where y is the dependent variable (resultant sign); x - independent, explanatory variable (feature-factor).

There are linear and non-linear regressions.
Linear regression: y = a + bx + ε
Nonlinear regressions are divided into two classes: regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, and regressions that are non-linear with respect to the estimated parameters.
Regressions that are non-linear in explanatory variables:

Regressions that are non-linear in the estimated parameters:

  • power y=a x b ε
  • exponential y=a b x ε
  • exponential y=e a+b x ε
The construction of the regression equation is reduced to estimating its parameters. To estimate the parameters of regressions that are linear in parameters, the method of least squares (LSM) is used. LSM makes it possible to obtain such parameter estimates for which the sum of the squared deviations of the actual values ​​of the effective feature y from the theoretical values ​​y x is minimal, i.e.
.
For linear and nonlinear equations reducible to linear, the following system is solved for a and b:

You can use ready-made formulas that follow from this system:

The closeness of the connection between the studied phenomena is estimated by the linear pair correlation coefficient r xy for linear regression (-1≤r xy ≤1):

and correlation index p xy - for non-linear regression (0≤p xy ≤1):

An assessment of the quality of the constructed model will be given by the coefficient (index) of determination, as well as the average approximation error.
The average approximation error is the average deviation of the calculated values ​​from the actual ones:
.
Permissible limit of values ​​A - no more than 8-10%.
The average coefficient of elasticity E shows how many percent on average the result y will change from its average value when the factor x changes by 1% from its average value:
.

The task of analysis of variance is to analyze the variance of the dependent variable:
∑(y-y )²=∑(y x -y )²+∑(y-y x)²
where ∑(y-y)² is the total sum of squared deviations;
∑(y x -y)² - sum of squared deviations due to regression ("explained" or "factorial");
∑(y-y x)² - residual sum of squared deviations.
The share of the variance explained by regression in the total variance of the effective feature y is characterized by the coefficient (index) of determination R2:

The coefficient of determination is the square of the coefficient or correlation index.

F-test - evaluation of the quality of the regression equation - consists in testing the hypothesis But about the statistical insignificance of the regression equation and the indicator of closeness of connection. For this, a comparison of the actual F fact and the critical (tabular) F table of the values ​​of the Fisher F-criterion is performed. F fact is determined from the ratio of the values ​​of the factorial and residual variances calculated for one degree of freedom:
,
where n is the number of population units; m is the number of parameters for variables x.
F table is the maximum possible value of the criterion under the influence of random factors for given degrees of freedom and significance level a. Significance level a - the probability of rejecting the correct hypothesis, provided that it is true. Usually a is taken equal to 0.05 or 0.01.
If F table< F факт, то Н о - гипотеза о случайной природе оцениваемых характеристик отклоняется и признается их статистическая значимость и надежность. Если F табл >F is a fact, then the hypothesis H about is not rejected and the statistical insignificance, the unreliability of the regression equation is recognized.
To assess the statistical significance of the regression and correlation coefficients, Student's t-test and confidence intervals for each of the indicators are calculated. A hypothesis H about the random nature of the indicators is put forward, i.e. about their insignificant difference from zero. The assessment of the significance of the regression and correlation coefficients using the Student's t-test is carried out by comparing their values ​​with the magnitude of the random error:
; ; .
Random errors of linear regression parameters and correlation coefficient are determined by the formulas:



Comparing the actual and critical (tabular) values ​​of t-statistics - t tabl and t fact - we accept or reject the hypothesis H o.
The relationship between Fisher's F-test and Student's t-statistics is expressed by the equality

If t table< t факт то H o отклоняется, т.е. a , b и r xy не случайно отличаются от нуля и сформировались под влиянием систематически действующего фактора х. Если t табл >t the fact that the hypothesis H about is not rejected and the random nature of the formation of a, b or r xy is recognized.
To calculate the confidence interval, we determine the marginal error D for each indicator:
Δ a =t table m a , Δ b =t table m b .
The formulas for calculating confidence intervals are as follows:
γ a \u003d aΔ a; γ a \u003d a-Δ a; γ a =a+Δa
γ b = bΔ b ; γ b = b-Δ b ; γb =b+Δb
If zero falls within the boundaries of the confidence interval, i.e. If the lower limit is negative and the upper limit is positive, then the estimated parameter is assumed to be zero, since it cannot simultaneously take on both positive and negative values.
The forecast value y p is determined by substituting the corresponding (forecast) value x p into the regression equation y x =a+b·x . The average standard error of the forecast m y x is calculated:
,
where
and the confidence interval of the forecast is built:
γ y x =y p Δ y p ; γ y x min=y p -Δ y p ; γ y x max=y p +Δ y p
where Δ y x =t table ·m y x .

Solution example

Task number 1. For seven territories of the Ural region For 199X, the values ​​of two signs are known.
Table 1.

Required: 1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power law (previously it is necessary to perform the procedure of linearization of variables by taking the logarithm of both parts);
c) demonstrative;
d) equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Evaluate each model through the average approximation error A and Fisher's F-test.

Solution (Option #1)

To calculate the parameters a and b of the linear regression y=a+b·x (the calculation can be done using a calculator).
solve the system of normal equations with respect to but And b:
Based on the initial data, we calculate ∑y, ∑x, ∑y x, ∑x², ∑y²:
y x yx x2 y2 y xy-y xAi
l68,8 45,1 3102,88 2034,01 4733,44 61,3 7,5 10,9
2 61,2 59,0 3610,80 3481,00 3745,44 56,5 4,7 7,7
3 59,9 57,2 3426,28 3271,84 3588,01 57,1 2,8 4,7
4 56,7 61,8 3504,06 3819,24 3214,89 55,5 1,2 2,1
5 55,0 58,8 3234,00 3457,44 3025,00 56,5 -1,5 2,7
6 54,3 47,2 2562,96 2227,84 2948,49 60,5 -6,2 11,4
7 49,3 55,2 2721,36 3047,04 2430,49 57,8 -8,5 17,2
Total405,2 384,3 22162,34 21338,41 23685,76 405,2 0,0 56,7
Wed value (Total/n)57,89
y
54,90
x
3166,05
x y
3048,34
3383,68
XX8,1
s 5,74 5,86 XXXXXX
s232,92 34,34 XXXXXX


a=y -b x = 57.89+0.35 54.9 ≈ 76.88

Regression equation: y= 76,88 - 0,35X. With an increase in the average daily wage by 1 rub. the share of spending on the purchase of food products is reduced by an average of 0.35% points.
Calculate the linear coefficient of pair correlation:

Communication is moderate, reverse.
Let's determine the coefficient of determination: r² xy =(-0.35)=0.127
The 12.7% variation in the result is explained by the variation in the x factor. Substituting the actual values ​​into the regression equation X, we determine the theoretical (calculated) values ​​of y x . Let us find the value of the average approximation error A :

On average, the calculated values ​​deviate from the actual ones by 8.1%.
Let's calculate the F-criterion:

The obtained value indicates the need to accept the hypothesis H 0 about the random nature of the revealed dependence and the statistical insignificance of the parameters of the equation and the indicator of the tightness of the connection.
1b. The construction of the power model y=a x b is preceded by the procedure of linearization of variables. In the example, linearization is done by taking the logarithm of both sides of the equation:
lg y=lg a + b lg x
Y=C+b Y
where Y=lg(y), X=lg(x), C=lg(a).

For calculations, we use the data in Table. 1.3.
Table 1.3

YX YX Y2 x2 y xy-y x(y-yx)²Ai
1 1,8376 1,6542 3,0398 3,3768 2,7364 61,0 7,8 60,8 11,3
2 1,7868 1,7709 3,1642 3,1927 3,1361 56,3 4,9 24,0 8,0
3 1,7774 1,7574 3,1236 3,1592 3,0885 56,8 3,1 9,6 5,2
4 1,7536 1,7910 3,1407 3,0751 3,2077 55,5 1,2 1,4 2,1
5 1,7404 1,7694 3,0795 3,0290 3,1308 56,3 -1,3 1,7 2,4
6 1,7348 1,6739 2,9039 3,0095 2,8019 60,2 -5,9 34,8 10,9
7 1,6928 1,7419 2,9487 2,8656 3,0342 57,4 -8,1 65,6 16,4
Total12,3234 12,1587 21,4003 21,7078 21,1355 403,5 1,7 197,9 56,3
Mean1,7605 1,7370 3,0572 3,1011 3,0194 XX28,27 8,0
σ 0,0425 0,0484 XXXXXXX
σ20,0018 0,0023 XXXXXXX

Calculate C and b:

C=Y -b X = 1.7605+0.298 1.7370 = 2.278126
We get a linear equation: Y=2.278-0.298 X
After potentiating it, we get: y=10 2.278 x -0.298
Substituting in this equation the actual values X, we obtain the theoretical values ​​of the result. Based on them, we calculate the indicators: the tightness of the connection - the correlation index p xy and the average approximation error A .

The characteristics of the power model indicate that it describes the relationship somewhat better than the linear function.

1v. The construction of the equation of the exponential curve y \u003d a b x is preceded by the procedure for linearizing the variables when taking the logarithm of both parts of the equation:
lg y=lg a + x lg b
Y=C+B x
For calculations, we use the table data.

Yx Yx Y2 x2y xy-y x(y-yx)²Ai
1 1,8376 45,1 82,8758 3,3768 2034,01 60,7 8,1 65,61 11,8
2 1,7868 59,0 105,4212 3,1927 3481,00 56,4 4,8 23,04 7,8
3 1,7774 57,2 101,6673 3,1592 3271,84 56,9 3,0 9,00 5,0
4 1,7536 61,8 108,3725 3,0751 3819,24 55,5 1,2 1,44 2,1
5 1,7404 58,8 102,3355 3,0290 3457,44 56,4 -1,4 1,96 2,5
6 1,7348 47,2 81,8826 3,0095 2227,84 60,0 -5,7 32,49 10,5
7 1,6928 55,2 93,4426 2,8656 3047,04 57,5 -8,2 67,24 16,6
Total12,3234 384,3 675,9974 21,7078 21338,41 403,4 -1,8 200,78 56,3
Wed zn.1,7605 54,9 96,5711 3,1011 3048,34 XX28,68 8,0
σ 0,0425 5,86 XXXXXXX
σ20,0018 34,339 XXXXXXX

The values ​​of the regression parameters A and IN amounted to:

A=Y -B x = 1.7605+0.0023 54.9 = 1.887
A linear equation is obtained: Y=1.887-0.0023x. We potentiate the resulting equation and write it in the usual form:
y x =10 1.887 10 -0.0023x = 77.1 0.9947 x
We estimate the tightness of the relationship through the correlation index p xy:

3588,01 56,9 3,0 9,00 5,0 4 56,7 0,0162 0,9175 0,000262 3214,89 55,5 1,2 1,44 2,1 5 55 0,0170 0,9354 0,000289 3025,00 56,4 -1,4 1,96 2,5 6 54,3 0,0212 1,1504 0,000449 2948,49 60,8 -6,5 42,25 12,0 7 49,3 0,0181 0,8931 0,000328 2430,49 57,5 -8,2 67,24 16,6 Total405,2 0,1291 7,5064 0,002413 23685,76 405,2 0,0 194,90 56,5 Mean57,9 0,0184 1,0723 0,000345 3383,68 XX27,84 8,1 σ 5,74 0,002145 XXXXXXX σ232,9476 0,000005 XX

Regression analysis underlies the creation of most econometric models, among which should be included the cost estimation models. To build valuation models, this method can be used if the number of analogues (comparable objects) and the number of cost factors (comparison elements) correlate with each other as follows: P> (5 -g-10) x to, those. there should be 5-10 times more analogues than cost factors. The same requirement for the ratio of the amount of data and the number of factors applies to other tasks: establishing a relationship between the cost and consumer parameters of an object; justification of the procedure for calculating corrective indices; clarification of price trends; establishing a relationship between wear and changes in influencing factors; obtaining dependencies for calculating cost standards, etc. The fulfillment of this requirement is necessary in order to reduce the probability of working with a data sample that does not satisfy the requirement of normal distribution of random variables.

The regression relationship reflects only the average trend of the resulting variable, such as cost, from changes in one or more factor variables, such as location, number of rooms, area, floor, etc. This is the difference between a regression relationship and a functional one, in which the value of the resulting variable is strictly defined for a given value of factor variables.

The presence of a regression relationship / between the resulting at and factor variables x p ..., x k(factors) indicates that this relationship is determined not only by the influence of the selected factor variables, but also by the influence of variables, some of which are generally unknown, others cannot be assessed and taken into account:

The influence of unaccounted for variables is denoted by the second term of this equation ?, which is called the approximation error.

There are the following types of regression dependencies:

  • ? paired regression - the relationship between two variables (resultant and factorial);
  • ? multiple regression - dependence of one resulting variable and two or more factor variables included in the study.

The main task of regression analysis is to quantify the closeness of the relationship between variables (in paired regression) and multiple variables (in multiple regression). The tightness of the relationship is quantified by the correlation coefficient.

The use of regression analysis allows you to establish the pattern of influence of the main factors (hedonic characteristics) on the indicator under study, both in their totality and each of them individually. With the help of regression analysis, as a method of mathematical statistics, it is possible, firstly, to find and describe the form of the analytical dependence of the resulting (desired) variable on the factorial ones and, secondly, to estimate the closeness of this dependence.

By solving the first problem, a mathematical regression model is obtained, with the help of which the desired indicator is then calculated for given factor values. The solution of the second problem makes it possible to establish the reliability of the calculated result.

Thus, regression analysis can be defined as a set of formal (mathematical) procedures designed to measure the tightness, direction and analytical expression of the form of the relationship between the resulting and factor variables, i.e. the output of such an analysis should be a structurally and quantitatively defined statistical model of the form:

where y - the average value of the resulting variable (the desired indicator, for example, cost, rent, capitalization rate) over P her observations; x is the value of the factor variable (/-th cost factor); to - number of factor variables.

Function f(x l ,...,x lc), describing the dependence of the resulting variable on the factorial ones is called the regression equation (function). The term "regression" (regression (lat.) - retreat, return to something) is associated with the specifics of one of the specific tasks solved at the stage of the formation of the method, and currently does not reflect the entire essence of the method, but continues to be used.

Regression analysis generally includes the following steps:

  • ? formation of a sample of homogeneous objects and collection of initial information about these objects;
  • ? selection of the main factors influencing the resulting variable;
  • ? checking the sample for normality using X 2 or binomial criterion;
  • ? acceptance of the hypothesis about the form of communication;
  • ? mathematical data processing;
  • ? obtaining a regression model;
  • ? assessment of its statistical indicators;
  • ? verification calculations using a regression model;
  • ? analysis of results.

The specified sequence of operations takes place in the study of both a pair relationship between a factor variable and one resulting variable, and a multiple relationship between the resulting variable and several factor variables.

The use of regression analysis imposes certain requirements on the initial information:

  • ? a statistical sample of objects should be homogeneous in functional and constructive-technological terms;
  • ? quite numerous;
  • ? the cost indicator under study - the resulting variable (price, cost, costs) - must be reduced to the same conditions for its calculation for all objects in the sample;
  • ? factor variables must be measured accurately enough;
  • ? factor variables must be independent or minimally dependent.

The requirements for homogeneity and completeness of the sample are in conflict: the more strictly the selection of objects is carried out according to their homogeneity, the smaller the sample is, and, conversely, to enlarge the sample, it is necessary to include objects that are not very similar to each other.

After the data are collected for a group of homogeneous objects, they are analyzed to establish the form of the relationship between the resulting and factor variables in the form of a theoretical regression line. The process of finding a theoretical regression line consists in a reasonable choice of an approximating curve and calculation of the coefficients of its equation. The regression line is a smooth curve (in a particular case, a straight line) that describes, using a mathematical function, the general trend of the dependence under study and smoothes irregular, random outliers from the influence of side factors.

To display paired regression dependencies in assessment tasks, the following functions are most often used: linear - y - a 0 + ars + s power - y - aj&i + c demonstrative - y - linear exponential - y - a 0 + ar * + s. Here - e approximation error due to the action of unaccounted for random factors.

In these functions, y is the resulting variable; x - factor variable (factor); but 0 , a r a 2 - regression model parameters, regression coefficients.

The linear exponential model belongs to the class of so-called hybrid models of the form:

where

where x (i = 1, /) - values ​​of factors;

b t (i = 0, /) are the coefficients of the regression equation.

In this equation, the components A, B And Z correspond to the cost of individual components of the asset being valued, for example, the cost of a land plot and the cost of improvements, and the parameter Q is common. It is designed to adjust the value of all components of the asset being valued for a common influence factor, such as location.

The values ​​of factors that are in the degree of the corresponding coefficients are binary variables (0 or 1). The factors that are at the base of the degree are discrete or continuous variables.

Factors associated with multiplication sign coefficients are also continuous or discrete.

The specification is carried out, as a rule, using an empirical approach and includes two stages:

  • ? plotting points of the regression field on the graph;
  • ? graphical (visual) analysis of the type of a possible approximating curve.

The type of regression curve is not always immediately selectable. To determine it, the points of the regression field are first plotted on the graph according to the initial data. Then a line is visually drawn along the position of the points, trying to find out the qualitative pattern of the connection: uniform growth or uniform decrease, growth (decrease) with an increase (decrease) in the rate of dynamics, a smooth approach to a certain level.

This empirical approach is complemented by logical analysis, starting from already known ideas about the economic and physical nature of the factors under study and their mutual influence.

For example, it is known that the dependences of the resulting variables - economic indicators (prices, rent) on a number of factor variables - price-forming factors (distance from the center of the settlement, area, etc.) are non-linear, and they can be described quite strictly by a power, exponential or quadratic function . But with small ranges of factors, acceptable results can also be obtained using a linear function.

If it is still impossible to immediately make a confident choice of any one function, then two or three functions are selected, their parameters are calculated, and then, using the appropriate criteria for the tightness of the connection, the function is finally selected.

In theory, the regression process of finding the shape of a curve is called specification model, and its coefficients - calibration models.

If it is found that the resulting variable y depends on several factorial variables (factors) x ( , x 2 , ..., x k, then they resort to building a multiple regression model. Usually, three forms of multiple communication are used: linear - y - a 0 + a x x x + a^x 2 + ... + a k x k, demonstrative - y - a 0 a*i a x t- a x b, power - y - a 0 x x ix 2 a 2. .x^ or combinations thereof.

The exponential and exponential functions are more universal, since they approximate non-linear relationships, which are the majority of the dependences studied in the assessment. In addition, they can be used in the evaluation of objects and in the method of statistical modeling for mass evaluation, and in the method of direct comparison in individual evaluation when establishing correction factors.

At the calibration stage, the parameters of the regression model are calculated by the least squares method, the essence of which is that the sum of the squared deviations of the calculated values ​​of the resulting variable at., i.e. calculated according to the selected relation equation, from the actual values ​​should be minimal:

Values ​​j) (. and y. known, therefore Q is a function of only the coefficients of the equation. To find the minimum S take partial derivatives Q by the coefficients of the equation and equate them to zero:

As a result, we obtain a system of normal equations, the number of which is equal to the number of determined coefficients of the desired regression equation.

Suppose we need to find the coefficients of the linear equation y - a 0 + ars. The sum of squared deviations is:

/=1

Differentiate a function Q by unknown coefficients a 0 and and equate the partial derivatives to zero:

After transformations we get:

where P - number of original actual values at them (the number of analogues).

The above procedure for calculating the coefficients of the regression equation is also applicable for nonlinear dependencies, if these dependencies can be linearized, i.e. bring to a linear form using a change of variables. Power and exponential functions after taking logarithm and the corresponding change of variables acquire a linear form. For example, a power function after taking a logarithm takes the form: In y \u003d 1n 0 +a x 1ph. After the change of variables Y- In y, L 0 - In and No. X- In x we ​​get a linear function

Y=A0 + cijX, whose coefficients are found as described above.

The least squares method is also used to calculate the coefficients of a multiple regression model. So, the system of normal equations for calculating a linear function with two variables Xj And x 2 after a series of transformations, it looks like this:

Usually this system of equations is solved using linear algebra methods. A multiple exponential function is brought to a linear form by taking logarithms and changing variables in the same way as a paired exponential function.

When using hybrid models, multiple regression coefficients are found using numerical procedures of the method of successive approximations.

To make a final choice among several regression equations, it is necessary to test each equation for the tightness of the relationship, which is measured by the correlation coefficient, variance, and coefficient of variation. For evaluation, you can also use the criteria of Student and Fisher. The greater the tightness of the connection reveals the curve, the more preferable it is, all other things being equal.

If a problem of such a class is being solved, when it is necessary to establish the dependence of a cost indicator on cost factors, then the desire to take into account as many influencing factors as possible and thereby build a more accurate multiple regression model is understandable. However, two objective limitations hinder the expansion of the number of factors. First, building a multiple regression model requires a much larger sample of objects than building a paired model. It is generally accepted that the number of objects in the sample should exceed the number P factors, at least 5-10 times. It follows that to build a model with three influencing factors, it is necessary to collect a sample of about 20 objects with different sets of factor values. Secondly, the factors selected for the model in their influence on the value indicator should be sufficiently independent of each other. This is not easy to ensure, since the sample usually combines objects belonging to the same family, in which there is a regular change in many factors from object to object.

The quality of regression models is usually tested using the following statistics.

Standard deviation of the regression equation error (estimation error):

where P - sample size (number of analogues);

to - number of factors (cost factors);

Error unexplained by the regression equation (Fig. 3.2);

y. - the actual value of the resulting variable (for example, cost); y t - calculated value of the resulting variable.

This indicator is also called standard error of estimation (RMS error). In the figure, the dots indicate specific values ​​of the sample, the symbol indicates the line of the mean values ​​of the sample, the inclined dash-dotted line is the regression line.


Rice. 3.2.

The standard deviation of the estimation error measures the amount of deviation of the actual values ​​of y from the corresponding calculated values. at( , obtained using the regression model. If the sample on which the model is built is subject to the normal distribution law, then it can be argued that 68% of the real values at are in the range at ± &e from the regression line, and 95% - in the range at ± 2d e. This indicator is convenient because the units of measure sg? match the units of measurement at,. In this regard, it can be used to indicate the accuracy of the result obtained in the evaluation process. For example, in a certificate of value, you can indicate that the value of the market value obtained using the regression model V with a probability of 95% is in the range from (V-2d,.) before (at + 2ds).

Coefficient of variation of the resulting variable:

where y - the mean value of the resulting variable (Figure 3.2).

In regression analysis, the coefficient of variation var is the standard deviation of the result, expressed as a percentage of the mean of the result variable. The coefficient of variation can serve as a criterion for the predictive qualities of the resulting regression model: the smaller the value var, the higher are the predictive qualities of the model. The use of the coefficient of variation is preferable to the exponent &e, since it is a relative exponent. In the practical use of this indicator, it can be recommended not to use a model whose coefficient of variation exceeds 33%, since in this case it cannot be said that these samples are subject to the normal distribution law.

Determination coefficient (multiple correlation coefficient squared):

This indicator is used to analyze the overall quality of the resulting regression model. It indicates what percentage of the variation in the resulting variable is due to the influence of all factor variables included in the model. The determination coefficient always lies in the range from zero to one. The closer the value of the coefficient of determination to unity, the better the model describes the original data series. The coefficient of determination can be represented in another way:

Here is the error explained by the regression model,

but - error unexplained

regression model. From an economic point of view, this criterion makes it possible to judge what percentage of the price variation is explained by the regression equation.

The exact acceptance limit of the indicator R2 it is impossible to specify for all cases. Both the sample size and the meaningful interpretation of the equation must be taken into account. As a rule, when studying data on objects of the same type, obtained at approximately the same time, the value R2 does not exceed the level of 0.6-0.7. If all prediction errors are zero, i.e. when the relationship between the resulting and factor variables is functional, then R2 =1.

Adjusted coefficient of determination:

The need to introduce an adjusted coefficient of determination is explained by the fact that with an increase in the number of factors to the usual coefficient of determination almost always increases, but the number of degrees of freedom decreases (n - k- one). The adjustment entered always reduces the value R2, insofar as (P - 1) > (p- to - one). As a result, the value R 2 CKOf) may even become negative. This means that the value R2 was close to zero before adjustment and the proportion of variance explained by the regression equation of the variable at very small.

Of the two variants of regression models that differ in the value of the adjusted coefficient of determination, but have equally good other quality criteria, the variant with a large value of the adjusted coefficient of determination is preferable. The coefficient of determination is not adjusted if (n - k): k> 20.

Fisher ratio:

This criterion is used to assess the significance of the determination coefficient. Residual sum of squares is a measure of prediction error using a regression of known cost values at.. Its comparison with the regression sum of squares shows how many times the regression dependence predicts the result better than the mean at. There is a table of critical values F R Fisher coefficient depending on the number of degrees of freedom of the numerator - to, denominator v 2 = p - k- 1 and significance level a. If the calculated value of the Fisher criterion F R more than the table value, then the hypothesis of the insignificance of the coefficient of determination, i.e. about the discrepancy between the relationships embedded in the regression equation and the really existing ones, with a probability p = 1 - a is rejected.

Average approximation error(average percentage deviation) is calculated as the average relative difference, expressed as a percentage, between the actual and calculated values ​​of the resulting variable:

The lower the value of this indicator, the better the predictive quality of the model. When the value of this indicator is not higher than 7%, they indicate the high accuracy of the model. If 8 > 15%, indicate the unsatisfactory accuracy of the model.

Standard error of the regression coefficient:

where (/I) -1 .- diagonal element of the matrix (X G X) ~ 1 to - number of factors;

X- matrix of factor variables values:

X7- transposed matrix of factor variables values;

(JL) _| is a matrix inverse to a matrix.

The smaller these scores for each regression coefficient, the more reliable the estimate of the corresponding regression coefficient.

Student's test (t-statistics):

This criterion allows you to measure the degree of reliability (significance) of the relationship due to a given regression coefficient. If the calculated value t. greater than table value

t av , where v - p - k - 1 is the number of degrees of freedom, then the hypothesis that this coefficient is statistically insignificant is rejected with a probability of (100 - a)%. There are special tables of the /-distribution that make it possible to determine the critical value of the criterion by a given level of significance a and the number of degrees of freedom v. The most commonly used value of a is 5%.

Multicollinearity, i.e. the effect of mutual relationships between factor variables leads to the need to be content with a limited number of them. If this is not taken into account, then you can end up with an illogical regression model. To avoid the negative effect of multicollinearity, before building a multiple regression model, pair correlation coefficients are calculated rxjxj between selected variables X. And X

Here XjX; - mean value of the product of two factorial variables;

XjXj- the product of the average values ​​of two factor variables;

Evaluation of the variance of the factor variable x..

Two variables are considered to be regressively related (i.e., collinear) if their pairwise correlation coefficient is strictly greater than 0.8 in absolute value. In this case, any of these variables should be excluded from consideration.

In order to expand the possibilities of economic analysis of the resulting regression models, averages are used coefficients of elasticity, determined by the formula:

where Xj- mean value of the corresponding factor variable;

y - mean value of the resulting variable; a i - regression coefficient for the corresponding factor variable.

The elasticity coefficient shows how many percent the value of the resulting variable will change on average when the factor variable changes by 1%, i.e. how the resulting variable reacts to a change in the factor variable. For example, how does the price of sq. m area of ​​the apartment at a distance from the city center.

Useful from the point of view of analyzing the significance of a particular regression coefficient is the estimate private coefficient of determination:

Here is the estimate of the variance of the resulting

variable. This coefficient shows how many percent the variation of the resulting variable is explained by the variation of the /-th factor variable included in the regression equation.

  • Hedonic characteristics are the characteristics of an object that reflect its useful (valuable) properties from the point of view of buyers and sellers.

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is considered in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equality is used in statistics and econometrics.

Definition of regression

In mathematics, regression is understood as a certain quantity that describes the dependence of the average value of a data set on the values ​​of another quantity. The regression equation shows, as a function of a particular feature, the average value of another feature. The regression function has the form of a simple equation y \u003d x, in which y acts as a dependent variable, and x is an independent variable (feature factor). In fact, the regression is expressed as y = f (x).

What are the types of relationships between variables

In general, two opposite types of relationship are distinguished: correlation and regression.

The first is characterized by equality of conditional variables. In this case, it is not known for certain which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to build a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

To date, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c + m * x + E. The hyperbolic equation has the form of a regular hyperbola y \u003d c + m / x + E. The logarithmically linear equation expresses the relationship using the logarithmic function: In y \u003d In c + m * In x + In E.

Multiple and non-linear

Two more complex types of regression are multiple and non-linear. The multiple regression equation is expressed by the function y \u003d f (x 1, x 2 ... x c) + E. In this situation, y is the dependent variable and x is the explanatory variable. The variable E is stochastic and includes the influence of other factors in the equation. The non-linear regression equation is a bit inconsistent. On the one hand, with respect to the indicators taken into account, it is not linear, and on the other hand, in the role of assessing indicators, it is linear.

Inverse and Pairwise Regressions

An inverse is a kind of function that needs to be converted to a linear form. In the most traditional application programs, it has the form of a function y \u003d 1 / c + m * x + E. The pairwise regression equation shows the relationship between the data as a function of y = f(x) + E. Just like the other equations, y depends on x and E is a stochastic parameter.

The concept of correlation

This is an indicator that demonstrates the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence of feedback, a positive indicator indicates a direct one. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1 - the stronger the relationship between the parameters, the closer to 0 - the weaker.

Methods

Correlation parametric methods can estimate the tightness of the relationship. They are used on the basis of distribution estimates to study parameters that obey the normal distribution law.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the chosen relationship formula. The correlation field is used as a method for identifying a relationship. To do this, all existing data must be represented graphically. In a rectangular two-dimensional coordinate system, all known data must be plotted. This is how the correlation field is formed. The value of the describing factor is marked along the abscissa, while the values ​​of the dependent factor are marked along the ordinate. If there is a functional relationship between the parameters, they line up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about the almost complete absence of a connection. If it is between 30% and 70%, then this indicates the presence of links of medium tightness. A 100% indicator is evidence of a functional connection.

A non-linear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of the multiple correlation. He speaks about the tightness of the relationship of the presented set of indicators with the trait under study. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is evaluated using this indicator.

In order to calculate the multiple correlation index, it is necessary to calculate its index.

Least square method

This method is a way of estimating regression factors. Its essence lies in minimizing the sum of squared deviations obtained due to the dependence of the factor on the function.

A paired linear regression equation can be estimated using such a method. This type of equations is used in case of detection between the indicators of a paired linear relationship.

Equation Options

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter t shows the average change in the final indicator of the function y, subject to a decrease (increase) in the variable x by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not make economic sense. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say about a slow change in the result compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed in terms of an equation. For example, the factor c has the form c = y - mx.

Grouped data

There are such conditions of the task in which all information is grouped according to the attribute x, but at the same time, for a certain group, the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depends on x. Thus, the grouped information helps to find the regression equation. It is used as a relationship analysis. However, this method has its drawbacks. Unfortunately, averages are often subject to external fluctuations. These fluctuations are not a reflection of the patterns of the relationship, they just mask its "noise". Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the size of a particular population by the corresponding average, you can get the sum of y within the group. Next, you need to knock out all the received amounts and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. In the event that the intervals are small, we can conditionally take the indicator x for all units (within the group) the same. Multiply it with the sum of y to find the sum of the products of x and y. Further, all the sums are knocked together and the total sum xy is obtained.

Multiple Pair Equation Regression: Assessing the Importance of a Relationship

As discussed earlier, multiple regression has a function of the form y \u003d f (x 1, x 2, ..., x m) + E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, studying the causes and type of production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics, such an equation is used a little less often.

The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what effect each of the factors has individually and in their totality on the indicator to be modeled and its coefficients. The regression equation can take on a variety of values. In this case, two types of functions are usually used to assess the relationship: linear and nonlinear.

A linear function is depicted in the form of such a relationship: y \u003d a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. In this case, a2, a m , are considered to be the coefficients of "pure" regression. They are necessary to characterize the average change in the parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of a stable value of other indicators.

Nonlinear equations have, for example, the form of a power function y=ax 1 b1 x 2 b2 ...x m bm . In this case, the indicators b 1, b 2 ..... b m - are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors should be considered when building a multiple regression

In order to correctly construct a multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationship between economic factors and the modeled. The factors to be included must meet the following criteria:

  • Must be measurable. In order to use a factor describing the quality of an object, in any case, it should be given a quantitative form.
  • There should be no factor intercorrelation, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditioned, and this entails its unreliability and fuzzy estimates.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction Methods

There are a huge number of methods and ways to explain how you can choose the factors for the equation. However, all these methods are based on the selection of coefficients using the correlation index. Among them are:

  • Exclusion method.
  • Turn on method.
  • Stepwise regression analysis.

The first method involves sifting out all coefficients from the aggregate set. The second method involves the introduction of many additional factors. Well, the third is the elimination of factors that were previously applied to the equation. Each of these methods has the right to exist. They have their pros and cons, but they can solve the issue of screening out unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Methods of multivariate analysis

Such methods for determining factors are based on the consideration of individual combinations of interrelated features. These include discriminant analysis, pattern recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, however, it appeared as a result of the development of the component method. All of them are applied in certain circumstances, under certain conditions and factors.

Modern political science proceeds from the position on the relationship of all phenomena and processes in society. It is impossible to understand events and processes, predict and manage the phenomena of political life without studying the connections and dependencies that exist in the political sphere of society. One of the most common tasks of policy research is to study the relationship between some observable variables. A whole class of statistical methods of analysis, united by the common name "regression analysis" (or, as it is also called, "correlation-regression analysis"), helps to solve this problem. However, if correlation analysis makes it possible to assess the strength of the relationship between two variables, then using regression analysis it is possible to determine the type of this relationship, to predict the dependence of the value of any variable on the value of another variable.

First, let's remember what a correlation is. Correlative called the most important special case of statistical relationship, which consists in the fact that equal values ​​of one variable correspond to different average values another. With a change in the value of the attribute x, the average value of the attribute y naturally changes, while in each individual case the value of the attribute at(with different probabilities) can take on many different values.

The appearance of the term “correlation” in statistics (and political science attracts the achievement of statistics for solving its problems, which, therefore, is a discipline related to political science) is associated with the name of the English biologist and statistician Francis Galton, who proposed in the 19th century. theoretical foundations of correlation-regression analysis. The term "correlation" in science was known before. In particular, in paleontology back in the 18th century. it was applied by the French scientist Georges Cuvier. He introduced the so-called correlation law, with the help of which, according to the remains of animals found during excavations, it was possible to restore their appearance.

There is a well-known story associated with the name of this scientist and his law of correlation. So, on the days of a university holiday, students who decided to play a trick on a famous professor pulled a goat skin with horns and hooves over one student. He climbed into the window of Cuvier's bedroom and shouted: "I'll eat you." The professor woke up, looked at the silhouette and replied: “If you have horns and hooves, then you are a herbivore and cannot eat me. And for ignorance of the law of correlation you will get a deuce. He turned over and fell asleep. A joke is a joke, but in this example we are seeing a special case of using multiple correlation-regression analysis. Here the professor, based on the knowledge of the values ​​of the two observed traits (the presence of horns and hooves), based on the law of correlation, derived the average value of the third trait (the class to which this animal belongs is a herbivore). In this case, we are not talking about the specific value of this variable (i.e., this animal could take on different values ​​on a nominal scale - it could be a goat, a ram, or a bull ...).

Now let's move on to the term "regression". Strictly speaking, it is not connected with the meaning of those statistical problems that are solved with the help of this method. An explanation of the term can only be given on the basis of knowledge of the history of the development of methods for studying the relationships between features. One of the first examples of studies of this kind was the work of statisticians F. Galton and K. Pearson, who tried to find a pattern between the growth of fathers and their children according to two observable signs (where X- father's height and U- children's growth). In their study, they confirmed the initial hypothesis that, on average, tall fathers raise averagely tall children. The same principle applies to low fathers and children. However, if the scientists had stopped there, their works would never have been mentioned in textbooks on statistics. The researchers found another pattern within the already mentioned confirmed hypothesis. They proved that very tall fathers produce children that are tall on average, but not very different in height from children whose fathers, although above average, are not very different from average height. The same is true for fathers with very small stature (deviating from the average of the short group) - their children, on average, did not differ in height from peers whose fathers were simply short. They called the function that describes this regularity regression function. After this study, all equations describing similar functions and constructed in a similar way began to be called regression equations.

Regression analysis is one of the methods of multivariate statistical data analysis, combining a set of statistical techniques designed to study or model relationships between one dependent and several (or one) independent variables. The dependent variable, according to the tradition accepted in statistics, is called the response and is denoted as V The independent variables are called predictors and are denoted as x. During the course of the analysis, some variables will be weakly related to the response and will eventually be excluded from the analysis. The remaining variables associated with the dependent may also be called factors.

Regression analysis makes it possible to predict the values ​​of one or more variables depending on another variable (for example, the propensity for unconventional political behavior depending on the level of education) or several variables. It is calculated on PC. To compile a regression equation that allows you to measure the degree of dependence of the controlled feature on the factor ones, it is necessary to involve professional mathematicians-programmers. Regression analysis can provide an invaluable service in building predictive models for the development of a political situation, assessing the causes of social tension, and in conducting theoretical experiments. Regression analysis is actively used to study the impact on the electoral behavior of citizens of a number of socio-demographic parameters: gender, age, profession, place of residence, nationality, level and nature of income.

In relation to regression analysis, the concepts independent And dependent variables. An independent variable is a variable that explains or causes a change in another variable. A dependent variable is a variable whose value is explained by the influence of the first variable. For example, in the presidential elections in 2004, the determining factors, i.e. independent variables were indicators such as stabilization of the financial situation of the population of the country, the level of popularity of candidates and the factor incumbency. In this case, the percentage of votes cast for candidates can be considered as a dependent variable. Similarly, in the pair of variables “age of the voter” and “level of electoral activity”, the first one is independent, the second one is dependent.

Regression analysis allows you to solve the following problems:

  • 1) establish the very fact of the presence or absence of a statistically significant relationship between Ci x;
  • 2) build the best (in the statistical sense) estimates of the regression function;
  • 3) according to the given values X build a prediction for the unknown At
  • 4) evaluate the specific weight of the influence of each factor X on the At and, accordingly, exclude insignificant features from the model;
  • 5) by identifying causal relationships between variables, partially manage the values ​​of P by adjusting the values ​​of explanatory variables x.

Regression analysis is associated with the need to select mutually independent variables that affect the value of the indicator under study, determine the form of the regression equation, and evaluate parameters using statistical methods for processing primary sociological data. This type of analysis is based on the idea of ​​the form, direction and closeness (density) of the relationship. Distinguish steam room And multiple regression depending on the number of studied features. In practice, regression analysis is usually performed in conjunction with correlation analysis. Regression Equation describes a numerical relationship between quantities, expressed as a tendency for one variable to increase or decrease while another increases or decreases. At the same time, razl and h a yut l frost And non-linear regression. When describing political processes, both variants of regression are equally found.

Scatterplot for the distribution of interdependence of interest in political articles ( U) and education of respondents (X) is a linear regression (Fig. 30).

Rice. thirty.

Scatterplot for the distribution of the level of electoral activity ( U) and age of the respondent (A) (conditional example) is a non-linear regression (Fig. 31).


Rice. 31.

To describe the relationship of two features (A "and Y) in a paired regression model, a linear equation is used

where a, is a random value of the error of the equation with variation of features, i.e. deviation of the equation from "linearity".

To evaluate the coefficients but And b use the least squares method, which assumes that the sum of the squared deviations of each point on the scatter plot from the regression line should be minimal. Odds a h b can be calculated using the system of equations:

The method of least squares estimation gives such estimates of the coefficients but And b, for which the line passes through the point with coordinates X And y, those. there is a relation at = ax + b. The graphical representation of the regression equation is called theoretical regression line. With a linear dependence, the regression coefficient represents on the graph the tangent of the slope of the theoretical regression line to the x-axis. The sign at the coefficient shows the direction of the relationship. If it is greater than zero, then the relationship is direct; if it is less, it is inverse.

The following example from the study "Political Petersburg-2006" (Table 56) shows a linear relationship between citizens' perceptions of the degree of satisfaction with their lives in the present and expectations of changes in the quality of life in the future. The connection is direct, linear (the standardized regression coefficient is 0.233, the significance level is 0.000). In this case, the regression coefficient is not high, but it exceeds the lower limit of the statistically significant indicator (the lower limit of the square of the statistically significant indicator of the Pearson coefficient).

Table 56

The impact of the quality of life of citizens in the present on expectations

(St. Petersburg, 2006)

* Dependent variable: "How do you think your life will change in the next 2-3 years?"

In political life, the value of the variable under study most often simultaneously depends on several features. For example, the level and nature of political activity are simultaneously influenced by the political regime of the state, political traditions, the peculiarities of the political behavior of people in a given area and the social microgroup of the respondent, his age, education, income level, political orientation, etc. In this case, you need to use the equation multiple regression, which has the following form:

where coefficient b.- partial regression coefficient. It shows the contribution of each independent variable to determining the values ​​of the independent (outcome) variable. If the partial regression coefficient is close to 0, then we can conclude that there is no direct relationship between the independent and dependent variables.

The calculation of such a model can be performed on a PC using matrix algebra. Multiple regression allows you to reflect the multifactorial nature of social ties and clarify the degree of influence of each factor individually and all together on the resulting trait.

Coefficient denoted b, is called the coefficient of linear regression and shows the strength of the relationship between the variation of the factor trait X and variation of the effective feature Y This coefficient measures the strength of the relationship in absolute units of measurement of features. However, the closeness of the correlation of features can also be expressed in terms of the standard deviation of the resulting feature (such a coefficient is called the correlation coefficient). Unlike the regression coefficient b the correlation coefficient does not depend on the accepted units of measurement of features, and therefore, it is comparable for any features. Usually, the connection is considered strong if /> 0.7, medium tightness - at 0.5 g 0.5.

As you know, the closest connection is a functional connection, when each individual value Y can be uniquely assigned to the value x. Thus, the closer the correlation coefficient is to 1, the closer the relationship is to a functional one. The significance level for regression analysis should not exceed 0.001.

The correlation coefficient has long been considered as the main indicator of the closeness of the relationship of features. However, later the coefficient of determination became such an indicator. The meaning of this coefficient is as follows - it reflects the share of the total variance of the resulting feature At, explained by the variance of the feature x. It is found by simply squaring the correlation coefficient (changing from 0 to 1) and, in turn, for a linear relationship reflects the share from 0 (0%) to 1 (100%) characteristic values Y, determined by the values ​​of the attribute x. It is recorded as I 2 , and in the resulting tables of regression analysis in the SPSS package - without a square.

Let us denote the main problems of constructing a multiple regression equation.

  • 1. Choice of factors included in the regression equation. At this stage, the researcher first compiles a general list of the main causes that, according to the theory, determine the phenomenon under study. Then he must select the features in the regression equation. The main selection rule is that the factors included in the analysis should correlate as little as possible with each other; only in this case it is possible to attribute a quantitative measure of influence to a certain factor-attribute.
  • 2. Selecting the Form of the Multiple Regression Equation(in practice, linear or linear-logarithmic is more often used). So, to use multiple regression, the researcher must first build a hypothetical model of the influence of several independent variables on the resulting one. For the obtained results to be reliable, it is necessary that the model exactly matches the real process, i.e. the relationship between variables must be linear, not a single significant independent variable can be ignored, just as not a single variable that is not directly related to the process under study can be included in the analysis. In addition, all measurements of variables must be extremely accurate.

From the above description follows a number of conditions for the application of this method, without which it is impossible to proceed to the procedure of multiple regression analysis (MRA). Only compliance with all of the following points allows you to correctly carry out regression analysis.



top