Posted: August 5th, 2022

8210 wk 10 discc

Discussion: Estimating Models Using Dummy Variables

You have had plenty of opportunity to interpret coefficients for metric variables in regression models. Using and interpreting categorical variables takes just a little bit of extra practice. In this Discussion, you will have the opportunity to practice how to recode categorical variables so they can be used in a regression model and how to properly interpret the coefficients. Additionally, you will gain some practice in running diagnostics and identifying any potential problems with the model.

To prepare for this Discussion:

·

Review Warner’s Chapter 12 and Chapter 2 of the Wagner course text and the media program found in this week’s Learning Resources and consider the use of dummy variables.

· Create a research question using the General Social Survey dataset that can be answered by multiple regression. Using the SPSS software, choose a categorical variable to dummy code as one of your predictor variables.

By Day 3

Estimate a multiple regression model that answers your research question. Post your response to the following:

1.

What is your research question?

2.

Interpret the coefficients for the model, specifically commenting on the dummy variable.

3.

Run diagnostics for the regression model. Does the model meet all of the assumptions? Be sure and comment on what assumptions were not met and the possible implications. Is there any possible remedy for one the assumption violations?

Be sure to support your Main Post and Response Post with reference to the week’s Learning Resources and other scholarly evidence in APA Style.

By Day 5

Respond to at least one of your colleagues’ posts and provide a constructive comment on their assessment of diagnostics.

1. Were all assumptions tested for?

2. Are there some violations that the model might be robust against? Why or why not?

3. Explain and provide any additional resources (i.e., web links, articles, etc.) to provide your colleague with addressing diagnostic issues.

Discrete Data

Discrete independent and dependent variables often lead to plots that are difficult to interpret. A simple example of this phenomenon appears in Figure 8.1, the data for which are drawn from the 1989 General Social Survey conducted by the National Opinion Research Center. The independent variable, years of education completed, is coded from 0 to 20. The dependent variable is the number of correct answers to a 10-item vocabulary test; note that this variable is a disguised proportion—literally, the proportion correct × 10.

Figure 8.1. Scatterplot (a) and residual plot (b) for vocabulary score by year of education. The least-squares regression line is shown on the scatterplot.

Click here to downloadicon download

The scatterplot in Figure 8.1a conveys the general impression that vocabulary increases with education. The plot is difficult to read, however, because most of the 968 data points fall on top of one another. The least-squares regression line, also shown on the plot, has the equation

where V and E are, respectively, the vocabulary score and education.

Figure 8.1b plots residuals from the fitted regression against education. The diagonal lines running from upper left to lower right in this plot are typical of residuals for a discrete dependent variable: For any one of the 11 distinct y values, e.g., y = 5, the residual is e = 5 – b0 – b1x = 3.87 – 0.374x, which is a linear function of x. I noted a similar phenomenon in Chapter 6 for the plot of residuals against fitted values when y has a fixed minimum score. The diagonals from lower left to upper right are due to the discreteness of x.

It also appears that the variation of the residuals in Figure 8.1b is lower for the largest and smallest values of education than for intermediate values. This pattern is consistent with the observation that the dependent variable is a disguised proportion: As the average number of correct answers approaches 0 or 10, the potential variation in vocabulary scores decreases. It is possible, however, that at least part of the apparent decrease in residual variation is due to the relative sparseness of data at the extremes of the education scale. Our eye is drawn to the range of residual values, especially because we cannot see most of the data points, and even when variance is constant, the range tends to increase with the amount of data.

These issues are addressed in Figure 8.2, where each data point has been randomly “jittered” both vertically and horizontally: Specifically, a uniform random variable on the interval [-1/2, 1/2] was added to each education and vocabulary score. This approach to plotting discrete data was suggested by Chambers, Cleveland, Kleiner, and Tukey (1983). The plot also shows the fitted regression line for the original data, along with lines tracing the median and first and third quartiles of the distribution of jittered vocabulary scores for each value of education; I excluded education values below six from the median and quartile traces because of the sparseness of data in this region.

Several features of Figure 8.2 are worth highlighting: (a) It is clear from the jittered data that the observations are particularly dense at 12 years of education, corresponding to high-school graduation; (b) the median trace is quite close to the linear least-squares regression line; and (c) the quartile traces indicate that the spread of y does not decrease appreciably at high values of education.

A discrete dependent variable violates the assumption that the error in the regression model is normally distributed with constant variance. This problem, like that of a limited dependent variable, is only serious in extreme cases—for example, when there are very few response categories, or where a large proportion of observations is in a small number of categories, conditional on the values of the independent variables.

In contrast, discrete independent variables are perfectly consistent with the regression model, which makes no distributional assumptions about the xs other than uncorrelation with the error. Indeed a discrete x makes possible a straightforward hypothesis test of nonlinearity, sometimes called a test for “lack of fit.” Likewise, it is relatively simple to test for nonconstant error variance across categories of a discrete independent variable (see below).

Figure 8.2. “Jittered” scatterplot for vocabulary score by education. A small random quantity is added to each horizontal and vertical coordinate. The dashed line is the least-squares regression line for the unjittered data. The solid lines are median and quartile traces for the jittered vocabulary scores.

Click here to downloadicon download

Testing for Nonlinearity

Suppose, for example, that we model education with a set of dummy regressors rather than specify a linear relationship between vocabulary score and education. Although there are 21 conceivable education scores, ranging from 0 through 20, none of the individuals in the sample has 2 years of education, yielding 20 categories and 19 dummy regressors. The model becomes


TABLE 8.1 Analysis of Variance for Vocabulary-Test Score, Showing the Incremental F Test for Nonlinearity of the Relationship Between Vocabulary and Education

Click here to downloadicon download

Contrasting this model with

produces a test for nonlinearity, because Equation 8.2, specifying a linear relationship, is a special case of Equation 8.1, which captures any pattern of relationship between E(y) and x. The resulting incremental F test for nonlinearity appears in the analysis-of-variance of Table 8.1. There is, therefore, very strong evidence of a linear relationship between vocabulary and education, but little evidence of nonlinearity.

The F test for nonlinearity easily can be extended to a discrete independent variable—say, x1—in a multiple-regression model. Here, we contrast the more general model

with a model specifying a linear effect of x1,

where d1, …, dq-1 are dummy regressors constructed to represent the q categories of x1.

Testing for Nonconstant Error Variance

A discrete x (or combination of xs) partitions the data into q groups. Let yij denote the jth of ni dependent-variable scores in the ith group. If the error variance is constant, then the within-group variance estimates

should be similar. Here, ŷi is the mean in the ith group. Tests that examine the si2 directly, such as Bartlett’s (1937) commonly employed test, do not maintain their validity well when the errors are non-normal.

Many alternative tests have been proposed. In a large-scale simulation study, Conover, Johnson, and Johnson (1981) demonstrate that the following simple F test is both robust and powerful: Calculate the values zij = |yij – yi∗| where yi∗ is the median y within the ith group. Then perform a one-way analysis-of-variance of the variable z over the q groups. If the error variance is not constant across the groups, then the group means  will tend to differ, producing a large value of the F test statistic. For the vocabulary data, for example, where education partitions the 968 observations into q = 20 groups, this test gives F19,948 = 1.48, p = .08, providing weak evidence of nonconstant spread.

https://go.openathens.net/redirector/waldenu.edu?url=https://methods.sagepub.com/book/regression-diagnostics/n8.xml

·

Exploratory data analysis

Discover method in the Methods Map

· On this page

·

Discrete Data

·

Figure 8.1. Scatterplot (a) and residual plot (b) for vocabulary score by year of education. The least-squares regression line is shown on the scatterplot.

·

Figure 8.2. “Jittered” scatterplot for vocabulary score by education. A small random quantity is added to each horizontal and vertical coordinate. The dashed line is the least-squares regression line for the unjittered data. The solid lines are median and quartile traces for the jittered vocabulary scores.

·

Testing for Nonlinearity

·

TABLE 8.1 Analysis of Variance for Vocabulary-Test Score, Showing the Incremental F Test for Nonlinearity of the Relationship Between Vocabulary and Education

·

Testing for Nonconstant Error Variance

·

Nonlinearity

Non-Normally Distributed Errors

The assumption of normally distributed errors is almost always arbitrary. Nevertheless, the central-limit theorem assures that under very broad conditions inference based on the least-squares estimators is approximately valid in all but small samples. Why, then, should we be concerned about non-normal errors?

First, although the validity of least-squares estimation is robust—as stated, the levels of tests and confidence intervals are approximately correct in large samples even when the assumption of normality is violated—the method is not robust in efficiency: The least-squares estimator is maximally efficient among unbiased estimators when the errors are normal. For some types of error distributions, however, particularly those with heavy tails, the efficiency of least-squares estimation decreases markedly. In these cases, the least-squares estimator becomes much less efficient than alternatives (e.g., so-called robust estimators, or least-squares augmented by diagnostics). To a substantial extent, heavy-tailed error distributions are problematic because they give rise to outliers, a problem that I addressed in the previous chapter.

A commonly quoted justification of least-squares estimation— called the Gauss-Markov theorem—states that the least-squares coefficients are the most efficient unbiased estimators that are linear functions of the observations yi. This result depends on the assumptions of linearity, constant error variance, and independence, but does not require normality (see, e.g., Fox, 1984, pp. 42–43). Although the restriction to linear estimators produces simple sampling properties, it is not compelling in light of the vulnerability of least squares to heavy-tailed error distributions.

Second, highly skewed error distributions, aside from their propensity to generate outliers in the direction of the skew, compromise the interpretation of the least-squares fit. This fit is, after all, a conditional mean (of y given the xs), and the mean is not a good measure of the center of a highly skewed distribution. Consequently, we may prefer to transform the data to produce a symmetric error distribution.

Finally, a multimodal error distribution suggests the omission of one or more qualitative variables mat divide the data naturally into groups. An examination of the distribution of residuals may therefore motivate respecification of the model.

Although there are tests for non-normal errors, I shall describe here instead graphical methods for examining the distribution of the residuals (but see Chapter 9). These methods are more useful for pinpointing the character of a problem and for suggesting solutions.

Normal Quantile-Comparison Plot of Residuals

One such graphical display is the quantile-comparison plot, which permits us to compare visually the cumulative distribution of an independent random sample—here of studentized residuals—to a cumulative reference distribution—the unit-normal distribution. Note that approximations are implied, because the studentized residuals are t distributed and dependent, but generally the distortion is negligible, at least for moderate-sized to large samples.

To construct the quantile-comparison plot:

1.

· Arrange the studentized residuals in ascending order: t(1), t(1), …, t(n). By convention, the ith largest studentized residual, t(i), has gi = (i – 1/2)/n proportion of the data below it. This convention avoids cumulative proportions of zero and one by (in effect) counting half of each observation below and half above its recorded value. Cumulative proportions of zero and one would be problematic because the normal distribution, to which we wish to compare the distribution of the residuals, never quite reaches cumulative probabilities of zero or one.

2.

· Find the quantile of the unit-normal distribution that corresponds to a cumulative probability of gi — that is, the value zi from Z ∼ N(0, 1) for which Pr(Z < zi) = gi.

3.

· Plot the t(i) against the zi.

If the ti were drawn from a unit-normal distribution, then, within the bounds of sampling error, t(i) = zi. Consequently, we expect to find an approximately linear plot with zero intercept and unit slope, a line that can be placed on the plot for comparison. Nonlinearity in the plot, in contrast, is symptomatic of non-normality.

It is sometimes advantageous to adjust the fitted line for the observed center and spread of the residuals. To understand how the adjustment may be accomplished, suppose more generally that a variable X is normally distributed with mean μ. and variance ζ2. Then, for an ordered sample of values, approximately x(i) = μ + ζzi, where zi is defined as before. In applications, we need to estimate μ and μ, preferably robustly, because the usual estimators—the sample mean and standard deviation—are markedly affected by extreme values. Generally effective choices are the median of x to estimate μ and (Q3 – Q1)/1.349 to estimate ζ, where Q1 and Q3 are, respectively, the first and third quartiles of x: The median and quartiles are not sensitive to outliers. Note that 1.349 is the number of standard deviations separating the quartiles of a normal distribution. Applied to the studentized residuals, we have the fitted line (i) = median(t) + {[Q3(t) – Q1(t)]/1.349} × zi. The normal quantile-comparison plots in this monograph employ the more general procedure.

Several illustrative normal-probability plots for simulated data are shown in Figure 5.1. In parts a and b of the figure, independent samples of size n = 25 and n = 100, respectively, were drawn from a unit-normal distribution. In parts c and d, samples of size n = 100 were drawn from the highly positively skewed χ42 distribution and the heavy-tailed t2 distribution, respectively. Note how the skew and heavy tails show up as departures from linearity in the normal quantile-comparison plots. Outliers are discernible as unusually large or small values in comparison with corresponding normal quantiles.

Judging departures from normality can be assisted by plotting information about sampling variation. If the studentized residuals were drawn independently from a unit-normal distribution, then

where ϕ(zi) is the probability density (i.e., the “height”) of the unit-normal distribution at Z = zi. Thus, zi ± 2 × SE(t(i)) gives a rough 95% confidence interval around the fitted line (i) = zi in the quantile-comparison plot. If the slope of the fitted line is taken as  = (Q3 – Q1)/ 1.349 rather than 1, then the estimated standard error may be multiplied by . As an alternative to computing standard errors, Atkinson (1985) has suggested a computationally intensive simulation procedure that does not treat the studentized residuals as independent and normally distributed.


Figure 5.1. Illustrative normal quantile-comparison plots. (a) For a sample of n = 25 from N(0, 1). (b) For a sample of n = 100 from N(0, 1). (c) For a sample of n – 100 from the positively skewed χ42. (d) For a sample of n = 100 from the heavy-tailed t2.

Click here to downloadicon download

Figure 5.2 shows a normal quantile-comparison plot for the studentized residuals from Duncan’s regression of rated prestige on occupational income and education levels. The plot includes a fitted line with two-standard-error limits. Note that the residual distribution is reasonably well behaved.


Figure 5.2. Normal quantile-comparison plot for the studentized residuals from the regression of occupational prestige on income and education. The plot shows a fitted line, based on the median and quartiles of the fs, and approximate ±2SE limits around the line.

Click here to downloadicon download

Histograms of Residuals

A strength of the normal quantile-comparison plot is that it retains high resolution in the tails of the distribution, where problems often manifest themselves. A weakness of the display, however, is that it does not convey a good overall sense of the shape of the distribution of the residuals. For example, multiple modes are difficult to discern in a quantile-comparison plot.

Histograms (frequency bar graphs), in contrast, have poor resolution in the tails or wherever data are sparse, but do a good job of conveying general distributional information. The arbitrary class boundaries, arbitrary intervals, and roughness of histograms sometimes produce misleading impressions of the data, however. These problems can partly be addressed by smoothing the histogram (see Silverman, 1986, or Fox, 1990). Generally, I prefer to employ stem-and-leaf displays—a type of histogram (Tukey, 1977) that records the numerical data values directly in the bars of the graph—for small samples (say n < 100), smoothed histograms for moderate-sized samples (say 100 ≤ n ≤ 1,000), and histograms with relatively narrow bars for large samples (say n > 1,000).


Figure 5.3. Stem-and-leaf display of studentized residuals from the regression of occupational prestige on income and education.

Click here to downloadicon download

A stem-and-leaf display of studentized residuals from the Duncan regression is shown in Figure 5.3. The display reveals nothing of note: There is a single node, the distribution appears reasonably symmetric, and there are no obvious outliers, although the largest value (3.1) is somewhat separated from the next-largest value (2.0).

Each data value in the stem-and-leaf display is broken into two parts: The leading digits comprise the stem; the first trailing digit forms the leaf; and the remaining trailing digits are discarded, thus truncating rather than rounding the data value. (Truncation makes it simpler to locate values in a list or table.) For studentized residuals, it is usually sensible to make this break at the decimal point. For example, for the residuals shown in Figure 5.4: 0.3039 → 0 |3; 3.1345 → 3 |1; and -0.4981 → -0 |4. Note that each stem digit appears twice, implicitly producing bins of width 0.5. Stems marked with asterisks (e.g., 1∗) take leaves 0 — 4; stems marked with periods (e.g., 1.) take leaves 5—9. (For more information about stem-and-leaf displays, see, e.g., Velleman and Hoaglin [1981] or Fox [1990].)


Figure 5.4. The family of powers and roots. The transformation labeled “p” is actually y’ = (yp – 1)/p; for p = 0, y’ = logey.

Click here to downloadicon download

SOURCE: Adapted with permission from Figure 4-1 from Hoaglin, Mosteller, and Tukey (eds.). Understanding Robust and Exploratory Data Analysis, © 1983 by John Wiley and Sons, Inc.

Correcting Asymmetry by Transformation

A frequently effective approach to a variety of problems in regression analysis is to transform the data so that they conform more closely to the assumptions of the linear model. In this and later chapters I shall introduce transformations to produce symmetry in the error distribution, to stabilize error variance, and to make the relationship between y and the xs linear.

In each of these cases, we shall employ the family of powers and roots, replacing a variable y (used here generically, because later we shall want to transform xs as well) by y’ = yp. Typically, p = -2, -1, -1/2, 1/2, 2, or 3, although sometimes other powers and roots are considered. Note that p = 1 represents no transformation. In place of the 0th power, which would be useless because y0 = 1 regardless of the value of y, we take y’ = log y, usually using base 2 or 10 for the log function. Because logs to different bases differ only by a constant factor, we can select the base for convenience of interpretation. Using the log transformation as a “zeroth power” is reasonable, because the closer p gets to zero, the more yp looks like the log function (formally, limp→0[(yp – 1)/p] = logey, where the log to the base e ≈ 2.718 is the so-called “natural” logarithm). Finally, for negative powers, we take y’ = -yp, preserving the order of the y values, which would otherwise be reversed.

As we move away from p = 1 in either direction, the transformations get stronger, as illustrated in Figure 5.4. The effect of some of these transformations is shown in Table 5.1a. Transformations “up the ladder” of powers and roots (a term borrowed from Tukey, 1977)—that is, toward y2—serve differentially to spread out large values of y relative to small ones; transformations “down the ladder”—toward log y—have the opposite effect. To correct a positive skew (as in Table 5.1b), it is therefore necessary to move down the ladder; to correct a negative skew (Table 5.1c), which is less common in applications, move up the ladder.

I have implicitly assumed that all data values are positive, a condition that must hold for power transformations to maintain order. In practice, negative values can be eliminated prior to transformation by adding a small constant, sometimes called a “start,” to the data. Likewise, for power transformations to be effective, the ratio of the largest to the smallest data value must be sufficiently large; otherwise the transformation will be too nearly linear. A small ratio can be dealt with by using a negative start.

In the specific context of regression analysis, a skewed error distribution, revealed by examining the distribution of the residuals, can often be corrected by transforming the dependent variable. Although more sophisticated approaches are available (see, e.g., Chapter 9), a good transformation can be located by trial and error.

Dependent variables that are bounded below, and hence that tend to be positively skewed, often respond well to transformations down the ladder of powers. Power transformations usually do not work well, however, when many values stack up against the boundary, a situation termed truncation or censoring (see, e.g., Tobin [1958] for a treatment of “limited” dependent variables in regression). As well, data that are bounded both above and below—such as proportions and percentages—generally require another approach. For example the logit or “log odds” transformation given by y’ = log[y/(l – y)], often works well for proportions.


TABLE 5.1 Correcting Skews by Power Transformations

https://go.openathens.net/redirector/waldenu.edu?url=https://dx.doi.org/10.4135/9781412985604.n5

Walden University, LLC. (Producer). (2016m). Regression diagnostics and model

evaluation [Video file]. Baltimore, MD: Author.

2

Es

t

imating

Model

s Using Dummy Variables

N

ame

Institution Name

Course Name

Professor’s Name

Date

2

The topic for this week’s debate is the relationship between male and female respondents’ labor force statuses

.

These variables’ frequency distributions exhibited no outliers, according to the preliminary information. A total of

2536

people were included in the study. Male and female respondent scales (M=

.449

1

, SD=.4975

0

) were found to have normal distributions of scores on the labor force status (M= 3, SD=

2.355

), as well as the female (M=

.5509

, SD=.4975). The gender of the respondent was used as a variable in a linear regression to predict their employment status. An

F

(1,

2534

) =

73.5

8

5

, p = 0

.001

and a modest effect R2 =.028 were found in the investigation.

Respondents Sex:

Case Processing Summary

Percent

N

Percent

2

100.0%

1397

100.0%

Cases

RESPONDENTS Valid

Missing

Total

SEX N

Percent

N

LA

B

OR FORCE STATUS

MALE

1139

99.8%

0.2%

1141

10

0.0%

FEMALE

1397

0 0.0%

Labor Force Statistics:

Descriptive Statistics

N

LABOR FORCE STATUS

2536

.49750

2536

Mean

Std. Deviation

3.00

2.355 2536

FemaleRespondent

.5509

.49750

MaleRespondents

.4491

Correlations

:

LABOR FORCE STATUS

LABOR FORCE STATUS

FemaleRespondent

.168

1.000

MaleRespondents

-.168

-1.000

1.000

.000

FemaleResponde nt

MaleRespondents

Pearson Correlation

1

.000

.16

8

-.16

8

-1.000

Sig.

(1-tailed) LABOR FORCE STATUS

. .000

FemaleRespondent

.000

.

.000

MaleRespondents

.000

.000

.

N

LABOR FORCE STATUS

2536

2536

2536

FemaleRespondent

2536

2536

2536

MaleRespondents

2536

2536

2536

ANOVAa

1

396.625

Total

Model Sum of Squares

df

Mean Square

F Sig.
1

Regression

396.625

73.585

.000b

Residual

13658.356

2534

5.390

14054.981

2535

a. Dependent Variable: LABOR FORCE STATUS

b. Predictors:

(Constant)

, MaleRespondents

1

.000

-.168

.000

-.168

-.168

1.000

1.000

Unstandardized Coefficients

Standardized Coefficients

Beta

t

Sig.

Correlations

Collinearity Statistics

Model B

Std. Error

Zero-

order

Part

ial

Part

Tolerance

VIF

(Constant)

3.354

.062

54

.002

MaleRespondent s

-.795

.093

-8.578

-.16
8

a. Dependent Variable: LABOR FORCE STATUS

Excluded Variablesa

t

Sig.

Collinearity Statistics

Tolerance

VIF

.

.

.

.000

.

.000

Model Beta In

Partial Correlation

Minimum

Tolerance

1 FemaleResponden t

.b

a. Dependent Variable: LABOR FORCE STATUS

b. Predictors in the Model: (Constant), MaleRespondents

Collinearity Diagnosticsa

(Constant)

MaleRespondents

1

1

1.000

.16

2

.84

Model Dimension Eigenvalue

Condition Index

Variance Proportions

1.670

.16

.330

2.250

.84

a. Dependent Variable: LABOR FORCE STATUS

RE: Discussion – Week 10


COLLAPSE

Top of Form

A dummy variable is a dichotomy with two categories not including missing values (Wagner, 2020).

 

Multiple regression is an extension of bivariate regression by examining the effect of two or more independent variables on the dependent variable (Frankfort-Nachmias et al., 2020). Dummy variables are useful for nominal variables in an analysis that requires interval or ratio variables to perfome multiple regression models. 

General Social Survey Dataset’s Mean of Age

In the General Social Survey Dataset, the older population’s mean of age in the dataset is 49.01 likely implying there are older respondents in the population and the median is 49.00 which would make the data normally distributed, however, the lowest mode is 53. The median is the “middle score” of 49.00 which would make the data normally distributed, however, the level of skewness is positive which makes the data skewed slightly to the right. Thus, the mean indicates the results may not be generalizable across all age groups. I chose to display the data surrounding the respondent’s highest degree and age of the respondents.

Research Question

Using the General Social Survey (GSS) dataset, a research question was constructed through a regression model to find an answer. The dependent variable is the respondents’ highest year of school completed and the two independent variables are owning or buying a home and paying rent used to construct the research question. Understanding the highest year of school completed can influence individuals’ ability to either purchase a home or rent one. Using the GSS data, the following research question and hypothesis were developed:  

· Research question: Do people with a higher year of school completed own or are buying a home rather than paying rent? 

· The null hypothesis (Ho): There is no relationship between the highest year of school completed and owning or buying a home and paying rent. 

· The alternative hypothesis (Ha): There is a relationship between the highest year of school completed and owning or buying a home and paying rent. 

Interpretation of the Coefficients of the Model

Table 1 shows the coefficients of owning or buying a home and paying rent for the highest year of school completed. The unstandardized coefficient of own or is buying is

1.960

(table 1) is different than pays rent which is less than that by

.839

(table 1) units of measure. The significance level is less than .001 (table 1) for both predictors of the own or is buying and pays rent which is well below the .05 threashold and we can reject the null hypothesis that there is no relationship between lived poverty index and problems with public health clinics and public schools (Walden University, 2016). Thus, both respondents’ owning or buying and paying rent for their homes are significant predictors of respondents’ highest year of school completed (Walden University, 2016). The Variance Inflation Factor (VIF) with a value above 10 shows serious multi-collinearity in the model meaning the independent variables of own/buying and pays rent have a high level of correlation between each other (Walden University, 2016). Therefore, we can assume that we have not met the assumption since the VIF is

16.106

(table 1). I don’t think there is a possible remedy for this assumption violation because violating the assumption of linearity implies that the model fails to capture the systematic pattern of relationship between the dependent and independent variables (Nonlinearity, 1991). Consequently, there is no linear relationship in this model. 

Table 1

Coefficients

 

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

Collinearity Statistics

Std. Error

Beta

Tolerance

VIF

1

(Constant)

 

 

 

.062

.168

.062

16.106

Coefficientsa

Model
B

12.24

0

.596

20.554

<.001

dwelown=OWN OR IS BUYING

1.960

.603

.314

3.252

.001 16.106

dwelown=PAYS RENT

.839

.608

.133

1.381

a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED

Analysis of the Model Summary

The Durbin-Watson statistic has values from 0 to 4.0 and provides information about the independence of errors (Walden University, 2016). A model summary calculates a Durbin-Watson value of 1.622 which shows their is no correlation between residuals (Walden University, 2016). Dubrin-Watson values below 1.0 and above 3.0 are considered dangeours because the model suffers from serious correlation which is a lagged version of itself over various time intervals (Walden University, 2016). The model summary calculates the R square of 0.35 units revealing a negligible effect that are slightly meaningful since a a higher effect size means that the research finding has practical significance, while a negligible effect size indicates limited practical applications (Warner, 2012).

Diagnostics for the Regression Model

The residuals statistics in tabel 2 shows the

Cook’s Distance

values which if are greater than or equal to 1.0 are considered problematic and further diagnotics should be performed for unduly influence (Walden University, 2016). Unduly influence the results of the analysis because their presence may signal that the regression model fails to capture important characteristics of the data (Outlying and Influential Data, 1991). For example, specific outliers for one or more of the variables that might be causing unduly influence on the model and have a significant impact (Walden University, 2016). Table 2 shows values for Cook’s Distance ranging from a minimum of .000 to

.037

(table 2) well below 1.0 revealing that we have no unduly influence in this model (Walden University, 2016). Having a negative minimum residual of –

14.20

0 (table 2) means that the predicted value is too high, a positive maximum residual of

6.921

(table 2) means that the predicted value was too low. Thus, the aim of a regression model is to explain variability in dependent variable by means of one or more of independent or control variables. 

Table 2

Residuals Statistics

 

Mean

Std. Deviation

N

.000

1.000

1669

.093

.596

1669

13.76

.569

1669

Residual

.000

1669

.000

1669

.000

1.000

1669

.000

1669

.000

1669

1669

.000

.001

1669

.000

.001

1669

a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED

Residuals Statisticsa

Minimum

Maximum

Predicted Value

12.24 14.20

13.76

.569

1669

Std. Predicted Value

-2.672

.770

Standard Error of Predicted Value

.110

.061

Adjusted Predicted Value

12.04

14.21

-14.200

6.921

2.976

Std. Residual

-4.769

2.324

.999

Stud. Residual

-4.771

2.326

Deleted Residual

-14.214

6.933

2.981

Stud. Deleted Residual

-4.803

2.329

1.001

Mahal. Distance

.612

65.721

1.999

7.879

Cook’s Distance .037 .002

Centered Leverage Value

.039

.005

Conclusion

The regression model is used to test the simple hypotheses and to analyze an association of causality across two or more independent variables. Visual displays of data can be invaluable because it makes it easier to understand the statistical findings through visual representation. After all, research can be useful when it organizes, summarizes, and communicates information (Frankfort-Nachmias et al., 2020). Therefore, the biggest takeaways from the respondents’ highest year of school completed is whether the model meets all the assumptions or can find possible remedies to any violations. 

References

Frankfort-Nachmias, C., Leon-Guerrero, A., & Davis, G. (2020). Social statistics for a diverse society (9th ed.). Sage Publications.

Nonlinearity. (1991). In J. Fox (Ed.), Regression Diagnostics. (pp. 54-62). SAGE Publications, Inc.

Outlying and Influential Data. (1991). In J. Fox (Ed.), Regression Diagnostics. (pp. 22-41). SAGE Publications, Inc.

Wagner, III, W. E. (2020). Using IBM® SPSS® statistics for research methods and social science statistics (7th ed.). Sage Publications.

Walden University, LLC. (Producer). (2016). Regression diagnostics and model evaluation [Video file]. Author.

Warner, R. M. (2012). Applied statistics from bivariate through multivariate techniques (2nd ed.). Sage Publications.

Bottom of Form

Expert paper writers are just a few clicks away

Place an order in 3 easy steps. Takes less than 5 mins.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00