Posted: August 5th, 2022
Discussion: Estimating Models Using Dummy Variables
You have had plenty of opportunity to interpret coefficients for metric variables in regression models. Using and interpreting categorical variables takes just a little bit of extra practice. In this Discussion, you will have the opportunity to practice how to recode categorical variables so they can be used in a regression model and how to properly interpret the coefficients. Additionally, you will gain some practice in running diagnostics and identifying any potential problems with the model.
To prepare for this Discussion:
·
Review Warner’s Chapter 12 and Chapter 2 of the Wagner course text and the media program found in this week’s Learning Resources and consider the use of dummy variables.
· Create a research question using the General Social Survey dataset that can be answered by multiple regression. Using the SPSS software, choose a categorical variable to dummy code as one of your predictor variables.
By Day 3
Estimate a multiple regression model that answers your research question. Post your response to the following:
1.
What is your research question?
2.
Interpret the coefficients for the model, specifically commenting on the dummy variable.
3.
Run diagnostics for the regression model. Does the model meet all of the assumptions? Be sure and comment on what assumptions were not met and the possible implications. Is there any possible remedy for one the assumption violations?
Be sure to support your Main Post and Response Post with reference to the week’s Learning Resources and other scholarly evidence in APA Style.
By Day 5
Respond to at least one of your colleagues’ posts and provide a constructive comment on their assessment of diagnostics.
1. Were all assumptions tested for?
2. Are there some violations that the model might be robust against? Why or why not?
3. Explain and provide any additional resources (i.e., web links, articles, etc.) to provide your colleague with addressing diagnostic issues.
Discrete Data
Discrete independent and dependent variables often lead to plots that are difficult to interpret. A simple example of this phenomenon appears in Figure 8.1, the data for which are drawn from the 1989 General Social Survey conducted by the National Opinion Research Center. The independent variable, years of education completed, is coded from 0 to 20. The dependent variable is the number of correct answers to a 10item vocabulary test; note that this variable is a disguised proportion—literally, the proportion correct × 10.
Figure 8.1. Scatterplot (a) and residual plot (b) for vocabulary score by year of education. The leastsquares regression line is shown on the scatterplot.
Click here to downloadicon download
The scatterplot in Figure 8.1a conveys the general impression that vocabulary increases with education. The plot is difficult to read, however, because most of the 968 data points fall on top of one another. The leastsquares regression line, also shown on the plot, has the equation
where V and E are, respectively, the vocabulary score and education.
Figure 8.1b plots residuals from the fitted regression against education. The diagonal lines running from upper left to lower right in this plot are typical of residuals for a discrete dependent variable: For any one of the 11 distinct y values, e.g., y = 5, the residual is e = 5 – b0 – b1x = 3.87 – 0.374x, which is a linear function of x. I noted a similar phenomenon in Chapter 6 for the plot of residuals against fitted values when y has a fixed minimum score. The diagonals from lower left to upper right are due to the discreteness of x.
It also appears that the variation of the residuals in Figure 8.1b is lower for the largest and smallest values of education than for intermediate values. This pattern is consistent with the observation that the dependent variable is a disguised proportion: As the average number of correct answers approaches 0 or 10, the potential variation in vocabulary scores decreases. It is possible, however, that at least part of the apparent decrease in residual variation is due to the relative sparseness of data at the extremes of the education scale. Our eye is drawn to the range of residual values, especially because we cannot see most of the data points, and even when variance is constant, the range tends to increase with the amount of data.
These issues are addressed in Figure 8.2, where each data point has been randomly “jittered” both vertically and horizontally: Specifically, a uniform random variable on the interval [1/2, 1/2] was added to each education and vocabulary score. This approach to plotting discrete data was suggested by Chambers, Cleveland, Kleiner, and Tukey (1983). The plot also shows the fitted regression line for the original data, along with lines tracing the median and first and third quartiles of the distribution of jittered vocabulary scores for each value of education; I excluded education values below six from the median and quartile traces because of the sparseness of data in this region.
Several features of Figure 8.2 are worth highlighting: (a) It is clear from the jittered data that the observations are particularly dense at 12 years of education, corresponding to highschool graduation; (b) the median trace is quite close to the linear leastsquares regression line; and (c) the quartile traces indicate that the spread of y does not decrease appreciably at high values of education.
A discrete dependent variable violates the assumption that the error in the regression model is normally distributed with constant variance. This problem, like that of a limited dependent variable, is only serious in extreme cases—for example, when there are very few response categories, or where a large proportion of observations is in a small number of categories, conditional on the values of the independent variables.
In contrast, discrete independent variables are perfectly consistent with the regression model, which makes no distributional assumptions about the xs other than uncorrelation with the error. Indeed a discrete x makes possible a straightforward hypothesis test of nonlinearity, sometimes called a test for “lack of fit.” Likewise, it is relatively simple to test for nonconstant error variance across categories of a discrete independent variable (see below).
Figure 8.2. “Jittered” scatterplot for vocabulary score by education. A small random quantity is added to each horizontal and vertical coordinate. The dashed line is the leastsquares regression line for the unjittered data. The solid lines are median and quartile traces for the jittered vocabulary scores.
Click here to downloadicon download
Testing for Nonlinearity
Suppose, for example, that we model education with a set of dummy regressors rather than specify a linear relationship between vocabulary score and education. Although there are 21 conceivable education scores, ranging from 0 through 20, none of the individuals in the sample has 2 years of education, yielding 20 categories and 19 dummy regressors. The model becomes
TABLE 8.1 Analysis of Variance for VocabularyTest Score, Showing the Incremental F Test for Nonlinearity of the Relationship Between Vocabulary and Education
Click here to downloadicon download
Contrasting this model with
produces a test for nonlinearity, because Equation 8.2, specifying a linear relationship, is a special case of Equation 8.1, which captures any pattern of relationship between E(y) and x. The resulting incremental F test for nonlinearity appears in the analysisofvariance of Table 8.1. There is, therefore, very strong evidence of a linear relationship between vocabulary and education, but little evidence of nonlinearity.
The F test for nonlinearity easily can be extended to a discrete independent variable—say, x1—in a multipleregression model. Here, we contrast the more general model
with a model specifying a linear effect of x1,
where d1, …, dq1 are dummy regressors constructed to represent the q categories of x1.
Testing for Nonconstant Error Variance
A discrete x (or combination of xs) partitions the data into q groups. Let yij denote the jth of ni dependentvariable scores in the ith group. If the error variance is constant, then the withingroup variance estimates
should be similar. Here, ŷi is the mean in the ith group. Tests that examine the si2 directly, such as Bartlett’s (1937) commonly employed test, do not maintain their validity well when the errors are nonnormal.
Many alternative tests have been proposed. In a largescale simulation study, Conover, Johnson, and Johnson (1981) demonstrate that the following simple F test is both robust and powerful: Calculate the values zij = yij – yi∗ where yi∗ is the median y within the ith group. Then perform a oneway analysisofvariance of the variable z over the q groups. If the error variance is not constant across the groups, then the group means will tend to differ, producing a large value of the F test statistic. For the vocabulary data, for example, where education partitions the 968 observations into q = 20 groups, this test gives F19,948 = 1.48, p = .08, providing weak evidence of nonconstant spread.
https://go.openathens.net/redirector/waldenu.edu?url=https://methods.sagepub.com/book/regressiondiagnostics/n8.xml
·
Exploratory data analysis
Discover method in the Methods Map
· On this page
·
Discrete Data
·
Figure 8.1. Scatterplot (a) and residual plot (b) for vocabulary score by year of education. The leastsquares regression line is shown on the scatterplot.
·
Figure 8.2. “Jittered” scatterplot for vocabulary score by education. A small random quantity is added to each horizontal and vertical coordinate. The dashed line is the leastsquares regression line for the unjittered data. The solid lines are median and quartile traces for the jittered vocabulary scores.
·
Testing for Nonlinearity
·
TABLE 8.1 Analysis of Variance for VocabularyTest Score, Showing the Incremental F Test for Nonlinearity of the Relationship Between Vocabulary and Education
·
Testing for Nonconstant Error Variance
·
Nonlinearity
NonNormally Distributed Errors
The assumption of normally distributed errors is almost always arbitrary. Nevertheless, the centrallimit theorem assures that under very broad conditions inference based on the leastsquares estimators is approximately valid in all but small samples. Why, then, should we be concerned about nonnormal errors?
First, although the validity of leastsquares estimation is robust—as stated, the levels of tests and confidence intervals are approximately correct in large samples even when the assumption of normality is violated—the method is not robust in efficiency: The leastsquares estimator is maximally efficient among unbiased estimators when the errors are normal. For some types of error distributions, however, particularly those with heavy tails, the efficiency of leastsquares estimation decreases markedly. In these cases, the leastsquares estimator becomes much less efficient than alternatives (e.g., socalled robust estimators, or leastsquares augmented by diagnostics). To a substantial extent, heavytailed error distributions are problematic because they give rise to outliers, a problem that I addressed in the previous chapter.
A commonly quoted justification of leastsquares estimation— called the GaussMarkov theorem—states that the leastsquares coefficients are the most efficient unbiased estimators that are linear functions of the observations yi. This result depends on the assumptions of linearity, constant error variance, and independence, but does not require normality (see, e.g., Fox, 1984, pp. 42–43). Although the restriction to linear estimators produces simple sampling properties, it is not compelling in light of the vulnerability of least squares to heavytailed error distributions.
Second, highly skewed error distributions, aside from their propensity to generate outliers in the direction of the skew, compromise the interpretation of the leastsquares fit. This fit is, after all, a conditional mean (of y given the xs), and the mean is not a good measure of the center of a highly skewed distribution. Consequently, we may prefer to transform the data to produce a symmetric error distribution.
Finally, a multimodal error distribution suggests the omission of one or more qualitative variables mat divide the data naturally into groups. An examination of the distribution of residuals may therefore motivate respecification of the model.
Although there are tests for nonnormal errors, I shall describe here instead graphical methods for examining the distribution of the residuals (but see Chapter 9). These methods are more useful for pinpointing the character of a problem and for suggesting solutions.
Normal QuantileComparison Plot of Residuals
One such graphical display is the quantilecomparison plot, which permits us to compare visually the cumulative distribution of an independent random sample—here of studentized residuals—to a cumulative reference distribution—the unitnormal distribution. Note that approximations are implied, because the studentized residuals are t distributed and dependent, but generally the distortion is negligible, at least for moderatesized to large samples.
To construct the quantilecomparison plot:
1.
· Arrange the studentized residuals in ascending order: t(1), t(1), …, t(n). By convention, the ith largest studentized residual, t(i), has gi = (i – 1/2)/n proportion of the data below it. This convention avoids cumulative proportions of zero and one by (in effect) counting half of each observation below and half above its recorded value. Cumulative proportions of zero and one would be problematic because the normal distribution, to which we wish to compare the distribution of the residuals, never quite reaches cumulative probabilities of zero or one.
2.
· Find the quantile of the unitnormal distribution that corresponds to a cumulative probability of gi — that is, the value zi from Z ∼ N(0, 1) for which Pr(Z < zi) = gi.
3.
· Plot the t(i) against the zi.
If the ti were drawn from a unitnormal distribution, then, within the bounds of sampling error, t(i) = zi. Consequently, we expect to find an approximately linear plot with zero intercept and unit slope, a line that can be placed on the plot for comparison. Nonlinearity in the plot, in contrast, is symptomatic of nonnormality.
It is sometimes advantageous to adjust the fitted line for the observed center and spread of the residuals. To understand how the adjustment may be accomplished, suppose more generally that a variable X is normally distributed with mean μ. and variance ζ2. Then, for an ordered sample of values, approximately x(i) = μ + ζzi, where zi is defined as before. In applications, we need to estimate μ and μ, preferably robustly, because the usual estimators—the sample mean and standard deviation—are markedly affected by extreme values. Generally effective choices are the median of x to estimate μ and (Q3 – Q1)/1.349 to estimate ζ, where Q1 and Q3 are, respectively, the first and third quartiles of x: The median and quartiles are not sensitive to outliers. Note that 1.349 is the number of standard deviations separating the quartiles of a normal distribution. Applied to the studentized residuals, we have the fitted line (i) = median(t) + {[Q3(t) – Q1(t)]/1.349} × zi. The normal quantilecomparison plots in this monograph employ the more general procedure.
Several illustrative normalprobability plots for simulated data are shown in Figure 5.1. In parts a and b of the figure, independent samples of size n = 25 and n = 100, respectively, were drawn from a unitnormal distribution. In parts c and d, samples of size n = 100 were drawn from the highly positively skewed χ42 distribution and the heavytailed t2 distribution, respectively. Note how the skew and heavy tails show up as departures from linearity in the normal quantilecomparison plots. Outliers are discernible as unusually large or small values in comparison with corresponding normal quantiles.
Judging departures from normality can be assisted by plotting information about sampling variation. If the studentized residuals were drawn independently from a unitnormal distribution, then
where ϕ(zi) is the probability density (i.e., the “height”) of the unitnormal distribution at Z = zi. Thus, zi ± 2 × SE(t(i)) gives a rough 95% confidence interval around the fitted line (i) = zi in the quantilecomparison plot. If the slope of the fitted line is taken as = (Q3 – Q1)/ 1.349 rather than 1, then the estimated standard error may be multiplied by . As an alternative to computing standard errors, Atkinson (1985) has suggested a computationally intensive simulation procedure that does not treat the studentized residuals as independent and normally distributed.
Figure 5.1. Illustrative normal quantilecomparison plots. (a) For a sample of n = 25 from N(0, 1). (b) For a sample of n = 100 from N(0, 1). (c) For a sample of n – 100 from the positively skewed χ42. (d) For a sample of n = 100 from the heavytailed t2.
Click here to downloadicon download
Figure 5.2 shows a normal quantilecomparison plot for the studentized residuals from Duncan’s regression of rated prestige on occupational income and education levels. The plot includes a fitted line with twostandarderror limits. Note that the residual distribution is reasonably well behaved.
Figure 5.2. Normal quantilecomparison plot for the studentized residuals from the regression of occupational prestige on income and education. The plot shows a fitted line, based on the median and quartiles of the fs, and approximate ±2SE limits around the line.
Click here to downloadicon download
Histograms of Residuals
A strength of the normal quantilecomparison plot is that it retains high resolution in the tails of the distribution, where problems often manifest themselves. A weakness of the display, however, is that it does not convey a good overall sense of the shape of the distribution of the residuals. For example, multiple modes are difficult to discern in a quantilecomparison plot.
Histograms (frequency bar graphs), in contrast, have poor resolution in the tails or wherever data are sparse, but do a good job of conveying general distributional information. The arbitrary class boundaries, arbitrary intervals, and roughness of histograms sometimes produce misleading impressions of the data, however. These problems can partly be addressed by smoothing the histogram (see Silverman, 1986, or Fox, 1990). Generally, I prefer to employ stemandleaf displays—a type of histogram (Tukey, 1977) that records the numerical data values directly in the bars of the graph—for small samples (say n < 100), smoothed histograms for moderatesized samples (say 100 ≤ n ≤ 1,000), and histograms with relatively narrow bars for large samples (say n > 1,000).
Figure 5.3. Stemandleaf display of studentized residuals from the regression of occupational prestige on income and education.
Click here to downloadicon download
A stemandleaf display of studentized residuals from the Duncan regression is shown in Figure 5.3. The display reveals nothing of note: There is a single node, the distribution appears reasonably symmetric, and there are no obvious outliers, although the largest value (3.1) is somewhat separated from the nextlargest value (2.0).
Each data value in the stemandleaf display is broken into two parts: The leading digits comprise the stem; the first trailing digit forms the leaf; and the remaining trailing digits are discarded, thus truncating rather than rounding the data value. (Truncation makes it simpler to locate values in a list or table.) For studentized residuals, it is usually sensible to make this break at the decimal point. For example, for the residuals shown in Figure 5.4: 0.3039 → 0 3; 3.1345 → 3 1; and 0.4981 → 0 4. Note that each stem digit appears twice, implicitly producing bins of width 0.5. Stems marked with asterisks (e.g., 1∗) take leaves 0 — 4; stems marked with periods (e.g., 1.) take leaves 5—9. (For more information about stemandleaf displays, see, e.g., Velleman and Hoaglin [1981] or Fox [1990].)
Figure 5.4. The family of powers and roots. The transformation labeled “p” is actually y’ = (yp – 1)/p; for p = 0, y’ = logey.
Click here to downloadicon download
SOURCE: Adapted with permission from Figure 41 from Hoaglin, Mosteller, and Tukey (eds.). Understanding Robust and Exploratory Data Analysis, © 1983 by John Wiley and Sons, Inc.
Correcting Asymmetry by Transformation
A frequently effective approach to a variety of problems in regression analysis is to transform the data so that they conform more closely to the assumptions of the linear model. In this and later chapters I shall introduce transformations to produce symmetry in the error distribution, to stabilize error variance, and to make the relationship between y and the xs linear.
In each of these cases, we shall employ the family of powers and roots, replacing a variable y (used here generically, because later we shall want to transform xs as well) by y’ = yp. Typically, p = 2, 1, 1/2, 1/2, 2, or 3, although sometimes other powers and roots are considered. Note that p = 1 represents no transformation. In place of the 0th power, which would be useless because y0 = 1 regardless of the value of y, we take y’ = log y, usually using base 2 or 10 for the log function. Because logs to different bases differ only by a constant factor, we can select the base for convenience of interpretation. Using the log transformation as a “zeroth power” is reasonable, because the closer p gets to zero, the more yp looks like the log function (formally, limp→0[(yp – 1)/p] = logey, where the log to the base e ≈ 2.718 is the socalled “natural” logarithm). Finally, for negative powers, we take y’ = yp, preserving the order of the y values, which would otherwise be reversed.
As we move away from p = 1 in either direction, the transformations get stronger, as illustrated in Figure 5.4. The effect of some of these transformations is shown in Table 5.1a. Transformations “up the ladder” of powers and roots (a term borrowed from Tukey, 1977)—that is, toward y2—serve differentially to spread out large values of y relative to small ones; transformations “down the ladder”—toward log y—have the opposite effect. To correct a positive skew (as in Table 5.1b), it is therefore necessary to move down the ladder; to correct a negative skew (Table 5.1c), which is less common in applications, move up the ladder.
I have implicitly assumed that all data values are positive, a condition that must hold for power transformations to maintain order. In practice, negative values can be eliminated prior to transformation by adding a small constant, sometimes called a “start,” to the data. Likewise, for power transformations to be effective, the ratio of the largest to the smallest data value must be sufficiently large; otherwise the transformation will be too nearly linear. A small ratio can be dealt with by using a negative start.
In the specific context of regression analysis, a skewed error distribution, revealed by examining the distribution of the residuals, can often be corrected by transforming the dependent variable. Although more sophisticated approaches are available (see, e.g., Chapter 9), a good transformation can be located by trial and error.
Dependent variables that are bounded below, and hence that tend to be positively skewed, often respond well to transformations down the ladder of powers. Power transformations usually do not work well, however, when many values stack up against the boundary, a situation termed truncation or censoring (see, e.g., Tobin [1958] for a treatment of “limited” dependent variables in regression). As well, data that are bounded both above and below—such as proportions and percentages—generally require another approach. For example the logit or “log odds” transformation given by y’ = log[y/(l – y)], often works well for proportions.
TABLE 5.1 Correcting Skews by Power Transformations
https://go.openathens.net/redirector/waldenu.edu?url=https://dx.doi.org/10.4135/9781412985604.n5
Walden University, LLC. (Producer). (2016m). Regression diagnostics and model
evaluation [Video file]. Baltimore, MD: Author.
2
Es
t
imating
Model
s Using Dummy Variables
N
ame
Institution Name
Course Name
Professor’s Name
Date
2The topic for this week’s debate is the relationship between male and female respondents’ labor force statuses
.
These variables’ frequency distributions exhibited no outliers, according to the preliminary information. A total of
2536
people were included in the study. Male and female respondent scales (M=
.449
1
, SD=.4975
0
) were found to have normal distributions of scores on the labor force status (M= 3, SD=
2.355
), as well as the female (M=
.5509
, SD=.4975). The gender of the respondent was used as a variable in a linear regression to predict their employment status. An
F
(1,
2534
) =
73.5
8
5
, p = 0
.001
and a modest effect R2 =.028 were found in the investigation.
Respondents Sex:
Case Processing Summary
Percent
N
Percent
2
100.0%
1397
100.0%
Cases 

RESPONDENTS Valid 
Missing 
Total 

SEX N 
Percent 
N  
LA B OR FORCE STATUS 
MALE 
1139 
99.8% 
0.2% 
1141 
10 0.0% 

FEMALE 
1397 
0  0.0% 
Labor Force Statistics:
Descriptive Statistics
N
2536
.49750
2536
Mean 
Std. Deviation 

3.00 
2.355  2536  
FemaleRespondent 
.5509 
.49750 

MaleRespondents 
.4491 
Correlations
:
LABOR FORCE STATUSLABOR FORCE STATUS
FemaleRespondent
.168
1.000
MaleRespondents
.168
1.000
1.000
.000
FemaleResponde nt 
MaleRespondents 

Pearson Correlation 
1
.000 
.16 8 
.16 8 

1.000 

Sig. (1tailed) LABOR FORCE STATUS 
.  .000 
FemaleRespondent
.000
.
.000
MaleRespondents
.000
.000
.
LABOR FORCE STATUS
2536
2536
2536
FemaleRespondent
2536
2536
2536
MaleRespondents
2536
2536
2536
ANOVAa
1
396.625
Total
Model Sum of Squares 
df 
Mean Square 
F  Sig.  
1 
Regression 
396.625 
73.585 
.000b 

Residual 
13658.356 
2534 
5.390 

14054.981 
2535 
a. Dependent Variable: LABOR FORCE STATUS
b. Predictors:
(Constant)
, MaleRespondents
1.000
.168
.000
.168
.168
1.000
1.000
Unstandardized Coefficients 
Standardized Coefficients 
Beta 
t 
Sig. 
Correlations 
Collinearity Statistics 

Model B 
Std. Error 
Zero order 
Part ial 
Part 
Tolerance 
VIF 

(Constant) 
3.354 
.062 
54 .002 

MaleRespondent s 
.795 
.093 
8.578 
.16 8 
a. Dependent Variable: LABOR FORCE STATUS
Excluded Variablesa
t
Sig.
Collinearity Statistics
Tolerance
VIF
.
.
.
.000
.
.000
Model Beta In 
Partial Correlation 
Minimum Tolerance 

1 FemaleResponden t 
.b 
a. Dependent Variable: LABOR FORCE STATUS
b. Predictors in the Model: (Constant), MaleRespondents
Collinearity Diagnosticsa
(Constant)
MaleRespondents
1
1.000
.16
2
.84
Model Dimension Eigenvalue 
Condition Index 
Variance Proportions 

1.670 
.16  
.330 
2.250 
.84 
a. Dependent Variable: LABOR FORCE STATUS
RE: Discussion – Week 10
COLLAPSE
Top of Form
A dummy variable is a dichotomy with two categories not including missing values (Wagner, 2020).
Multiple regression is an extension of bivariate regression by examining the effect of two or more independent variables on the dependent variable (FrankfortNachmias et al., 2020). Dummy variables are useful for nominal variables in an analysis that requires interval or ratio variables to perfome multiple regression models.
General Social Survey Dataset’s Mean of Age
In the General Social Survey Dataset, the older population’s mean of age in the dataset is 49.01 likely implying there are older respondents in the population and the median is 49.00 which would make the data normally distributed, however, the lowest mode is 53. The median is the “middle score” of 49.00 which would make the data normally distributed, however, the level of skewness is positive which makes the data skewed slightly to the right. Thus, the mean indicates the results may not be generalizable across all age groups. I chose to display the data surrounding the respondent’s highest degree and age of the respondents.
Research Question
Using the General Social Survey (GSS) dataset, a research question was constructed through a regression model to find an answer. The dependent variable is the respondents’ highest year of school completed and the two independent variables are owning or buying a home and paying rent used to construct the research question. Understanding the highest year of school completed can influence individuals’ ability to either purchase a home or rent one. Using the GSS data, the following research question and hypothesis were developed:
· Research question: Do people with a higher year of school completed own or are buying a home rather than paying rent?
· The null hypothesis (Ho): There is no relationship between the highest year of school completed and owning or buying a home and paying rent.
· The alternative hypothesis (Ha): There is a relationship between the highest year of school completed and owning or buying a home and paying rent.
Interpretation of the Coefficients of the Model
Table 1 shows the coefficients of owning or buying a home and paying rent for the highest year of school completed. The unstandardized coefficient of own or is buying is
1.960
(table 1) is different than pays rent which is less than that by
.839
(table 1) units of measure. The significance level is less than .001 (table 1) for both predictors of the own or is buying and pays rent which is well below the .05 threashold and we can reject the null hypothesis that there is no relationship between lived poverty index and problems with public health clinics and public schools (Walden University, 2016). Thus, both respondents’ owning or buying and paying rent for their homes are significant predictors of respondents’ highest year of school completed (Walden University, 2016). The Variance Inflation Factor (VIF) with a value above 10 shows serious multicollinearity in the model meaning the independent variables of own/buying and pays rent have a high level of correlation between each other (Walden University, 2016). Therefore, we can assume that we have not met the assumption since the VIF is
16.106
(table 1). I don’t think there is a possible remedy for this assumption violation because violating the assumption of linearity implies that the model fails to capture the systematic pattern of relationship between the dependent and independent variables (Nonlinearity, 1991). Consequently, there is no linear relationship in this model.
Table 1
Coefficients
Unstandardized Coefficients
Standardized Coefficients
t
Sig.
Collinearity Statistics
Std. Error
Beta
Tolerance
VIF
(Constant)
.062
.168
.062
16.106
Coefficientsa 

Model  
B  
12.24 0 
.596 
20.554 
<.001 

dwelown=OWN OR IS BUYING 
1.960 
.603 
.314 
3.252 
.001  16.106  
dwelown=PAYS RENT 
.839 
.608 
.133 
1.381 

a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED 
Analysis of the Model Summary
The DurbinWatson statistic has values from 0 to 4.0 and provides information about the independence of errors (Walden University, 2016). A model summary calculates a DurbinWatson value of 1.622 which shows their is no correlation between residuals (Walden University, 2016). DubrinWatson values below 1.0 and above 3.0 are considered dangeours because the model suffers from serious correlation which is a lagged version of itself over various time intervals (Walden University, 2016). The model summary calculates the R square of 0.35 units revealing a negligible effect that are slightly meaningful since a a higher effect size means that the research finding has practical significance, while a negligible effect size indicates limited practical applications (Warner, 2012).
Diagnostics for the Regression Model
The residuals statistics in tabel 2 shows the
Cook’s Distance
values which if are greater than or equal to 1.0 are considered problematic and further diagnotics should be performed for unduly influence (Walden University, 2016). Unduly influence the results of the analysis because their presence may signal that the regression model fails to capture important characteristics of the data (Outlying and Influential Data, 1991). For example, specific outliers for one or more of the variables that might be causing unduly influence on the model and have a significant impact (Walden University, 2016). Table 2 shows values for Cook’s Distance ranging from a minimum of .000 to
.037
(table 2) well below 1.0 revealing that we have no unduly influence in this model (Walden University, 2016). Having a negative minimum residual of –
14.20
0 (table 2) means that the predicted value is too high, a positive maximum residual of
6.921
(table 2) means that the predicted value was too low. Thus, the aim of a regression model is to explain variability in dependent variable by means of one or more of independent or control variables.
Table 2
Residuals Statistics
Mean
Std. Deviation
N
.000
1.000
1669
.093
.596
1669
13.76
.569
1669
.000
1669
.000
1669
.000
1.000
1669
.000
1669
.000
1669
1669
.000
.001
1669
.000
.001
1669
a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED
Residuals Statisticsa 

Minimum 
Maximum 

Predicted Value 
12.24  14.20 
13.76 
.569 
1669 

Std. Predicted Value 
2.672 
.770 

Standard Error of Predicted Value 
.110 
.061 

Adjusted Predicted Value 
12.04 
14.21 

14.200 
6.921 
2.976 

Std. Residual 
4.769 
2.324 
.999 

Stud. Residual 
4.771 
2.326 

Deleted Residual 
14.214 
6.933 
2.981 

Stud. Deleted Residual 
4.803 
2.329 
1.001 

Mahal. Distance 
.612 
65.721 
1.999 
7.879 

Cook’s Distance  .037  .002  
Centered Leverage Value 
.039 
.005 
Conclusion
The regression model is used to test the simple hypotheses and to analyze an association of causality across two or more independent variables. Visual displays of data can be invaluable because it makes it easier to understand the statistical findings through visual representation. After all, research can be useful when it organizes, summarizes, and communicates information (FrankfortNachmias et al., 2020). Therefore, the biggest takeaways from the respondents’ highest year of school completed is whether the model meets all the assumptions or can find possible remedies to any violations.
References
FrankfortNachmias, C., LeonGuerrero, A., & Davis, G. (2020). Social statistics for a diverse society (9th ed.). Sage Publications.
Nonlinearity. (1991). In J. Fox (Ed.), Regression Diagnostics. (pp. 5462). SAGE Publications, Inc.
Outlying and Influential Data. (1991). In J. Fox (Ed.), Regression Diagnostics. (pp. 2241). SAGE Publications, Inc.
Wagner, III, W. E. (2020). Using IBM® SPSS® statistics for research methods and social science statistics (7th ed.). Sage Publications.
Walden University, LLC. (Producer). (2016). Regression diagnostics and model evaluation [Video file]. Author.
Warner, R. M. (2012). Applied statistics from bivariate through multivariate techniques (2nd ed.). Sage Publications.
Bottom of Form
Place an order in 3 easy steps. Takes less than 5 mins.