{"id":111701,"date":"2018-07-18T16:00:00","date_gmt":"2018-07-18T21:00:00","guid":{"rendered":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/\/\/how-to-break-regression\/"},"modified":"2024-04-14T04:10:46","modified_gmt":"2024-04-14T09:10:46","slug":"how-to-break-regression","status":"publish","type":"decoded","link":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/2018\/07\/18\/how-to-break-regression\/","title":{"rendered":"How to break regression"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter size-640-wide\"><a rel=\"attachment wp-att-126056\" href=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/\/\/how-to-break-regression\/07-13-2018_featured-png\/\"><img data-dominant-color=\"f1deda\" data-has-transparency=\"false\" style=\"--dominant-color: #f1deda;\" loading=\"lazy\" decoding=\"async\"  srcset=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?resize=480,270 480w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?resize=782,440 782w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?resize=960,540 960w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?resize=1200,675 1200w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?resize=1400,788 1400w\" sizes=\"(max-width: 480px) 480px, (max-width: 782px) 782px, 640px\" height=\"360\" width=\"640\" src=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?w=640\" alt=\"\" class=\"wp-image-126056 not-transparent\" \/><\/a><figcaption class=\"wp-element-caption\">(Photo illustration by Pew Research Center and iStock.com\/Eshma)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f9cb\">Regression models are a cornerstone of modern social science. They\u2019re at the heart of efforts to estimate causal relationships between variables in a multivariate environment and are the basic building blocks of many machine learning models. Yet social scientists can run into a lot of situations where regression models break.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ae15\">Famed social psychologist Richard Nisbett&nbsp;<a href=\"https:\/\/www.edge.org\/conversation\/richard_nisbett-the-crusade-against-multiple-regression-analysis\" rel=\"noreferrer noopener\" target=\"_blank\">recently argued<\/a>&nbsp;that regression analysis is so misused and misunderstood that analyses based on multiple regression \u201care often somewhere between meaningless and quite damaging.\u201d (He was mainly talking about cases in which researchers publish correlational results that are covered in the media as causal statements about the world.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d2e9\">Below, I\u2019ll walk through some of the potential pitfalls you might encounter when you fire up your favorite&nbsp;<a href=\"https:\/\/seanjtaylor.com\/post\/39573264781\/the-statistics-software-signal\" target=\"_blank\" rel=\"noreferrer noopener\">statistical software<\/a>&nbsp;package and run regressions. Specifically, I\u2019ll be using simulation in R as an educational tool to help you better understand the ways in which regressions can break.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"628e\"><strong>Using simulations to unpack regression<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4b94\">The idea of using R simulations to help understand regression models was inspired by Ben Ogorek\u2019s&nbsp;<a href=\"http:\/\/anythingbutrbitrary.blogspot.com\/2016\/01\/how-to-create-confounders-with.html\" target=\"_blank\" rel=\"noreferrer noopener\">post<\/a>&nbsp;on regression confounders and collider bias. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The great thing about using simulation in this way is that you control the world that generates your data. The code I\u2019ll introduce below represents the true&nbsp;<em>data-generating process<\/em>,<em>&nbsp;<\/em>since I\u2019m using R\u2019s random number generators to simulate the data. In real life, of course, we only have the data we observe, and we don\u2019t really know how the data-generating process works unless we have a solid theory (like Newtonian physics or evolution) where the system of relevant variables and causal relationships is well understood and to which there is really no analogous phenomenon in social science.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What I\u2019ll do here is create a dataset based on two random standard normal variables by simulating them using the&nbsp;<em>rnorm()<\/em>&nbsp;function, which draws random values from a normal distribution with mean 0 and standard deviation 1, unless you specify otherwise. I\u2019ll create a functional relationship between y and x such that a 1 unit increase in x will be associated with a .4 unit increase in y.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 # make the code reproducible by setting a random number seed\n2 set.seed(100)\n3\n4 # When everything works:\n5 N &lt;- 1000\n6 x &lt;- rnorm(N)\n7 y &lt;- .4 * x + rnorm(N)\n8 hist(x)\n9 hist(y)\n10<mark class=\"has-inline-color has-gray-dark-color\">\n<\/mark>11 # Now estimate our model:\n12 summary(lm(y ~ x))\n13\n14 Call:\n15 lm(formula = y ~ x)\n16 Residuals:\n17    Min      1Q  Median      3Q     Max \n18 -3.0348 -0.7013  0.0085  0.6212  3.1688 \n19 Coefficients:\n20            \t\tEstimate Std. Error t value Pr(&gt;|t|)    \n21 (Intercept) \t0.003921   0.031039   0.126    0.899    \n22 x           \t0.413415   0.030129  13.722   &lt;2e-16 ***\n23 ---\n24 Signif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\n25 Residual standard error: 0.9814 on 998 degrees of freedom\n26 Multiple R-squared:  0.1587,\tAdjusted R-squared:  0.1579 \n27 F-statistic: 188.3 on 1 and 998 DF,  p-value: &lt; 2.2e-16\n28\n29 # Plot it\n30 library(ggplot2)\n31 qplot(x, y) +\n32  geom_smooth(method='lm') +\n33  theme_bw() +\n34  ggtitle(\"The Perfect Regression\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Notice that the model estimates the functional relationship between x and y that I simulated quite well. The plot looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a rel=\"attachment wp-att-126059\" href=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/\/\/how-to-break-regression\/07-13-2018_regression-png\/\"><img data-dominant-color=\"e6e6e7\" data-has-transparency=\"false\" style=\"--dominant-color: #e6e6e7;\" loading=\"lazy\" decoding=\"async\" width=\"468\" height=\"387\"  srcset=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_Regression.png?resize=468,387 468w\" sizes=\"(max-width: 480px) 480px, (max-width: 782px) 782px, 640px\" src=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_Regression.png\" alt=\"\" class=\"wp-image-126059 not-transparent\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">What about omitted variables? Our machinery actually still works if there is another factor causing y, as long as it is&nbsp;<em>uncorrelated<\/em>&nbsp;with x.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"c0e3\"><strong>The dreaded omitted variable bias<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c87b\">Omitted variable bias (OVB) is much feared, and judging by the top internet search results, not well understood. Some top sources say it occurs when \u201c<a href=\"http:\/\/carecon.org.uk\/UWEcourse\/OVbias.pdf\" rel=\"noreferrer noopener\" target=\"_blank\">an important<\/a>\u201d variable is missing or when a variable that \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Omitted-variable_bias\" rel=\"noreferrer noopener\" target=\"_blank\">is correlated<\/a>\u201d with both x and y is missing. I even found a university&nbsp;<a href=\"http:\/\/www3.wabash.edu\/econometrics\/EconometricsBook\/chap18.htm\" rel=\"noreferrer noopener\" target=\"_blank\">econometrics<\/a>&nbsp;course that defined OVB this way.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d272\">But neither of those definitions are quite right. OVB occurs when a variable that&nbsp;<em>causes&nbsp;<\/em>y is missing from the model (and is correlated with x). Let\u2019s call that variable w. Because w is in play when we consider the causal relationship between x and y, it\u2019s often referred to as \u201cendogenous\u201d or a \u201cconfounding variable.\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The example below first demonstrates that w, our confounding variable, will bias our results if we fail to include it in our model. The next two examples are essentially a re-telling of the&nbsp;<a href=\"http:\/\/anythingbutrbitrary.blogspot.com\/2016\/01\/how-to-create-confounders-with.html\" target=\"_blank\" rel=\"noreferrer noopener\">post I mentioned above<\/a>&nbsp;on collider bias, but emphasizing slightly different points.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 w &lt;- rnorm(N)\n2 x &lt;- .5 * w + rnorm(N)\n3 y &lt;- .4 * x + .3 * w + rnorm(N)\n4\n5 m1 &lt;- lm(y ~ x)\n6 summary (m1) # Omitted variable bias\n7\n8 Call:\n9 lm(formula = y ~ x)\n10\n11 Residuals:\n12    Min      1Q  Median      3Q     Max \n13 -3.2190 -0.7025  0.0314  0.7120  3.1158 \n14\n15 Coefficients:\n16            Estimate Std. Error t value Pr(&gt;|t|)    \n17 (Intercept)  0.01126    0.03310    0.34    0.734    \n18 x            0.50179    0.03049   16.46   &lt;2e-16 ***\n19 ---\n20 Signif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\n21\n22 Residual standard error: 1.046 on 998 degrees of freedom\n23 Multiple R-squared:  0.2135,\tAdjusted R-squared:  0.2127 \n24 F-statistic: 270.9 on 1 and 998 DF,  p-value: &lt; 2.2e-16<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">There it is: classic omitted variable bias. We only observed x, and the influence of the omitted variable w was attributed to x in our model. If you re-rerun the regression with w in the model, you no longer get biased estimates.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 m2 &lt;- lm(y ~ x + w)\n2 summary (m2) # No omitted variable bias after conditioning on w\n3\n4 Call:\n5 lm(formula = y ~ x + w)\n6\n7 Residuals:\n8    Min      1Q  Median      3Q     Max \n9 -3.2748 -0.6632 -0.0001  0.6933  2.9664 \n10\n11 Coefficients:\n12            Estimate Std. Error t value Pr(&gt;|t|)    \n13 (Intercept)  0.02841    0.03141   0.905    0.366    \n14 x            0.40627    0.03132  12.973   &lt;2e-16 ***\n15 w            0.32344    0.03439   9.405   &lt;2e-16 ***\n16 ---\n17 Signif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\n18\n19 Residual standard error: 0.9927 on 997 degrees of freedom\n20 Multiple R-squared:  0.3024,\tAdjusted R-squared:  0.301 \n21 F-statistic: 216.1 on 2 and 997 DF,  p-value: &lt; 2.2e-16\n22<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Note that the regression errors, also known as residuals, are correlated with w:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 cor(w,m1$residuals)\n2 &#091;1] 0.2597859<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now, recall above that I wrote that it\u2019s wrong to say that OVB occurs when our omitted variable is correlated with both x and y. And yet w, x and w and y are all correlated in this first example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 cormatrix &lt;- cor(as.matrix(data.frame(x,y,w)))\n2 round(matrix, 2)\n3\n4     x    y    w\n5 x 1.00 0.49 0.41\n6 y 0.49 1.00 0.43\n7 w 0.41 0.43 1.00<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3629\">So why can\u2019t we just say that OVB occurs when our omitted variable is correlated with both x and y? As the next example will show, correlation isn\u2019t enough \u2014 w needs to&nbsp;<em>cause&nbsp;<\/em>both x and y. We can easily imagine a case in which we don\u2019t have causality but we still see this kind of correlation \u2014 when x and y both cause w.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7c60\">Let\u2019s make this a little more concrete. Suppose we care about the effect of news media consumption (x) on voter turnout (y). One factor that some researchers think may cause both news media consumption and turnout is political interest (w). If we only measure media consumption and voter turnout, political interest is likely to confound our estimates.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But another school of thought from social psychology \u2014 along the lines of self-perception theory and&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Cognitive_dissonance\" target=\"_blank\" rel=\"noreferrer noopener\">cognitive dissonance<\/a>&nbsp;\u2014 suggests that the causality could be reversed: Voting behavior might be mostly determined by other factors, and casting a ballot might prompt us to be&nbsp;<em>more<\/em>&nbsp;interested in political developments in the future. Similarly, watching the news might prompt us to become&nbsp;<em>more<\/em>interested in politics. Let\u2019s suppose that second school of thought is right. If so, our simulated data will look like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 media_consumption_x &lt;- rnorm(N)\n2 voter_turnout_y &lt;- .1 * media_consumption_x + rnorm(N)\n3\n4 # Political interest increases after consuming media and participating, and, \n5 # in this hypothetical world, does *not* increase media consuption or participation\n6 political_interest_w &lt;- 1.2 * media_consumption_x + .6 * voter_turnout_y + rnorm(N)\n7\n8 cormat &lt;- cor(as.matrix(data.frame(media_consumption_x, voter_turnout_y, political_interest_w)))\n9 round(cormat, 2)\n10\n11                     media_consumption_x voter_turnout_y \n12 political_interest_w\nmedia_consumption_x                 1.00            0.11                 0.70\n13 voter_turnout_y                     0.11            1.00                 0.46\n14 political_interest_w                0.70            0.46                 1.00<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As you can see, all factors are again correlated with each other. But this time, if we&nbsp;<em>only<\/em>&nbsp;include x (media consumption) and y (turnout) in the equation, we get the correct estimate:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 summary(lm(voter_turnout_y ~ media_consumption_x))\n2\n3 Call:\n4 lm(formula = voter_turnout_y ~ media_consumption_x)\n5\n6 Residuals:\n7    Min      1Q  Median      3Q     Max \n8 -2.8460 -0.6972 -0.0076  0.6702  3.3925 \n9\n10 Coefficients:\n11                    Estimate Std. Error t value Pr(&gt;|t|)    \n12 (Intercept)         -0.01202    0.03217  -0.374 0.708839    \n13 media_consumption_x  0.11719    0.03321   3.529 0.000436 ***\n14 ---\n15 Signif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\n16\n17 Residual standard error: 1.014 on 998 degrees of freedom\n18 Multiple R-squared:  0.01233,\tAdjusted R-squared:  0.01134 \n19 F-statistic: 12.46 on 1 and 998 DF,  p-value: 0.0004359<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">What makes defining omitted variable bias based on correlation so dangerous is that if we now include w (political interest), we will get a different kind of bias \u2014 what\u2019s called&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Collider_(epidemiology)\" target=\"_blank\" rel=\"noreferrer noopener\">collider bias<\/a>&nbsp;or&nbsp;<a href=\"https:\/\/www.annualreviews.org\/doi\/10.1146\/annurev-soc-071913-043455\" target=\"_blank\" rel=\"noreferrer noopener\">endogenous selection bias<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 summary(lm(voter_turnout_y ~ media_consumption_x +  political_interest_w))\n2\n3 Call:\n4 lm(formula = voter_turnout_y ~ media_consumption_x + political_interest_w)\n5\n6 Residuals:\n7    Min      1Q  Median      3Q     Max \n8 -2.1569 -0.5981 -0.0129  0.5701  2.8356 \n9\n10 Coefficients:\n11                      Estimate Std. Error t value Pr(&gt;|t|)    \n12 (Intercept)           0.003155   0.027098   0.116    0.907    \n13 media_consumption_x  -0.437084   0.039102 -11.178   &lt;2e-16 ***\n14 political_interest_w  0.444571   0.021928  20.274   &lt;2e-16 ***\n15 ---\n16 Signif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\n17 \n18 Residual standard error: 0.854 on 997 degrees of freedom\n19 Multiple R-squared:  0.3007,\tAdjusted R-squared:  0.2993 \n20 F-statistic: 214.3 on 2 and 997 DF,  p-value: &lt; 2.2e-16<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"cad3\"><strong>Simpson\u2019s paradox<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"910f\">Simpson\u2019s paradox often occurs in social science (and medicine, too) when you pool data instead of conditioning it on group membership (i.e., adding it as a factor in your regression model).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e218\">Suppose that, all other things being equal, consuming media causes a slight shift in policy preferences toward the left. But, on average, Republicans consume more news than non-Republicans. And we know that generally Republicans have much more right-leaning preferences.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we just measure media consumption and policy preferences without including Republicans in the model, we\u2019ll actually estimate that the effect goes in the direction&nbsp;<em>opposite<\/em>&nbsp;of the true causal effect.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 N &lt;- 1000\n2\n3 # Let's say that 40% of people in this population are Republicans\n4 republican &lt;- rbinom(N, 1, .4)\n5\n6 # And they consume more media\n7 media_consumption &lt;- .75 * republican + rnorm(N)\n8\n9 # Consuming more media causes a slight leftward shift in policy\n10 # preferences, and Republicans have more right-leaning preferences\n11 policy_prefs &lt;- -.2 * media_consumption + 2 * republican + rnorm(N)\n12\n13 # for easier plotting later\n14 df &lt;- data.frame(media_consumption, policy_prefs, republican)\n15 df$republican = factor(c(\"non-republican\", \"republican\")&#091;df$republican + 1])\n16\n17 # If we don't condition on being Republican, we'll actually estimate\n18 # that the effect goes in the *opposite* direction\n19 summary(lm(policy_prefs ~ media_consumption))\n20\n21\n22 Call:\n23 lm(formula = policy_prefs ~ media_consumption)\n24\n25 Residuals:\n26    Min      1Q  Median      3Q     Max \n27 -3.6108 -0.9559 -0.0198  0.9257  3.9537 \n28\n29 Coefficients:\n30                  Estimate Std. Error t value Pr(&gt;|t|)    \n31 (Intercept)        0.68923    0.04323   15.94  &lt; 2e-16 ***\n32 media_consumption  0.15269    0.03966    3.85 0.000126 ***\n33 ---\n34 Signif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\n35\n36 Residual standard error: 1.317 on 998 degrees of freedom\n37 Multiple R-squared:  0.01463,\tAdjusted R-squared:  0.01365 \n38 F-statistic: 14.82 on 1 and 998 DF,  p-value: 0.0001257\n39\n40 # Naive plot\n41 qplot(media_consumption, policy_prefs) +\n42   geom_smooth(method='lm') +\n43   theme_bw() +\n44   ggtitle(\"Naive estimate (Simpson's Paradox)\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The estimate goes in the opposite direction of the true effect! Here\u2019s what the plot looks like:<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-large\"><a rel=\"attachment wp-att-126062\" href=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/\/\/how-to-break-regression\/simpsons-paradox-png\/\"><img data-dominant-color=\"e4e4e5\" data-has-transparency=\"false\" style=\"--dominant-color: #e4e4e5;\" loading=\"lazy\" decoding=\"async\" width=\"468\" height=\"381\"  sizes=\"(max-width: 480px) 480px, (max-width: 782px) 782px, 640px\" srcset=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/simpsons-paradox.png?resize=468,381 468w\" data-id=\"126062\" src=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/simpsons-paradox.png?w=468\" alt=\"\" class=\"wp-image-126062 not-transparent\" \/><\/a><\/figure>\n<\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To resolve this paradox, we need to add a factor in the model that indicates whether or not a respondent is a Republican. Adding that factor lets us estimate&nbsp;<em>separate<\/em>&nbsp;slopes for Republicans and non-Republicans. Note that this is&nbsp;<em>not<\/em>&nbsp;like estimating an interaction term, where two explanatory variables are multiplied together. It\u2019s not that the slopes are&nbsp;<em>different<\/em>, we just need to estimate separate ones for Republicans and non-Republicans.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 # Condition on being a Republican to get the right estimates\n2 summary(lm(policy_prefs ~ media_consumption + republican))\n3\n4 Call:\n5 lm(formula = policy_prefs ~ media_consumption + republican)\n6\n7 Residuals:\n8    Min      1Q  Median      3Q     Max \n9 -3.5518 -0.6678 -0.0186  0.6562  3.3009 \n10\n11 Coefficients:\n12                  Estimate Std. Error t value Pr(&gt;|t|)    \n13 (Intercept)        0.05335    0.03904   1.366    0.172    \n14 media_consumption -0.13615    0.03111  -4.376 1.34e-05 ***\n15 republican         1.93049    0.06758  28.565  &lt; 2e-16 ***\n16 ---\n17 Signif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 \n18\n19 Residual standard error: 0.9774 on 997 degrees of freedom\n20 Multiple R-squared:  0.4581,\tAdjusted R-squared:  0.457 \n21 F-statistic: 421.4 on 2 and 997 DF,  p-value: &lt; 2.2e-16\n22 \n23 # Conditioning on being Republican\n24 qplot(media_consumption, policy_prefs, data=df, colour = republican) +\n25   scale_color_manual(values = c(\"blue\",\"red\")) +\n26   geom_smooth(method='lm') +\n27  theme_bw() +\n28  ggtitle(\"Conditioning on being a Republican (Simpson's Paradox)\")<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s what the plot looks like:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><a rel=\"attachment wp-att-126064\" href=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/\/\/how-to-break-regression\/smp-2-png\/\"><img data-dominant-color=\"ece6f0\" data-has-transparency=\"false\" style=\"--dominant-color: #ece6f0;\" loading=\"lazy\" decoding=\"async\" width=\"468\" height=\"380\"  srcset=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/smp-2.png?resize=468,380 468w\" sizes=\"(max-width: 480px) 480px, (max-width: 782px) 782px, 640px\" src=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/smp-2.png\" alt=\"\" class=\"wp-image-126064 not-transparent\" \/><\/a><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"c460\"><strong>Correlated errors<\/strong><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a73b\">Another cardinal sin \u2014 and one that we should worry a lot about because it often arises from social desirability bias in survey responses \u2014 is the phenomenon of correlated errors. This example is inspired by&nbsp;<a href=\"https:\/\/www.nowpublishers.com\/article\/Details\/QJPS-6005\" rel=\"noreferrer noopener\" target=\"_blank\">Vavreck (2007).<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"364b\">Here, self-reported turnout and media consumption are caused by a combination of social desirability bias and true turnout and true consumption, respectively:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 N &lt;- 1000\n2\n3 # The \"Truth\"\n4 true_media_consumption &lt;- rnorm(N)\n5 true_vote &lt;- .1 * media_consumption + rnorm(N)\n6\n7 # social desirability bias\n8 social_desirability &lt;- rnorm(N)\n9 #what we actually observe from self reports:\n10 self_report_media_consumption &lt;- true_media_consumption + social_desirability\n11 self_report_vote &lt;- true_vote + social_desirability<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s compare the estimated effect sizes of the self-reported data and the \u201ctrue\u201d data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 # Self reports\n2 summary(lm(self_report_vote ~ self_report_media_consumption))\n3\n4 Call:\n5 lm(formula = self_report_vote ~ self_report_media_consumption)\n6\n7 Residuals:\n8    Min      1Q  Median      3Q     Max \n9 -3.9604 -0.7766  0.0142  0.8465  4.1811 \n10\n11 Coefficients:\n12                              Estimate Std. Error t value \n   Pr(&gt;|t|)    \n13 (Intercept)                    0.02020    0.03951   0.511    0.609    \n14 self_report_media_consumption  0.54605    0.02716  20.102   &lt;2e-16 ***\n15 ---\n16 Signif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\n17\n18 Residual standard error: 1.248 on 998 degrees of freedom\n19 Multiple R-squared:  0.2882,\tAdjusted R-squared:  0.2875 \n20 F-statistic: 404.1 on 1 and 998 DF,  p-value: &lt; 2.2e-16\n21\n22 # \"Truth\"\n23 summary(lm(true_vote ~ true_media_consumption))\n24\n25 Call:\n26 lm(formula = true_vote ~ true_media_consumption)\n27\n28 Residuals:\n29     Min      1Q  Median      3Q     Max \n30 -3.5814 -0.6677 -0.0077  0.6829  3.4799 \n31\n32 Coefficients:\n33                       Estimate Std. Error t value Pr(&gt;|t|)\n34 (Intercept)             0.01372    0.03217   0.426    0.670\n35 true_media_consumption  0.01313    0.03245   0.404    0.686\n36\n37 Residual standard error: 1.017 on 998 degrees of freedom\n38 Multiple R-squared:  0.0001639,\tAdjusted R-squared:  -0.000838 \n39 F-statistic: 0.1636 on 1 and 998 DF,  p-value: 0.686<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The self-reported data is biased toward over-estimating the effect size, a very dangerous problem. How could we fix this? Well, one way is to actually measure social desirability and include it in the model:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>1 summary(lm(self_report_vote ~ self_report_media_consumption + social_desirability))\n2\n3 Call:\n4 lm(formula = self_report_vote ~ self_report_media_consumption + \n5    social_desirability)\n6\n7 Residuals:\n8     Min      1Q  Median      3Q     Max \n9 -3.6042 -0.6774 -0.0127  0.6899  3.4470 \n10\n11 Coefficients:\n12                               Estimate Std. Error t value Pr(&gt;|t|)    \n13 (Intercept)                    0.01208    0.03220   0.375    0.708    \n14 self_report_media_consumption  0.01220    0.03246   0.376    0.707    \n15 social_desirability            1.02245    0.04547  22.487   &lt;2e-16 ***\n16 ---\n17 Signif. codes:  0 \u2018***\u2019 0.001 \u2018**\u2019 0.01 \u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\n18\n19 Residual standard error: 1.017 on 997 degrees of freedom\n20 Multiple R-squared:  0.5277,\tAdjusted R-squared:  0.5268 \n21 F-statistic:   557 on 2 and 997 DF,  p-value: &lt; 2.2e-16<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0f8d\">Note that this while most people think about social desirability as being a problem related to measurement error, it is essentially the same problem as omitted variable bias, as described above.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5b29\">It\u2019s important to remember that omitted variable bias and correlated errors are just two potential problems with regression analysis. Regression models are also not immune to issues associated with low levels of&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Power_(statistics)\" target=\"_blank\" rel=\"noreferrer noopener\">statistical power<\/a>, the failure to account for the influence of extreme values, and&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Heteroscedasticity\" target=\"_blank\" rel=\"noreferrer noopener\">heteroskedasticity<\/a>, among others. But by simulating the data-generating process, researchers can get a good sense of some of the more common ways in which statistical models might depart from reality.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Regression models are a cornerstone of modern social science. Yet social scientists can run into a lot of situations where regression&#8230;<\/p>\n","protected":false},"author":655,"featured_media":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"sub_headline":"","sub_title":"","_prc_public_revisions":[],"_ppp_expiration_hours":0,"_ppp_enabled":false,"ai_generated_summary":"","relatedPosts":[],"datacite_doi":"","datacite_doi_citation":"","_prc_seo_qr_attachment_id":0,"spoken_article_player_enabled":true,"displayBylines":true,"footnotes":"","prc_watchers":[],"_prc_fork_parent":0,"_prc_fork_status":"","_prc_active_fork":0},"categories":[],"bylines":[749],"collection":[],"_post_visibility":[],"decoded-category":[531],"formats":[],"_fund_pool":[],"languages":[],"regions-countries":[],"research-teams":[524],"workflow-status":[],"class_list":["post-111701","decoded","type-decoded","status-publish","hentry","bylines-solomon-messing","decoded-category-coding-how-to","research-teams-decoded"],"label":"Decoded","post_parent":0,"word_count":2419,"canonical_url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/2018\/07\/18\/how-to-break-regression\/","art_direction":{"A1":{"id":126056,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?w=564&h=317&crop=1","width":564,"height":317,"caption":"","chartArt":false},"A2":{"id":126056,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?w=268&h=151&crop=1","width":268,"height":151,"caption":"","chartArt":false},"A3":{"id":126056,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?w=194&h=110&crop=1","width":194,"height":110,"caption":"","chartArt":false},"A4":{"id":126056,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?w=268&h=151&crop=1","width":268,"height":151,"caption":"","chartArt":false},"XL":{"id":126056,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?w=720&h=405&crop=1","width":720,"height":405,"caption":"","chartArt":false},"social":{"id":126056,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/07.13.2018_featured.png?w=1200&h=628&crop=1","width":1200,"height":628,"caption":"","chartArt":false}},"_embeds":[],"watchers":[],"table_of_contents":[],"datacite_doi":"","prc_seo_data":{"title":"How to break regression","description":"Regression models are a cornerstone of modern social science. Yet social scientists can run into a lot of situations where regression...","og_title":"How to break regression","og_description":"Regression models are a cornerstone of modern social science. Yet social scientists can run into a lot of situations where regression...","schema_type":"Article","noindex":false,"canonical_url":"","primary_terms":{"category":42},"custom_schema":[],"og_image":126056,"indexnow_submitted_at":null,"gsc_index_status":null},"prepublish_checks":{},"jetpack_sharing_enabled":true,"relatedPostsOrdered":[],"bylinesOrdered":[{"key":"_xa6i8jf4z","termId":749}],"acknowledgementsOrdered":[],"_links":{"self":[{"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded\/111701","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded"}],"about":[{"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/types\/decoded"}],"author":[{"embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/users\/655"}],"replies":[{"embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/comments?post=111701"}],"version-history":[{"count":2,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded\/111701\/revisions"}],"predecessor-version":[{"id":138557,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded\/111701\/revisions\/138557"}],"wp:attachment":[{"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/media?parent=111701"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/categories?post=111701"},{"taxonomy":"bylines","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/bylines?post=111701"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/collection?post=111701"},{"taxonomy":"_post_visibility","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/_post_visibility?post=111701"},{"taxonomy":"decoded-category","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded-category?post=111701"},{"taxonomy":"formats","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/formats?post=111701"},{"taxonomy":"_fund_pool","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/_fund_pool?post=111701"},{"taxonomy":"languages","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/languages?post=111701"},{"taxonomy":"regions-countries","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/regions-countries?post=111701"},{"taxonomy":"research-teams","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/research-teams?post=111701"},{"taxonomy":"workflow-status","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/workflow-status?post=111701"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}