{"id":111603,"date":"2019-06-12T15:05:00","date_gmt":"2019-06-12T20:05:00","guid":{"rendered":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/\/\/using-tidy-verse-tools-with-pew-research-center-survey-data-in-r\/"},"modified":"2024-04-14T04:10:37","modified_gmt":"2024-04-14T09:10:37","slug":"using-tidy-verse-tools-with-pew-research-center-survey-data-in-r","status":"publish","type":"decoded","link":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/2019\/06\/12\/using-tidy-verse-tools-with-pew-research-center-survey-data-in-r\/","title":{"rendered":"Using tidyverse tools with Pew Research Center survey data in R"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter size-640-wide\"><a rel=\"attachment wp-att-125958\" href=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/\/\/using-tidy-verse-tools-with-pew-research-center-survey-data-in-r\/06-12-2019_feature-png\/\"><img data-dominant-color=\"191848\" data-has-transparency=\"true\" style=\"--dominant-color: #191848;\" loading=\"lazy\" decoding=\"async\"  srcset=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?resize=480,270 480w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?resize=782,440 782w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?resize=960,540 960w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?resize=1200,675 1200w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?resize=1400,788 1400w\" sizes=\"(max-width: 480px) 480px, (max-width: 782px) 782px, 640px\" height=\"360\" width=\"640\" src=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?w=640\" alt=\"\" class=\"wp-image-125958 has-transparency\" \/><\/a><figcaption class=\"wp-element-caption\">(Illustration by Selena Qian\/Pew Research Center)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"92b7\"><em>(Related post:&nbsp;<\/em><a href=\"https:\/\/medium.com\/pew-research-center-decoded\/how-to-analyze-pew-research-center-survey-data-in-r-f326df360713\"><em>How to analyze Pew Research Center survey data in R<\/em><\/a><em>)<\/em><br><br>Last year, I wrote an&nbsp;<a href=\"https:\/\/medium.com\/pew-research-center-decoded\/how-to-analyze-pew-research-center-survey-data-in-r-f326df360713\">introductory blog post<\/a>&nbsp;about how to access and analyze Pew Research Center survey data with R, a free, open-source software for statistical analysis. The post showed how to perform tasks using the&nbsp;<a href=\"https:\/\/cran.r-project.org\/web\/packages\/survey\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">survey<\/a>&nbsp;package.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9d9d\">The&nbsp;<code>survey<\/code>&nbsp;package is ideal if you\u2019re analyzing survey data and need variance estimates that account for complex sample designs, such as cluster samples or stratification. However, the first step of any data analysis is often exploring the data with simple but powerful summaries, such as crosstabs and plots. This usually requires significant re-coding of variables as well as data cleaning and other manipulation tasks that can be difficult and counterintuitive. Fortunately, something called the \u201c<a href=\"https:\/\/www.tidyverse.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">tidyverse<\/a>\u201d is here to help.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"61d9\">In this post, I\u2019ll show how to use&nbsp;<code>tidyverse<\/code>&nbsp;tools to do exploratory analyses of Pew Research Center survey data. (These tools, however, can be used with data from any source.)<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"what-is-the-tidyverse\">What is the &#8220;tidyverse&#8221;?<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2ab9\">R has a large community of users, many of whom make their own code freely available to other R users in the form of&nbsp;<a href=\"https:\/\/cran.r-project.org\/web\/packages\/\" rel=\"noreferrer noopener\" target=\"_blank\">packages<\/a>&nbsp;that implement specialized kinds of analyses or enable other programming tasks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d204\">R packages are typically hosted on&nbsp;<a href=\"https:\/\/cran.r-project.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">The Comprehensive R Archive Network<\/a> (CRAN), but are also available on other primarily open-source code repositories like&nbsp;<a href=\"https:\/\/github.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">github<\/a>,&nbsp;<a href=\"https:\/\/about.gitlab.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">gitlab<\/a>&nbsp;or&nbsp;<a href=\"https:\/\/www.bioconductor.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">bioconductor<\/a>. One of the nice things about R is that anyone can build a package and contribute to the software\u2019s development. But with so many packages (there were 14,366 on CRAN as of June 12, 2019), there are often many different ways to tackle the same challenge. This can lead to inconsistencies that are often confusing, making it harder on new users or those looking to develop their R skills.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5dfc\">Enter the&nbsp;<code>tidyverse<\/code>, a collection of R packages designed for data science that share a consistent design philosophy and produce code that can be more easily understood and shared. The&nbsp;<code>tidyverse<\/code>&nbsp;has a&nbsp;<a href=\"https:\/\/community.rstudio.com\/c\/tidyverse\" target=\"_blank\" rel=\"noreferrer noopener\">growing community<\/a>&nbsp;of users, contributors and developers that aim to help R users access and learn these free, open-source tools. The book R for Data Science by Hadley Wickham and Garrett Grolemund walks through many key components of how to use these tools to do data science in R and can be&nbsp;<a href=\"https:\/\/r4ds.had.co.nz\/\" target=\"_blank\" rel=\"noreferrer noopener\">accessed online<\/a>&nbsp;for free.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"packages\">Packages<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2caa\">For this post, we\u2019ll need the&nbsp;<code>tidyverse<\/code>&nbsp;and&nbsp;<a href=\"https:\/\/haven.tidyverse.org\/\" rel=\"noreferrer noopener\" target=\"_blank\">haven<\/a>&nbsp;packages. The&nbsp;<code>haven<\/code>&nbsp;package is the tidyverse\u2019s solution for importing and exporting data from several different formats, including SPSS (the format in which most Center datasets are currently released), SAS and Stata.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9efd\">The only functions from the&nbsp;<code>haven<\/code>&nbsp;package we&#8217;ll use here are&nbsp;<code>read_sav()<\/code>&nbsp;and&nbsp;<code>as_factor().<\/code>&nbsp;All other functions referenced in this post come from packages like&nbsp;<code>dplyr<\/code>,&nbsp;<code>forcats&nbsp;<\/code>or&nbsp;<code>ggplot2<\/code>, which are loaded automatically with the&nbsp;<a href=\"https:\/\/www.tidyverse.org\/packages\/\" rel=\"noreferrer noopener\" target=\"_blank\">tidyverse<\/a>package.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4c61\">If you don\u2019t already have these packages, you\u2019ll need to install them in order to run the code in this post. Even if you&nbsp;<em>have<\/em>&nbsp;already installed them, you can update them by running the below code:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><code>install.packages(\"tidyverse\")\ninstall.packages(\"haven\")<\/code><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Next, load the packages in with the&nbsp;<code>library()<\/code>&nbsp;function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><code>library(tidyverse)\n#loads all \"core\" tidyverse packages like\n#dplyr, tidyr, forcats, and ggplot2\nlibrary(haven)<\/code><\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"pipes\">Pipes<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4de6\">One of the most appealing features of the tidyverse is its emphasis on making code easier to read and understand. A key tool for improving the readability of tidyverse code is the&nbsp;<em>pipe<\/em>, a set of three characters that looks like this:&nbsp;<code>%&gt;%<\/code>. Placing a pipe to the right of an object (such as&nbsp;<code>data %&gt;%<\/code>) tells R to take the object on its left and send it to its right. This allows you to chain multiple commands together without getting lost in a hailstorm of parentheses.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a590\">Let\u2019s use a pop culture dataset to illustrate this point. The&nbsp;<code>starwars<\/code>&nbsp;dataset that comes with the tidyverse package contains information on the characters from the movies, made available through the&nbsp;<a href=\"https:\/\/swapi.co\/\" rel=\"noreferrer noopener\" target=\"_blank\">Star Wars API<\/a>. Suppose we want to filter the dataset to all characters for whom a height is known, arrange the dataset in order of decreasing height and then change the height variable from centimeters to inches.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"74e4\">Here is the code to do this without using pipes:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><code>mutate(arrange(filter(starwars, !is.na(height)), desc(height)), height = 0.393701 * height)<\/code><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And here\u2019s the code with the pipe:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><code>starwars %&gt;%\nfilter(!is.na(height)) %&gt;%\narrange(desc(height)) %&gt;%\nmutate(height = 0.393701 * height)<\/code><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"92dc\">As far as R is concerned, these two pieces of code do exactly the same thing. But using a series of pipes makes it much easier for humans to read and understand exactly what\u2019s going on, and in what order. The code above can be easily read as a series of steps: take the starwars dataset, filter it to cases where height is not missing, arrange the remaining cases by height in descending order, and, finally, change the height variable to inches.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"13f2\">Readability makes it easier to share your code with other people. Even if you\u2019ll never share your code with anyone else, this approach will make it easier for you to understand your own code if you ever need to revisit it weeks, months or even years later when the details are no longer fresh in your memory. (This can also help you or your team avoid&nbsp;<a href=\"https:\/\/medium.com\/pew-research-center-decoded\/avoiding-technical-debt-in-social-science-research-54618194790a\">technical debt<\/a>.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"98d6\">Beyond readability, pipes make your code easier to maintain. It\u2019s easy to remove a step, insert a new step or change the order when all of the commands are in a sequential series of pipes, rather than nested.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"loading-the-data-into-r\">Loading the data into R<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"63ed\">Let\u2019s shift gears and return our attention to using&nbsp;<code>tidyverse<\/code>&nbsp;tools to analyze Pew Research Center survey data. You can download the Center\u2019s datasets via&nbsp;<a href=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/download-datasets\/\" rel=\"noreferrer noopener\" target=\"_blank\">this link<\/a>, and you can learn more&nbsp;<a href=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/short-reads\/2018\/03\/09\/how-to-access-pew-research-center-survey-data\/\" rel=\"noreferrer noopener\" target=\"_blank\">here<\/a>&nbsp;about the kind of data we release and how to access it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"db6f\">For the purposes of this backgrounder, we\u2019ll use data from our&nbsp;<a href=\"http:\/\/www.people-press.org\/dataset\/april-2017-political-survey\/\" rel=\"noreferrer noopener\" target=\"_blank\">April 2017 Political survey<\/a>. We\u2019ll walk through how to use&nbsp;<code>tidyverse<\/code>&nbsp;tools to calculate weighted survey estimates (specifically, approval of President Donald Trump) broken out by other variables in our dataset (education, race\/ethnicity and generation).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5d25\">The first step is to load the dataset into your R environment. Almost all of the datasets available to download from the Center are stored as .sav (SPSS) files. The haven package can import datasets created by SPSS and a number other programs such as Stata and SAS.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4865\">There are a few features of SPSS datasets that require special handling when loading data into R. SPSS files often contain variables with both character labels and numeric codes that are not necessarily sequential (e.g. for party ID, 1 = Republican, 2 = Democrat, 9 = Refused). SPSS also allows variables to have multiple, user-defined missing values. This sort of variable metadata is helpful, but it doesn\u2019t quite align with any of R\u2019s standard data types.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7550\">The&nbsp;<code>haven<\/code>&nbsp;package handles these differences by importing variables in a custom format called&nbsp;<code>\"labelled\"<\/code>&nbsp;and leaving the user to decide which of the numeric codes or value labels he or she wants to work with in R.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b589\">We\u2019ll import the data using haven\u2019s&nbsp;<code>read_sav()<\/code>&nbsp;function. We&#8217;ll set&nbsp;<code>user_na = TRUE<\/code>&nbsp;to ensure that responses such as &#8220;Don&#8217;t know&#8221; or &#8220;Refused&#8221; aren\u2019t automatically converted to missing values. In this case, we also want to convert any&nbsp;<code>labelled<\/code>&nbsp;variables into&nbsp;<a href=\"https:\/\/r4ds.had.co.nz\/factors.html\" target=\"_blank\" rel=\"noreferrer noopener\">factors<\/a>&nbsp;so that we can work with value labels instead of the numeric codes (e.g. &#8220;Republican&#8221; instead of &#8220;1&#8221;). We do this by piping (<code>%&gt;%<\/code>) the entire dataset to haven&#8217;s&nbsp;<code>as_factor()<\/code>&nbsp;function which converts any labelled variables in the dataset to factors.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><code>Apr17 &lt;- read_sav(\"Apr17 public.sav\",\nuser_na = TRUE) %&gt;%\nas_factor()<\/code><\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"4085\">Adding variables with&nbsp;<code><mark style=\"background-color:#ecece3\" class=\"has-inline-color\">mutate<\/mark><\/code><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b72c\">The first question (<code>q1<\/code>) on the April 2017 survey asked whether or not respondents approved of Trump\u2019s performance as president so far. The next question (<code>q1a<\/code>) asked respondents how strongly they approved or disapproved (&#8220;Very strongly,&#8221; &#8220;Not so strongly,&#8221; &#8220;Don&#8217;t know\/Refused (VOL.)&#8221;). We want to use these two variables to create a new variable that combines&nbsp;<code>q1<\/code>&nbsp;and&nbsp;<code>q1a<\/code>. This sort of re-coding is extremely common when working with survey data, but it can be a surprisingly difficult task when using only base R functions. The&nbsp;<code>dplyr<\/code>package, which is loaded automatically as part of the&nbsp;<code>tidyverse<\/code>&nbsp;package, includes a number of tools that make these sorts of data processing tasks much easier.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"651e\">The most general of these is the&nbsp;<a href=\"https:\/\/dplyr.tidyverse.org\/reference\/mutate.html\" target=\"_blank\" rel=\"noreferrer noopener\">mutate()<\/a>&nbsp;function, which either adds new variables to a data frame or modifies existing variables. We\u2019ll use&nbsp;<code>mutate()<\/code>&nbsp;in the code below to create a variable called trump_approval. The actual re-coding is done with the&nbsp;<a href=\"https:\/\/dplyr.tidyverse.org\/reference\/case_when.html\" target=\"_blank\" rel=\"noreferrer noopener\">case_when()<\/a>&nbsp;function, which evaluates a series of conditions and returns the value associated with the first one that applies.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><code>Apr17 &lt;- Apr17 %&gt;%\nmutate(trump_approval = case_when(\n\nq1 == \"Approve\" &amp; q1a == \"Very strongly\" ~ \"Strongly approve\",\nq1 == \"Approve\" &amp; q1a != \"Very strongly\" ~ \"Not strongly approve\",\nq1 == \"Disapprove\" &amp; q1a == \"Very strongly\" ~ \"Strongly disapprove\",\nq1 == \"Disapprove\" &amp; q1a != \"Very strongly\" ~ \"Not strongly disapprove\",\nq1 == \"Don't know\/Refused (VOL.)\" |\nq1a == \"Don't know\/Refused (VOL.)\" ~ \"Refused\"\n) #this parentheses closes call to\n#case_when and sends it to\n#fct_relevel with %&gt;%\n%&gt;%\nfct_relevel(\"Strongly approve\",\n\"Not strongly approve\",\n\"Not strongly disapprove\",\n\"Strongly disapprove\",\n\"Refused\"\n) #this parentheses closes our call to fct_relevel\n) #this parentheses closes our call to mutate<\/code><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ad31\">We supply a series of formulas to the&nbsp;<code>case_when()<\/code>&nbsp;function. Each formula is an \u201cif-then\u201d statement, where the left side (everything to the left of&nbsp;<code>~<\/code>) describes a logical condition and the right side provides the value to return if that condition is true. As a reminder, you can type ?case_when() to access the help page (this works for any function in R, not just&nbsp;<code>case_when()<\/code>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9f3e\">The first line of the call to&nbsp;<code>case_when()<\/code>&nbsp;is:&nbsp;<code>q1 == \"Approve\" &amp; q1a == \"Very strongly\" ~ \"Strongly approve\".<\/code>&nbsp;This can be read to mean that respondent said &#8220;Approve&#8221; to&nbsp;<code>q1<\/code>&nbsp;and answered &#8220;Very strongly&#8221; to&nbsp;<code>q1a<\/code>, so we\u2019ll code them as &#8220;Strongly approve&#8221; in our new&nbsp;<code>trump_approval<\/code>&nbsp;variable. We use five different clauses to make these new categories and add the trump_approval variable to our dataset.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"fe47\">Because&nbsp;<code>case_when()<\/code>&nbsp;returns character variables when used this way, the last step is to pipe the variable to&nbsp;<a href=\"https:\/\/forcats.tidyverse.org\/reference\/fct_relevel.html\" target=\"_blank\" rel=\"noreferrer noopener\">fct_relevel()<\/a>&nbsp;from the&nbsp;<code>forcats<\/code>&nbsp;package. This converts&nbsp;<code>trump_approval<\/code>&nbsp;to a factor and orders the categories from &#8220;Strongly approve&#8221; to &#8220;Refused.&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can run the&nbsp;<code>table<\/code>&nbsp;command on our new variable to verify that everything worked as intended.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>table(Apr17$trump_approval, Apr17$q1)\n##                          \n##                  Approve      Disapprove              Don't know\/          \n##                                                    Refused (VOL.)\n##Strongly approve          476          0                         0\n##Not strongly approve      130          0                         0\n##Not strongly disapprove     0        143                         0\n##Strongly disapprove         0        676                         0\n##Refused                     0          0                        76<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7b22\">Having created our new&nbsp;<code>trump_approval<\/code>&nbsp;variable, we&#8217;d like to see how it breaks down according to a few different demographic characteristics: educational attainment (<code>educ2<\/code>&nbsp;in the Apr17 dataset), race\/ethnicty (<code>racethn<\/code>) and generation (<code>gen5<\/code>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"34b8\">But&nbsp;<code>educ2<\/code>&nbsp;has nine categories, some of which don\u2019t have very many respondents. We thus want to collapse them into fewer categories with larger numbers of respondents. We can do this with the&nbsp;<code>forcats<\/code>&nbsp;function&nbsp;<code>fct_collapse()<\/code>. We use&nbsp;<code>mutate()<\/code>&nbsp;again to make new variables.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c507\">Let\u2019s first check the answer options of educ2 to see the categories we need to collapse. Since we used&nbsp;<code>as_factor()<\/code>&nbsp;when we read the dataset in,&nbsp;<code>educ2<\/code>&nbsp;is a factor variable. So, we can see the answer options by using the&nbsp;<code>levels()<\/code>function.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>levels(Apr17$educ2)\n\n## &#091;1] \"Less than high school (Grades 1-8 or no formal schooling)\"                                \n## &#091;2] \"High school incomplete (Grades 9-11 or Grade 12 with NO diploma)\"                         \n## &#091;3] \"High school graduate (Grade 12 with diploma or GED certificate)\"                          \n## &#091;4] \"Some college, no degree (includes some community college)\"                                \n## &#091;5] \"Two year associate degree from a college or university\"                                   \n## &#091;6] \"Four year college or university degree\/Bachelor's degree (e.g., BS, BA, AB)\"              \n## &#091;7] \"Some postgraduate or professional schooling, no postgraduate degree\"                      \n## &#091;8] \"Postgraduate or professional degree, including master's, doctorate, medical or law degree\"\n## &#091;9] \"Don't know\/Refused (VOL.)\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now that we know the levels, we can collapse them into fewer categories. The first argument to&nbsp;<code>fct_collapse()<\/code>&nbsp;is the variable whose categories you want to collapse, in our case&nbsp;<code>educ2<\/code>. We assign new categories on the left, like &#8220;High school grad or less&#8221;, in the below example. We use the&nbsp;<code>c()<\/code>&nbsp;function to tell&nbsp;<code>fct_collapse()<\/code>&nbsp;these categories belong in our new &#8220;High school grad or less&#8221; category. Note that you need to use the exact names of these categories.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Apr17 &lt;- Apr17 %&gt;%\nmutate(educ_cat = fct_collapse(educ2,\n\"High school grad or less\" = c(\n\"Less than high school (Grades 1-8 or no formal schooling)\",\n\"High school incomplete (Grades 9-11 or Grade 12 with NO diploma)\",\n\"High school graduate (Grade 12 with diploma or GED certificate)\"\n),\n\"Some college\" = c(\n\"Some college, no degree (includes some community college)\",\n\"Two year associate degree from a college or university\"\n),\n\"College grad+\" = c(\n\"Four year college or university degree\/Bachelor's degree (e.g., BS, BA, AB)\",\n\"Some postgraduate or professional schooling, no postgraduate degree\",\n\"Postgraduate or professional degree, including master's, doctorate, medical or law degree\"\n)\n) #this parentheses closes our call\n#to fct_collapse\n) #this parentheses closes our call to mutate<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"getting-weighted-estimates-with-group-by-and-summarise\">Getting weighted estimates with group_by and summarise<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8bb4\">Now that we have created and re-coded the variables we want to estimate, we can use the&nbsp;<code>dplyr<\/code>&nbsp;functions&nbsp;<a href=\"https:\/\/dplyr.tidyverse.org\/reference\/group_by.html\" rel=\"noreferrer noopener\" target=\"_blank\">group_by()<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/dplyr.tidyverse.org\/reference\/summarise.html\" rel=\"noreferrer noopener\" target=\"_blank\">summarise()<\/a>&nbsp;to produce some weighted summaries of the data. We can use&nbsp;<code>group_by()<\/code>&nbsp;to tell R to group the Apr17 dataset by these variables and then&nbsp;<code>summarise()<\/code>&nbsp;to create summary statistics among those groups.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b2ca\">To make sure our estimates are representative of the population, we need to use the survey weights (variable named&nbsp;<code>weight<\/code>) included in our dataset, as is the case in almost all of the Center&#8217;s datasets. For the total sample, we can calculate weighted percentages by adding up the respondent weights for each category and dividing by the sum of the weights for the whole sample.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"79d0\">The first step is to&nbsp;<code>group_by()<\/code>&nbsp;trump_approval and then use&nbsp;<code>summarise()<\/code>&nbsp;to get the sum of the weights for each group. Then we use&nbsp;<code>mutate()<\/code>&nbsp;to convert these totals into percentages.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8efb\">So the first step in code is this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><code>trump_approval &lt;- Apr17 %&gt;%\ngroup_by(trump_approval) %&gt;%\nsummarise(weighted_n = sum(weight))<\/code><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c053\">We can see that&nbsp;<code>group_by()<\/code>&nbsp;and&nbsp;<code>summarise()<\/code>&nbsp;gives us a table with one row for each group and one column for the summary variables that we created. If we look at the&nbsp;<code>trump_approval<\/code>&nbsp;object that we just created, we can see we have a column for each of the categories in our&nbsp;<code>trump_approval<\/code>&nbsp;variable (that we passed to&nbsp;<code>group_by()<\/code>) and a column we created called&nbsp;<code>weighted_n<\/code>. The&nbsp;<code>weighted_n<\/code>column is the weighted sum of each category in&nbsp;<code>trump_approval<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5c54\">But since we want show proportions, we have another step to do. We\u2019ll use&nbsp;<code>mutate()<\/code>&nbsp;to add a column called&nbsp;<code>weighted_group_size<\/code>&nbsp;that is the sum of&nbsp;<code>weighted_n<\/code>. Then we just divide&nbsp;<code>weighted_n<\/code>&nbsp;by&nbsp;<code>weighted_group_size<\/code>&nbsp;and store that in a column called&nbsp;<code>weighted_estimate<\/code>, which gives us our weighted proportions.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>trump_approval &lt;- Apr17 %&gt;% \n##group by trump_approval to calculate weighted totals\n##by taking the sum of the weights\n  group_by(trump_approval) %&gt;% \n  summarise(weighted_n = sum(weight)) %&gt;% \n##add the weighted_group_size to get the total weighted n and \n##divide weighted_n by weighted_group_size to get the proportions \n  mutate(weighted_group_size = sum(weighted_n), \n         weighted_estimate = weighted_n \/ weighted_group_size\n         )\n\ntrump_approval## # A tibble: 5 x 4\n\n##trump_approval weighted_n weighted_group_size weighted_estimate\n##  &lt;fct&gt;          &lt;dbl&gt;               &lt;dbl&gt;             &lt;dbl&gt;\n## 1 Strongly\n## approve         1293.                4319.            0.299 \n## 2 Not strongly\n## approve          408.                4319.            0.0945\n## 3 Not strongly\n## disapprove       458.               4319.             0.106 \n## 4 Strongly\n## disapprove      1884.               4319.            0.436 \n## 5 Refused        275.               4319.            0.0636<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To add a subgroup variable and get a weighted crosstab, we can use the same procedure with one slight addition. Instead of going straight from&nbsp;<code>summarise()<\/code>to&nbsp;<code>mutate()<\/code>&nbsp;and adding our group sizes and proportions, we have to tell&nbsp;<code>mutate()<\/code>&nbsp;to calculate the&nbsp;<code>weighted_group_size<\/code>&nbsp;of&nbsp;<code>educ_cat<\/code>. This gives the weighted number of respondents that are in each category of&nbsp;<code>educ_cat<\/code>&nbsp;(&#8220;High school grad or less,&#8221; &#8220;Some college,&#8221; &#8220;College grad+,&#8221; &#8220;Don&#8217;t know\/Refused (VOL.)&#8221;)<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>trump_estimates_educ &lt;- Apr17 %&gt;% \n##group by educ and trump approval to get weighted n's per group\n  group_by(educ_cat, trump_approval) %&gt;% \n##calculate the total number of people in each answer and education category using survey weights (weight)\n  summarise(weighted_n = sum(weight)) %&gt;% \n##group by education to calculate education category size\n  group_by(educ_cat) %&gt;%\n##add columns for total group size and the proportion\n  mutate(weighted_group_size = sum(weighted_n),\n         weighted_estimate = weighted_n\/weighted_group_size)\n\ntrump_estimates_educ## # A tibble: 17 x 5\n## # Groups:   educ_cat &#091;4]\n## educ_cat  trump_approval  weighted_n weighted_group_\u2026 weighted_estima\u2026\n##    &lt;fct&gt;       &lt;fct&gt;                &lt;dbl&gt;            &lt;dbl&gt;            &lt;dbl&gt;\n##  1 High schoo\u2026 Strongly appro\u2026     550.             1710.            0.322 \n##  2 High schoo\u2026 Not strongly a\u2026     207.             1710.            0.121 \n##  3 High schoo\u2026 Not strongly d\u2026     221.             1710.            0.129 \n##  4 High schoo\u2026 Strongly disap\u2026     593.             1710.            0.347 \n##  5 High schoo\u2026 Refused             140.             1710.            0.0817\n##  6 Some colle\u2026 Strongly appro\u2026     404.             1337.            0.302 \n##  7 Some colle\u2026 Not strongly a\u2026     111.             1337.            0.0833\n##  8 Some colle\u2026 Not strongly d\u2026     128.             1337.            0.0959\n##  9 Some colle\u2026 Strongly disap\u2026     605.             1337.            0.453 \n## 10 Some colle\u2026 Refused              88.1            1337.            0.0659\n## 11 College gr\u2026 Strongly appro\u2026     336.             1258.            0.267 \n## 12 College gr\u2026 Not strongly a\u2026      89.9            1258.            0.0715\n## 13 College gr\u2026 Not strongly d\u2026     109.             1258.            0.0870\n## 14 College gr\u2026 Strongly disap\u2026     676.             1258.            0.537 \n## 15 College gr\u2026 Refused              47.0            1258.            0.0373\n## 16 Don't know\u2026 Strongly appro\u2026       2.97             13.3           0.224 \n## 17 Don't know\u2026 Strongly disap\u2026      10.3              13.3           0.776<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2301\">This may seem like a lot of work to create one weighted crosstab, and there are definitely easier ways to do this if that\u2019s all you need. However, it gives us a great deal of flexibility to create any number of other summary statistics besides weighted percentages. Because the summaries are stored as a&nbsp;<code>data.frame<\/code>, it\u2019s easy to convert them into graphs or use them in other analyses. (Technically the output is a&nbsp;<a href=\"https:\/\/cran.r-project.org\/web\/packages\/tibble\/vignettes\/tibble.html\" rel=\"noreferrer noopener\" target=\"_blank\">tibble<\/a>, tidyverse&#8217;s approach to data frames.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3927\">Perhaps most usefully, we can use this approach to create summaries for a large number of demographics or outcome variables simultaneously. In this example, we\u2019re not only interested in breaking out&nbsp;<code>trump_approval<\/code>&nbsp;by&nbsp;<code>educ_cat<\/code>, but also the&nbsp;<code>racethn<\/code>&nbsp;and&nbsp;<code>gen5<\/code>&nbsp;variables. With some reshaping of the data using the&nbsp;<code>gather()<\/code>&nbsp;function, we can use the procedure to get all of our subgroup summaries at once.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"cadf\">Rearranging data with&nbsp;<code><strong><mark style=\"background-color:#ecece3\" class=\"has-inline-color\">gather()<\/mark><\/strong><\/code><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"26f4\">The first step to making the process a bit simpler is to take the Apr17 dataset and select down to only the columns we need for our analysis using the&nbsp;<code>select()<\/code>function. The&nbsp;<code>select()<\/code>&nbsp;function is like a Swiss army knife of keeping, rearranging or dropping (using&nbsp;<code>-<\/code>) columns. There are also a number of helper functions like&nbsp;<code>starts_with()<\/code>,&nbsp;<code>ends_with()<\/code>and&nbsp;<code>matches()<\/code>&nbsp;that make it easy to select a number of columns with a certain naming pattern instead of naming each column explicitly. You can also rename variables with&nbsp;<code>select()<\/code>&nbsp;by supplying a new variable name before an&nbsp;<code>=<\/code>&nbsp;and the old variable name.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Apr17 &lt;- Apr17 %&gt;%\nselect(resp_id = psraid,\nweight,\ntrump_approval,\neduc_cat, racethn, gen5)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This line of code uses the&nbsp;<code>select()<\/code>&nbsp;function to limit the Apr17 dataset to only the columns we need for our analysis. Just to illustrate how renaming works, we\u2019ll change the name of the respondent identifier from&nbsp;<code>psraid<\/code>&nbsp;to&nbsp;<code>resp_id<\/code>. We\u2019ll retain the survey weight (<code>weight<\/code>), the&nbsp;<code>trump_approval<\/code>&nbsp;variable and the subgroup variables we are interested in (<code>educ_cat<\/code>,&nbsp;<code>racethn<\/code>&nbsp;and&nbsp;<code>gen5<\/code>).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>head(Apr17)\n\n## # A tibble: 6 x 6\n##   resp_id weight trump_approval     educ_cat         racethn     gen5      \n##     &lt;dbl&gt;  &lt;dbl&gt; &lt;fct&gt;              &lt;fct&gt;            &lt;fct&gt;       &lt;fct&gt;     \n## 1  100005  2.94 Strongly disappro\u2026 College grad+  Black, non\u2026 Silent (1\u2026\n## 2  100010  1.32 Not strongly appr\u2026 Some college   Hispanic    Boomer (1\u2026\n## 3  100021  1.24 Strongly disappro\u2026 College grad+  White, non\u2026 Silent (1\u2026\n## 4  100028  4.09 Strongly approve   Some college   White, non\u2026 Boomer (1\u2026\n## 5  100037  1.12 Refused            College grad+  White, non\u2026 Boomer (1\u2026\n## 6  100039  6.68 Strongly disappro\u2026 High school gra\u2026 Black, non\u2026 Boomer (1\u2026<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"692c\">The next step is to rearrange the data in a way that makes it easy to calculate the weighted summary statistics by each demographic group. We\u2019ll do this with the&nbsp;<a href=\"https:\/\/tidyr.tidyverse.org\/reference\/gather.html\" target=\"_blank\" rel=\"noreferrer noopener\">gather()<\/a>&nbsp;function, which reshapes data from \u201cwide\u201d format to \u201clong\u201d \u2014 an extremely useful function when working with data. (Recently, the team behind the&nbsp;<code>tidyr<\/code>&nbsp;package introduced functions called &nbsp;<code><a href=\"https:\/\/tidyr.tidyverse.org\/dev\/articles\/pivot.html\" target=\"_blank\" rel=\"noreferrer noopener\">pivot_longer()<\/a><\/code><a href=\"https:\/\/tidyr.tidyverse.org\/dev\/articles\/pivot.html\" target=\"_blank\" rel=\"noreferrer noopener\">&nbsp;and&nbsp;<\/a><code><a href=\"https:\/\/tidyr.tidyverse.org\/dev\/articles\/pivot.html\" target=\"_blank\" rel=\"noreferrer noopener\">pivot_wider()<\/a><\/code>. These functions will not replace&nbsp;<code>gather()<\/code>&nbsp;and&nbsp;<code>spread()<\/code>, but are designed to be more useful and intuitive.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d7e5\">In wide format, there is one row per person and one column per variable. In long format, there are multiple rows per person: one for each of the demographic variables we want to analyze. The separate demographic columns are replaced by a pair of columns; a \u201ckey\u201d column and a \u201cvalue\u201d column. Here, the \u201ckey\u201d column is called&nbsp;<code>subgroup_variable<\/code>, and it identifies which demographic variable is associated with that row (one of&nbsp;<code>educ_cat<\/code>,&nbsp;<code>racethn<\/code>, or&nbsp;<code>gen5<\/code>). The &#8220;value&#8221; column is called&nbsp;<code>subgroup<\/code>&nbsp;and identifies the specific demographic category to which the person belongs.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Apr17_long &lt;- Apr17 %&gt;%\ngather(key = subgroup_variable, value = subgroup,\neduc_cat, racethn, gen5)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In this code, we\u2019re telling&nbsp;<code>gather()<\/code>&nbsp;to name the key variable&nbsp;<code>\"subgroup_variable\"<\/code>&nbsp;and to name the value column&nbsp;<code>\"subgroup.\"<\/code>&nbsp;The remaining arguments specify the names of the columns that we want to gather into rows. After the&nbsp;<code>gather()<\/code>&nbsp;step, we\u2019ve gone from a wide dataset with 1,501 rows to a long dataset 4,503 rows (1,501 respondents x 3 demographic variables).<\/p>\n\n\n\n<figure class=\"wp-block-image size-640-wide\"><a rel=\"attachment wp-att-125961\" href=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/\/\/using-tidy-verse-tools-with-pew-research-center-survey-data-in-r\/image-10-png\/\"><img data-dominant-color=\"e8e2b8\" data-has-transparency=\"false\" style=\"--dominant-color: #e8e2b8;\" loading=\"lazy\" decoding=\"async\"  srcset=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-10.png?resize=480,582 480w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-10.png?resize=782,949 782w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-10.png?resize=960,1165 960w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-10.png?resize=1200,1456 1200w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-10.png?resize=1281,1554 1281w\" sizes=\"(max-width: 480px) 480px, (max-width: 782px) 782px, 640px\" height=\"776\" width=\"640\" src=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-10.png?w=640\" alt=\"\" class=\"wp-image-125961 not-transparent\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a4c0\">If you want to read more about&nbsp;<code>gather()<\/code>, I recommend starting with&nbsp;<a href=\"https:\/\/twitter.com\/WeAreRLadies\/status\/1059520693857996800\" rel=\"noreferrer noopener\" target=\"_blank\">this tweet<\/a>by&nbsp;<a href=\"https:\/\/twitter.com\/apreshill\" rel=\"noreferrer noopener\" target=\"_blank\">Alison Hill<\/a>&nbsp;from the&nbsp;<a href=\"https:\/\/twitter.com\/WeAreRLadies\" rel=\"noreferrer noopener\" target=\"_blank\">WeAreRLadies<\/a>&nbsp;twitter account. You can also go to the t<a href=\"https:\/\/r4ds.had.co.nz\/tidy-data.html#gathering\" rel=\"noreferrer noopener\" target=\"_blank\">idy data<\/a>&nbsp;chapter in the previously mentioned&nbsp;<em>R for Data Science<\/em>&nbsp;book.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b1d3\">Now that we have our data arranged in this format, getting weighted summaries for all three subgroup variables is just a matter adding another grouping variable. Whereas before we grouped by&nbsp;<code>educ_cat<\/code>, we can now use our newly created&nbsp;<code>subgroup_variable<\/code>&nbsp;and&nbsp;<code>subgroup<\/code>&nbsp;columns instead.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><code>trump_estimates &lt;- Apr17_long %&gt;%\n#group by subgroup_variable, subgroup, and trump approval\ngroup_by(subgroup_variable, subgroup, trump_approval) %&gt;%\n#calculate the total number of people in each answer and education #category using survey weights (weight)\nsummarise(weighted_n = sum(weight)) %&gt;%\n#group by subgroup only to calculate subgroup category size\ngroup_by(subgroup) %&gt;%\n#add columns for total group size and the proportion\nmutate(weighted_group_size = sum(weighted_n),\nweighted_estimate = weighted_n\/weighted_group_size)<\/code><\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Since we are only really interested in the proportions, we\u2019ll remove the weighted_total and weighted_group_size columns using the select() function. Including the \u2014 before a column name tells select() to drop that column.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>trump_estimates &lt;- trump_estimates %&gt;%\nselect(-weighted_n, -weighted_group_size)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s what we end up with.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>trump_estimates\n## # A tibble: 70 x 4\n## # Groups: subgroup &#091;14]\n## subgroup_variable subgroup trump_approval weighted_estima\u2026\n##     &lt;chr&gt;               &lt;chr&gt;                &lt;fct&gt;              &lt;dbl&gt;\n## 1 educ_cat           College grad+      Strongly approve        0.267\n## 2 educ_cat           College grad+      Not strongly appr\u2026     0.0715\n## 3 educ_cat           College grad+      Not strongly disa\u2026     0.0870\n## 4 educ_cat           College grad+      Strongly disappro\u2026      0.537\n## 5 educ_cat           College grad+          Refused            0.0373\n## 6 educ_cat         Don't know\/Refuse\u2026   Strongly approve       0.0203\n## 7 educ_cat         Don't know\/Refuse\u2026   Strongly disappro\u2026     0.0704\n## 8 educ_cat         High school grad \u2026   Strongly approve        0.322\n## 9 educ_cat         High school grad \u2026   Not strongly appr\u2026      0.121\n## 10 educ_cat        High school grad \u2026   Not strongly disa\u2026      0.129\n## # \u2026 with 60 more rows<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Arranging the data this way makes it easy to create plots or tables of the variables of interest. Below is an example of a simple plot of the estimates we created in this post. We use the&nbsp;<a href=\"https:\/\/dplyr.tidyverse.org\/reference\/filter.html\" target=\"_blank\" rel=\"noreferrer noopener\">filter()<\/a>&nbsp;function to remove the \u201cRefused\u201d category in the&nbsp;<code>trump_approval<\/code>&nbsp;variable and in any of our subgroup variables. Then, we pipe that data into the plotting code from&nbsp;<a href=\"https:\/\/ggplot2.tidyverse.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">ggplot2<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>trump_estimates %&gt;%\nfilter(trump_approval != \"Refused\") %&gt;%\nfilter(!(subgroup %in%\nc(\"Don't know\/Refused (VOL.)\", \"DK\/Ref\"))) %&gt;%\nggplot(\naes(\nx = weighted_estimate,\ny = subgroup\n)\n) +\ngeom_point() +\nscale_x_continuous(limits = c(0, .8),\nbreaks = seq(0, .6, by = .2),\nlabels = scales::percent(\nseq(0, .6, by = .2), accuracy = 1)\n) +\nfacet_grid(cols = vars(trump_approval),\nrows = vars(subgroup_variable),\nscales = \"free_y\",\nspace = \"free\"\n) +\ntheme_bw() +\ntheme(axis.title.y = element_blank())<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-640-wide\"><a rel=\"attachment wp-att-125965\" href=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/\/\/using-tidy-verse-tools-with-pew-research-center-survey-data-in-r\/image-11-png\/\"><img data-dominant-color=\"f5f5f5\" data-has-transparency=\"false\" style=\"--dominant-color: #f5f5f5;\" loading=\"lazy\" decoding=\"async\"  srcset=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-11.png?resize=480,339 480w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-11.png?resize=782,552 782w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-11.png?resize=960,678 960w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-11.png?resize=1200,847 1200w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-11.png?resize=1564,1104 1564w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-11.png?resize=1600,1129 1600w, https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-11.png?resize=2550,1800 2550w\" sizes=\"(max-width: 480px) 480px, (max-width: 782px) 782px, 640px\" height=\"452\" width=\"640\" src=\"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/image-11.png?w=640\" alt=\"\" class=\"wp-image-125965 not-transparent\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Here is all the necessary code used in the post:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#code from start to finish \n\n##install or update packages if neccessary\n#install.packages(\"tidyverse\")\n#install.packages(\"haven\") \n\n##load packages in \nlibrary(tidyverse) #loads all \"core\" tidyverse packages like dplyr, tidyr, forcats, and ggplot2\nlibrary(haven) \n\n##read dataset in with value labels (as_factor)\nApr17 &lt;- read_sav(\"Apr17 public.sav\", user_na = TRUE) %&gt;% as_factor()\n\n##create trump_approval variable and relevel it \nApr17 &lt;- Apr17 %&gt;% \n  mutate(trump_approval = case_when(\n    q1 == \"Approve\" &amp; q1a == \"Very strongly\" ~ \"Strongly approve\",\n    q1 == \"Approve\" &amp; q1a != \"Very strongly\" ~ \"Not strongly approve\",\n    q1 == \"Disapprove\" &amp; q1a == \"Very strongly\" ~ \"Strongly disapprove\",\n    q1 == \"Disapprove\" &amp; q1a != \"Very strongly\" ~ \"Not strongly disapprove\",\n    q1 == \"Don't know\/Refused (VOL.)\" | q1a == \"Don't know\/Refused (VOL.)\" ~ \"Refused\"\n  ) #this parentheses closes our call to case_when\n  %&gt;% #and then sends it to fct_relevel with %&gt;% \n    fct_relevel(\"Strongly approve\",\n                \"Not strongly approve\",\n                \"Not strongly disapprove\",\n                \"Strongly disapprove\",                             \n                \"Refused\"                \n    ) #this parentheses closes our call to fct_relevel\n  ) #this parentheses closes our call to mutate\n\n## collapse education variable into 3 categories\nApr17 &lt;- Apr17 %&gt;% \n  mutate(educ_cat = fct_collapse(educ2,\n                                 \"High school grad or less\" = c(\n                                   \"Less than high school (Grades 1-8 or no formal schooling)\",\n                                   \"High school incomplete (Grades 9-11 or Grade 12 with NO diploma)\",                         \n                                   \"High school graduate (Grade 12 with diploma or GED certificate)\"\n                                 ),\n                                 \"Some college\" = c(\n                                   \"Some college, no degree (includes some community college)\",                                \n                                   \"Two year associate degree from a college or university\"\n                                 ),\n                                 \"College grad+\" = c(\n                                   \"Four year college or university degree\/Bachelor's degree (e.g., BS, BA, AB)\",              \n                                   \"Some postgraduate or professional schooling, no postgraduate degree\",                      \n                                   \"Postgraduate or professional degree, including master's, doctorate, medical or law degree\"\n                                 )\n  ) #this parentheses closes our call to fct_collapse\n  ) #this parentheses closes our call to mutate\n\n##get trump_approval weighted totals\ntrump_approval &lt;- Apr17 %&gt;% \n  group_by(trump_approval) %&gt;% \n  summarise(weighted_n = sum(weight))\n\n##get trump_approval weighted proportions\ntrump_approval &lt;- Apr17 %&gt;% \n  ##group by trump_approval to calculated weighted totals by taking the sum of the weights\n  group_by(trump_approval) %&gt;% \n  summarise(weighted_n = sum(weight)) %&gt;% \n  ##add the weighted_group_size to get the total weighted n and \n  ##divide weighted_n by weighted_group_size to get the proportions \n  mutate(weighted_group_size = sum(weighted_n), \n         weighted_estimate = weighted_n \/ weighted_group_size\n  )\n\n\n##get trump_approval by education \ntrump_estimates_educ &lt;- Apr17 %&gt;% \n  #group by educ and trump approval to get weighted n's per group\n  group_by(educ_cat, trump_approval) %&gt;% \n  #calculate the total number of people in each answer and education category using survey weights (weight)\n  summarise(weighted_n = sum(weight)) %&gt;% \n  #group by education to calculate education category size\n  group_by(educ_cat) %&gt;%\n  #add columns for total group size and the proportion\n  mutate(weighted_group_size = sum(weighted_n),\n         weighted_estimate = weighted_n\/weighted_group_size)\n\n##select only columns interested in for this analysis\n###rename psraid to resp_id\nApr17 &lt;- Apr17 %&gt;% \n  select(resp_id = psraid, weight, trump_approval, educ_cat, racethn, gen5)\n\n##create Apr_17 long with gather\nApr17_long &lt;- Apr17 %&gt;% \n  #gather educ_cat, racethn, gen5 into two columns: \n  ##a key  called \"subgroup variable\" (educ_cat, racethn, gen5)\n  ##and a value called \"subgroup\" \n  gather(key = subgroup_variable, value = subgroup, educ_cat, racethn, gen5) \n\n##get weighted estimates for every subgroup \ntrump_estimates &lt;- Apr17_long %&gt;% \n  #group by subgroup_variable, subgroup, and trump approval to get weighted n of approval\/disapproval for all subgroup cats\n  group_by(subgroup_variable, subgroup, trump_approval) %&gt;% \n  #calculate the total number of people in each answer and education category using survey weights (weight)\n  summarise(weighted_n = sum(weight)) %&gt;% \n  #group by subgroup only to calculate subgroup category size\n  group_by(subgroup) %&gt;%\n  #add columns for total group size and the proportion\n  mutate(weighted_group_size = sum(weighted_n),\n         weighted_estimate = weighted_n\/weighted_group_size)\n\n#only want proportions so select out total categories\ntrump_estimates &lt;- trump_estimates %&gt;% \n  select(-weighted_n, -weighted_group_size) \n\n##create plot\ntrump_estimates %&gt;% \n  ##remove \"Refused\" category for Trump Approval\n  filter(trump_approval != \"Refused\") %&gt;% \n  ##remove Refused categories in our subgroup values\n  filter(!(subgroup %in% c(\"Don't know\/Refused (VOL.)\", \"DK\/Ref\"))) %&gt;% \n  ggplot(\n    aes(\n      x = weighted_estimate,\n      y = subgroup\n    )\n  ) +\n  geom_point() + \n  scale_x_continuous(limits = c(0, .8),\n                     breaks = seq(0, .6, by = .2),\n                     labels = scales::percent(seq(0, .6, by = .2), accuracy = 1)\n  ) +\n  facet_grid(cols = vars(trump_approval),\n             rows = vars(subgroup_variable), \n             scales = \"free_y\",\n             space = \"free\"\n  ) +\n  theme_bw() +\n  theme(axis.title.y = element_blank())<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I wrote an introductory blog post about how to access and analyze Pew Research Center survey data with R, a free, open-source software for statistical analysis. The post showed how to perform tasks using the survey package.<\/p>\n","protected":false},"author":655,"featured_media":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"sub_headline":"","sub_title":"","_prc_public_revisions":[],"_ppp_expiration_hours":0,"_ppp_enabled":false,"ai_generated_summary":"","relatedPosts":[],"datacite_doi":"","datacite_doi_citation":"","_prc_seo_qr_attachment_id":0,"spoken_article_player_enabled":true,"displayBylines":true,"footnotes":"","prc_watchers":[],"_prc_fork_parent":0,"_prc_fork_status":"","_prc_active_fork":0},"categories":[357],"bylines":[779],"collection":[],"_post_visibility":[],"decoded-category":[531,532],"formats":[],"_fund_pool":[],"languages":[],"regions-countries":[],"research-teams":[524],"workflow-status":[],"class_list":["post-111603","decoded","type-decoded","status-publish","hentry","category-survey-methods","bylines-nick-hatley","decoded-category-coding-how-to","decoded-category-survey-methods","research-teams-decoded"],"label":"Decoded","post_parent":0,"word_count":5051,"canonical_url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/decoded\/2019\/06\/12\/using-tidy-verse-tools-with-pew-research-center-survey-data-in-r\/","art_direction":{"A1":{"id":125958,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?w=564&h=317&crop=1","width":564,"height":317,"caption":"","chartArt":false},"A2":{"id":125958,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?w=268&h=151&crop=1","width":268,"height":151,"caption":"","chartArt":false},"A3":{"id":125958,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?w=194&h=110&crop=1","width":194,"height":110,"caption":"","chartArt":false},"A4":{"id":125958,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?w=268&h=151&crop=1","width":268,"height":151,"caption":"","chartArt":false},"XL":{"id":125958,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?w=720&h=405&crop=1","width":720,"height":405,"caption":"","chartArt":false},"social":{"id":125958,"rawUrl":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png","url":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-content\/uploads\/sites\/20\/2022\/08\/06.12.2019_feature.png?w=1200&h=628&crop=1","width":1200,"height":628,"caption":"","chartArt":false}},"_embeds":[],"watchers":[],"table_of_contents":[],"datacite_doi":"","prc_seo_data":{"title":"Using tidyverse tools with Pew Research Center survey data in R","description":"I wrote an introductory blog post about how to access and analyze Pew Research Center survey data with R, a free, open-source software for statistical analysis. The post showed how to perform tasks using the survey package.","og_title":"Using tidyverse tools with Pew Research Center survey data in R","og_description":"I wrote an introductory blog post about how to access and analyze Pew Research Center survey data with R, a free, open-source software for statistical analysis. The post showed how to perform tasks using the survey package.","schema_type":"Article","noindex":false,"canonical_url":"","primary_terms":{"category":43},"custom_schema":[],"og_image":125958,"indexnow_submitted_at":null,"gsc_index_status":null},"prepublish_checks":{},"jetpack_sharing_enabled":true,"relatedPostsOrdered":[],"bylinesOrdered":[{"key":"_r6x699i4u","termId":779}],"acknowledgementsOrdered":[],"_links":{"self":[{"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded\/111603","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded"}],"about":[{"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/types\/decoded"}],"author":[{"embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/users\/655"}],"replies":[{"embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/comments?post=111603"}],"version-history":[{"count":2,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded\/111603\/revisions"}],"predecessor-version":[{"id":138527,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded\/111603\/revisions\/138527"}],"wp:attachment":[{"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/media?parent=111603"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/categories?post=111603"},{"taxonomy":"bylines","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/bylines?post=111603"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/collection?post=111603"},{"taxonomy":"_post_visibility","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/_post_visibility?post=111603"},{"taxonomy":"decoded-category","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/decoded-category?post=111603"},{"taxonomy":"formats","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/formats?post=111603"},{"taxonomy":"_fund_pool","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/_fund_pool?post=111603"},{"taxonomy":"languages","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/languages?post=111603"},{"taxonomy":"regions-countries","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/regions-countries?post=111603"},{"taxonomy":"research-teams","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/research-teams?post=111603"},{"taxonomy":"workflow-status","embeddable":true,"href":"https:\/\/alpha.pewresearch.org\/pewresearch-org\/wp-json\/wp\/v2\/workflow-status?post=111603"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}