Weighted survey data with Power BI compared to dplyr, SQL or survey by @ellis2013nz

R-bloggers 2018-04-10

Summary:

A conundrum for Microsoft Power BI I’ve been familiarising myself with Microsoft Power BI, which features prominently in any current discussion on data analysis and dissemination tools for organisations. It’s a good tool with some nice features. I won’t try to do a full review here, but just ruminate on one aspect - setting it up for non-specialists to explore weighted survey data. For this, I want to be able to do appropriately weighted cross-tabs, but I’m not expecting anything that is either more sophisticated or more upstream in the data processing chain. Actually creating the weights, and estimating sampling uncertainty based on them, is something for another tool like R. Judging from discussion threads like this one I’m not the only one who wishes you could just say “apply case weights” in the way that you would with SPSS or a market research cross tab tool. In fact, there are some tutorials out there on elaborate and painful ways of getting around this problem that seem totally surreal to me, being used to the ease with which R or Stata deal with such problems. Caveat on what follows - my total experience with Power BI can be measured in hours rather than days, so please take the below with a grain of salt. I may have missed something important. I was worried that inability to deal with weighted data could be a deal breaker for the purpose I was thinking of, and when I found out that a recent release proudly touted the ability to do a weighted average (in a way that didn’t even help me much) I nearly gave up on it in disgust. Power BI lets the developer write R code and at one point I was considering the successful workflow was to pass everything through to R and send it back, before realising that this made no sense at all - might as well do the whole thing in R if that’s what it takes. However, a few hours of experimentation and trying to get my head around a different way of thinking, and it turns out the solution wasn’t too difficult. It all comes down to understanding the way Power BI differentiates between static columns of data as opposed to measures which are calculated on the fly. Once I’d cracked the problem I made a couple of Power BI reports with weighted microdata from complex surveys to be sure it worked generally. Here’s one that’s been made with public data, the New Zealand International Visitor Survey. It took about 20 minutes to make this, after I’d familiarised myself with the toolkit on another (non-public) dataset. It’s live and interactive, in fact interactive in too many ways to try to describe, so just have a play with it: Disclaimer - I’ve been responsible for that survey in the past, but not for more than a year now. What follows is very much written as a private citizen. Introducing some data This example survey is one I’ve blogged about before. It’s an on-going survey of 5,000 to 10,000 tourists per year on their departure from New Zealand. Sample size, questionnaire and mode have varied over time, but the Ministry of Business, Innovation and Employment publish a backcast set of the microdata that is as comparable across time as is possible. It’s about 24MB to download. For today’s demo, I’m only going to use the simplest part of the data - the vw_IVSSurveyMainHeader table which has one row for each of the 125,000 respondents since 1997. Here’s code to download it, including a couple of convenience R functions that MBIE use to help classify countries into groupings (dated I’m afraid - I can criticise them because I wrote them myself in 2011). I also reduce the dataset to just 8 columns so when I get into Power BI I won’t have to deal with the complexity of the full data: library(readr) library(survey) library(dplyr) source("https://github.com/nz-mbie/mbie-r-package-public/raw/master/pkg/R/CountryManip.R") # for CountryGroup source("https://github.com/nz-mbie/mbie-r-package-public/raw/master/pkg/R/NZTourism.R") # for rename.levels # download survey data from the MBIE (Ministry of Business, Innovation and Employment) website download.file("http://www.mbie.govt.nz/info-services/sectors-industries/tourism/tourism-research-data/ivs/documents-image-library/vw_IVS.zip", mode = "wb", destfile = "vw_IVS.zip") unzip("vw_IVS.zip") ivs % select(Country, Year, PopulationWeight, WeightedSpend, POV, Gender, AgeRange, SurveyResponseID) %__% rename(Spend = WeightedSpend) %__% as_tibble() %__% arrange(Year) # save a copy to use as the data source for Power BI write_delim(ivs_sub, path = "ivs-1997-to-2017.txt", delim = "|")Of these variables: POV stands for “purpose of visit”, a key concept in tourism data analysis that will be familiar to travellers from many countries’ arrival or departure cards. WeightedSpend actually means “outlier-treated spend” PopulationWeight is the survey weight, after all sorts of complex post-stratification including for age, gender, airport, country of residence and purpose of visit. Now, I’m interested in weighted counts of people for various combinations of dimensions (like year and purpose), and also in weighted

Link:

http://feedproxy.google.com/~r/RBloggers/~3/jawTOxSitWQ/

From feeds:

Statistics and Visualization » R-bloggers

Tags:

Authors:

free range statistics - R

Date tagged:

04/10/2018, 21:27

Date published:

04/10/2018, 08:00