Discrepancy Analysis in Excel and R

I recently participated in a discrepancy analysis exercise where lists from similar data sources needed to be compared to see where data did not match. The exercise asked that pivot tables be used for the analysis and summary.

After participating in the exercise I performed some additional research and model building for discrepancy analysis and I am sharing that information here. I am starting by showing a pivot table solution first and then an R-based solution (which I feel is much better for a real world application).

I found a great YouTube video on checking two lists in Excel for discrepancies by excelisfun and I am going to use the data sets and operational questions from that video in this exercise.

The example data sets are as follows:

List 1

Product ID Description Wholesale Price
Prod1 Desc1 1.45
Prod Desc2 1.27
Prod3 Desc 3.62
Prod4 Desc4 2.15
Prod5 Desc5 2.87
Prod12 Desc6 2.53
Prod7 Desc7 2.2
Prod8 Desc8 2.6
Prod9 Desc9 3.68

List 2

Product ID Description Wholesale Price
Prod1 Desc1 1.45
Prod2 Desc2 1.27
Prod3 Desc3 3.62
Prod4 Desc4 1.57
Prod5 Desc5 2.87
Prod6 Desc6 2.53
Prod7 Desc7 2.2
Prod8 Desc8 3.21
Prod9 Desc9 1.81

The Excel workbook containing my pivot table solution and the non-pivot-table solution from excelisfun can be downloaded here.

The objective of the model is to see if we can answer the following 3 questions:

  1. Is the product ID in the other list?
  2. If the product ID is on the other list, does the description match?
  3. If the product ID is on the other list, do the wholesale prices match?

Pivot Table Excel Model

Step 1: Combine the two lists into one list, adding a column providing a distinguishing characteristic (eg. List 1, List 2)

Product ID Description Wholesale Price List
Prod1 Desc1 1.45 List 1
Prod Desc2 1.27 List 1
Prod3 Desc 3.62 List 1
Prod4 Desc4 2.15 List 1
Prod5 Desc5 2.87 List 1
Prod12 Desc6 2.53 List 1
Prod7 Desc7 2.2 List 1
Prod8 Desc8 2.6 List 1
Prod9 Desc9 3.68 List 1
Prod1 Desc1 1.45 List 2
Prod2 Desc2 1.27 List 2
Prod3 Desc3 3.62 List 2
Prod4 Desc4 1.57 List 2
Prod5 Desc5 2.87 List 2
Prod6 Desc6 2.53 List 2
Prod7 Desc7 2.2 List 2
Prod8 Desc8 3.21 List 2
Prod9 Desc9 1.81 List 2

Step 2: Create pivot table with Product ID in the rows, List in the columns, and Product ID and Wholesale Price in the values.

Step 3: Apply conditional formatting to the pivot table to highlight the analysis points

discrepancy_pivot_20150602

The conditional formatting highlights the answers to questions 1 and 3; unfortunately it appears that we cannot answer question 2 (do descriptions match?) using a pivot table model (a formula-based model can check for an exact match between cells containing strings but I believe the pivot table will only indicate if a description exists via a COUNT function)..

We can answer questions 1 and 3 using the pivot table model:

Is the product ID in the other list?

Prod and Prod12 are in List 1 but not in List 2.

Prod2 and Prod6 are in List 2 but not in List 1.

If the product ID is on the other list, do the wholesale prices match?

Wholesale prices do not match for the following Product IDs (where the Product ID exists in both lists): Prod4, Prod8, and Prod9.

Note: If you are interested in the non-pivot-table Excel model that can answer all three questions, watch the YouTube video: Excel Magic Trick 528: Check Two Lists For Discrepancies MATCH and VLOOKUP functions

R Model

While Excel models are nice, what I really want is an automated solution that prints out the discrepancies and tells me what those discrepancies are, allowing me to get to the work of root cause analysis and data correction. The R code here provides my desired solution.

# repmis package is needed to pull data from dropbox
# plyr package is needed to rename columns
require(“repmis”)
require(“plyr”)

# load the CSV data directly from dropbox

list_1 <- source_DropboxData(“list_1.csv”, key=”tv93kl0awjsj12l”, sep=”,”, header=TRUE)

list_2 <- source_DropboxData(“list_2.csv”, key=”3e1uo51ynm8i3ow”, sep=”,”, header=TRUE)

# review the data

list_1

list_2

# column names appear to get changed when pulling from Dropbox; rename columns

list_1 <- rename(list_1, c(“Product ID” = “Product.ID”, “Wholesale Price” = “Wholesale.Price”, stringsAsFactors = FALSE))

list_2 <- rename(list_2, c(“Product ID” = “Product.ID”, “Wholesale Price” = “Wholesale.Price”, stringsAsFactors = FALSE))

# create a new data frame by merging list_1 and list_2 but keeping all columns

x <- merge(list_1, list_2, by=”Product.ID”)

# let’s look at the new data frame

x

# Print out the columns with those that have a mismatch in price or description:

x[x$Description.x != x$Description.y | x$Wholesale.Price.x != x$Wholesale.Price.y, ]

# Output containing the data that would require investigation

r_output_discrepancy_analysis_20150605

The R file is available at https://github.com/augustguadamuz/blog

Big thank you to jeremycg at stackoverflow.com for his assistance in developing the R solution.

Converting JSON to Google Sheet

Converting JSON to Google Sheet
I have worked on a few projects recently where the data pull was in JSON format and not the spreadsheet-ready database pulls that I am used to from my SQL queries against a database such as MySQL, Vertica, Postgres, etc. I am not a programmer and the JSON format is not something that I am used to working with comfortably yet, so I set out to investigate ways to pull JSON formatted data and get it into a working format (in this case a spreadsheet). I am planning to write soon on techniques to pull and parse data in RStudio.

My hope is that this post may help some people get to work with their data quickly without having to wrestle too much with parsing the data into an analysis-friendly format.

What is JSON?
JSON (JavaScript Object Notation) is described by http://json.org/ as a lightweight data-interchangeable format that is easy for humans to read and write and easy for machines to generate. Apparently JSON is an ideal data interchange language as it is programming language independent but uses conventions familiar to programmers of the C-family of languages.

My recent experiences pulling JSON formatted data have been via a pull from an application, such as Keen IO, or by using cURL and a URL provided by a developer to pull the data needed for analysis.

If you want to see an example of JSON formatted data, take a look at the data on Lady Gaga from graph.facebook.com: http://graph.facebook.com/ladygaga

How to pull JSON data into Google Sheets:
You will need to set up Google Sheets with the functionality to pull the data, the steps are as follows:

  1. Open Google Docs and go to Tools > Script editor
  2. Select “Create script for” > “Custom functions in sheets”
  3. Delete the existing code
  4. Paste in the import_json_appsscript.js code (Thank you Chris Keller for making this code! available on your GitHub repository)
    1. Tip: click the “Raw” button at the upper-right on the GitHub repository as this will allow you to select all and copy the code more easily.
  5. Rename the script, I used ImportJSON.gs

Now you should be able to type =ImportJSON() into a cell in a Google Sheet, including the URL/path to pull and format the JSON data.

IMPORTANT NOTE: I noticed that in Google sheets the =ImportJSON() command does not show up in the list of functions that automatically populates when you begin to type in the function, so you will need to type in the entire =ImportJSON() command in the cell (make sure to wrap the URL/path within quotes inside the parenthesis).

Let’s test it out:
Click here to see a clip of my test to generate some JSON data and pull it into Google Sheets.

I want to acknowledge the articles and that I pulled from to write this post:

Hello world!

My name is August Guadamuz. I consider myself to be a person who is constantly learning, specifically in the areas of data analysis, forecasting, operations research/management science (the science of better), and optimization. I am passionate about Six Sigma process improvement as a means for organizations to eliminate waste, run lean, and operate at peak efficiency. I am a student of R, the software environment for statistical computing and graphics.

I am currently working through the book Machine Learning with R and will be posting my work from that project. One of my goals is to publish to R-Bloggers (http://www.r-bloggers.com/). I am hoping to write on forecasting in R as well.

Outside of analytics, I am interested in politics, popular culture, training in muay thai kickboxing, skeet shooting, fishing, hiking, comic books and committing random acts of kindness.

I hope to connect with other people interested in analytics, statistics, and separating the signal from the noise. I hope that if you are reading this, you may be one of those people.