Discrepancy Analysis in Excel and R

I recently participated in a discrepancy analysis exercise where lists from similar data sources needed to be compared to see where data did not match. The exercise asked that pivot tables be used for the analysis and summary.

After participating in the exercise I performed some additional research and model building for discrepancy analysis and I am sharing that information here. I am starting by showing a pivot table solution first and then an R-based solution (which I feel is much better for a real world application).

I found a great YouTube video on checking two lists in Excel for discrepancies by excelisfun and I am going to use the data sets and operational questions from that video in this exercise.

The example data sets are as follows:

List 1

Product ID Description Wholesale Price
Prod1 Desc1 1.45
Prod Desc2 1.27
Prod3 Desc 3.62
Prod4 Desc4 2.15
Prod5 Desc5 2.87
Prod12 Desc6 2.53
Prod7 Desc7 2.2
Prod8 Desc8 2.6
Prod9 Desc9 3.68

List 2

Product ID Description Wholesale Price
Prod1 Desc1 1.45
Prod2 Desc2 1.27
Prod3 Desc3 3.62
Prod4 Desc4 1.57
Prod5 Desc5 2.87
Prod6 Desc6 2.53
Prod7 Desc7 2.2
Prod8 Desc8 3.21
Prod9 Desc9 1.81

The Excel workbook containing my pivot table solution and the non-pivot-table solution from excelisfun can be downloaded here.

The objective of the model is to see if we can answer the following 3 questions:

  1. Is the product ID in the other list?
  2. If the product ID is on the other list, does the description match?
  3. If the product ID is on the other list, do the wholesale prices match?

Pivot Table Excel Model

Step 1: Combine the two lists into one list, adding a column providing a distinguishing characteristic (eg. List 1, List 2)

Product ID Description Wholesale Price List
Prod1 Desc1 1.45 List 1
Prod Desc2 1.27 List 1
Prod3 Desc 3.62 List 1
Prod4 Desc4 2.15 List 1
Prod5 Desc5 2.87 List 1
Prod12 Desc6 2.53 List 1
Prod7 Desc7 2.2 List 1
Prod8 Desc8 2.6 List 1
Prod9 Desc9 3.68 List 1
Prod1 Desc1 1.45 List 2
Prod2 Desc2 1.27 List 2
Prod3 Desc3 3.62 List 2
Prod4 Desc4 1.57 List 2
Prod5 Desc5 2.87 List 2
Prod6 Desc6 2.53 List 2
Prod7 Desc7 2.2 List 2
Prod8 Desc8 3.21 List 2
Prod9 Desc9 1.81 List 2

Step 2: Create pivot table with Product ID in the rows, List in the columns, and Product ID and Wholesale Price in the values.

Step 3: Apply conditional formatting to the pivot table to highlight the analysis points

discrepancy_pivot_20150602

The conditional formatting highlights the answers to questions 1 and 3; unfortunately it appears that we cannot answer question 2 (do descriptions match?) using a pivot table model (a formula-based model can check for an exact match between cells containing strings but I believe the pivot table will only indicate if a description exists via a COUNT function)..

We can answer questions 1 and 3 using the pivot table model:

Is the product ID in the other list?

Prod and Prod12 are in List 1 but not in List 2.

Prod2 and Prod6 are in List 2 but not in List 1.

If the product ID is on the other list, do the wholesale prices match?

Wholesale prices do not match for the following Product IDs (where the Product ID exists in both lists): Prod4, Prod8, and Prod9.

Note: If you are interested in the non-pivot-table Excel model that can answer all three questions, watch the YouTube video: Excel Magic Trick 528: Check Two Lists For Discrepancies MATCH and VLOOKUP functions

R Model

While Excel models are nice, what I really want is an automated solution that prints out the discrepancies and tells me what those discrepancies are, allowing me to get to the work of root cause analysis and data correction. The R code here provides my desired solution.

# repmis package is needed to pull data from dropbox
# plyr package is needed to rename columns
require(“repmis”)
require(“plyr”)

# load the CSV data directly from dropbox

list_1 <- source_DropboxData(“list_1.csv”, key=”tv93kl0awjsj12l”, sep=”,”, header=TRUE)

list_2 <- source_DropboxData(“list_2.csv”, key=”3e1uo51ynm8i3ow”, sep=”,”, header=TRUE)

# review the data

list_1

list_2

# column names appear to get changed when pulling from Dropbox; rename columns

list_1 <- rename(list_1, c(“Product ID” = “Product.ID”, “Wholesale Price” = “Wholesale.Price”, stringsAsFactors = FALSE))

list_2 <- rename(list_2, c(“Product ID” = “Product.ID”, “Wholesale Price” = “Wholesale.Price”, stringsAsFactors = FALSE))

# create a new data frame by merging list_1 and list_2 but keeping all columns

x <- merge(list_1, list_2, by=”Product.ID”)

# let’s look at the new data frame

x

# Print out the columns with those that have a mismatch in price or description:

x[x$Description.x != x$Description.y | x$Wholesale.Price.x != x$Wholesale.Price.y, ]

# Output containing the data that would require investigation

r_output_discrepancy_analysis_20150605

The R file is available at https://github.com/augustguadamuz/blog

Big thank you to jeremycg at stackoverflow.com for his assistance in developing the R solution.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s