I recently participated in a discrepancy analysis exercise where lists from similar data sources needed to be compared to see where data did not match. The exercise asked that pivot tables be used for the analysis and summary.

After participating in the exercise I performed some additional research and model building for discrepancy analysis and I am sharing that information here. I am starting by showing a pivot table solution first and then an R-based solution (which I feel is much better for a real world application).

I found a great YouTube video on checking two lists in Excel for discrepancies by excelisfun and I am going to use the data sets and operational questions from that video in this exercise.

The example data sets are as follows:

List 1

Product ID | Description | Wholesale Price |

Prod1 | Desc1 | 1.45 |

Prod | Desc2 | 1.27 |

Prod3 | Desc | 3.62 |

Prod4 | Desc4 | 2.15 |

Prod5 | Desc5 | 2.87 |

Prod12 | Desc6 | 2.53 |

Prod7 | Desc7 | 2.2 |

Prod8 | Desc8 | 2.6 |

Prod9 | Desc9 | 3.68 |

List 2

Product ID | Description | Wholesale Price |

Prod1 | Desc1 | 1.45 |

Prod2 | Desc2 | 1.27 |

Prod3 | Desc3 | 3.62 |

Prod4 | Desc4 | 1.57 |

Prod5 | Desc5 | 2.87 |

Prod6 | Desc6 | 2.53 |

Prod7 | Desc7 | 2.2 |

Prod8 | Desc8 | 3.21 |

Prod9 | Desc9 | 1.81 |

The Excel workbook containing my pivot table solution and the non-pivot-table solution from excelisfun can be downloaded here.

The objective of the model is to see if we can answer the following 3 questions:

- Is the product ID in the other list?
- If the product ID is on the other list, does the description match?
- If the product ID is on the other list, do the wholesale prices match?

**Pivot Table Excel Model**

**Step 1**: Combine the two lists into one list, adding a column providing a distinguishing characteristic (eg. List 1, List 2)

Product ID | Description | Wholesale Price | List |

Prod1 | Desc1 | 1.45 | List 1 |

Prod | Desc2 | 1.27 | List 1 |

Prod3 | Desc | 3.62 | List 1 |

Prod4 | Desc4 | 2.15 | List 1 |

Prod5 | Desc5 | 2.87 | List 1 |

Prod12 | Desc6 | 2.53 | List 1 |

Prod7 | Desc7 | 2.2 | List 1 |

Prod8 | Desc8 | 2.6 | List 1 |

Prod9 | Desc9 | 3.68 | List 1 |

Prod1 | Desc1 | 1.45 | List 2 |

Prod2 | Desc2 | 1.27 | List 2 |

Prod3 | Desc3 | 3.62 | List 2 |

Prod4 | Desc4 | 1.57 | List 2 |

Prod5 | Desc5 | 2.87 | List 2 |

Prod6 | Desc6 | 2.53 | List 2 |

Prod7 | Desc7 | 2.2 | List 2 |

Prod8 | Desc8 | 3.21 | List 2 |

Prod9 | Desc9 | 1.81 | List 2 |

**Step 2:** Create pivot table with Product ID in the rows, List in the columns, and Product ID and Wholesale Price in the values.

**Step 3**: Apply conditional formatting to the pivot table to highlight the analysis points

The conditional formatting highlights the answers to questions 1 and 3; unfortunately it appears that we cannot answer question 2 (do descriptions match?) using a pivot table model (a formula-based model can check for an exact match between cells containing strings but I believe the pivot table will only indicate if a description exists via a COUNT function)..

We can answer questions 1 and 3 using the pivot table model:

Is the product ID in the other list?

Prod and Prod12 are in List 1 but not in List 2.

Prod2 and Prod6 are in List 2 but not in List 1.

If the product ID is on the other list, do the wholesale prices match?

Wholesale prices do not match for the following Product IDs (where the Product ID exists in both lists): Prod4, Prod8, and Prod9.

**Note:** If you are interested in the non-pivot-table Excel model that can answer all three questions, watch the YouTube video: Excel Magic Trick 528: Check Two Lists For Discrepancies MATCH and VLOOKUP functions

**R Model**

While Excel models are nice, what I really want is an automated solution that prints out the discrepancies and tells me what those discrepancies are, allowing me to get to the work of root cause analysis and data correction. The R code here provides my desired solution.

*# repmis package is needed to pull data from dropbox*

*# plyr package is needed to rename columns*

*require(“repmis”)*

*require(“plyr”)*

*# load the CSV data directly from dropbox*

*list_1 <- source_DropboxData(“list_1.csv”, key=”tv93kl0awjsj12l”, sep=”,”, header=TRUE)*

*list_2 <- source_DropboxData(“list_2.csv”, key=”3e1uo51ynm8i3ow”, sep=”,”, header=TRUE)*

*# review the data*

*list_1*

*list_2*

*# column names appear to get changed when pulling from Dropbox; rename columns*

*list_1 <- rename(list_1, c(“Product ID” = “Product.ID”, “Wholesale Price” = “Wholesale.Price”, stringsAsFactors = FALSE))*

*list_2 <- rename(list_2, c(“Product ID” = “Product.ID”, “Wholesale Price” = “Wholesale.Price”, stringsAsFactors = FALSE))*

*# create a new data frame by merging list_1 and list_2 but keeping all columns *

*x <- merge(list_1, list_2, by=”Product.ID”)*

*# let’s look at the new data frame*

*x*

*# Print out the columns with those that have a mismatch in price or description:*

*x[x$Description.x != x$Description.y | x$Wholesale.Price.x != x$Wholesale.Price.y, ]*

*# Output containing the data that would require investigation*

The R file is available at https://github.com/augustguadamuz/blog

Big thank you to jeremycg at stackoverflow.com for his assistance in developing the R solution.