Comparing data frames
Problem
You want to do compare two or more data frames and find rows that appear in more than one data frame, or rows that appear only in one data frame.
Solution
An example
Suppose you have the following three data frames, and you want to know whether each row from each data frame appears in at least one of the other data frames.
dfA <- data.frame(Subject=c(1,1,2,2), Response=c("X","X","X","X"))
dfA
#> Subject Response
#> 1 1 X
#> 2 1 X
#> 3 2 X
#> 4 2 X
dfB <- data.frame(Subject=c(1,2,3), Response=c("X","Y","X"))
dfB
#> Subject Response
#> 1 1 X
#> 2 2 Y
#> 3 3 X
dfC <- data.frame(Subject=c(1,2,3), Response=c("Z","Y","Z"))
dfC
#> Subject Response
#> 1 1 Z
#> 2 2 Y
#> 3 3 Z
In dfA, the rows containing (1,X) also appear in dfB, but the rows containing (2,X) do not appear in any of the other data frames. Similarly, dfB contains (1,X) which appears in dfA, and (2,Y), which appears in dfC, but (3,X) does not appear in any other data frame.
You might wish to mark the rows which are duplicated in another data frame, or which are unique to each data frame.
Joining the data frames
To proceed, first join the three data frames with a column identifying which source each row came from. It’s called Coder
here because this could be data coded by three different people. In this case, you might wish to find where the coders agreed, or where they disagreed.
dfA$Coder <- "A"
dfB$Coder <- "B"
dfC$Coder <- "C"
df <- rbind(dfA, dfB, dfC) # Stick them together
df <- df[,c("Coder", "Subject", "Response")] # Reorder the columns to look nice
df
#> Coder Subject Response
#> 1 A 1 X
#> 2 A 1 X
#> 3 A 2 X
#> 4 A 2 X
#> 5 B 1 X
#> 6 B 2 Y
#> 7 B 3 X
#> 8 C 1 Z
#> 9 C 2 Y
#> 10 C 3 Z
If your data starts out in this format, then there’s obviously no need to join it together.
Finding duplicated rows
Using the function dupsBetweenGroups
(defined below), we can find which rows are duplicated between different groups:
# Find the rows which have duplicates in a different group.
dupRows <- dupsBetweenGroups(df, "Coder")
# Print it alongside the data frame
cbind(df, dup=dupRows)
#> Coder Subject Response dup
#> 1 A 1 X TRUE
#> 2 A 1 X TRUE
#> 3 A 2 X FALSE
#> 4 A 2 X FALSE
#> 5 B 1 X TRUE
#> 6 B 2 Y TRUE
#> 7 B 3 X FALSE
#> 8 C 1 Z FALSE
#> 9 C 2 Y TRUE
#> 10 C 3 Z FALSE
Note that this does not mark duplicated rows within a group. With Coder=A, there are two rows with Subject=1 and Response=X, but these are not marked as duplicates.
Finding unique rows
It’s also possible to find the rows that are unique within each group:
cbind(df, unique=!dupRows)
#> Coder Subject Response unique
#> 1 A 1 X FALSE
#> 2 A 1 X FALSE
#> 3 A 2 X TRUE
#> 4 A 2 X TRUE
#> 5 B 1 X FALSE
#> 6 B 2 Y FALSE
#> 7 B 3 X TRUE
#> 8 C 1 Z TRUE
#> 9 C 2 Y FALSE
#> 10 C 3 Z TRUE
Splitting apart the data frame
If you wish to split the joined data frame into the three original data frames
# Store the results in df
dfDup <- cbind(df, dup=dupRows)
dfA <- subset(dfDup, Coder=="A", select=-Coder)
dfA
#> Subject Response dup
#> 1 1 X TRUE
#> 2 1 X TRUE
#> 3 2 X FALSE
#> 4 2 X FALSE
dfB <- subset(dfDup, Coder=="B", select=-Coder)
dfB
#> Subject Response dup
#> 5 1 X TRUE
#> 6 2 Y TRUE
#> 7 3 X FALSE
dfC <- subset(dfDup, Coder=="C", select=-Coder)
dfC
#> Subject Response dup
#> 8 1 Z FALSE
#> 9 2 Y TRUE
#> 10 3 Z FALSE
Ignoring columns
It is also possible to ignore one or more columns, by removing that column from the data frame that is passed to the function. The results can be joined to the original complete data frame if desired.
# Ignore the Subject column -- only use Response
dfNoSub <- subset(df, select=-Subject)
dfNoSub
#> Coder Response
#> 1 A X
#> 2 A X
#> 3 A X
#> 4 A X
#> 5 B X
#> 6 B Y
#> 7 B X
#> 8 C Z
#> 9 C Y
#> 10 C Z
# Check for duplicates
dupRows <- dupsBetweenGroups(dfNoSub, "Coder")
# Join the result to the original data frame
cbind(df, dup=dupRows)
#> Coder Subject Response dup
#> 1 A 1 X TRUE
#> 2 A 1 X TRUE
#> 3 A 2 X TRUE
#> 4 A 2 X TRUE
#> 5 B 1 X TRUE
#> 6 B 2 Y TRUE
#> 7 B 3 X TRUE
#> 8 C 1 Z FALSE
#> 9 C 2 Y TRUE
#> 10 C 3 Z FALSE
dupsBetweenGroups function
This is the function that does all the work:
dupsBetweenGroups <- function (df, idcol) {
# df: the data frame
# idcol: the column which identifies the group each row belongs to
# Get the data columns to use for finding matches
datacols <- setdiff(names(df), idcol)
# Sort by idcol, then datacols. Save order so we can undo the sorting later.
sortorder <- do.call(order, df)
df <- df[sortorder,]
# Find duplicates within each id group (first copy not marked)
dupWithin <- duplicated(df)
# With duplicates within each group filtered out, find duplicates between groups.
# Need to scan up and down with duplicated() because first copy is not marked.
dupBetween = rep(NA, nrow(df))
dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols])
dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols], fromLast=TRUE) | dupBetween[!dupWithin]
# ============= Replace NA's with previous non-NA value ==============
# This is why we sorted earlier - it was necessary to do this part efficiently
# Get indexes of non-NA's
goodIdx <- !is.na(dupBetween)
# These are the non-NA values from x only
# Add a leading NA for later use when we index into this vector
goodVals <- c(NA, dupBetween[goodIdx])
# Fill the indices of the output vector with the indices pulled from
# these offsets of goodVals. Add 1 to avoid indexing to zero.
fillIdx <- cumsum(goodIdx)+1
# The original vector, now with gaps filled
dupBetween <- goodVals[fillIdx]
# Undo the original sort
dupBetween[sortorder] <- dupBetween
# Return the vector of which entries are duplicated across groups
return(dupBetween)
}
Notes
To find exact duplicate rows within a single dataframe, see ../Finding and removing duplicate records.