Comparing data frames

Problem
Solution

Problem

You want to do compare two or more data frames and find rows that appear in more than one data frame, or rows that appear only in one data frame.

Solution

An example

Suppose you have the following three data frames, and you want to know whether each row from each data frame appears in at least one of the other data frames.

dfA <- data.frame(Subject=c(1,1,2,2), Response=c("X","X","X","X"))
dfA
#>   Subject Response
#> 1       1        X
#> 2       1        X
#> 3       2        X
#> 4       2        X

dfB <- data.frame(Subject=c(1,2,3), Response=c("X","Y","X"))
dfB
#>   Subject Response
#> 1       1        X
#> 2       2        Y
#> 3       3        X

dfC <- data.frame(Subject=c(1,2,3), Response=c("Z","Y","Z"))
dfC
#>   Subject Response
#> 1       1        Z
#> 2       2        Y
#> 3       3        Z

In dfA, the rows containing (1,X) also appear in dfB, but the rows containing (2,X) do not appear in any of the other data frames. Similarly, dfB contains (1,X) which appears in dfA, and (2,Y), which appears in dfC, but (3,X) does not appear in any other data frame.

You might wish to mark the rows which are duplicated in another data frame, or which are unique to each data frame.

Joining the data frames

To proceed, first join the three data frames with a column identifying which source each row came from. It’s called Coder here because this could be data coded by three different people. In this case, you might wish to find where the coders agreed, or where they disagreed.

dfA$Coder <- "A"
dfB$Coder <- "B"
dfC$Coder <- "C"

df <- rbind(dfA, dfB, dfC)                    # Stick them together
df <- df[,c("Coder", "Subject", "Response")]  # Reorder the columns to look nice
df
#>    Coder Subject Response
#> 1      A       1        X
#> 2      A       1        X
#> 3      A       2        X
#> 4      A       2        X
#> 5      B       1        X
#> 6      B       2        Y
#> 7      B       3        X
#> 8      C       1        Z
#> 9      C       2        Y
#> 10     C       3        Z

If your data starts out in this format, then there’s obviously no need to join it together.

Finding duplicated rows

Using the function dupsBetweenGroups (defined below), we can find which rows are duplicated between different groups:

# Find the rows which have duplicates in a different group.
dupRows <- dupsBetweenGroups(df, "Coder")

# Print it alongside the data frame
cbind(df, dup=dupRows)
#>    Coder Subject Response   dup
#> 1      A       1        X  TRUE
#> 2      A       1        X  TRUE
#> 3      A       2        X FALSE
#> 4      A       2        X FALSE
#> 5      B       1        X  TRUE
#> 6      B       2        Y  TRUE
#> 7      B       3        X FALSE
#> 8      C       1        Z FALSE
#> 9      C       2        Y  TRUE
#> 10     C       3        Z FALSE

Note that this does not mark duplicated rows within a group. With Coder=A, there are two rows with Subject=1 and Response=X, but these are not marked as duplicates.

Finding unique rows

It’s also possible to find the rows that are unique within each group:

cbind(df, unique=!dupRows)
#>    Coder Subject Response unique
#> 1      A       1        X  FALSE
#> 2      A       1        X  FALSE
#> 3      A       2        X   TRUE
#> 4      A       2        X   TRUE
#> 5      B       1        X  FALSE
#> 6      B       2        Y  FALSE
#> 7      B       3        X   TRUE
#> 8      C       1        Z   TRUE
#> 9      C       2        Y  FALSE
#> 10     C       3        Z   TRUE

Splitting apart the data frame

If you wish to split the joined data frame into the three original data frames

# Store the results in df
dfDup <- cbind(df, dup=dupRows)

dfA <- subset(dfDup, Coder=="A", select=-Coder)
dfA
#>   Subject Response   dup
#> 1       1        X  TRUE
#> 2       1        X  TRUE
#> 3       2        X FALSE
#> 4       2        X FALSE

dfB <- subset(dfDup, Coder=="B", select=-Coder)
dfB
#>   Subject Response   dup
#> 5       1        X  TRUE
#> 6       2        Y  TRUE
#> 7       3        X FALSE

dfC <- subset(dfDup, Coder=="C", select=-Coder)
dfC
#>    Subject Response   dup
#> 8        1        Z FALSE
#> 9        2        Y  TRUE
#> 10       3        Z FALSE

Ignoring columns

It is also possible to ignore one or more columns, by removing that column from the data frame that is passed to the function. The results can be joined to the original complete data frame if desired.

# Ignore the Subject column -- only use Response
dfNoSub <- subset(df, select=-Subject)
dfNoSub
#>    Coder Response
#> 1      A        X
#> 2      A        X
#> 3      A        X
#> 4      A        X
#> 5      B        X
#> 6      B        Y
#> 7      B        X
#> 8      C        Z
#> 9      C        Y
#> 10     C        Z

# Check for duplicates
dupRows <- dupsBetweenGroups(dfNoSub, "Coder")

# Join the result to the original data frame
cbind(df, dup=dupRows)
#>    Coder Subject Response   dup
#> 1      A       1        X  TRUE
#> 2      A       1        X  TRUE
#> 3      A       2        X  TRUE
#> 4      A       2        X  TRUE
#> 5      B       1        X  TRUE
#> 6      B       2        Y  TRUE
#> 7      B       3        X  TRUE
#> 8      C       1        Z FALSE
#> 9      C       2        Y  TRUE
#> 10     C       3        Z FALSE

dupsBetweenGroups function

This is the function that does all the work:

dupsBetweenGroups <- function (df, idcol) {
    # df: the data frame
    # idcol: the column which identifies the group each row belongs to

    # Get the data columns to use for finding matches
    datacols <- setdiff(names(df), idcol)

    # Sort by idcol, then datacols. Save order so we can undo the sorting later.
    sortorder <- do.call(order, df)
    df <- df[sortorder,]

    # Find duplicates within each id group (first copy not marked)
    dupWithin <- duplicated(df)

    # With duplicates within each group filtered out, find duplicates between groups. 
    # Need to scan up and down with duplicated() because first copy is not marked.
    dupBetween = rep(NA, nrow(df))
    dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols])
    dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols], fromLast=TRUE) | dupBetween[!dupWithin]

    # ============= Replace NA's with previous non-NA value ==============
    # This is why we sorted earlier - it was necessary to do this part efficiently

    # Get indexes of non-NA's
    goodIdx <- !is.na(dupBetween)

    # These are the non-NA values from x only
    # Add a leading NA for later use when we index into this vector
    goodVals <- c(NA, dupBetween[goodIdx])

    # Fill the indices of the output vector with the indices pulled from
    # these offsets of goodVals. Add 1 to avoid indexing to zero.
    fillIdx <- cumsum(goodIdx)+1

    # The original vector, now with gaps filled
    dupBetween <- goodVals[fillIdx]

    # Undo the original sort
    dupBetween[sortorder] <- dupBetween

    # Return the vector of which entries are duplicated across groups
    return(dupBetween)
}

Notes

To find exact duplicate rows within a single dataframe, see ../Finding and removing duplicate records.