As spreadsheets grow in size, it is very likely that duplicate entries will be created, and depending on the information contained in the sheet, they could pose a problem.
At the very least, they will bloat the sheet.
You could sort the sheet and then manually go through the rows trying to find and remove any duplicate entries, but this would be time consuming and error prone, especially with a large sheet.
It turns out that since this is a very common requirement, Google sheets has a dedicated menu option that takes care of this.
We’ll be taking a look at this approach, as well as a way to do this in a formula, and then we’ll taking a look at a tool that can help analyze and detect duplicates, before we go ahead and remove them.
Remove Duplicates Menu Option
In the sample sheet above, we would like to remove any duplicate entries based on the name fields (first two columns in the sheet).
Looking at the sheet, you will notice that rows 4 and 10 match on both the first_name and last_name columns.
Overall, the sheet has 500 entries, so there could well be more duplicates further down.
The method we are going to use here, will remove all duplicates in the sheet, no matter where they are.
From the Google sheets menu, open Data > Data clean-up and select Remove duplicates.
This will open the Remove duplicates dialog box.
At the top, the dialog box will display some details about the sheet, such as how many rows there and the number of columns detected.
This is important because Google sheets allows you to select which columns will be used to locate duplicate entries.
If your sheet uses a header row (i.e. if you’ve given your columns titles), then select the first checkbox, Data has header row.
When not selected, the Columns to analyze section directly below will refer to the columns by their default letter designations (i.e. Column A, Column B, etc.) rather than the titles.
When there are many columns in a sheet, as in our Example Contacts sheet, it is not always easy which column contains what information.
Hence, with Data has header row checked, the header titles will also be displayed, as shown above.
This will make it easier to select which columns are to be used in finding the duplicates.
By default, Google sheets selects all columns.
However, since in our case we only want to match on first_name and last_name, we need to only select those columns.
Left clicking on the Remove duplicates button will start the ball rolling.
Depending on the size of your sheet, it may take a few seconds to complete, but when the process is finished, a report is displayed with how many duplicate rows were found and removed, and how many unique rows are left.
Another look at the sheet, and now out of rows 4 and 10, only row 4 is still present.
Row 10, which was a duplicate, has been removed.
The same will have happened to any other duplicates found.
While in the above case, we’ve only searched for duplicates based on the first_name and last_name columns, we could have used any other columns (e.g. email), or a combination of any number of columns.
The UNIQUE Function
An instance of where removing duplicates would be required, is when building an emailing list for a newsletter.
You would not want to send the same newsletter multiple times to the same contact.
We are going to build this email listing from the example contacts sheet, and so it would be nice if the method we use, does not alter the original contact list in any way.
The above snapshot shows a list of contact names and their corresponding email addresses taken from our earlier example contacts sheet.
For the purposes of this article and for convenience, the columns have been reordered so that the columns of interest are adjacent, and hence we don’t need to view the sheet in its entire width.
Duplicates exist between rows 4 and 10, and 8 and 17.
Since we will be using our mailing list to send out a newsletter, the fields we are interested in are the first and last names, and the email address.
As stated previously, it would be ideal if whatever method we use to pull in the names and email address, is non pervasive.
We can use the UNIQUE function to achieve all our requirements.
The syntax for the function is as follows:
UNIQUE(range)
UNIQUE returns unique rows in the provided source range, discarding duplicates.
Rows are returned in the order in which they first appear in the source range.
We start by creating a new sheet and then in cell A1, enter the following formula.
You could also do this in the same sheet, but it is cleaner and easier to maintain if it is done in a separate sheet.
=UNIQUE(Sheet1!A1:C502)
The formula assumes the original information is in Sheet1, hence the cell reference Sheet1!A1:C502 which is also the range.
The UNIQUE function will exclude any duplicates for first_name, last_name, and email, and add them into our new sheet.
Note that all three fields must match. If for example, the first and last names are the same, but the email is different, it will not be considered a duplicate.
Looking at the results in the snapshot above, we now see that there are no duplicates between rows 4 and 10, and 8 and 17.
Since the results have been written to a new sheet, we have also not affected the original sheet in anyway, fulfilling our requirements.
Furthermore, if and when the data in the original sheet changes sometime in the future, since we are using a formula in the new sheet, the results will update automatically, which means our emailing list is always current, requires not further intervention, and most importantly, always free of duplicates.
Pivot Tables
From the Google Workspace Learning Center:
Pivot tables help you summarize data, find patterns, and reorganize information.
You can add pivot tables based on suggestions in Google Sheets or create them manually.
After you create a pivot table, you can add and move data, add a filter, drill down to see details about a calculation, group data, and more.
While we won’t be using a pivot table to create an emailing list or to directly remove any duplicates, we can use a pivot table to analyze the information in our original emailing list to determine if and where duplicates exist.
We can then exploit that information to decide on a plan of action.
To create the pivot table, start by selecting the columns in the existing sheet that contain the information that will be used in our search for duplicates.
In our case, there are first_name, last_name, and email.
Then, open the Insert option from the Google sheets menu bar, and select Pivot table.
This will bring up the Create pivot table dialog box.
The Data range field will automatically reference to the columns we selected right before creating the pivot table, and we can leave it as is.
We can also leave the Insert to field as is, which will by default have the New sheet radio button selected.
Now click on the Create button.
Google sheets will create a new sheet in our workbook (assuming the New sheet radio button was selected), with the Pivot table editor in the right hand pane.
We now need to set some rules that will help us find any duplicates.
We’ll start off simple by determining if there are any email address duplicates.
In the Pivot table editor, click on the Add button on the right of the Rows section.
This will open the list of columns we can use in our analysis.
Since for now, we’re only looking for duplicates in the email addresses, we will select email.
This will create an entry in the Rows section in the Pivot table editor, add the column email to the sheet and copy all the email addresses from the original sheet (Sheet1) into this column, and sort them alphabetically.
The sorting order is governed by the Order dropdown in the email entry under Rows in the Pivot table editor.
At this point, there’s nothing that tells us anything of importance as far as duplicates is concerned.
We have one more step to go.
In the Pivot table editor in the Values section, click the Add button and again select email.
This creates an entry in the Values section.
By default, the Summarize by dropdown will have the function COUNTA selected, which is exactly what we want.
It also creates a second column in the sheet, COUNTA of email, which will contain a count of how many times the corresponding email address in the email column appears in the original emailing list.
A quick scan down the second column will reveal that row 34, with the email address art@venere.org, has a value of 2, meaning that it appears twice and therefore a duplicate exists for this email.
We’ve detected a duplicate.
A scan down the entire sheet may reveal more duplciates.
Now, it’s not uncommon for the same email address to be used by more than one person, and so we may want to do some further analysis, this time checking for duplicates based on both the email address and the recipient’s name.
All we need to do is add another couple of entries to the Rows section, one for first_name and one for last_name.
As we did when adding the email column, in the Pivot table editor, click on the Add button on the right of the Rows section and select last_name from the Sort_by dropdown. Repeat the steps one more, this time selecting first_name.
We now have three entries under Rows, namely email, last_name, and first_name.
If a duplicate email exists, but the first and/or last names for that email address differ, then we would get an output similar to rows 34 and 35 in the snapshot above.
This tells us that the email address is present twice, but under different names.
If our mailing list is being used to send out a newsletter, it is now up to us to decide as to whether we want to send the newsletter to the same email address twice but addressing two different people, or just sent it the one time.
If the email and names match exactly between two entries in our mailing list, then we would see something like row 346 above, where there is just the one row with a value of 2 in column D.
This means that the email address, last and first names of the recipient match exactly, and therefore appear twice in our list.
You could play around further with the Rows and Values sections to find the best fit for your particular requirements and then add a filter so that finally, only the duplicates would be listed.
This would save having to scroll down, scanning for a value other than 1 in the COUNTA of email column.
The above snapshot shows setting a filter and selecting the Filter by values option in which we have disabled (Blanks), 0, and 1, leaving only the value of 2 ticked.
This also tells us that there are no entries with more than 2 matches.
The result is now that only duplicates meeting our criteria are displayed.
Conclusion
As we’ve seen, removing duplicate entries with Google sheets is very easy and straightforward, through the Remove duplicates menu option, as well as by using the UNIQUE function in a formula.
Both these methods also allow us to send the results to a new sheet, leaving our original data intact.
But sometimes, we may not want to just blindly remove duplicates without first finding out what will be removed.
To accomplish this, Google sheets gives us the ability, through a handy tool called a pivot table, to help detect duplicates using any number of criteria, and hence more thoroughly analyze the information in the sheet so as to make a more informed decision.