Post

Efficiently Identifying Duplicates using MongoDB

Efficiently Identifying Duplicates using MongoDB

Efficiently Identifying Duplicates using MongoDB

One question that comes up time and again on Stack Overflow or the MongoDB Developer Forums is “how can I find duplicate X and get a list of Y that contains these duplicates” (ex: “Query to find duplicate users (ip)”).

Using MongoDB’s Aggregation Framework this can be done easily.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
function generate_random_data(size){
    var chars = 'abcdefghijklmnopqrstuvwxyz'.split('');
    var len = chars.length;
    var random_data = [];

    while (size--) { random_data.push(chars[Math.random()*len | 0]); }

    return random_data.join('');
}

function setup() {
  db.foo.drop();
  db.foo.insertMany([
    { parent_id: 1, user_id: 1, junk: generate_random_data(512) },
    { parent_id: 1, user_id: 2, junk: generate_random_data(512) },
    { parent_id: 2, user_id: 3, junk: generate_random_data(512) },
    { parent_id: 3, user_id: 4, junk: generate_random_data(512) },
    { parent_id: 4, user_id: 5, junk: generate_random_data(512) },
    { parent_id: 3, user_id: 6, junk: generate_random_data(512) },
    { parent_id: 2, user_id: 7, junk: generate_random_data(512) }
  ]);
}

setup();

Given the above setup our collection will contain 7 documents. If we wanted to identify how many duplicate parent_id values there are and what the associated user_id values are we could do something like the following:

1
2
3
4
5
6
7
8
9
10
11
12
db.foo.aggregate([
  { $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
  { $match: { used: { $gt: 1 } } },
  { $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]);
/* ** output **
[
  { "parent_id": 3, "user_ids": [ 4, 6 ] },
  { "parent_id": 1, "user_ids": [ 1, 2 ] },
  { "parent_id": 2, "user_ids": [ 3, 7 ] }
]
*/

This will work pretty efficiently for our sample set of 7 documents, but what about millions (or billions)?

Reviewing Performance

By generating Explain Results for the above operation we can better understand how this operation is performing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
db.foo.explain("executionStats").aggregate([
  { $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
  { $match: { used: { $gt: 1 } } },
  { $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]);
/* ** output **
"winningPlan": {
  ...
  "inputStage": {
    "stage": "COLLSCAN",
    "direction": "forward"
  }
},
...
"executionStats": {
  "nReturned": 7,
  "totalKeysExamined": 0,
  "totalDocsExamined": 7,
*/

According to the documentation we can improve our pipeline’s performance with indexes and document filters.

No index was available for use and as a result a full collection scan was required.

Adding an Index

We know only 2 fields from our document are actually being used by the pipeline, so let’s try again using a purpose-built compound index and review the explain plan again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
db.foo.createIndex({ parent_id: 1, user_id: 1 });
db.foo.explain("executionStats").aggregate([
  { $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
  { $match: { used: { $gt: 1 } } },
  { $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]);
/* ** output **
"winningPlan": {
  ...
  "inputStage": {
    "stage": "COLLSCAN",
    "direction": "forward"
  }
},
...
"executionStats": {
  "nReturned": 7,
  "totalKeysExamined": 0,
  "totalDocsExamined": 7,
*/

Even with what appears to be an ideal index a collection scan is being performed. What gives? Oh right …. even if following “ESR Guidance” for creating optimal indexes, an unfiltered $group must scan the entire collection anyway and wouldn’t benefit directly from an index ….. or would it?

Adding a $sort before the $group

Having a $sort stage prior to the $group will allow the pipeline take advantage of the index to group a sorted set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
db.foo.explain("executionStats").aggregate([
  { $sort: { parent_id: 1 } },
  { $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
  { $match: { used: { $gt: 1 } } },
  { $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]);
/* ** output **
"winningPlan": {
  ...
  "inputStage": {
    "stage": "IXSCAN",
    "keyPattern": {
      "parent_id": 1,
      "user_id": 1
    },
    ...
"executionStats": {
  "nReturned": 7,
  "totalKeysExamined": 7,
  "totalDocsExamined": 0,
*/

The explain plan for the above operation shows not only that an index was used, but the entire operation was a covered query.

Conclusion

Now that we can identify duplicates further processing can be done with the results as needed. For example assume we wanted to remove all documents with a duplicate parent_id and only keep the first:

1
2
3
4
5
6
db.foo.aggregate([
  { $sort: { parent_id: 1 } },
  { $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
  { $match: { used: { $gt: 1 } } },
  { $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]).forEach((d) => db.foo.deleteMany({ user_id: { $in: d.user_ids.slice(1, d.user_ids.length) } }));

The above is taking the results of the aggregation pipeline and applying the following deleteMany command to each: (d) => db.foo.deleteMany({ user_id: { $in: d.user_ids.slice(1, d.user_ids.length) } }).

Note this could be further optimized for larger delete workloads by instead writing all duplicate user_id values to a single array and deleting those all at once:

1
2
3
4
5
6
7
8
9
var toDelete = [];
db.foo.aggregate([
  { $sort: { parent_id: 1 } },
  { $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
  { $match: { used: { $gt: 1 } } },
  { $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]).forEach((d) => toDelete.push(d.user_ids.slice(1, d.user_ids.length)));

db.foo.deleteMany({ user_id: { $in: toDelete.flat() } });

Be very careful whenever batch removing data and test in a lower environment first!

Hopefully you found this helpful. If you did, feel free to drop a comment below ;)

This post is licensed under CC BY 4.0 by the author.