Efficiently Identifying Duplicates using MongoDB
Efficiently Identifying Duplicates using MongoDB
One question that comes up time and again on Stack Overflow or the MongoDB Developer Forums is “how can I find duplicate X and get a list of Y that contains these duplicates” (ex: “Query to find duplicate users (ip)”).
Using MongoDB’s Aggregation Framework this can be done easily.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
function generate_random_data(size){
var chars = 'abcdefghijklmnopqrstuvwxyz'.split('');
var len = chars.length;
var random_data = [];
while (size--) { random_data.push(chars[Math.random()*len | 0]); }
return random_data.join('');
}
function setup() {
db.foo.drop();
db.foo.insertMany([
{ parent_id: 1, user_id: 1, junk: generate_random_data(512) },
{ parent_id: 1, user_id: 2, junk: generate_random_data(512) },
{ parent_id: 2, user_id: 3, junk: generate_random_data(512) },
{ parent_id: 3, user_id: 4, junk: generate_random_data(512) },
{ parent_id: 4, user_id: 5, junk: generate_random_data(512) },
{ parent_id: 3, user_id: 6, junk: generate_random_data(512) },
{ parent_id: 2, user_id: 7, junk: generate_random_data(512) }
]);
}
setup();
Given the above setup our collection will contain 7 documents. If we wanted to identify how many duplicate parent_id
values there are and what the associated user_id
values are we could do something like the following:
1
2
3
4
5
6
7
8
9
10
11
12
db.foo.aggregate([
{ $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
{ $match: { used: { $gt: 1 } } },
{ $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]);
/* ** output **
[
{ "parent_id": 3, "user_ids": [ 4, 6 ] },
{ "parent_id": 1, "user_ids": [ 1, 2 ] },
{ "parent_id": 2, "user_ids": [ 3, 7 ] }
]
*/
This will work pretty efficiently for our sample set of 7 documents, but what about millions (or billions)?
Reviewing Performance
By generating Explain Results for the above operation we can better understand how this operation is performing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
db.foo.explain("executionStats").aggregate([
{ $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
{ $match: { used: { $gt: 1 } } },
{ $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]);
/* ** output **
"winningPlan": {
...
"inputStage": {
"stage": "COLLSCAN",
"direction": "forward"
}
},
...
"executionStats": {
"nReturned": 7,
"totalKeysExamined": 0,
"totalDocsExamined": 7,
*/
According to the documentation we can improve our pipeline’s performance with indexes and document filters.
No index was available for use and as a result a full collection scan was required.
Adding an Index
We know only 2 fields from our document are actually being used by the pipeline, so let’s try again using a purpose-built compound index and review the explain plan again:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
db.foo.createIndex({ parent_id: 1, user_id: 1 });
db.foo.explain("executionStats").aggregate([
{ $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
{ $match: { used: { $gt: 1 } } },
{ $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]);
/* ** output **
"winningPlan": {
...
"inputStage": {
"stage": "COLLSCAN",
"direction": "forward"
}
},
...
"executionStats": {
"nReturned": 7,
"totalKeysExamined": 0,
"totalDocsExamined": 7,
*/
Even with what appears to be an ideal index a collection scan is being performed. What gives? Oh right …. even if following “ESR Guidance” for creating optimal indexes, an unfiltered $group
must scan the entire collection anyway and wouldn’t benefit directly from an index ….. or would it?
Adding a $sort
before the $group
Having a $sort
stage prior to the $group
will allow the pipeline take advantage of the index to group a sorted set.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
db.foo.explain("executionStats").aggregate([
{ $sort: { parent_id: 1 } },
{ $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
{ $match: { used: { $gt: 1 } } },
{ $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]);
/* ** output **
"winningPlan": {
...
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"parent_id": 1,
"user_id": 1
},
...
"executionStats": {
"nReturned": 7,
"totalKeysExamined": 7,
"totalDocsExamined": 0,
*/
The explain plan for the above operation shows not only that an index was used, but the entire operation was a covered query.
Conclusion
Now that we can identify duplicates further processing can be done with the results as needed. For example assume we wanted to remove all documents with a duplicate parent_id
and only keep the first:
1
2
3
4
5
6
db.foo.aggregate([
{ $sort: { parent_id: 1 } },
{ $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
{ $match: { used: { $gt: 1 } } },
{ $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]).forEach((d) => db.foo.deleteMany({ user_id: { $in: d.user_ids.slice(1, d.user_ids.length) } }));
The above is taking the results of the aggregation pipeline and applying the following deleteMany
command to each: (d) => db.foo.deleteMany({ user_id: { $in: d.user_ids.slice(1, d.user_ids.length) } })
.
Note this could be further optimized for larger delete workloads by instead writing all duplicate user_id
values to a single array and deleting those all at once:
1
2
3
4
5
6
7
8
9
var toDelete = [];
db.foo.aggregate([
{ $sort: { parent_id: 1 } },
{ $group: { _id: "$parent_id", used: { $sum: 1 }, user_ids: { $push: "$user_id" } } },
{ $match: { used: { $gt: 1 } } },
{ $project: { _id: 0, parent_id: "$_id", user_ids: "$user_ids"} }
]).forEach((d) => toDelete.push(d.user_ids.slice(1, d.user_ids.length)));
db.foo.deleteMany({ user_id: { $in: toDelete.flat() } });
Be very careful whenever batch removing data and test in a lower environment first!
Hopefully you found this helpful. If you did, feel free to drop a comment below ;)
Comments powered by Disqus.