I am working with RecordLinkage Library in R.
I have a data frame with id, name, phone, mail
My code looks like this:
ids = data$id
pairs = compare.dedup(data, identity=ids, blockfld=as.list(2,3,4))
The problem is that my ids are not the same in my result output
so if I had this data:
id Name Phone Mail
233 Nathali 2222 nathali@dd.com
435 Nathali 2222
553 Jean 3444 jean@dd.com
In my result output I will have something like
id1 id2
1 2
Instead of
id1 id2
233 435
I want to know if there is a way to keep the ids instead of the index, or someone could explain me the identity parameter.
Thanks
The identity vector tells the getPairs method which of the input records belong to the same entity. It actually holds information that you usually want to gain from record linkage, i.e. you have a couple of records and do not know in advance which of them belong together. However, when you use a training set to calibrate a method or you want to evaluate the accurateness of record linkage methods (the package was mainly written for this purpose), you start with an already deduplicated or linked data set.
In your example, the first two rows (ids 233, 435) obviously mean the same person and the third row a different one. A meaningful identity vector would therefore be:
c(1,1,2)
But it could also be:
c(42,42,128)
Just make sure that the identity vector has identical values exactly at those positions where the corresponding table rows hold matching record (vector index = row index).
About your question on how to display the ids in the result: You can get the full record pairs, including all data fields, with (see the documentation for more details):
getPairs(pairs)
There might be better ways to get hold of the original ids, depending on how you further process the record pairs (e.g. running a classification algorithm). Extend your example if you need more advice on this.
p.s.: I am one of the package authors. I have only very recently become aware that people ask questions about the package on Stack Overflow, so please excuse that a couple of questions have been around unanswered for a long time. I will look for a way to get notified on new questions posted here, but I would also like to mention that people can contact us directly via one of the email addresses listed in the package information.
You have to replace the index column with your identify column.