R RecordLinkage Identity

2019-02-17 21:51发布

I am working with RecordLinkage Library in R. I have a data frame with id, name, phone, mail

My code looks like this:

ids = data$id
pairs = compare.dedup(data, identity=ids, blockfld=as.list(2,3,4))

The problem is that my ids are not the same in my result output so if I had this data:

id   Name     Phone    Mail
233  Nathali  2222     nathali@dd.com
435  Nathali  2222 
553  Jean     3444     jean@dd.com

In my result output I will have something like

id1 id2
1   2

Instead of

id1 id2
233 435 

I want to know if there is a way to keep the ids instead of the index, or someone could explain me the identity parameter.

Thanks

标签: r record linkage
2条回答
做个烂人
2楼-- · 2019-02-17 22:33

You have to replace the index column with your identify column.

查看更多
该账号已被封号
3楼-- · 2019-02-17 22:48

The identity vector tells the getPairs method which of the input records belong to the same entity. It actually holds information that you usually want to gain from record linkage, i.e. you have a couple of records and do not know in advance which of them belong together. However, when you use a training set to calibrate a method or you want to evaluate the accurateness of record linkage methods (the package was mainly written for this purpose), you start with an already deduplicated or linked data set.

In your example, the first two rows (ids 233, 435) obviously mean the same person and the third row a different one. A meaningful identity vector would therefore be:

c(1,1,2)

But it could also be:

c(42,42,128)

Just make sure that the identity vector has identical values exactly at those positions where the corresponding table rows hold matching record (vector index = row index).

About your question on how to display the ids in the result: You can get the full record pairs, including all data fields, with (see the documentation for more details):

getPairs(pairs)

There might be better ways to get hold of the original ids, depending on how you further process the record pairs (e.g. running a classification algorithm). Extend your example if you need more advice on this.

p.s.: I am one of the package authors. I have only very recently become aware that people ask questions about the package on Stack Overflow, so please excuse that a couple of questions have been around unanswered for a long time. I will look for a way to get notified on new questions posted here, but I would also like to mention that people can contact us directly via one of the email addresses listed in the package information.

查看更多
登录 后发表回答