I have a data table I've loaded from a CSV file. I need to determine which rows are duplicates based on two columns (product_id
and owner_org_id
) in the datatable. Once I've determined that, I can use that information to build my result, which is a datatable containing only the rows that are not unique, and a data table containing only the rows that are unique.
I've looked at other examples on here and the code I've come up with so far does compile and execute, but it seems to think every row in the data is unique. In reality in the test data there's 13 rows and only 6 are unique. So clearly I'm doing something wrong.
EDIT: Thought I should note, rows that have duplicates should ALL be removed, not just the duplicates of that row. eg if there are 4 duplicates, all 4 should be removed not 3, leaving one unique row from the 4.
EDIT2: Alternatively, if I can select all duplicate rows (instead of trying to select unique rows) it is fine with me. Either way can get me to my end result.
The code in the processing method:
MyRowComparer myrc = new MyRowComparer();
var uniquerows = dtCSV.AsEnumerable().Distinct(myrc);
along with the following:
public class MyRowComparer : IEqualityComparer<DataRow>
{
public bool Equals(DataRow x, DataRow y)
{
//return ((string.Compare(x.Field<string>("PRODUCT_ID"), y.Field<string>("PRODUCT_ID"), true)) ==
// (string.Compare(x.Field<string>("OWNER_ORG_ID"), y.Field<string>("OWNER_ORG_ID"), true)));
return
x.ItemArray.Except(new object[] { x[x.Table.Columns["PRODUCT_ID"].ColumnName] }) ==
y.ItemArray.Except(new object[] { y[y.Table.Columns["PRODUCT_ID"].ColumnName] }) &&
x.ItemArray.Except(new object[] { x[x.Table.Columns["OWNER_ORG_ID"].ColumnName] }) ==
y.ItemArray.Except(new object[] { y[y.Table.Columns["OWNER_ORG_ID"].ColumnName] });
}
public int GetHashCode(DataRow obj)
{
int y = int.Parse(obj.Field<string>("PRODUCT_ID"));
int z = int.Parse(obj.Field<string>("OWNER_ORG_ID"));
int c = y ^ z;
return c;
}
}
You could use LINQ-To-DataSet and
Enumerable.Except
/Intersect
:Edit: Obviously i've misunderstood your requirement a little bit. So you only have one
DataTable
and want to get all unique and all duplicate rows, that's even more straight-forward. You can useEnumerable.GroupBy
with an anonymous type containing both fields:Your criterium is off. You are comparing sets of objects that you are not interested (
Except
excludes) in.Instead, be as clear (data type) as possible and keep it simple:
Check for
null
if that is a possibility. Maybe you want to exclude rows that are equal because their IDs are null.Observe the
int?
. This is not a typo. The question mark is required if you are dealing with database values from columns that can beNULL
. The reason is thatNULL
values will be represented by the typeDBNull
in C#. Using theas
operator just gives younull
in this case (instead of anInvalidCastException
. If you are sure, you are dealing withINT NOT NULL
, cast with(int)
.The same is true for strings.
(string)
asserts you are expecting non-null DB values.EDIT1:
Had the type wrong. ItemArray is not a hashtable. Use the row directly.
EDIT2:
Added
string
example, some commentFor a more straight-forward way, check How to select distinct rows in a datatable and store into an array
EDIT3:
Some explanation regarding the casts.
The other link I suggested does the same as your code. I forgot your original intent ;-) I just saw your code and responded to the most obvious error, I saw - sorry
Here is how I would solve the problem
I don't know if this 100% correct, I don't have an IDE at hand. And you'll need to adjust the casts to the appropriate type. See my addition above.