What is the best way to remove duplicates from a d

2019-02-09 12:22发布

I have checked the whole site and googled on the net but was unable to find a simple solution to this problem.

I have a datatable which has about 20 columns and 10K rows. I need to remove the duplicate rows in this datatable based on 4 key columns. Doesn't .Net have a function which does this? The function closest to what I am looking for was datatable.DefaultView.ToTable(true, array of columns to display), But this function does a distinct on all the columns.

It would be great if someone could help me with this.

EDIT: I am sorry for not being clear on this. This datatable is being created by reading a CSV file and not from a DB. So using an SQL query is not an option.

12条回答
一夜七次
2楼-- · 2019-02-09 13:02

I think this must be the best way to remove duplicates from Datatable by using Linq and moreLinq Code:

Linq

RemoveDuplicatesRecords(yourDataTable);


private DataTable RemoveDuplicatesRecords(DataTable dt)
{
    var UniqueRows = dt.AsEnumerable().Distinct(DataRowComparer.Default);
    DataTable dt2 = UniqueRows.CopyToDataTable();
    return dt2;
}

Blog Article : Remove Duplicate rows records from DataTable Asp.net c#


MoreLinq

// Distinctby  column name ID 
var valueDistinctByIdColumn = yourTable.AsEnumerable().DistinctBy(row => new { Id = row["Id"] });
DataTable dtDistinctByIdColumn = valueDistinctByIdColumn.CopyToDataTable();

Note: moreLinq need to add library.

In morelinq you can use function called DistinctBy in which you can specify the property on which you want to find Distinct objects.

Blog article : Using moreLinq DistinctBy method to remove duplicate records

查看更多
Ridiculous、
3楼-- · 2019-02-09 13:02

Use a query instead of functions:

DELETE FROM table1 AS tb1 INNER JOIN 
(SELECT id, COUNT(id) AS cntr FROM table1 GROUP BY id) AS tb2
ON tb1.id = tb2.id WHERE tb2.cntr > 1
查看更多
倾城 Initia
4楼-- · 2019-02-09 13:08

I wasn't keen on using the Linq solution above so I wrote this:

/// <summary>
/// Takes a datatable and a column index, and returns a datatable without duplicates
/// </summary>
/// <param name="dt">The datatable containing duplicate records</param>
/// <param name="ComparisonFieldIndex">The column index containing duplicates</param>
/// <returns>A datatable object without duplicated records</returns>
public DataTable duplicateRemoval(DataTable dt, int ComparisonFieldIndex)
{
    try
    {
        //Build the new datatable that will be returned
        DataTable dtReturn = new DataTable();
        for (int i = 0; i < dt.Columns.Count; i++)
        {
            dtReturn.Columns.Add(dt.Columns[i].ColumnName, System.Type.GetType("System.String"));
        }

        //Loop through each record in the datatable we have been passed
        foreach (DataRow dr in dt.Rows)
        {
            bool Found = false;
            //Loop through each record already present in the datatable being returned
            foreach (DataRow dr2 in dtReturn.Rows)
            {
                bool Identical = true;
                //Compare the column specified to see if it matches an existing record
                if (!(dr2[ComparisonFieldIndex].ToString() == dr[ComparisonFieldIndex].ToString()))
                {
                    Identical = false;
                }
                //If the record found identically matches one we already have, don't add it again
                if (Identical)
                {
                    Found = true;
                    break;
                }
            }
            //If we didn't find a matching record, we'll add this one
            if (!Found)
            {
                DataRow drAdd = dtReturn.NewRow();
                for (int i = 0; i < dtReturn.Columns.Count; i++)
                {
                    drAdd[i] = dr[i];
                }

                dtReturn.Rows.Add(drAdd);
            }
        }
        return dtReturn;
    }
    catch (Exception)
    {
        //Return the original datatable if something failed above
        return dt;
    }
}

Additionally, this works on ALL columns rather than a specific column index:

/// <summary>
/// Takes a datatable and returns a datatable without duplicates
/// </summary>
/// <param name="dt">The datatable containing duplicate records</param>
/// <returns>A datatable object without duplicated records</returns>
public DataTable duplicateRemoval(DataTable dt)
{
    try
    {
        //Build the new datatable that will be returned
        DataTable dtReturn = new DataTable();
        for (int i = 0; i < dt.Columns.Count; i++)
        {
            dtReturn.Columns.Add(dt.Columns[i].ColumnName, System.Type.GetType("System.String"));
        }

        //Loop through each record in the datatable we have been passed
        foreach (DataRow dr in dt.Rows)
        {
            bool Found = false;
            //Loop through each record already present in the datatable being returned
            foreach (DataRow dr2 in dtReturn.Rows)
            {
                bool Identical = true;
                //Compare all columns to see if they match the existing record
                for (int i = 0; i < dt.Columns.Count; i++)
                {
                    if (!(dr2[i].ToString() == dr[i].ToString()))
                    {
                        Identical = false;
                    }
                }
                //If the record found identically matches one we already have, don't add it again
                if (Identical)
                {
                    Found = true;
                    break;
                }
            }
            //If we didn't find a matching record, we'll add this one
            if (!Found)
            {
                DataRow drAdd = dtReturn.NewRow();
                for (int i = 0; i < dtReturn.Columns.Count; i++)
                {
                    drAdd[i] = dr[i];
                }

                dtReturn.Rows.Add(drAdd);
            }
        }
        return dtReturn;
    }
    catch (Exception)
    {
        //Return the original datatable if something failed above
        return dt;
    }
}
查看更多
兄弟一词,经得起流年.
5楼-- · 2019-02-09 13:15

How can I remove duplicate rows?. (Adjust the query there to join on your 4 key columns)

EDIT: with your new information I believe the easiest way would be to implement IEqualityComparer<T> and use Distinct on your data rows. Otherwise if you're working with IEnumerable/IList instead of DataTable/DataRow, it is certainly possible with some LINQ-to-objects kung-fu.

EDIT: example IEqualityComparer

public class MyRowComparer : IEqualityComparer<DataRow>
{

    public bool Equals(DataRow x, DataRow y)
    {
        return (x.Field<int>("ID") == y.Field<int>("ID")) &&
            string.Compare(x.Field<string>("Name"), y.Field<string>("Name"), true) == 0 &&
          ... // extend this to include all your 4 keys...
    }

    public int GetHashCode(DataRow obj)
    {
        return obj.Field<int>("ID").GetHashCode() ^ obj.Field<string>("Name").GetHashCode() etc.
    }
}

You can use it like this:

var uniqueRows = myTable.AsEnumerable().Distinct(MyRowComparer);
查看更多
手持菜刀,她持情操
6楼-- · 2019-02-09 13:16

Try this

Let us consider dtInput is your data table with duplicate records.

I have a new DataTable dtFinal in which I want to filter the duplicate rows.

So my code will be something like below.

DataTable dtFinal = dtInput.DefaultView.ToTable(true, 
                           new string[ColumnCount] {"Col1Name","Col2Name","Col3Name",...,"ColnName"});
查看更多
smile是对你的礼貌
7楼-- · 2019-02-09 13:17

Found this on bytes.com:

You can use the JET 4.0 OLE DB provider with the classes in the System.Data.OleDb namespace to access the comma delimited text file (using a DataSet/DataTable).

Or you could use Microsoft Text Driver for ODBC with the classes in the System.Data.Odbc namespace to access the file using ODBC drivers.

That would allow you to access your data via sql queries, as others proposed.

查看更多
登录 后发表回答