What is the best way to remove duplicates from a d

2019-02-09 12:22发布

I have checked the whole site and googled on the net but was unable to find a simple solution to this problem.

I have a datatable which has about 20 columns and 10K rows. I need to remove the duplicate rows in this datatable based on 4 key columns. Doesn't .Net have a function which does this? The function closest to what I am looking for was datatable.DefaultView.ToTable(true, array of columns to display), But this function does a distinct on all the columns.

It would be great if someone could help me with this.

EDIT: I am sorry for not being clear on this. This datatable is being created by reading a CSV file and not from a DB. So using an SQL query is not an option.

12条回答
等我变得足够好
2楼-- · 2019-02-09 12:55

This is a very simple code which doesnot require linq nor individual columns to do the filter. If all the values of columns in a row are null it will be deleted.


    public DataSet duplicateRemoval(DataSet dSet) 
{
    bool flag;
    int ccount = dSet.Tables[0].Columns.Count;
    string[] colst = new string[ccount];
    int p = 0;

    DataSet dsTemp = new DataSet();
    DataTable Tables = new DataTable();
    dsTemp.Tables.Add(Tables);

    for (int i = 0; i < ccount; i++)
    {
        dsTemp.Tables[0].Columns.Add(dSet.Tables[0].Columns[i].ColumnName, System.Type.GetType("System.String"));
    }

    foreach (System.Data.DataRow row in dSet.Tables[0].Rows)
    {
        flag = false;
        p = 0;
        foreach (System.Data.DataColumn col in dSet.Tables[0].Columns)
        {
            colst[p++] = row[col].ToString();
            if (!string.IsNullOrEmpty(row[col].ToString()))
            {  //Display only if any of the data is present in column
                flag = true;
            }
        }
        if (flag == true)
        {
            DataRow myRow = dsTemp.Tables[0].NewRow();
            //Response.Write("<tr style=\"background:#d2d2d2;\">");
            for (int kk = 0; kk < ccount; kk++)
            {
                myRow[kk] = colst[kk];         

                // Response.Write("<td class=\"table-line\" bgcolor=\"#D2D2D2\">" + colst[kk] + "</td>");
            }
            dsTemp.Tables[0].Rows.Add(myRow);
        }
    } return dsTemp;
}

This can even be used to remove null data from excel sheet.

查看更多
欢心
3楼-- · 2019-02-09 13:00

Liggett78's answer is much better - esp. as mine had an error! Correction as follows...

DELETE TableWithDuplicates
    FROM TableWithDuplicates
        LEFT OUTER JOIN (
            SELECT PK_ID = Min(PK_ID), --Decide your method for deciding which rows to keep
                KeyColumn1,
                KeyColumn2,
                KeyColumn3,
                KeyColumn4
                FROM TableWithDuplicates
                GROUP BY KeyColumn1,
                    KeyColumn2,
                    KeyColumn3,
                    KeyColumn4
            ) AS RowsToKeep
            ON TableWithDuplicates.PK_ID = RowsToKeep.PK_ID
    WHERE RowsToKeep.PK_ID IS NULL
查看更多
Evening l夕情丶
4楼-- · 2019-02-09 13:00

"This datatable is being created by reading a CSV file and not from a DB."

So put a unique constraint on the four columns in the database, and inserts that are duplicates under your design won't go in. Unless it decides to fail instead of continuing when this happens, but this surely is configurable in your CSV import script.

查看更多
做自己的国王
5楼-- · 2019-02-09 13:01

You can use Linq to Datasets. Check this. Something like this:

// Fill the DataSet.
DataSet ds = new DataSet();
ds.Locale = CultureInfo.InvariantCulture;
FillDataSet(ds);

List<DataRow> rows = new List<DataRow>();

DataTable contact = ds.Tables["Contact"];

// Get 100 rows from the Contact table.
IEnumerable<DataRow> query = (from c in contact.AsEnumerable()
                              select c).Take(100);

DataTable contactsTableWith100Rows = query.CopyToDataTable();

// Add 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

// Create duplicate rows by adding the same 100 rows to the list.
foreach (DataRow row in contactsTableWith100Rows.Rows)
    rows.Add(row);

DataTable table =
    System.Data.DataTableExtensions.CopyToDataTable<DataRow>(rows);

// Find the unique contacts in the table.
IEnumerable<DataRow> uniqueContacts =
    table.AsEnumerable().Distinct(DataRowComparer.Default);

Console.WriteLine("Unique contacts:");
foreach (DataRow uniqueContact in uniqueContacts)
{
    Console.WriteLine(uniqueContact.Field<Int32>("ContactID"));
}
查看更多
女痞
6楼-- · 2019-02-09 13:01

If you have access to Linq I think you should be able to use the built in group functionality on the in memory collection and pick out the duplicate rows

Search Google for Linq Group by for examples

查看更多
Anthone
7楼-- · 2019-02-09 13:01

It should be taken into account that Table.AcceptChanges() must be called to complete the deletion. Otherwise deleted row is still present in DataTable with RowState set to Deleted. And Table.Rows.Count is not changed after deletion.

查看更多
登录 后发表回答