How to compare two csv files by 2 columns?

2019-07-27 10:03发布

I have 2 csv files

1.csv

spain;russia;japan
italy;russia;france

2.csv

spain;russia;japan
india;iran;pakistan

I read both files and add data to lists

var lst1= File.ReadAllLines("1.csv").ToList();
var lst2= File.ReadAllLines("2.csv").ToList();

Then I find all unique strings from both lists and add it to result lists

var rezList = lst1.Except(lst2).Union(lst2.Except(lst1)).ToList();

rezlist contains this data

[0] = "italy;russia;france"
[1] = "india;iran;pakistan"

At now I want to compare, make except and union by second and third column in all rows.

1.csv

spain;russia;japan

italy;russia;france

2.csv

spain;russia;japan

india;iran;pakistan

I think I need to split all rows by symbol ';' and make all 3 operations (except, distinct and union) but cannot understand how.

rezlist must contains

india;iran;pakistan

I added class

 class StringLengthEqualityComparer : IEqualityComparer<string>
        {

            public bool Equals(string x, string y)
            {
               ...
            }

            public int GetHashCode(string obj)
            {
               ...
            }
        }



 StringLengthEqualityComparer stringLengthComparer = new StringLengthEqualityComparer();
 var rezList = lst1.Except(lst2,stringLengthComparer ).Union(lst2.Except(lst1,stringLengthComparer),stringLengthComparer).ToList();

标签: c# list compare
4条回答
混吃等死
2楼-- · 2019-07-27 10:28
  class Program
    {
        static void Main(string[] args)
        {
            var lst1 = File.ReadLines(@"D:\test\1.csv").Select(x => new StringWrapper(x)).ToList();
            var lst2 = File.ReadLines(@"D:\test\2.csv").Select(x => new StringWrapper(x));
            var set = new HashSet<StringWrapper>(lst1);
            set.SymmetricExceptWith(lst2);

            foreach (var x in set)
            {
                Console.WriteLine(x.Value);
            }

        }
    }

    struct StringWrapper : IEquatable<StringWrapper>
    {
        public string Value { get; }
        private readonly string _comparand0;
        private readonly string _comparand14;

        public StringWrapper(string value)
        {
            Value = value;
            var split = value.Split(';');
            _comparand0 = split[0];
            _comparand14 = split[14];
        }

        public bool Equals(StringWrapper other)
        {
            return string.Equals(_comparand0, other._comparand0, StringComparison.OrdinalIgnoreCase)
                   && string.Equals(_comparand14, other._comparand14, StringComparison.OrdinalIgnoreCase);
        }

        public override bool Equals(object obj)
        {
            if (ReferenceEquals(null, obj)) return false;
            return obj is StringWrapper && Equals((StringWrapper) obj);
        }

        public override int GetHashCode()
        {
            unchecked
            {
                return ((_comparand0 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand0) : 0)*397)
                       ^ (_comparand14 != null ? StringComparer.OrdinalIgnoreCase.GetHashCode(_comparand14) : 0);
            }
        }
    }
查看更多
仙女界的扛把子
3楼-- · 2019-07-27 10:30

your GetHashCode()-Method in EqualityComparer are buggy. Fixed version:

public int GetHashCode(string obj)
{
  return obj.Split(';')[1].GetHashCode();
}

now the result are correct:

// one result: "india;iran;pakistan"

btw. "StringLengthEqualityComparer"is not a good name ;-)

查看更多
不美不萌又怎样
4楼-- · 2019-07-27 10:34

Your question is not very clear: for instance, is india;iran;pakistan the desired result primarily because russia is at element[1]? Isn't it also included because element [2] pakistan does not match france and japan? Even though thats unclear, I assume the desired result comes from either situation.

Then there is this: find all unique string from both lists which changes the nature dramatically. So, I take it that the desired results are because "iran" appears in column[1] no where else in column[1] in either file and even if it did, that row would still be unique due to "pakistan" in col[2].

Also note that a data sample of 2 leaves room for a fair amount of error.

Trying to do it in one step makes it very confusing. Since eliminating dupes found in 1.CSV is pretty easy, do it first:

// parse "1.CSV"
List<string[]> lst1 = File.ReadAllLines(@"C:\Temp\1.csv").
            Select(line => line.Split(';')).
            ToList();

// parse "2.CSV"
List<string[]> lst2 = File.ReadAllLines(@"C:\Temp\2.csv").
            Select(line => line.Split(';')).
            ToList();

// extracting once speeds things up in the next step
//  and leaves open the possibility of iterating in a method
List<List<string>> tgts = new List<List<string>>();
tgts.Add(lst1.Select(z => z[1]).Distinct().ToList());
tgts.Add(lst1.Select(z => z[2]).Distinct().ToList());

var tmpLst = lst2.Where(x => !tgts[0].Contains(x[1]) ||
                !tgts[1].Contains(x[2])).
                ToList();

That results in the items which are not in 1.CSV (no matching text in Col[1] nor Col[2]). If that is really all you need, you are done.

Getting unique rows within 2.CSV is trickier because you have to actually count the number of times each Col[1] item occurs to see if it is unique; then repeat for Col[2]. This uses GroupBy:

var unique = tmpLst.
    GroupBy(g => g[1], (key, values) =>
            new GroupItem(key, 
                    values.ToArray()[0], 
                    values.Count())
            ).Where(q => q.Count == 1).
    GroupBy(g => g.Data[2], (key, values) => new
            {
                Item = string.Join(";", values.ToArray()[0]),
                Count = values.Count()
            }
            ).Where(q => q.Count == 1).Select(s => s.Item).
        ToList();

The GroupItem class is trivial:

class GroupItem
{
    public string Item { set; get; }      // debug aide
    public string[] Data { set; get; }
    public int Count { set; get; }

    public GroupItem(string n, string[] d, int c)
    {
        Item = n;
        Data =  d;
        Count = c;
    }
    public override string ToString()
    {
        return string.Join(";", Data);
    }
}

It starts with tmpList, gets the rows with a unique element at [1]. It uses a class for storage since at this point we need the array data for further review.

The second GroupBy acts on those results, this time looking at col[2]. Finally, it selects the joined string data.

Results

Using 50,000 random items in File1 (1.3 MB), 15,000 in File2 (390 kb). There were no naturally occurring unique items, so I manually made 8 unique in 2.CSV and copied 2 of them into 1.CSV. The copies in 1.CSV should eliminate 2 if the 8 unique rows in 2.CSV making the expected result 6 unique rows:

enter image description here

NepalX and ItalyX were the repeats in both files and they correctly eliminated each other.

With each step it is scanning and working with less and less data, which seems to make it pretty fast for 65,000 rows / 130,000 data elements.

查看更多
乱世女痞
5楼-- · 2019-07-27 10:50
private void GetUnion(List<string> lst1, List<string> lst2)
{
    List<string> lstUnion = new List<string>();
    foreach (string value in lst1)
    {
        string valueColumn1 = value.Split(';')[0];
        string valueColumn2 = value.Split(';')[1];
        string valueColumn3 = value.Split(';')[2];

        string result = lst2.FirstOrDefault(s => s.Contains(";" + valueColumn2 + ";" + valueColumn3));

        if (result != null)
        {
            if (!lstUnion.Contains(result))
            {
                lstUnion.Add(result);
            }                   
        }
    }
}
查看更多
登录 后发表回答