I have 2 csv files
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I read both files and add data to lists
var lst1= File.ReadAllLines("1.csv").ToList();
var lst2= File.ReadAllLines("2.csv").ToList();
Then I find all unique strings from both lists and add it to result lists
var rezList = lst1.Except(lst2).Union(lst2.Except(lst1)).ToList();
rezlist contains this data
[0] = "italy;russia;france"
[1] = "india;iran;pakistan"
At now I want to compare, make except and union by second and third column in all rows.
1.csv
spain;russia;japan
italy;russia;france
2.csv
spain;russia;japan
india;iran;pakistan
I think I need to split all rows by symbol ';' and make all 3 operations (except, distinct and union) but cannot understand how.
rezlist
must contains
india;iran;pakistan
I added class
class StringLengthEqualityComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
...
}
public int GetHashCode(string obj)
{
...
}
}
StringLengthEqualityComparer stringLengthComparer = new StringLengthEqualityComparer();
var rezList = lst1.Except(lst2,stringLengthComparer ).Union(lst2.Except(lst1,stringLengthComparer),stringLengthComparer).ToList();
your
GetHashCode()
-Method in EqualityComparer are buggy. Fixed version:now the result are correct:
btw.
"StringLengthEqualityComparer"
is not a good name ;-)Your question is not very clear: for instance, is
india;iran;pakistan
the desired result primarily because russia is at element[1]? Isn't it also included because element [2] pakistan does not match france and japan? Even though thats unclear, I assume the desired result comes from either situation.Then there is this:
find all unique string from both lists
which changes the nature dramatically. So, I take it that the desired results are because "iran" appears in column[1] no where else in column[1] in either file and even if it did, that row would still be unique due to "pakistan" in col[2].Also note that a data sample of 2 leaves room for a fair amount of error.
Trying to do it in one step makes it very confusing. Since eliminating dupes found in 1.CSV is pretty easy, do it first:
That results in the items which are not in 1.CSV (no matching text in Col[1] nor Col[2]). If that is really all you need, you are done.
Getting unique rows within 2.CSV is trickier because you have to actually count the number of times each Col[1] item occurs to see if it is unique; then repeat for Col[2]. This uses GroupBy:
The
GroupItem
class is trivial:It starts with
tmpList
, gets the rows with a unique element at [1]. It uses a class for storage since at this point we need the array data for further review.The second
GroupBy
acts on those results, this time looking at col[2]. Finally, it selects the joined string data.Results
Using 50,000 random items in File1 (1.3 MB), 15,000 in File2 (390 kb). There were no naturally occurring unique items, so I manually made 8 unique in 2.CSV and copied 2 of them into 1.CSV. The copies in 1.CSV should eliminate 2 if the 8 unique rows in 2.CSV making the expected result 6 unique rows:
NepalX
andItalyX
were the repeats in both files and they correctly eliminated each other.With each step it is scanning and working with less and less data, which seems to make it pretty fast for 65,000 rows / 130,000 data elements.