Collection removeAll ignoring case?

2020-06-09 07:10发布

问题:

Ok so here is my issue. I have to HashSet's, I use the removeAll method to delete values that exist in one set from the other.

Prior to calling the method, I obviously add the values to the Sets. I call .toUpperCase() on each String before adding because the values are of different cases in both lists. There is no rhyme or reason to the case.

Once I call removeAll, I need to have the original cases back for the values that are left in the Set. Is there an efficient way of doing this without running through the original list and using CompareToIgnoreCase?

Example:

List1:

"BOB"
"Joe"
"john"
"MARK"
"dave"
"Bill"

List2:

"JOE"
"MARK"
"DAVE"

After this, create a separate HashSet for each List using toUpperCase() on Strings. Then call removeAll.

Set1.removeAll(set2);

Set1:
    "BOB"
    "JOHN"
    "BILL"

I need to get the list to look like this again:

"BOB"
"john"
"Bill"

Any ideas would be much appreciated. I know it is poor, there should be a standard for the original list but that is not for me to decide.

回答1:

In my original answer, I unthinkingly suggested using a Comparator, but this causes the TreeSet to violate the equals contract and is a bug waiting to happen:

// Don't do this:
Set<String> setA = new TreeSet<String>(String.CASE_INSENSITIVE_ORDER);
setA.add("hello");
setA.add("Hello");
System.out.println(setA);

Set<String> setB = new HashSet<String>();
setB.add("HELLO");
// Bad code; violates symmetry requirement
System.out.println(setB.equals(setA) == setA.equals(setB));

It is better to use a dedicated type:

public final class CaselessString {
  private final String string;
  private final String normalized;

  private CaselessString(String string, Locale locale) {
    this.string = string;
    normalized = string.toUpperCase(locale);
  }

  @Override public String toString() { return string; }

  @Override public int hashCode() { return normalized.hashCode(); }

  @Override public boolean equals(Object obj) {
    if (obj instanceof CaselessString) {
      return ((CaselessString) obj).normalized.equals(normalized);
    }
    return false;
  }

  public static CaselessString as(String s, Locale locale) {
    return new CaselessString(s, locale);
  }

  public static CaselessString as(String s) {
    return as(s, Locale.ENGLISH);
  }

  // TODO: probably best to implement CharSequence for convenience
}

This code is less likely to cause bugs:

Set<CaselessString> set1 = new HashSet<CaselessString>();
set1.add(CaselessString.as("Hello"));
set1.add(CaselessString.as("HELLO"));

Set<CaselessString> set2 = new HashSet<CaselessString>();
set2.add(CaselessString.as("hello"));

System.out.println("1: " + set1);
System.out.println("2: " + set2);
System.out.println("equals: " + set1.equals(set2));

This is, unfortunately, more verbose.



回答2:

It could be done by:

  1. Moving the content of your lists into case-insensitive TreeSets,
  2. then removing all common Strings case-insensitively thanks TreeSet#removeAll(Collection<?> c)
  3. and finally relying on the fact that ArrayList#retainAll(Collection<?> c) will iterate over the elements of the list and for each element it will call contains(Object o) on the provided collection to know whether the value should be kept or not and here as the collection is case-insensitive, we will keep only the Strings that match case-insensitively with what we have in the provided TreeSet instance.

The corresponding code:

List<String> list1 = new ArrayList<>(
    Arrays.asList("BOB", "Joe", "john", "MARK", "dave", "Bill")
);

List<String> list2 = Arrays.asList("JOE", "MARK", "DAVE");

// Add all values of list1 in a case insensitive collection
Set<String> set1 = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
set1.addAll(list1);
// Add all values of list2 in a case insensitive collection
Set<String> set2 = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
set2.addAll(list2);
// Remove all common Strings ignoring case
set1.removeAll(set2);
// Keep in list1 only the remaining Strings ignoring case
list1.retainAll(set1);

for (String s : list1) {
    System.out.println(s);
}

Output:

BOB
john
Bill

NB 1: It is important to have the content of the second list into a TreeSet especially if we don't know the size of it because the behavior of TreeSet#removeAll(Collection<?> c) depends on the size of both collections, if the size of the current collection is strictly bigger than the size of the provided collection, then it will call directly remove(Object o) on the current collection to remove each element, in this case the provided collection could be a list. But if it is the opposite, it will call contains(Object o) on the provided collection to know whether a given element should be removed or not so if it is not an case-insensitive collection, we won't get the expected result.

NB 2: The behavior of the method ArrayList#retainAll(Collection<?> c) described above is the same as the behavior of the default implementation of the method retainAll(Collection<?> c) that we can find in AbstractCollection such that this approach will actually work with any collections whose implementation of retainAll(Collection<?> c) has the same behavior.



回答3:

You can use a hashmap and use the capital set as keys that map to the mixed case set.

Keys of hashmaps are unique and you can get a set of them using HashMap.keyset();

to retrieve the original case, it's as simple as HashMap.get("UPPERCASENAME").

And according to the documentation:

Returns a set view of the keys contained in this map. The set is backed by the map, so changes to the map are reflected in the set, and vice-versa. The set supports element removal, which removes the corresponding mapping from this map, via the Iterator.remove, Set.remove, removeAll, retainAll, and clear operations. It does not support the add or addAll operations.

So HashMap.keyset().removeAll will effect the hashmap :)

EDIT: use McDowell's solution. I overlooked the fact that you didn't actually need the letters to be upper case :P



回答4:

This would be an interesting one to solve using google-collections. You could have a constant Predicate like so:

private static final Function<String, String> TO_UPPER = new Function<String, String>() {
    public String apply(String input) {
       return input.toUpperCase();
}

and then what you're after could be done someting like this:

Collection<String> toRemove = Collections2.transform(list2, TO_UPPER);

Set<String> kept = Sets.filter(list1, new Predicate<String>() {
    public boolean apply(String input) {
        return !toRemove.contains(input.toUpperCase());
    }
}

That is:

  • Build an upper-case-only version of the 'to discard' list
  • Apply a filter to the original list, retaining only those items whose uppercased value is not in the upper-case-only list.

Note that the output of Collections2.transform isn't an efficient Set implementation, so if you're dealing with a lot of data and the cost of probing that list will hurt you, you can instead use

Set<String> toRemove = Sets.newHashSet(Collections2.transform(list2, TO_UPPER));

which will restore an efficient lookup, returning the filtering to O(n) instead of O(n^2).



回答5:

as far as i know, hashset's use the object's hashCode-method to distinct them from each other. you should therefore override this method in your object in order to distinct cases.

if you're really using string, you cannot override this method as you cannot extend the String-class.

therefore you need to create your own class containing a string as attribute which you fill with your content. you might want to have a getValue() and setValue(String) method in order to modify the string.

then you can add your own class to the hashmap.

this should solve your problem.

regards