Efficient way to find frequencies of each unique v

2020-04-07 19:52发布

问题:

Given a vector std::vector<double> v, we can find unique elements efficiently by:

std::vector<double> uv(v.begin(), v.end());
std::sort(uv.begin(), uv.end());
std::erase(std::unique(uv.begin, uv.end()), uv.end());

What would the be the nicest way (without loops, with STL or lambdas) to create a vector:

std::vector<double> freq_uv(uv.size());

which would contain frequencies of each distinct element appearing in v (order the same as sorted unique values)?

Note: type can be anything, not just double

回答1:

After you sort, before you erase:

std::vector<int> freq_uv;
freq_uv.push_back(0);
auto prev = uv[0];        // you should ensure !uv.empty() if previous code did not already ensure it.
for (auto const & x : uv)
{
    if (prev != x)
    {
        freq_uv.push_back(0);
        prev = x;
    }
    ++freq_uv.back();
}

Note that, while I generally like to count occurences with a map, as Yakk is doing, in this case I think it is doing a lot of unnecessary work as we already know the vector is sorted.

Another possibility is to use a std::map (not unordered), instead of sorting. This will get your frequencies first. Then, since the map is ordered, you can just create the sorted, unique vector, and the frequency vector directly from the map.

// uv not yet created
std::map<T, int> freq_map;
for (auto const & x : v)
    ++freq_map[x];
std::vector<T> uv;
std::vector<int> freq_uv;
for (auto const & p : freq_map)
{
    uv.push_back(p.first);
    freq_uv.push_back(p.second);
}


回答2:

First, note that == and to a lesser extent < on double is often a poor idea: often you'll have values that logically "should" be equal if the double was infinite precision, but are slightly different.

However, collecting the frequencies is easy:

template<typename T, typename Allocator>
std::unordered_map< T, std::size_t > frequencies( std::vector<T, Allocator> const& src ) {
  std::unordered_map< T, std::size_t > retval;
  for (auto&& x:src)
    ++retval[x];
  return retval;
}

assuming std::hash<T> is defined (which it is for double). If not, there is more boilerplate, so I'll skip it. Note that this does not care if the vector is sorted.

If you want it in the form of std::vector<std::size_t> in sync with your sorted vector, you can just do this:

template<typename T, typename Hash, typename Equality, typename Allocator>
std::vector<std::size_t> collate_frequencies(
  std::vector<T, Allocator> const& order,
  std::unordered_map<T, std::size_t, Hash, Equality> const& frequencies
) {
  std::vector<std::size_t> retval;
  retval.reserve(order.size());
  for( auto&& x : order )
    retval.push_back( frequencies[x] );
  return retval;
}

I took the liberty of making these functions overly generic, so they support more than just doubles.



回答3:

using equal_range:

std::vector<int> results;
for(auto i = begin(v); i != end(v);)
{
    auto r = std::equal_range(i, end(v), *i);
    results.emplace_back( std::distance(r.first, r.second) );
    i = r.second;
}

SSCCE:

#include <vector>
#include <algorithm>
#include <iostream>
#include <iterator>

int main()
{
    std::vector<double> v{1.0, 2.0, 1.0, 2.0, 1.0, 3.0};
    std::sort(begin(v), end(v));

    std::vector<int> results;
    for(auto i = begin(v); i != end(v);)
    {
        auto r = std::equal_range(i, end(v), *i);
        results.emplace_back( std::distance(r.first, r.second) );
        i = r.second;
    }

    for(auto const& e : results) std::cout << e << "; ";
}