Keep track of highest 5 numbers during file input

2019-05-12 16:38发布

问题:

So lets say i have a struct

struct largestOwners
{
    string name;
    double amountOwned;
};

And i am reading it in from a file using ifstream with 300 names and amounts.

How can i go about keeping track of the highest 5 numbers during input? So i dont have to sort after, but rather track it during ifstream input.

My goal is to keep track of the 5 highest amounts during input so i can easily print it out later. And save time/processing rather than to do it in the future

I get i can store this in an array or another struct, but whats a good algorithm for tracking this during ifstream input to the struct?

Lets say the text file looks like this, when im reading it in.

4025025 Tony
66636 John
25 Tom
23693296 Brady
363 Bradley
6200 Tim

Thanks!

回答1:

To keep track of the highest 5 numbers in a stream of incoming numbers, you could use a min-heap of size 5 (C++ STL set can be used as a min-heap).

First fill the min-heap with the first 5 numbers. After that, for each incoming element, compare that with the smallest of the largest 5 numbers that you have (root of the min-heap). If the current number is smaller than that, do nothing, otherwise remove the 5th largest (pop from min-heap) and insert the current number to the min-heap.

Deleting and inserting in the min-heap will take O(log n) time.

For example, consider the following stream of numbers:

1 2 5 6 3 4 0 10 3

The min-heap will have 1 2 3 5 6 initially.

On encountering 4, 1 gets removed and 4 gets inserted. Min heap now looks like this: 2 3 4 5 6

On encountering 0, nothing happens.

On encountering 10, 2 gets removed and 10 gets inserted. Min heap now looks like this: 3 4 5 6 10

On encountering 3, nothing happens.

So your final set of 5 largest elements are contained in the heap (3 4 5 6 10)

You can even tweak this to keep track of the k highest elements in an incoming stream of numbers. Just change the size of the min-heap to k.



回答2:

While reading the file, keep a sorted list of the 5 largest numbers seen (and their owners). Whenever you read a value higher than the lowest of the 5, remove the lowest and insert the new number in your sorted list.

List list can be stored in an array or in any other data structure that has an order and where you can implement a sort and insert. (Or where this is already implemented)

Instead of sorting the list you can also simply go through the 5 entries every time you read a new one (should not be too bad, because 5 entries is a very small number)



回答3:

You can use the standard library function std::nth_element() for this.

It should be fairly easy to implement a comparison function (or overload the comparison operator) for your struct. Then you'd just parse the file into a vector of those and be done with it. The algorithm uses a partial sort, linear time on average.

Here's the example given on the documentation site I've linked below:

#include <iostream>
#include <vector>
#include <algorithm>
#include <functional>

int main()
{
    std::vector<int> v{5, 6, 4, 3, 2, 6, 7, 9, 3};

    std::nth_element(v.begin(), v.begin() + v.size()/2, v.end());
    std::cout << "The median is " << v[v.size()/2] << '\n';

    std::nth_element(v.begin(), v.begin()+1, v.end(), std::greater<int>());
    std::cout << "The second largest element is " << v[1] << '\n';
}

For reference:

  • http://en.cppreference.com/w/cpp/algorithm/nth_element

Out of curiosity, I have implemented some approaches:

#include <algorithm>
#include <functional>
#include <queue>
#include <set>
#include <vector>

std::vector<int> filter_nth_element(std::vector<int> v, int n) {
    auto target = v.begin()+n;
    std::nth_element(v.begin(), target, v.end(), std::greater<int>());
    std::vector<int> result(v.begin(), target);
    return result;
}

std::vector<int> filter_pqueue(std::vector<int> v, int n) {
    std::vector<int> result;
    std::priority_queue<int, std::vector<int>, std::greater<int>> q;
    for (auto i: v) {
        q.push(i);
        if (q.size() > n) {
            q.pop();
        }
    }
    while (!q.empty()) {
        result.push_back(q.top());
        q.pop();
    }
    return result;
}

std::vector<int> filter_set(std::vector<int> v, int n) {
    std::set<int> s;
    for (auto i: v) {
        s.insert(i);
        if (s.size() > n) {
            s.erase(s.begin());
        }
    }
    return std::vector<int>(s.begin(), s.end());
}

std::vector<int> filter_deque(std::vector<int> v, int n) {
    std::deque<int> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::min_element(q.begin(), q.end()));
        }
    }
    return std::vector<int>(q.begin(), q.end());
}

std::vector<int> filter_vector(std::vector<int> v, int n) {
    std::vector<int> q;
    for (auto i: v) {
        q.push_back(i);
        if (q.size() > n) {
            q.erase(std::min_element(q.begin(), q.end()));
        }
    }
    return q;
}

And I have made up some tests:

#include <random>
#include <iostream>
#include <chrono>

std::vector<int> filter_nth_element(std::vector<int> v, int n);
std::vector<int> filter_pqueue(std::vector<int> v, int n);
std::vector<int> filter_set(std::vector<int> v, int n);
std::vector<int> filter_deque(std::vector<int> v, int n);
std::vector<int> filter_vector(std::vector<int> v, int n);

struct stopclock {
    typedef std::chrono::high_resolution_clock high_resolution_clock;
    std::chrono::time_point<high_resolution_clock> start, end;
    stopclock() : start(high_resolution_clock::now()) {}
    ~stopclock() {
        using namespace std::chrono;
        auto elapsed = high_resolution_clock::now() - start;
        auto elapsed_ms = duration_cast<milliseconds>(elapsed);
        std::cout << elapsed_ms.count() << " ";
    }
};

int main() {
    // randomly initialize input array
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dist;
    std::vector<int> v(10000000);
    for (auto &i: v)
        i = dist(gen);
    // run tests
    for (std::vector<int>::size_type x = 5; x <= 100; x+=5) {
        // keep this many values
        std::cout << x << " ";
        {
            stopclock t;
            auto result = filter_nth_element(v, x);
        }
        {
            stopclock t;
            auto result = filter_pqueue(v, x);
        }
        {
            stopclock t;
            auto result = filter_set(v, x);
        }
        {
            stopclock t;
            auto result = filter_deque(v, x);
        }
        {
            stopclock t;
            auto result = filter_vector(v, x);
        }
        std::cout << "\n";
    }
}

And I found it quite interesting to see the relative performance of these approaches (compiled with -O3 - I think I have to think a bit about these results):



回答4:

A binary search tree could be a suitable data structure for this problem. Maybe you can find a suitable Tree class in STL or Boost or so (try to look for that). Otherwise simply use a struct if you insist.

The struct would be like that:

struct tnode {        /* the tree node: */
    char *word;           /* points to the text */
    int count;            /* number of occurrences */
    struct tnode *left;   /* left child */
    struct tnode *right;  /* right child */
};

Taken from The C Programming Language, chapter 6.5 - Self-referential Structures. Just adapt it to your needs.

Though, I think if you want to program in C++ properly, try to create a Tree data structure (class) or try to use an existing one.

Considering that you only have 300 entries, that should do it. In theory when the input data is random it is supposed to be efficient. But that is theory and does not really play a role in your case. I think it is a good solution.



回答5:

You can use sorted buffer of 5 elements and on each step if item is higher than lowest item of the buffer, put item in buffer and evict lowest



回答6:

Use a map of elements

First Create a class

  class Data {
    public:
       std::string name;
       int         number;
};


 typedef std::map< int, Data > DataItems;
 DataItems  largest;

If the size of largest is < 5, then you haven't read five elements.

 if( largest.size() < 5 ) {
     largest[ dt.number] = dt;
 } else {

Otherwise - if it is larger than the smallest of the largest five, then the largest five has changed.

     DataItems::iterator it = largest.begin(); // lowest current item.
     if( it->second.number < dt.number ) { // is this item bigger? - yes
          largest[ dt.number ] = dt;       // add it (largest.size() == 6)
          largest.erase( largest.begin() );// remove smallest item
     }
 }


回答7:

You can use a set to keep track of the highest values. If you want to track non-unique numbers use a multiset instead:

vector<int> nums{10,11,12,1,2,3,4,5,6,7,8,9}; //example data

int k=5;       // number of values to track
set<int> s;    // this set will hold the result

for(auto a: nums)
{
    if(s.size()<k)s.insert(a); 
    else if(a>*s.begin())
    {
       s.erase(s.begin());
       s.insert(a);
    }
}

Of course you will have to provide a custom comparison function for your struct.