sort array of integers lexicographically C++

2019-03-11 05:48发布

I want to sort a large array of integers (say 1 millon elements) lexicographically.

Example:

input [] = { 100, 21 , 22 , 99 , 1  , 927 }
sorted[] = { 1  , 100, 21 , 22 , 927, 99  }

I have done it using the simplest possible method:

  • convert all numbers to strings (very costly because it will take huge memory)
  • use std:sort with strcmp as comparison function
  • convert back the strings to integers

Is there a better method than this?

12条回答
Lonely孤独者°
2楼-- · 2019-03-11 06:03

The task sounds like a natural fit for an MSD variant of Radix Sort with padding ( http://en.wikipedia.org/wiki/Radix_sort ).

Depends on how much code you want to throw at it. The simple code as the others show is O(log n) complexity, while a fully optimized radix sort would be O(kn).

查看更多
趁早两清
3楼-- · 2019-03-11 06:07

A compact solution if all your numbers are nonnegative and they are small enough so that multiplying them by 10 doesn't cause an overflow:

template<class T> bool lex_less(T a, T b) {
  unsigned la = 1, lb = 1;
  for (T t = a; t > 9; t /= 10) ++la;
  for (T t = b; t > 9; t /= 10) ++lb;
  const bool ll = la < lb;
  while (la > lb) { b *= 10; ++lb; }
  while (lb > la) { a *= 10; ++la; }
  return a == b ? ll : a < b;
}

Run it like this:

#include <iostream>
#include <algorithm>
int main(int, char **) {
  unsigned short input[] = { 100, 21 , 22 , 99 , 1  , 927 };
  unsigned input_size = sizeof(input) / sizeof(input[0]);
  std::sort(input, input + input_size, lex_less<unsigned short>);
  for (unsigned i = 0; i < input_size; ++i) {
    std::cout << ' ' << input[i];
  }
  std::cout << std::endl;
  return 0;
}
查看更多
趁早两清
4楼-- · 2019-03-11 06:07

You could try using the % operator to give you access to each individual digit eg 121 % 100 will give you the first digit and check that way but you'll have to find a way to get around the fact they have different sizes.

So find the maximum value in array. I don't know if theres a function for this in built you could try.

int Max (int* pdata,int size)
 {
int temp_max =0 ;

for (int i =0 ; i < size ; i++)
 {
    if (*(pdata+i) > temp_max)
    {
        temp_max = *(pdata+i);

    }
 }
 return temp_max;
 }

This function will return the number of digits in the number

 int Digit_checker(int n)
{
 int num_digits = 1;

while (true)
{
    if ((n % 10) == n)
        return num_digits;
    num_digits++;
    n = n/10;
}
return num_digits;
}

Let number of digits in max equal n. Once you have this open a for loop in the format of for (int i = 1; i < n ; i++)

then you can go through your and use "data[i] % (10^(n-i))" to get access to the first digit then sort that and then on the next iteration you'll get access to the second digit. I Don't know how you'll sort them though.

It wont work for negative numbers and you'll have to get around data[i] % (10^(n-i)) returning itself for numbers with less digits than max

查看更多
家丑人穷心不美
5楼-- · 2019-03-11 06:08

Here is the dumb solution that doesn't use any floating point tricks. It's pretty much the same as the string comparison, but doesn't use a string per say, doesn't also handle negative numbers, to do that add a section at the top...

bool comp(int l, int r)
{
  int lv[10] = {}; // probably possible to get this from numeric_limits
  int rv[10] = {};

  int lc = 10; // ditto
  int rc = 10;
  while (l || r)
  {
    if (l)
    {
      auto t = l / 10;
      lv[--lc] = l - (t * 10);
      l = t;
    }
    if (r)
    {
      auto t = r / 10;
      rv[--rc] = r - (t * 10);
      r = t;
    }
  }
  while (lc < 10 && rc < 10)
  {
    if (lv[lc] == rv[rc])
    {
      lc++;
      rc++;
    }
    else
      return lv[lc] < rv[rc];
  }
  return lc > rc;
}

It's fast, and I'm sure it's possible to make it faster still, but it works and it's dumb enough to understand...

EDIT: I ate to dump lots of code, but here is a comparison of all the solutions so far..

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <random>
#include <vector>
#include <utility>
#include <cmath>
#include <cassert>
#include <chrono>

std::pair<int, int> lexicographic_pair_helper(int p, int maxDigits)
{
  int digits = std::log10(p);
  int l = p*std::pow(10, maxDigits-digits);
  return {l, p};
}

bool l_comp(int l, int r)
{
  int lv[10] = {}; // probably possible to get this from numeric_limits
  int rv[10] = {};

  int lc = 10; // ditto
  int rc = 10;
  while (l || r)
  {
    if (l)
    {
      auto t = l / 10;
      lv[--lc] = l - (t * 10);
      l = t;
    }
    if (r)
    {
      auto t = r / 10;
      rv[--rc] = r - (t * 10);
      r = t;
    }
  }
  while (lc < 10 && rc < 10)
  {
    if (lv[lc] == rv[rc])
    {
      lc++;
      rc++;
    }
    else
      return lv[lc] < rv[rc];
  }
  return lc > rc;
}

int up_10pow(int n) {
  int ans = 1;
  while (ans < n) ans *= 10;
  return ans;
}
bool l_comp2(int v1, int v2) {
  int n1 = up_10pow(v1), n2 = up_10pow(v2);
  while ( v1 != 0 && v2 != 0) {
    if (v1 / n1  < v2 / n2) return true;
    else if (v1 / n1 > v2 / n2) return false;
    v1 /= 10;
    v2 /= 10;
    n1 /= 10;
    n2 /= 10;
  }
  if (v1 == 0 && v2 != 0) return true;
  return false;
}

int main()
{
  std::vector<int> numbers;
  {
    constexpr int number_of_elements = 1E6;
    std::random_device rd;
    std::mt19937 gen( rd() );
    std::uniform_int_distribution<> dist;
    for(int i = 0; i < number_of_elements; ++i) numbers.push_back( dist(gen) );
  }

  std::vector<int> lo(numbers);
  std::vector<int> dyp(numbers);
  std::vector<int> nim(numbers);
  std::vector<int> nb(numbers);

  std::cout << "starting..." << std::endl;

  {

    auto start = std::chrono::high_resolution_clock::now();
    /**
    * Sorts the array lexicographically.
    *
    * The trick is that we have to compare digits left-to-right
    * (considering typical Latin decimal notation) and that each of
    * two numbers to compare may have a different number of digits.
    *
    * This probably isn't very efficient, so I wouldn't do it on
    * "millions" of numbers. But, it works...
    */
    std::sort(
    std::begin(lo),
              std::end(lo),
              [](int lhs, int rhs) -> bool {
                // Returns true if lhs < rhs
                // Returns false otherwise
                const auto BASE      = 10;
                const bool LHS_FIRST = true;
                const bool RHS_FIRST = false;
                const bool EQUAL     = false;


                // There's no point in doing anything at all
                // if both inputs are the same; strict-weak
                // ordering requires that we return `false`
                // in this case.
                if (lhs == rhs) {
                  return EQUAL;
                }

                // Compensate for sign
                if (lhs < 0 && rhs < 0) {
                  // When both are negative, sign on its own yields
                  // no clear ordering between the two arguments.
                  //
                  // Remove the sign and continue as for positive
                  // numbers.
                  lhs *= -1;
                  rhs *= -1;
                }
                else if (lhs < 0) {
                  // When the LHS is negative but the RHS is not,
              // consider the LHS "first" always as we wish to
              // prioritise the leading '-'.
              return LHS_FIRST;
                }
                else if (rhs < 0) {
                  // When the RHS is negative but the LHS is not,
              // consider the RHS "first" always as we wish to
              // prioritise the leading '-'.
              return RHS_FIRST;
                }

                // Counting the number of digits in both the LHS and RHS
                // arguments is *almost* trivial.
                const auto lhs_digits = (
                lhs == 0
                ? 1
                : std::ceil(std::log(lhs+1)/std::log(BASE))
                );

                const auto rhs_digits = (
                rhs == 0
                ? 1
                : std::ceil(std::log(rhs+1)/std::log(BASE))
                );

                // Now we loop through the positions, left-to-right,
              // calculating the digit at these positions for each
              // input, and comparing them numerically. The
              // lexicographic nature of the sorting comes from the
              // fact that we are doing this per-digit comparison
              // rather than considering the input value as a whole.
              const auto max_pos = std::max(lhs_digits, rhs_digits);
              for (auto pos = 0; pos < max_pos; pos++) {
                if (lhs_digits - pos == 0) {
                  // Ran out of digits on the LHS;
                  // prioritise the shorter input
                  return LHS_FIRST;
                }
                else if (rhs_digits - pos == 0) {
                  // Ran out of digits on the RHS;
                  // prioritise the shorter input
                  return RHS_FIRST;
                }
                else {
                  const auto lhs_x = (lhs / static_cast<decltype(BASE)>(std::pow(BASE, lhs_digits - 1 - pos))) % BASE;
                  const auto rhs_x = (rhs / static_cast<decltype(BASE)>(std::pow(BASE, rhs_digits - 1 - pos))) % BASE;

                  if (lhs_x < rhs_x)
                    return LHS_FIRST;
                  else if (rhs_x < lhs_x)
                    return RHS_FIRST;
                }
              }

              // If we reached the end and everything still
              // matches up, then something probably went wrong
              // as I'd have expected to catch this in the tests
              // for equality.
              assert("Unknown case encountered");
              }
              );

    auto end = std::chrono::high_resolution_clock::now();
    auto elapsed = end - start;
    std::cout << "Lightness: " << elapsed.count() << '\n';
  }

  {
    auto start = std::chrono::high_resolution_clock::now();

    auto max = *std::max_element(begin(dyp), end(dyp));
    int maxDigits = std::log10(max);

    std::vector<std::pair<int,int>> temp;
    temp.reserve(dyp.size());
    for(auto const& e : dyp) temp.push_back( lexicographic_pair_helper(e, maxDigits) );

    std::sort(begin(temp), end(temp), [](std::pair<int, int> const& l, std::pair<int, int> const& r)
    { if(l.first < r.first) return true; if(l.first > r.first) return false; return l.second < r.second; });

    auto end = std::chrono::high_resolution_clock::now();
    auto elapsed = end - start;
    std::cout << "Dyp: " << elapsed.count() << '\n';
  }

  {
    auto start = std::chrono::high_resolution_clock::now();
    std::sort (nim.begin(), nim.end(), l_comp);
    auto end = std::chrono::high_resolution_clock::now();
    auto elapsed = end - start;
    std::cout << "Nim: " << elapsed.count() << '\n';
  }

//   {
//     auto start = std::chrono::high_resolution_clock::now();
//     std::sort (nb.begin(), nb.end(), l_comp2);
//     auto end = std::chrono::high_resolution_clock::now();
//     auto elapsed = end - start;
//     std::cout << "notbad: " << elapsed.count() << '\n';
//   }

  std::cout << (nim == lo) << std::endl;
  std::cout << (nim == dyp) << std::endl;
  std::cout << (lo == dyp) << std::endl;
//   std::cout << (lo == nb) << std::endl;
}
查看更多
Emotional °昔
6楼-- · 2019-03-11 06:12

Overload the < operator to compare two integers lexicographically. For each integer, find the smallest 10^k, which is not less than the given integer. Than compare the digits one by one.

class CmpIntLex {
int up_10pow(int n) {
  int ans = 1;
  while (ans < n) ans *= 10;
  return ans;
}
public: 
bool operator ()(int v1, int v2) {
   int ceil1 = up_10pow(v1), ceil2 = up_10pow(v2);
   while ( ceil1 != 0 && ceil2 != 0) {
      if (v1 / ceil1  < v2 / ceil2) return true;
      else if (v1 / ceil1 > v2 / ceil2) return false;
      ceil1 /= 10; 
      ceil2 /= 10;
   }
   if (v1 < v2) return true;
   return false;
}
int main() {
vector<int> vi = {12,45,12134,85};
sort(vi.begin(), vi.end(), CmpIntLex());
}
查看更多
Evening l夕情丶
7楼-- · 2019-03-11 06:14

Use std::sort() with a suitable comparison function. This cuts down on the memory requirements.

The comparison function can use n % 10, n / 10 % 10, n / 100 % 10 etc. to access the individual digits (for positive integers; negative integers work a bit differently).

查看更多
登录 后发表回答