Array of structs with dynamic allocation runs very

2019-07-22 12:57发布

问题:

I "glue" together (with a help of other SO users) a small C program that maps string labels to integer labels in a edgelist data structure. For instance, for input file

Mike Andrew
Mike Jane
John Jane

the program outputs

1 2
1 3
4 3

However, I mapped huge edgelist files and unfortunately the program runs very slow in comparison to Python alternative. Below I pasted both programs, in C and Python. I kindly ask for a pointers how to improve speed of the C program.

#include <stdio.h>
#include <stdlib.h>

// Initial number of maximal lines in a file
enum { MAXL = 200};

typedef struct {
    unsigned int first;
    unsigned int second;
} edge;

typedef struct {
    unsigned int hashed;
     char **map;
} hash;


int insertInMap(hash *map, char *entry)
{
  int i =0;
  for (i=0;i<map->hashed;i++)
  {
    if (strcmp(map->map[i],entry) == 0)
    return i+1;
  }
  /* Warning no boundary check is added */
  map->map[map->hashed++] = strdup(entry);   
  return map->hashed;
}

int main() {
  FILE *fp = NULL;
  char node1[30];
  char node2[30];
  int idx = 0;
  int i, n = 0, maxl = MAXL;

  edge *edges;
  hash map;

  edges = malloc(MAXL * sizeof(edge));
  map.map = malloc(MAXL * sizeof(char*));
  map.hashed = 0;

  fp = fopen("./test.txt", "r");

  while (fscanf(fp, "%s %s", &node1, &node2) == 2) {
    if (++n == maxl) { /* if limit reached, realloc lines  */
      void *tmp = realloc (edges, (maxl + 40) * sizeof *edges);
      void *tmp1 = realloc (map.map, (maxl + 80) * sizeof(char*));
      if (!tmp) {     /* validate realloc succeeded */
        fprintf (stderr, "error: realloc - virtual memory exhausted.\n");
        break;      /* on failure, exit with existing data */
      }
      edges = tmp;    /* assign reallocated block to lines */

      map.map = tmp1;
      maxl += 40;     /* update maxl to reflect new size */
    }
    edges[idx].first = insertInMap(&map,node1);
    edges[idx].second = insertInMap(&map,node2);
    idx++;
  }

  fclose(fp);

  for (int i = 0; i < idx; i++) {
    printf("%d -- %d\n", edges[i].first, edges[i].second);
  }


  free(edges);

  return 0;
}

The corresponding Python alternative:

import fileinput

i = 0
cui2int = {}

for line in fileinput.input():    
    (cui1, cui2) = line.split()
    if cui1 in cui2int:
        int1 = cui2int[cui1]
    else:
        i += 1
        cui2int[cui1] = i
        int1 = i

    if cui2 in cui2int:
        int2 = cui2int[cui2]
    else:
        i += 1
        cui2int[cui2] = i
        int2 = i

    print(int1, int2)

EDITED AND ADDED

Below is modified code using the GLib hash implementation. I improved performance, but unfortunately still have troubles with the output which should be

1 2
1 3
4 3

instead of

0 0
0 1
1 1

Could somebody please take a look.

#include <stdio.h>
#include <stdlib.h>
#include <glib.h>
#include <stdint.h>

int main() {
  GHashTable *table;
  table = g_hash_table_new(g_int_hash, g_int_equal);

  FILE *fp = NULL;
  char node1[30];
  char node2[30];

  fp = fopen("./test.txt", "r");
  int i = 0;
  while (fscanf(fp, "%s %s", &node1, &node2) == 2) {
    char *key1 = malloc(sizeof(char)*1024);
    char *key2 = malloc(sizeof(char)*1024);
    uint32_t* value = (uint32_t *)malloc(sizeof(uint32_t));
    key1 = g_strdup(node1);
    key2 = g_strdup(node2);
    *value = i;

    uint32_t *x;
    if (g_hash_table_contains(table, key1)) {
      x = (uint32_t *)g_hash_table_lookup(table, key1);
    } else {
      i++;
      g_hash_table_insert(table, (gpointer)key1, (gpointer)value);
      x = (uint32_t *)value;
    }

    uint32_t *y;
    if (g_hash_table_contains(table, key2)) {
      y = (uint32_t *)g_hash_table_lookup(table, key2);
    } else {
      g_hash_table_insert(table, (gpointer)key2, (gpointer)value);
      y = (uint32_t *)value;
    }
    printf("%d -- %d\n", *x, *y);
  }

  fclose(fp);

  g_hash_table_destroy(table);
  table = NULL;
  return 0;
}

回答1:

Your "hash" in C operates more like a linked list, with linear insertion and lookup. On the other hand, Python's dictionary is industrial-strength with O(1) average insertions and lookups (the in operator). If you're writing a hashmap from scratch in C, there's a good deal of theory you're going to need to put into practice in order to begin approaching Python's implementation in terms of performance.

In my opinion, the best bet is to write your code in C++ if possible and use an unordered_map. This is the best of both worlds: all the work has already been done for you, yet you don't need to make performance compromises.

If you are set on (or stuck with) C, there are quite a few resources on the internet, but I'm hesitant to post any links here since I can't vouch for their quality. It should be an educational endeavor.



回答2:

The two programs are using fundamentally different data structures with different time complexity. The python program is using a dictionary which is a highly tuned hash- table with O(1) amortized performance for lookup and deletion.

So the python program is running with O(number of words) asymptotic complexity.

Now, talking about your C program what you have tried to create is essentially just an array of key-value pairs. Inserting or retrieving the keys takes O(size of the array) here since you can potentially loop through the array till the end to find a match.

If you do some math, it turns out to be O((number of words)2).

C++ has built-in Hash-table implementation named unordered_map, you can use that if you have no trouble switching to C++. Or check out this question on SO to learn writing your own hash table in C. What is a hash table and how do you make it in C?



回答3:

The problem with your code is that despite the name, it isn't a working hash table. You chew through the map with a linear search which is very slow. What you should do instead:

  • Set the hash table size to a fixed size. Avoid any solution based on realloc.
  • Come up with a hashing function to determine table index. There should be plenty of code examples on the net using strings.
  • Implement a method for storing/checking an index. This could be storing in the next table index available, or by implementing "chaining", where each index is a linked list, etc.