I have ~300 text files that contain data on trackers, torrents and peers. Each file is organised like this:
tracker.txt
time torrent
time peer
time peer
...
time torrent
...
I have several files per tracker and much of the information is repeated (same information, different time).
I'd like to be able to analyse what I have and report statistics on things like
- How many torrents are at each tracker
- How many trackers are torrents listed on
- How many peers do torrents have
- How many torrents to peers have
The sheer quantity of data is making this hard for me to. Here's What I've tried.
MySQL
I put everything into a database; one table per entity type and tables to hold the relationships (e.g. this torrent is on this tracker).
Adding the information to the database was slow (and I didn't have 13GB of it when I tried this) but analysing the relationships afterwards was a no-go. Every mildly complex query took over 24 hours to complete (if at all).
An example query would be:
SELECT COUNT(DISTINCT torrent)
FROM TorrentAtPeer, Peer
WHERE TorrentAtPeer.peer = Peer.id
GROUP BY Peer.ip;
I tried bumping up the memory allocations in my my.cnf
file but it didn't seem to help. I used the my-innodb-heavy-4G.cnf
settings file.
EDIT: Adding table details
Here's what I was using:
Peer Torrent Tracker
----------- ----------------------- ------------------
id (bigint) id (bigint) id (bigint)
ip* (int) infohash* (varchar(40)) url (varchar(255))
port (int)
TorrentAtPeer TorrentAtTracker
----------------- ----------------
id (bigint) id (bigint)
torrent* (bigint) torrent* (bigint)
peer* (bigint) tracker* (bigint)
time (int) time (int)
*indexed field. Navicat reports them as being of normal type and Btree method.
id - Always the primary key
There are no foreign keys. I was confident in my ability to only use IDs that corresponded to existing entities, adding a foreign key check seemed like a needless delay. Is this naive?
Matlab
This seemed like an application that was designed for some heavy lifting but I wasn't able to allocate enough memory to hold all of the data in one go.
I didn't have numerical data so I was using cell arrays, I moved from these to tries in an effort to reduce the footprint. I couldn't get it to work.
Java
My most successful attempt so far. I found an implementation of Patricia Tries provided by the people at Limewire. Using this I was able to read in the data and count how many unique entities I had:
- 13 trackers
- 1.7mil torrents
- 32mil peers
I'm still finding it too hard to work out the frequencies of the number of torrents at peers. I'm attempting to do so by building tries like this:
Trie<String, Trie<String, Object>> peers = new Trie<String, Trie<String, Object>>(...);
for (String line : file) {
if (containsTorrent(line)) {
infohash = getInfohash(line);
}
else if (containsPeer(line)) {
Trie<String, Object> torrents = peers.get(getPeer(line));
torrents.put(infohash, null);
}
}
From what I've been able to do so far, if I can get this peers
trie built then I can easily find out how many torrents are at each peer. I ran it all yesterday and when I came back I noticed that the log file wan't being written to, I ^Z
the application and time
reported the following:
real 565m41.479s
user 0m0.001s
sys 0m0.019s
This doesn't look right to me, should user and sys be so low? I should mention that I've also increased the JVM's heap size to 7GB (max and start), without that I rather quickly get an out of memory error.
I don't mind waiting for several hours/days but it looks like the thing grinds to a halt after about 10 hours.
I guess my question is, how can I go about analysing this data? Are the things I've tried the right things? Are there things I'm missing? The Java solution seems to be the best so far, is there anything I can do to get it work?