我在谷歌App Engine数据存储两个数据集。
class First_Set(db.Model):
start_time = db.DateTimeProperty()
end_time = db.DateTimeProperty()
data1 = db.FloatProperty()
...
class Second_Set(db.Model):
start_time = db.DateTimeProperty()
end_time = db.DateTimeProperty()
data2 = db.FloatProperty()
...
(他们有不同的数据,这就是为什么他们在不同的数据集。)
我想找到的数据存储ID的所有重叠START_TIME和END_TIME跨越两个数据集,最理想的情况从一个拉动的结果,并遍历其它的第一批成果。
初始数据集的一大可视化是从这里 (它也有在SQL解决了这个问题):
1 |-----|
2 |-----|
3 |--|
4 |-----|
5 |-----|
6 |---|
7 |---|
8 |---|
9 |-----|
最终的结果我需要的是在调东西(从相同的例子 ):
+----+---------------------+----+---------------------+
| id | start | id | end |
+----+---------------------+----+---------------------+
| 2 | 2008-09-01 15:02:00 | 1 | 2008-09-01 15:04:00 |
| 5 | 2008-09-01 16:19:00 | 4 | 2008-09-01 16:23:00 |
| 8 | 2008-09-01 16:20:00 | 4 | 2008-09-01 16:22:00 |
| 8 | 2008-09-01 16:20:00 | 5 | 2008-09-01 16:22:00 |
| 7 | 2008-09-01 18:18:00 | 9 | 2008-09-01 18:22:00 |
+----+---------------------+----+---------------------+
SQL解决方案在下面的例子进行了说明,但由于缺乏JOIN的数据存储我不能这样做:
SELECT v1.id, v1.start, v2.id, LEAST(v1.end,v2.end) AS end
FROM visits v1
JOIN visits v2 ON v1.id <> v2.id and v1.start >= v2.start and v1.start < v2.end
ORDER BY v1.start;
据我所知,使用的ListProperty()(从一到多版本的,这是相当简单的这个问题 )。
谁能想到一个解决方案,找到重叠的时间(最好在Python)吗?
看看Marzullo算法的时间效率为O(n log n)的。
也有#2很多问题 ,涵盖可用于解决AppEngine上你的问题的重叠区间。
发布我的解决方案没有联接,由于吉文的方向。 应该能够在任意数量的与小修改数据集(至少是理论)找到重叠。
我的Python是不是很大,但低于应给予的想法:
from operator import itemgetter
class Find_Overlaps(webapp2.RequestHandler):
def get(self):
all_dates = []
first_dates = db.GqlQuery("SELECT * FROM First_Set")
for date in first_dates:
row = {'dataset':'First_Set', 'dbkey':date.key(), 'offset':date.start_time, 'type': -1}
all_dates.append(row)
row = {'dataset':'First_Set', 'dbkey':date.key(), 'offset':date.end_time, 'type': 1}
all_dates.append(row)
second_dates = db.GqlQuery("SELECT * FROM Second_Set")
for date in second_dates:
row = {'dataset':'Second_Set', 'dbkey':date.key(), 'offset':date.start_time, 'type': -1}
all_dates.append(row)
row = {'dataset':'Second_Set', 'dbkey':date.key(), 'offset':date.end_time, 'type': 1}
all_dates.append(row)
newlist = sorted(all_dates, key=itemgetter('offset','type'))
number_datasets = 2 #goal is to find overlaps in all sets not only the best overlaps, that's why this is needed
loopcnt = 0
update_bestend = 0
overlaps = []
for row in newlist: #Below is mostly from Marzullo's alghorithm
loopcnt = loopcnt - row['type']#this is to keep track of overall tally
if update_bestend == 1:
if loopcnt == (number_datasets - 1):
bestend = row['offset']
end_set = row['dataset']
end_key = row['dbkey']
overlaps.append({'start':beststart,'start_set':start_set,'start_key':start_key,'end':bestend,'end_set':end_set,'end_key':end_key})
update_bestend = 0
if loopcnt == number_datasets:
beststart = row['offset']
start_set = row['dataset']
start_key = row['dbkey']
update_bestend = 1
for overlap in overlaps: #just to see what the outcome is
self.response.out.write('start: %s, start_set: %s, end: %s, end_set: %s<br>' % (overlap['start'], overlap['start_set'], overlap['end'], overlap['end_set']))
文章来源: Compare many date ranges in google app engine datastore (Many to many, Python)