-->

比较在谷歌App Engine数据存储许多日期范围(多对多,巨蟒)(Compare many dat

2019-09-18 09:28发布

我在谷歌App Engine数据存储两个数据集。

class First_Set(db.Model):
  start_time = db.DateTimeProperty()
  end_time = db.DateTimeProperty()
  data1 = db.FloatProperty()
  ...

class Second_Set(db.Model):
  start_time = db.DateTimeProperty()
  end_time = db.DateTimeProperty()
  data2 = db.FloatProperty()
  ...

(他们有不同的数据,这就是为什么他们在不同的数据集。)

我想找到的数据存储ID的所有重叠START_TIME和END_TIME跨越两个数据集,最理想的情况从一个拉动的结果,并遍历其它的第一批成果。

初始数据集的一大可视化是从这里 (它也有在SQL解决了这个问题):

1     |-----| 
2        |-----| 
3                 |--| 
4                       |-----| 
5                          |-----| 
6                                  |---| 
7                                        |---|  
8                           |---| 
9                                       |-----|

最终的结果我需要的是在调东西(从相同的例子 ):

+----+---------------------+----+---------------------+ 
| id | start               | id | end                 | 
+----+---------------------+----+---------------------+ 
|  2 | 2008-09-01 15:02:00 |  1 | 2008-09-01 15:04:00 | 
|  5 | 2008-09-01 16:19:00 |  4 | 2008-09-01 16:23:00 | 
|  8 | 2008-09-01 16:20:00 |  4 | 2008-09-01 16:22:00 | 
|  8 | 2008-09-01 16:20:00 |  5 | 2008-09-01 16:22:00 | 
|  7 | 2008-09-01 18:18:00 |  9 | 2008-09-01 18:22:00 | 
+----+---------------------+----+---------------------+ 

SQL解决方案在下面的例子进行了说明,但由于缺乏JOIN的数据存储我不能这样做:

SELECT v1.id, v1.start, v2.id, LEAST(v1.end,v2.end) AS end 
FROM visits v1 
JOIN visits v2 ON v1.id <> v2.id and v1.start >= v2.start and v1.start < v2.end  
ORDER BY v1.start;

据我所知,使用的ListProperty()(从一到多版本的,这是相当简单的这个问题 )。

谁能想到一个解决方案,找到重叠的时间(最好在Python)吗?

Answer 1:

看看Marzullo算法的时间效率为O(n log n)的。
也有#2很多问题 ,涵盖可用于解决AppEngine上你的问题的重叠区间。



Answer 2:

发布我的解决方案没有联接,由于吉文的方向。 应该能够在任意数量的与小修改数据集(至少是理论)找到重叠。

我的Python是不是很大,但低于应给予的想法:

from operator import itemgetter

class Find_Overlaps(webapp2.RequestHandler):
    def get(self):
        all_dates = []
        first_dates = db.GqlQuery("SELECT * FROM First_Set")
        for date in first_dates:
            row = {'dataset':'First_Set', 'dbkey':date.key(), 'offset':date.start_time, 'type': -1}
            all_dates.append(row)
            row = {'dataset':'First_Set', 'dbkey':date.key(), 'offset':date.end_time, 'type': 1}
            all_dates.append(row)

        second_dates = db.GqlQuery("SELECT * FROM Second_Set")
        for date in second_dates:
            row = {'dataset':'Second_Set', 'dbkey':date.key(), 'offset':date.start_time, 'type': -1}
            all_dates.append(row)
            row = {'dataset':'Second_Set', 'dbkey':date.key(), 'offset':date.end_time, 'type': 1}
            all_dates.append(row)

        newlist = sorted(all_dates, key=itemgetter('offset','type'))
        number_datasets = 2 #goal is to find overlaps in all sets not only the best overlaps, that's why this is needed
        loopcnt = 0
        update_bestend = 0
        overlaps = []
        for row in newlist: #Below is mostly from Marzullo's alghorithm
            loopcnt = loopcnt - row['type']#this is to keep track of overall tally
            if update_bestend == 1:
                if loopcnt == (number_datasets - 1):
                    bestend = row['offset']
                    end_set = row['dataset']
                    end_key = row['dbkey']
                    overlaps.append({'start':beststart,'start_set':start_set,'start_key':start_key,'end':bestend,'end_set':end_set,'end_key':end_key})
                    update_bestend = 0
            if loopcnt == number_datasets:
                beststart = row['offset']
                start_set = row['dataset']
                start_key = row['dbkey']
                update_bestend = 1

        for overlap in overlaps: #just to see what the outcome is
            self.response.out.write('start: %s, start_set: %s, end: %s, end_set: %s<br>' % (overlap['start'], overlap['start_set'], overlap['end'], overlap['end_set']))


文章来源: Compare many date ranges in google app engine datastore (Many to many, Python)