How do I match similar coordinates using Python?

2019-04-28 08:41发布

Background:

I have been given four catalogues of data, the first of which (let's call Cat1) gives the coordinates (in right ascension and declination, RA and Dec) for radio sources in fields 1 and 2, the second catalogue (Cat2) gives the RA and Dec for radio sources and infrared (IR) sources in field 1, the third catalogue (Cat3) gives the RA and Dec for radio and IR sources in field 2, and the fourth catalogue (Cat4) gives the RA and Dec for optical sources in fields 1 and 2.

Cat1 has approximately 2000 sources for field 2, keeping in mind that some of the sources are actually measured multiple times across their dimensions, e.g.; source 1, source 2, source 3a, source 3b, source 3c, source 4... Cat1 has approximately 3000 sources for field 1, again with some sources being in parts. Cat 1 is a .dat file, which I am opening in textedit, and converted to .txt for use with np.genfromtxt.

Cat2 has approximately 1700 sources for field 1. Cat3 has approximately 1700 sources for field 2. Cat2 and Cat3 are .csv files, which I am opening in Numbers.

Cat4 has approximately 1200 sources for field 1, and approximately 700 sources for field 2. Cat4 is a .dat file, which I am opening in textedit, and converted to .txt for use with np.genfromtxt.

Also figured out how to convert Cat1 and Cat4 in .csv files.

Aim:

The goal is to combine these four catalogues into one single catalogue, that gives the RA and Dec from Cat2, Cat1 and Cat4 (field 1), then the RA and Dec from Cat3, Cat1 and Cat4 (field 2), such that the RA and Dec from Cat1 and Cat4 are closest to the RA and Dec from Cat1 or Cat2, such that it can be said that they are highly likely to be the same source. The level of overlap will vary, but I have produced scatterplots for the data that shows that there is a corresponding Cat1 and Cat4 source for each Cat2 and Cat3 source, within the accuracy of the size of the plot marker, with of course many leftover sources in Cat1 and Cat4, which contain many more sources than does Cat2 and Cat3.

The trick is that because the coordinates do not exactly match, I need to be able to first look at RA and find the best matching value, then look at the corresponding Dec, and check that it is also the best corresponding value.

e.g., For a source in Cat2: RA = 53.13360595, Dec = -28.0530758

Cat1: RA = 53.133496, Dec = -27.553401 or RA = 53.133873, Dec = -28.054600

Here, 53.1336 is equally between 53.1334 and 53.1338, but clearly -28.053 is closer to -28.054 than -27.553, so the second option in Cat1 is the winner.

Progress:

So far I have matched the first 15 values in Cat2 to values in Cat1 purely by eye (command+f to as many decimal places as possible, then using best judgement), but clearly this is extremely inefficient for all 3400 sources across Cat2 and Cat3. I just wanted to see for myself what sort of accuracy to be expecting in the matching, and unfortunately, some match only to second or third decimal places, while others match to fourth or fifth.

In producing my scatterplots, I used the code:

cat1 = np.genfromtext('filepath/cat1.txt', delimiter = '   ')
RA_cat1 = cat1[:,][:,0]
Dec_cat1 = cat1[:,][:,1]

Then simply plotted RA_cat1 against Dec_cat1, and repeated for all my catalogues.

My problem now is that in searching for answers on how to produce a code that will be able to match my coordinates, I have seen a lot of answers that involve converting the arrays to lists, but when attempting to do so using

cat1list = np.array([RA_cat1, Dec_cat1])
cat1list.tolist()

I end up with a list of the form;

[RA1, RA2, RA3,...,RA3000], [Dec1, Dec2, Dec3,...,Dec3000]

instead of what I assume would be the more helpful;

[RA1, Dec1], [RA2, Dec2], ... , [RA3000, Dec3000].

Furthermore, for similar questions, the most useful answers once the list conversion has been successful appear to be to use dictionaries, but I am unclear on how to use a dictionary to produce the sorts of comparisons that I described above.

Additionally, I should mention that once I have been successful in this task, I have been asked to repeat the process for a considerably larger set of data, I'm not sure how much larger, but I am assuming perhaps tens of thousands of coordinate sets.

1条回答
2楼-- · 2019-04-28 09:30

For the amount of data you have, you can calculate a distance metric between each pair of points. Something like:

def close_enough(p1, p2):
    # You may need to scale the RA difference with dec. 
    return (p1.RA - p2.RA)**2 + (p1.Dec - p2.Dec)**2) < 0.01

candidates = [(p1,p2) for p1,p2 in itertools.combinations(points, 2)
              if close_enough(p1,p2)]

For a large data set you may want to use a line sweep algorithm to only calculate the metric for points that are in the same neighborhood. Like this:

import itertools as it
import operator as op
import sortedcontainers     # handy library on Pypi
import time

from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform

Observation = namedtuple("Observation", "dec ra other")

Generate some test data

number_of_observations = 5000
field1 = [Observation(uniform(-25.0, -35.0),     # dec
                      uniform(45.0, 55.0),       # ra
                      uniform(0, 10))            # other data
          for shop_id in range(number_of_observations)]

# add in near duplicates
number_of_dups = 1000
dups = []
for obs in sample(field1, number_of_dups):
    dDec = uniform(-0.0001, 0.0001)
    dRA  = uniform(-0.0001, 0.0001)
    dups.append(Observation(obs.dec + dDec, obs.ra + dRA, obs.other))

data = field1 + dups

Here's the algorithm:

# Note: dec is first in Observation, so data is sorted by .dec then .ra.
data.sort()

# Parameter that determines the size of a sliding declination window
# and therefore how close two observations need to be to be considered
# observations of the same object.
dec_span = 0.0001

# Result. A list of observation pairs close enough to be considered 
# observations of the same object.
candidates = []

# Sliding declination window.  Within the window, observations are
# ordered by .ra.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('ra'))

# lag_obs is the 'southernmost' observation within the sliding declination window.
observation = iter(data)
lag_obs = next(observation)

# lead_obs is the 'northernmost' observation in the sliding declination window.
for lead_obs in data:

    # Dec of lead_obs represents the leading edge of window.
    window.add(lead_obs)

    # Remove observations further than the trailing edge of window.
    while lead_obs.dec - lag_obs.dec > dec_span:
        window.discard(lag_obs)
        lag_obs = next(observation)

    # Calculate 'east-west' width of window_size at dec of lead_obs
    ra_span = dec_span / cos(radians(lead_obs.dec))
    east_ra = lead_obs.ra + ra_span
    west_ra = lead_obs.ra - ra_span

    # Check all observations in the sliding window within
    # ra_span of lead_obs.
    for other_obs in window.irange_key(west_ra, east_ra):

        if other_obs != lead_obs:
            # lead_obs is at the top center of a box 2 * ra_span wide by 
            # 1 * ra_span tall.  other_obs is is in that box. If desired, 
            # put additional fine-grained 'closeness' tests here. 
            # For example:
            #    average_dec = (other_obs.dec + lead_obs.dec) / 2
            #    delta_dec = other_obs.dec - lead_obs.dec
            #    delta_ra  = other_obs.ra - lead_obs.ra)/cos(radians(average_dec))
            # e.g. if delta_dec**2 + delta_ra**2 < threshold:
            candidates.append((lead_obs, other_obs))

On my laptop, it finds the close point in < tenth of a second.

查看更多
登录 后发表回答