Background:
I have been given four catalogues of data, the first of which (let's call Cat1) gives the coordinates (in right ascension and declination, RA and Dec) for radio sources in fields 1 and 2, the second catalogue (Cat2) gives the RA and Dec for radio sources and infrared (IR) sources in field 1, the third catalogue (Cat3) gives the RA and Dec for radio and IR sources in field 2, and the fourth catalogue (Cat4) gives the RA and Dec for optical sources in fields 1 and 2.
Cat1 has approximately 2000 sources for field 2, keeping in mind that some of the sources are actually measured multiple times across their dimensions, e.g.; source 1, source 2, source 3a, source 3b, source 3c, source 4... Cat1 has approximately 3000 sources for field 1, again with some sources being in parts. Cat 1 is a .dat file, which I am opening in textedit, and converted to .txt for use with np.genfromtxt.
Cat2 has approximately 1700 sources for field 1. Cat3 has approximately 1700 sources for field 2. Cat2 and Cat3 are .csv files, which I am opening in Numbers.
Cat4 has approximately 1200 sources for field 1, and approximately 700 sources for field 2. Cat4 is a .dat file, which I am opening in textedit, and converted to .txt for use with np.genfromtxt.
Also figured out how to convert Cat1 and Cat4 in .csv files.
Aim:
The goal is to combine these four catalogues into one single catalogue, that gives the RA and Dec from Cat2, Cat1 and Cat4 (field 1), then the RA and Dec from Cat3, Cat1 and Cat4 (field 2), such that the RA and Dec from Cat1 and Cat4 are closest to the RA and Dec from Cat1 or Cat2, such that it can be said that they are highly likely to be the same source. The level of overlap will vary, but I have produced scatterplots for the data that shows that there is a corresponding Cat1 and Cat4 source for each Cat2 and Cat3 source, within the accuracy of the size of the plot marker, with of course many leftover sources in Cat1 and Cat4, which contain many more sources than does Cat2 and Cat3.
The trick is that because the coordinates do not exactly match, I need to be able to first look at RA and find the best matching value, then look at the corresponding Dec, and check that it is also the best corresponding value.
e.g., For a source in Cat2: RA = 53.13360595, Dec = -28.0530758
Cat1: RA = 53.133496, Dec = -27.553401 or RA = 53.133873, Dec = -28.054600
Here, 53.1336 is equally between 53.1334 and 53.1338, but clearly -28.053 is closer to -28.054 than -27.553, so the second option in Cat1 is the winner.
Progress:
So far I have matched the first 15 values in Cat2 to values in Cat1 purely by eye (command+f to as many decimal places as possible, then using best judgement), but clearly this is extremely inefficient for all 3400 sources across Cat2 and Cat3. I just wanted to see for myself what sort of accuracy to be expecting in the matching, and unfortunately, some match only to second or third decimal places, while others match to fourth or fifth.
In producing my scatterplots, I used the code:
cat1 = np.genfromtext('filepath/cat1.txt', delimiter = ' ')
RA_cat1 = cat1[:,][:,0]
Dec_cat1 = cat1[:,][:,1]
Then simply plotted RA_cat1 against Dec_cat1, and repeated for all my catalogues.
My problem now is that in searching for answers on how to produce a code that will be able to match my coordinates, I have seen a lot of answers that involve converting the arrays to lists, but when attempting to do so using
cat1list = np.array([RA_cat1, Dec_cat1])
cat1list.tolist()
I end up with a list of the form;
[RA1, RA2, RA3,...,RA3000], [Dec1, Dec2, Dec3,...,Dec3000]
instead of what I assume would be the more helpful;
[RA1, Dec1], [RA2, Dec2], ... , [RA3000, Dec3000].
Furthermore, for similar questions, the most useful answers once the list conversion has been successful appear to be to use dictionaries, but I am unclear on how to use a dictionary to produce the sorts of comparisons that I described above.
Additionally, I should mention that once I have been successful in this task, I have been asked to repeat the process for a considerably larger set of data, I'm not sure how much larger, but I am assuming perhaps tens of thousands of coordinate sets.
For the amount of data you have, you can calculate a distance metric between each pair of points. Something like:
For a large data set you may want to use a line sweep algorithm to only calculate the metric for points that are in the same neighborhood. Like this:
Generate some test data
Here's the algorithm:
On my laptop, it finds the close point in < tenth of a second.