I have csv files which are 1200 Rows x 3 Columns. Number of rows can differ from as low as 500 to as large as 5000 but columns remain same.
I want to create a feature vector from these files which will thus maintain consistent cells/vector length & thus help in finding out the distance between these vectors.
FILE_1
A, B, C
(267.09669678867186, 6.3664069175720197, 1257325.5809999991),
(368.24070923984374, 9.0808353424072301, 49603.662999999884),
(324.21470826328124, 11.489830970764199, 244391.04699999979),
(514.33452027500005, 7.5162401199340803, 56322.424999999988),
(386.19673340976561, 9.4927110671997106, 175958.77100000033),
(240.09965330898439, 10.3463039398193, 457819.8519411764),
(242.17559998691405, 8.4401674270629901, 144891.51100000029),
(314.23066895664061, 7.4405002593994096, 58433.818999999959),
(933.3073596304688, 7.1564397811889604, 41977.960000000014),
(274.04136473476564, 4.8482465744018599, 48782.314891525479),
(584.2639294320312, 7.90128517150879, 49730.705000000096),
(202.13173096835936, 10.559995651245099, 20847.805144088608),
(324.98563963710939, 2.2546300888061501, 43767.774800000007),
(464.35059935390626, 11.573680877685501, 1701597.3915132943),
(776.28339964687495, 8.7755222320556605, 106882.2469999999),
(310.11652952968751, 10.3175926208496, 710341.19162800116),
(331.19962889492189, 10.7578010559082, 224621.80632433048),
(452.31337752387947, 7.3100395202636701, 820707.26700000139),
(430.16615111171876, 10.134071350097701, 18197.691999999963),
(498.24687010585939, 11.0102319717407, 45423.269964585743),
.....,
.....,
500th row
FILE_2
(363.02781861484374, 8.8369808197021502, 72898.479666666608),
(644.20353882968755, 8.6263589859008807, 22776.78799999999),
(259.25105469882811, 9.8575859069824201, 499615.64068339905),
(410.19474608242189, 9.8795070648193395, 316146.18800000293),
(288.12153809726561, 4.7451887130737296, 58615.577999999943),
(376.25868409335936, 10.508985519409199, 196522.12200000012),
(261.11118895351564, 8.5228433609008807, 32721.110000000026),
(319.98896605312501, 3.2100667953491202, 60587.077000000027),
(286.94926268398439, 4.7687568664550799, 47842.133999999867),
(121.00206177890625, 7.9372291564941397, 239813.20531182736),
(308.19895750820314, 6.0029039382934597, 26354.519000000011),
(677.17011839687495, 9.0299625396728498, 10391.757655172449),
(182.1304913216797, 8.0010566711425799, 145583.55700000061),
(187.06341736972655, 9.9460496902465803, 77488.229000000007),
(144.07867615878905, 3.6044106483459499, 104651.56499999999),
(288.92317015468751, 4.3750333786010698, 151872.1949999998),
(228.2089825326172, 4.4475774765014604, 658120.07628214348),
(496.18831055820311, 11.422966003418001, 2371155.6659999997),
(467.30134398281251, 11.0771179199219, 109702.48440899582),
(163.08418089687501, 5.7271881103515598, 38107.106791666629),
.....,
.....,
3400th row
You can see that there is no correspondence between the two files, i.e. if someone asked you to calculate the distance between these two vectors its not possible.
The aim is to be able to interpolate the rows of both the files in such a manner so that there is a consistency across all such files. i.e. when I look up first row, it should represent same feature across all the files. Now lest look at FILE_1
Range of values for three columns is (considering only 20 rows for time being)
A: 202.13173096835936,933.3073596304688
B: 2.2546300888061501, 11.573680877685501
C: 18197.691999999963,1701597.3915132943
I want to put these points on a 3d array, the grid size of which will be .1X.1X.1 (or lets say 10X10X10 or any arbitrary size grid cell) But for that to work we need to normalize the data (mean normalize etc)
Now the data we have is a 3d data, which need to be normalized in order to interpolate them into this 3d array. Which neednt be 3d even if its a vector then that will also do.
Now when I said I need to average the points, by that I meant that if in a cell more than two points happen to fall (which will happen if the cell size is big eg 100X100X100) then we will take the average value of x,y,z coordinate as the value of that cell.
These interpolated vectors will have same length & correspondence, because corresponding point of a vector when compared to rest of such vectors will represent same point.
**NOTE : Min & Max range for all coordinates across all files is 100:1000,2:12, 10000:2000000