Merging a large number of csv datasets

2019-08-29 09:03发布

问题:

Here are 2 sample datasets.

PRISM-APPT_1895.csv https://copy.com/SOO2KbCHBX4MRQbn

PRISM-APPT_1896.csv https://copy.com/JDytBqLgDvk6JzUe

I have 100 of these types of data sets that I'm trying to merge into one data frame, export that to csv, and then merge that into another very large dataset.

I need to merge everything by "gridNumber" and "Year", creating a time series dataset.

Originally, I imported all of the annual datasets and then tried to merge them with this :

df <- join_all(list(Year_1895,    Year_1896,    Year_1897,    Year_1898,    Year_1899,    Year_1900,    Year_1901,    Year_1902,   
                    Year_1903,    Year_1904,    Year_1905,    Year_1906,   Year_1907,    Year_1908,    Year_1909,    Year_1910,   
                    Year_1911,    Year_1912,    Year_1913,    Year_1914,   Year_1915,    Year_1916,    Year_1917,    Year_1918,   
                    Year_1919,    Year_1920,    Year_1921,    Year_1922,   Year_1923,    Year_1924,    Year_1925,    Year_1926,   
                    Year_1927,    Year_1928,    Year_1929,    Year_1930,   Year_1931,    Year_1932,    Year_1933,    Year_1934,   
                    Year_1935,    Year_1936,    Year_1937,    Year_1938,   Year_1939,    Year_1940,    Year_1941,    Year_1942,   
                    Year_1943,    Year_1944,    Year_1945,    Year_1946,   Year_1947,    Year_1948,    Year_1949,    Year_1950,   
                    Year_1951,    Year_1952,    Year_1953,    Year_1954,   Year_1955,    Year_1956,    Year_1957,    Year_1958,   
                    Year_1959,    Year_1960,    Year_1961,    Year_1962,   Year_1963,    Year_1964,    Year_1965,    Year_1966,   
                    Year_1967,    Year_1968,    Year_1969,    Year_1970,   Year_1971,    Year_1972,    Year_1973,    Year_1974,   
                    Year_1975,    Year_1976,    Year_1977,    Year_1978,   Year_1979,    Year_1980,    Year_1981,    Year_1982,   
                    Year_1983,    Year_1984,    Year_1985,    Year_1986,   Year_1987,    Year_1988,    Year_1989,    Year_1990,   
                    Year_1991,    Year_1992,    Year_1993,    Year_1994,   Year_1995,    Year_1996,    Year_1997,    Year_1998,   
                    Year_1999,    Year_2000), 
                    by = c("gridNumber","Year"),type="full")

But R keeps crashing because I think the merge is a bit to large for it to handle, so I'm looking for something that would work better. Maybe data.table? Or another option.

Thanks for any help you can provide.

回答1:

Almost nine months later and your question has no answer. I could not find your datasets, however, I will show one way to do the job. It is trivial in awk.

Here is a minimal awk script:

BEGIN {
    for(i=0;i<10;i++) {
        filename = "out" i ".csv";
        while(getline < filename) print $0;
        close(filename);
    }
}

The script is run as awk -f s.awk where s.awk is the above script in a text file.

This script creates ten filenames: out0.csv, out1.csv ... out9.csv. These are the already-existing files with the data. The first file is opened and all records sent to the standard output. The file is then closed and the next filename created and opened. The above script has little to offer over a command line read/redirect. You would typically use awk to process a long list of filenames read from another file; with statements to selectively ignore lines or columns depending on various criteria.