Delete duplicate words in a big text file - Java

2019-09-25 07:31发布

问题:

I have text file with a size of over 50gb. Now i want to delete the duplicate words. But I have heard, that i need very much RAM to load every Word from the text file into an Hash Set. Can you tell me a very good way to delete every duplicate word from the text file? The Words are sorted by a white space, like this.

word1 word2 word3 ... ... 

回答1:

The H2 answer is good, but maybe overkill. All the words in the english language won't be more than a few Mb. Just use a set. You could use this in RAnders00 program.

public static void read50Gigs(String fileLocation, String newFileLocation) {
    Set<String> words = new HashSet<>();
    try(FileInputStream fileInputStream = new FileInputStream(fileLocation);
        Scanner scanner = new Scanner(fileInputStream);) {

        while (scanner.hasNext()) {
            String nextWord = scanner.next();
            words.add(nextWord);
        }
        System.out.println("words size "+words.size());
        Files.write(Paths.get(newFileLocation), words, 
                StandardOpenOption.CREATE, StandardOpenOption.WRITE);

    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

As an estimate of common words, I added this for war and peace (from gutenberg)

public static void read50Gigs(String fileLocation, String newFileLocation) {
    try {
        Set<String> words = Files.lines(Paths.get("war and peace.txt"))
                .map(s -> s.replaceAll("[^a-zA-Z\\s]", ""))
                .flatMap(Pattern.compile("\\s")::splitAsStream)
                .collect(Collectors.toSet());

        System.out.println("words size " + words.size());//22100
        Files.write(Paths.get("out.txt"), words,
                StandardOpenOption.CREATE, 
                StandardOpenOption.TRUNCATE_EXISTING,
                StandardOpenOption.WRITE);

    } catch (IOException e) {}
}

It completed in 0 seconds. You can't use Files.lines unless your huge source file has line breaks. With line breaks, it will process it line by line so it won't use too much memory.



回答2:

This approach uses a database to buffer the words found.

It also assumes that words - regardless of case - are equal.

The H2 documentation states that a database on a non-FAT filesystem has a maximum size of 4 TB (using the default page size of 2KB), which is more than enough for this purpose.

package com.stackoverflow;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.sql.*;
import java.util.Scanner;

public class H2WordReading {

    public static void main(String[] args) {
//        read50Gigs("50gigfile.txt", "cleaned50gigfile.txt");
        read50Gigs("./testSmallFile", "./cleaned");
    }

    public static void read50Gigs(String fileLocation, String newFileLocation) {
        try (Connection connection = DriverManager.getConnection("jdbc:h2:./words");
             FileInputStream fileInputStream = new FileInputStream(fileLocation);
             Scanner scanner = new Scanner(fileInputStream);
             FileOutputStream fileOutputStream = new FileOutputStream(newFileLocation);
             OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileOutputStream)) {
            connection.createStatement().execute("DROP TABLE IF EXISTS WORDS;");
            connection.createStatement().execute("CREATE TABLE WORDS(WORD VARCHAR NOT NULL);");

            PreparedStatement insertStatement = connection.prepareStatement("INSERT INTO WORDS VALUES (?);");
            PreparedStatement queryStatement = connection.prepareStatement("SELECT * FROM WORDS WHERE UPPER(WORD) = UPPER(?);");

            while (scanner.hasNext()) {
                String nextWord = scanner.next();
                queryStatement.setString(1, nextWord);
                ResultSet resultSet = queryStatement.executeQuery();
                if (!resultSet.next())  // word not found, ok
                {
                    outputStreamWriter.write(scanner.hasNext() ? (nextWord + ' ') : nextWord);
                    insertStatement.setString(1, nextWord);
                    insertStatement.execute();
                } // word found, just don't write anything
            }

        } catch (IOException | SQLException e) {
            throw new RuntimeException(e);
        }
    }
}

You need to add the H2 driver jar present on the classpath.

Note that I only tested this with a small file consisting of 10 words or so. You should try this attempt with your 50 gigabyte file and report back any errors.

Please be aware that this attempt

  • normalizes all whitespace and newlines down to a single space character
  • always uses the first occurrence of a word and deletes all upcoming occurences

The time this attempt takes scales exponentially with the amount of words in the file.



回答3:

// Remove duplicate words from a file
    public String removeDupsFromFile(String str) {
        String[] words = str.split(" ");
        LinkedHashMap<String, Integer> map = new LinkedHashMap<String, Integer>();

        for (int i = 0 ; i < words.length ; i++) {
            if (map.containsKey(words[i])) {
                int count = map.get(words[i]) + 1;
                map.put(words[i], count);
            } else {
                map.put(words[i], 1);
            }
        }

        StringBuilder result = new StringBuilder("");
        Iterator itr = map.keySet().iterator();
        while (itr.hasNext()) {
            result.append(itr.next() + " ");

        }
        return result.toString();
    }