Delete duplicate words in a big text file - Java

2019-09-25 07:12发布

I have text file with a size of over 50gb. Now i want to delete the duplicate words. But I have heard, that i need very much RAM to load every Word from the text file into an Hash Set. Can you tell me a very good way to delete every duplicate word from the text file? The Words are sorted by a white space, like this.

word1 word2 word3 ... ... 

3条回答
做自己的国王
2楼-- · 2019-09-25 07:52
// Remove duplicate words from a file
    public String removeDupsFromFile(String str) {
        String[] words = str.split(" ");
        LinkedHashMap<String, Integer> map = new LinkedHashMap<String, Integer>();

        for (int i = 0 ; i < words.length ; i++) {
            if (map.containsKey(words[i])) {
                int count = map.get(words[i]) + 1;
                map.put(words[i], count);
            } else {
                map.put(words[i], 1);
            }
        }

        StringBuilder result = new StringBuilder("");
        Iterator itr = map.keySet().iterator();
        while (itr.hasNext()) {
            result.append(itr.next() + " ");

        }
        return result.toString();
    }
查看更多
手持菜刀,她持情操
3楼-- · 2019-09-25 07:58

This approach uses a database to buffer the words found.

It also assumes that words - regardless of case - are equal.

The H2 documentation states that a database on a non-FAT filesystem has a maximum size of 4 TB (using the default page size of 2KB), which is more than enough for this purpose.

package com.stackoverflow;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.sql.*;
import java.util.Scanner;

public class H2WordReading {

    public static void main(String[] args) {
//        read50Gigs("50gigfile.txt", "cleaned50gigfile.txt");
        read50Gigs("./testSmallFile", "./cleaned");
    }

    public static void read50Gigs(String fileLocation, String newFileLocation) {
        try (Connection connection = DriverManager.getConnection("jdbc:h2:./words");
             FileInputStream fileInputStream = new FileInputStream(fileLocation);
             Scanner scanner = new Scanner(fileInputStream);
             FileOutputStream fileOutputStream = new FileOutputStream(newFileLocation);
             OutputStreamWriter outputStreamWriter = new OutputStreamWriter(fileOutputStream)) {
            connection.createStatement().execute("DROP TABLE IF EXISTS WORDS;");
            connection.createStatement().execute("CREATE TABLE WORDS(WORD VARCHAR NOT NULL);");

            PreparedStatement insertStatement = connection.prepareStatement("INSERT INTO WORDS VALUES (?);");
            PreparedStatement queryStatement = connection.prepareStatement("SELECT * FROM WORDS WHERE UPPER(WORD) = UPPER(?);");

            while (scanner.hasNext()) {
                String nextWord = scanner.next();
                queryStatement.setString(1, nextWord);
                ResultSet resultSet = queryStatement.executeQuery();
                if (!resultSet.next())  // word not found, ok
                {
                    outputStreamWriter.write(scanner.hasNext() ? (nextWord + ' ') : nextWord);
                    insertStatement.setString(1, nextWord);
                    insertStatement.execute();
                } // word found, just don't write anything
            }

        } catch (IOException | SQLException e) {
            throw new RuntimeException(e);
        }
    }
}

You need to add the H2 driver jar present on the classpath.

Note that I only tested this with a small file consisting of 10 words or so. You should try this attempt with your 50 gigabyte file and report back any errors.

Please be aware that this attempt

  • normalizes all whitespace and newlines down to a single space character
  • always uses the first occurrence of a word and deletes all upcoming occurences

The time this attempt takes scales exponentially with the amount of words in the file.

查看更多
放荡不羁爱自由
4楼-- · 2019-09-25 08:00

The H2 answer is good, but maybe overkill. All the words in the english language won't be more than a few Mb. Just use a set. You could use this in RAnders00 program.

public static void read50Gigs(String fileLocation, String newFileLocation) {
    Set<String> words = new HashSet<>();
    try(FileInputStream fileInputStream = new FileInputStream(fileLocation);
        Scanner scanner = new Scanner(fileInputStream);) {

        while (scanner.hasNext()) {
            String nextWord = scanner.next();
            words.add(nextWord);
        }
        System.out.println("words size "+words.size());
        Files.write(Paths.get(newFileLocation), words, 
                StandardOpenOption.CREATE, StandardOpenOption.WRITE);

    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

As an estimate of common words, I added this for war and peace (from gutenberg)

public static void read50Gigs(String fileLocation, String newFileLocation) {
    try {
        Set<String> words = Files.lines(Paths.get("war and peace.txt"))
                .map(s -> s.replaceAll("[^a-zA-Z\\s]", ""))
                .flatMap(Pattern.compile("\\s")::splitAsStream)
                .collect(Collectors.toSet());

        System.out.println("words size " + words.size());//22100
        Files.write(Paths.get("out.txt"), words,
                StandardOpenOption.CREATE, 
                StandardOpenOption.TRUNCATE_EXISTING,
                StandardOpenOption.WRITE);

    } catch (IOException e) {}
}

It completed in 0 seconds. You can't use Files.lines unless your huge source file has line breaks. With line breaks, it will process it line by line so it won't use too much memory.

查看更多
登录 后发表回答