Monday, March 2, 2009

Sorting large text files

Sorting a very big text file is a problem in terms of memory resources. Ideally, if all data could load to memory, it was an easy task: simply sort data in memory and write it to a new file. But, there are times we need to sort a really big file and the physical memory is not enough.

I once had to write such a large file sorter, and had a hard time finding something ready on the web. Therefore, I wrote a sorter of my own. The logic behind sorting a very large file is not hard to understand once you get the point. In general, the steps needed are:

Phase 1: Split big file to several small sorted files

  1. Read n lines from file to memory.
  2. Sort lines in memory.
  3. Write sorted lines to a temp file.
  4. Repeat step 1 while reading next n lines to memory and writing sorted lines to a new temporary file.

Phase 2: Merge smaller files to a new sorted big file

  1. Find the smallest line among all files.
  2. Write line to a new file (this file is the new merged file).
  3. Move to the next row in file (the file that had the smallest line).
  4. Repeat step 1 until all lines in all files were read.

Here is the code for such a file sorter:

import java.io.*;
import java.util.*;
public class FileSort
{
  protected String filenameToSort;
  protected String filenameSorted;
  protected Comparator comparator;
  protected int maxCapacity;
  public FileSort(String filenameToSort, String filenameSorted, Comparator comparator, int maxCapacity)
  {
    this.filenameToSort = filenameToSort;
    this.filenameSorted = filenameSorted;
    this.comparator = comparator;
    this.maxCapacity = maxCapacity;
  }
  public void sort() throws IOException
  {
    BufferedReader bufferedReader = new BufferedReader(new FileReader(filenameToSort));
    String line = null;
    int fileIndex = 0;
    BufferedWriter bufferedWriter;
    do
    {
      List<String> lines = new ArrayList<String>(maxCapacity);
      for (int i = 0; i < maxCapacity; i++)
      {
        line = bufferedReader.readLine();
        if (line == null)
        {
          break;
        }
        else
        {
          lines.add(line);
        }
      }
      Collections.sort(lines, comparator);
      bufferedWriter = new BufferedWriter(new FileWriter(filenameToSort + ".tmp" + fileIndex++));
      for (String _line : lines)
      {
        bufferedWriter.write(_line);
        bufferedWriter.newLine();
      }
      bufferedWriter.flush();
      bufferedWriter.close();
    }
    while (line != null);
    bufferedReader.close();
    mergeFiles(fileIndex);
  }
  public void mergeFiles(int numFiles) throws IOException
  {
    BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(filenameSorted));
    List<MergeFile> mergeFiles = new ArrayList<MergeFile>();
    for (int i = 0; i < numFiles; i++)
    {
      mergeFiles.add(new MergeFile(filenameToSort + ".tmp" + i));
    }
    do
    {
      // Find smallest line
      Iterator<MergeFile> iterator = mergeFiles.iterator();
      MergeFile minMergeFile = null;
      while (iterator.hasNext())
      {
        MergeFile mergeFile = iterator.next();
        if (mergeFile.line != null) // File has more lines
        {
          if (minMergeFile == null || comparator.compare(mergeFile.line, minMergeFile.line) < 0)
          {
            minMergeFile = mergeFile;
          }
        }
        else // No more lines in file. No need to iterate it
        {
          mergeFile.removeFile();
          iterator.remove();
        }
      }
      // Write smallest line to file
      if (minMergeFile != null)
      {    
        bufferedWriter.write(minMergeFile.line);
        bufferedWriter.newLine();
        minMergeFile.readLine();
      }
    }
    while (mergeFiles.size() > 0); // As long as there are files to read
    bufferedWriter.flush();
    bufferedWriter.close();
  }
  private final class MergeFile
  {
    BufferedReader bufferedReader;
    String filename;
    String line;
    boolean isReadNextRow = true;
    public MergeFile(String filename) throws IOException
    {
      this.filename = filename;
      bufferedReader = new BufferedReader(new FileReader(filename));
      readLine();
    }
    public void readLine() throws IOException
    {
      isReadNextRow = (line = bufferedReader.readLine()) != null;
    }
    public void removeFile() throws IOException
    {
      bufferedReader.close();
      new File(filename).delete();
    }
  }
}

Note for several things on this file sorter:
  1. It doesn’t limit the number of simultaneous open files. It shouldn't be much of a problem, since the maximum open files limit is pretty high and also can be configured (at least on Linux machines). If such limit exists you can always merge a group of files at a time (the size of the group is the maximum allowed open files). Then merge the new grouped files to a single file.
  2. This file sorter does not define the logic in which the lines are sorted.It should be supplied using a Comparator.
  3. The maximum number of rows to read to memory is also defined passed as a parameter to the sorter.

Suppose you would like to sort several big files to a single huge sorted file. The FileSort class can be easily extended to support sorting multiple big files. It can be done by first merging all files to a single file. Then sorting it using the FileSorter:
public class MultipleFileSort extends FileSort {
    private static final int BUFFER_SIZE = 1024 * 4;
    protected String[] filesToSort;
    public MultipleFileSort(String[] filesToSort, String filenameSorted, Comparator comparator, int maxReadLines) {
        super(filenameSorted + ".tmp", filenameSorted, comparator, maxReadLines);
        this.filesToSort = filesToSort;
    }
    public static void mergeFiles(String[] files, String filenameMerged) throws IOException {
        // Open all files for reading
        InputStream[] inputs = new InputStream[files.length];
        OutputStream os = null;
        try {
            for (int i = 0; i < files.length; i++) {
                inputs[i] = new FileInputStream(files[i]);
            }
            // Open file for writing
            os = new FileOutputStream(filenameMerged);
            byte[] buffer = new byte[BUFFER_SIZE];
            int len;
            for (InputStream is : inputs) {
                while ((len = is.read(buffer)) > 0) {
                    os.write(buffer, 0, len);
                }
            }
        }
        finally {
            for (InputStream is : inputs) {
                FileUtils.close(is);
            }
            FileUtils.close(os);
        }
    }
    public void sort() throws IOException {
        FileUtils.mergeFiles(filesToSort, filenameToSort);
        super.sort();
        new File(filenameToSort).delete();
    }
}

And finally example of using the MultipleFileSort class:

    public static void main(String[] args) {
        try {
            MultipleFileSort multipleFileSort = new MultipleFileSort(
                    new String[]{"in_file1.txt", "in_file2.txt", "in_file3.txt"},
                    "out_sorted.txt", new Comparator<String>() {
                        public int compare(String o1, String o2) {
                            return (o1).compareTo(o2);
                        }
                    }, 1000);
            multipleFileSort.sort();
        }
        catch (Exception e) {
            e.printStackTrace();
        }
    }

No comments:

Post a Comment