[RFC] File::SplitStream - iterate over files >2GB when large filesupport unavailable - Perl

This is a discussion on [RFC] File::SplitStream - iterate over files >2GB when large filesupport unavailable - Perl ; I am implementing a module and seek the community's input about its suitability for placement on CPAN. File::SplitStream (I am open to better names) is designed to be used when an OS supports large files, but the Perl interpreter does ...

+ Reply to Thread
Results 1 to 5 of 5

[RFC] File::SplitStream - iterate over files >2GB when large filesupport unavailable

  1. Default [RFC] File::SplitStream - iterate over files >2GB when large filesupport unavailable

    I am implementing a module and seek the community's input about its
    suitability for placement on CPAN. File::SplitStream (I am open to
    better names) is designed to be used when an OS supports large files,
    but the Perl interpreter does not have large file support enabled
    (specifically, Red Hat Linux did this for awhile). It uses the Unix
    split command to split the large file into <2GB chunks, the generates an
    iterator to allow the calling routine to transparently read the file
    chunks as if they are still one large file.

    Below is a draft for the documentation for this module. I searched CPAN
    and did not find anything similar. Your input and suggestions regarding
    structure, functionality, and documentation improvements will be greatly
    appreciated!


    NAME
    File::SplitStream - iterate over multiple files as if they were one
    file. Optionally split a large file into smaller files before
    iteration.

    SYNOPSIS
    use File::SplitStream;

    # split a file into parts
    my $filestream = new File::SplitStream;
    $filestream->file('/path/to/inputfile');
    $filestream->lines(19000000);
    $filestream->genFileStream() || die("cannot generate filestream: $!");

    --OR--

    # or use a group of pre-existing files
    my @inputfiles = qw(file01.txt file02.txt file03.txt);
    my $filestream = new File::SplitStream;
    $filestream->files(@inputfiles);

    # regardless of how you set things up, you can
    # now iterate over the files as if they're one file
    while (my $line = $filestream->nextLine()->() ) {
    ...do stuff on each line of all of the files...
    }

    --OR--

    # you can use a function call rather than instantiating an object
    use File::SplitStream qw(genFileStream);

    my $filestream = genFileStream('/path/to/inputfile', 19000000);
    while ( my $line = $filestream->() ) {
    ...do stuff on each line of all of the files...
    }

    DESCRIPTION
    File::SplitStream can be used to split a large text file (or
    optionally,
    to use a list of pre-existing files) and iterate over the files as if
    they were a single file. This class is designed to help work with large
    files (>2GB) when large file support is unavailable. Perhaps the
    programmer does not have permissions to recompile the available Perl
    interpreter, or simply does not have the time. Regardless of reason,
    this module can help fill in the gap when large file support is
    unavailable.

    In order for File::SplitStream to work properly, the Unix split and cat
    commands should be in your $PATH. The split command is used to split up
    the large file into more manageable chunks, while the cat command is
    used to buffer input of the files. The number of lines in each of your
    split files will depending on how much data is in each line. Shorter
    lines will allow you to put many more lines into a file before it
    crosses the 2GB barrier. Longer lines will require you to decrease the
    lines/file value.

    ACCESSOR METHODS
    These accessor methods can be used directly or set by passing them to
    the new() method.

    (set to split a single file apart and iterate over the pieces)
    file the file to split apart
    lines maximum number of lines in each file chunk

    (set to use a pre-existing set of files as a single file)
    files reference to a list of files to iterate over

    OTHER METHODS
    new(%options)
    Use the new() method to create a new File::SplitStream object. You will
    need to do this to use the module in an object-oriented way. You can
    pass options to the new() method to set the file(), lines(), and
    files()
    values.

    Examples:

    # new File::SplitStream with no options
    my $fss = new File::SplitStream;

    # new File::SplitStream with options to split a single file
    my $fss = new File::SplitStream(FILE => '/path/to/file', LINES =>
    1000000);

    # new File::SplitStream with option to use a list of pre-existing
    files
    my $fss = new File::SplitStream(FILES => ['/path/to/file1',
    '/path/to/file2',
    '/path/to/file3'
    ] );

    init(%options)
    If options are passed to new(), init() is invoked by new() to set the
    appropriate object attributes given the options. Normally init() is
    only
    invoked by new(), but can be used to (re)set your File::SplitStream
    object's attributes if you want.

    Example:

    my $fss = new File::SplitStream;
    $fss->init(FILE => '/path/to/file', LINES => 15000000);

    genFileStream($filepath, $number_of_lines)
    The workhorse of File::SplitStream is the genFileStream()
    method/function. It splits the large data file (if necessary) using the
    Unix split command, then generates an iterator function to return each
    line of the split files in order, transparently opening and closing the
    split files as necessary. If you have specified a list of pre-existing
    files, the iterator will open each in the order you gave.

    In an object-oriented context, genFileStream() will take the values of
    the file() and lines() accessors (or the files() accessor in the
    case of
    pre-existing files) as its parameters. If you explicitly pass
    genFileStream() parameters, these will override the object's
    attributes.
    In a procedural context, obviously you will have to explicitly pass
    these parameters.

    In object-oriented style, genFileStream() will assign the iterator
    function to the nextLine() accessor and return 1 to the calling
    routine;
    this way the calling routine does not need yet another variable to hold
    the "filestream." In procedural style, the iterator will be returned to
    the calling routine.

    When the data in all of the files have been exhausted, the iterator
    function will return undef. If there is a problem generating the
    iterator (usually a problem with the split), or a problem is
    encountered
    while the split files are being read, the program will die() with the
    error being written to STDERR.

    Examples:

    # OO way
    use File::SplitStream;
    my $fss = new File::SplitStream;
    $fss->file('/data/largefile.dat');
    $fss->lines(1000000);
    $fss->genFileStream();
    while ( $line = $fss->nextLine()->() ) {
    ...process the file...
    }

    # procedural
    use File::SplitStream qw(genFileStream);
    my $stream = genFileStream('/data/largefile.dat',1000000);
    while ( $line = $stream->() ) {
    ...process the file...
    }

    EXPORT
    None by default. You can import genFileStream() into your namespace if
    you wish to use it in procedural style.


  2. Default Re: [RFC] File::SplitStream - iterate over files >2GB when largefile support unavailable

    AJ wrote:

    > I am implementing a module and seek the community's input about its
    > suitability for placement on CPAN. File::SplitStream (I am open to
    > better names) is designed to be used when an OS supports large files,
    > but the Perl interpreter does not have large file support enabled
    > (specifically, Red Hat Linux did this for awhile). It uses the Unix
    > split command to split the large file into <2GB chunks, the generates an
    > iterator to allow the calling routine to transparently read the file
    > chunks as if they are still one large file.


    This seems rather complex and involves very big temporary files.

    Is there some problem with just doing...

    open my $fh, '-|, 'cat', $huge_file or die "Cannot read $huge_file: $!";


  3. Default Re: [RFC] File::SplitStream - iterate over files >2GB when largefile support unavailable

    Brian McCauley wrote:
    > AJ wrote:
    >
    >> I am implementing a module and seek the community's input about its
    >> suitability for placement on CPAN. File::SplitStream (I am open to
    >> better names) is designed to be used when an OS supports large files,
    >> but the Perl interpreter does not have large file support enabled
    >> (specifically, Red Hat Linux did this for awhile). It uses the Unix
    >> split command to split the large file into <2GB chunks, the generates
    >> an iterator to allow the calling routine to transparently read the
    >> file chunks as if they are still one large file.

    >
    >
    > This seems rather complex and involves very big temporary files.
    >
    > Is there some problem with just doing...
    >
    > open my $fh, '-|, 'cat', $huge_file or die "Cannot read $huge_file: $!";
    >

    Yes. In my case, it didn't work. I received a 'File too large' error
    after the input pipe passed the 2GB limit. Thus, this solution.
    Obviously, it's not very pretty, but it does work.

  4. Default Re: [RFC] File::SplitStream - iterate over files >2GB when large file support unavailable

    AJ <ajperl@exiledplanet.org> wrote:
    > I am implementing a module and seek the community's input about its
    > suitability for placement on CPAN. File::SplitStream (I am open to
    > better names) is designed to be used when an OS supports large files,
    > but the Perl interpreter does not have large file support enabled
    > (specifically, Red Hat Linux did this for awhile). It uses the Unix
    > split command to split the large file into <2GB chunks, the generates an
    > iterator to allow the calling routine to transparently read the file
    > chunks as if they are still one large file.


    I don't understand the need for this. It doesn't appear to implement
    "seek" and "tell", only streaming. It has been a while since I've used
    a small-file perl, but I never knew there was a problem in streaming large
    files in the first place. I thought it was only seek and tell (and
    truncate, and maybe other non-streaming things) which elicited the problem.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB

  5. Default Re: [RFC] File::SplitStream - iterate over files >2GB when largefile support unavailable

    xhoster wrote:
    > AJ <ajperl@exiledplanet.org> wrote:
    >
    >>I am implementing a module and seek the community's input about its
    >>suitability for placement on CPAN. File::SplitStream (I am open to
    >>better names) is designed to be used when an OS supports large files,
    >>but the Perl interpreter does not have large file support enabled
    >>(specifically, Red Hat Linux did this for awhile). It uses the Unix
    >>split command to split the large file into <2GB chunks, the generates an
    >>iterator to allow the calling routine to transparently read the file
    >>chunks as if they are still one large file.

    >
    >
    > I don't understand the need for this. It doesn't appear to implement
    > "seek" and "tell", only streaming. It has been a while since I've used
    > a small-file perl, but I never knew there was a problem in streaming large
    > files in the first place. I thought it was only seek and tell (and
    > truncate, and maybe other non-streaming things) which elicited the problem.
    >
    > Xho
    >


    I can tell you that seek() and tell() are not the only things that don't
    work when trying to access a large file without large file support
    enabled. In my original case, merely trying to open the file in
    question (~20GB size) yielded a "File too large" error immediately.
    Rewriting my code to cat the file through a pipe worked until I read
    past the 2GB threshold; at that point, the "File too large" error
    resurfaced. This system is using an older OS whose perl was not
    compiled with large file support enabled and, if given the chance, I
    would have upgraded the Perl (and the OS, for that matter). But for
    several reasons I am unable to do this. A solution similar to this
    module (though not using the same code) seemed to provide the necessary
    workaround. My thought was, if I experienced this problem, others might
    too. It may be messy, since you're having to carve up a file and double
    your required disk space, but it *works*, and in a situation like that,
    *working* may be exactly what you need.

    I should also point out the module does not *have* to split the original
    file up; it can work from a list of files that are already separate for
    whatever reason (autorotated log files come to mind). Sure, you can
    just cat them, but what if their total size is >2GB? Without large file
    support, the perl interpreter will give up after it has read past the
    2GB threshold. This module will prevent that from happening. Again,
    this is a very specific set of circumstances that ideally one would
    avoid. But if you're in such a position, as I was recently, having a
    module to give you a helping hand would be a very good thing.

+ Reply to Thread

Similar Threads

  1. Error xHarbour file function for large files
    By Application Development in forum xharbour
    Replies: 0
    Last Post: 09-25-2007, 01:29 PM
  2. accessing parts of large files with File.seek()
    By Application Development in forum Python
    Replies: 2
    Last Post: 08-09-2007, 01:16 AM
  3. Berkely Db. How to iterate over large number of keys "quickly"
    By Application Development in forum Python
    Replies: 5
    Last Post: 08-02-2007, 09:04 PM
  4. File compression in .NET - for very large files
    By Application Development in forum DOTNET
    Replies: 3
    Last Post: 06-21-2007, 02:40 AM
  5. iterate/loop indexing services files
    By Application Development in forum Inetserver
    Replies: 5
    Last Post: 10-27-2006, 12:28 AM