Comparing Files - quickly and definitively - CSharp

This is a discussion on Comparing Files - quickly and definitively - CSharp ; I would appreciate some recommendations for programmatically determining if files differ. I'm writing a utility that backs up files that customers upload to Web sites. Rather than mindlessly copying any/all files from each Web site to the backup server (and ...

+ Reply to Thread
Results 1 to 5 of 5

Comparing Files - quickly and definitively

  1. Default Comparing Files - quickly and definitively

    I would appreciate some recommendations for programmatically determining if
    files differ.

    I'm writing a utility that backs up files that customers upload to Web
    sites. Rather than mindlessly copying any/all files from each Web site to
    the backup server (and wasting space), I'm looking to copy only files that
    have been modified since the last backup took place. The files include
    anything from PDF to GIF/JPG to XML, text, etc. Max size is currently under
    5MB, but that could be increased later depending on customer demand.

    I understand that I can look to the LastModified date or other file
    properties, but I would prefer something more reliable. By "more reliable" I
    mean this: I have noticed that the time can differ by a couple of seconds
    after copying a file from one server to another. If the logic were to
    compare using those date/times, we would expect "false positives" - files
    that appear to be newer (different) based on Date/Time, but are in fact no
    different. At least this scenario would happen if the logic looked to the
    last backup (on the backup server) and compared against the current file on
    a Web server.

    So I'm thinking that there may be a more reliable way to determine if the
    file content is actually different. While it would be a no-brainer to open
    each file and compare the contents, that could be a rather costly
    operation - given the large number of files to potentially compare, and
    their potential large sizes.

    So I'm looking for a reliable means through which to determine which files
    have, in fact, been changed - and make that determination with fast
    performance.

    Suggestions? Ideas?

    Thanks!

    -S



  2. Default Re: Comparing Files - quickly and definitively

    Smithers wrote:
    > [...]
    > So I'm looking for a reliable means through which to determine which files
    > have, in fact, been changed - and make that determination with fast
    > performance.


    Depends on your definition of "reliable". Many backup programs use only
    the filename, size, and modified date to determine whether the file has
    changed. Some even just use the archive bit. When they use these
    things, they make sure that they copy not only the file but also the
    file attributes they are checking. So if you are relying on the
    modified date, for example, you'd have to copy the modified date too (I
    know that the Windows Explorer does this when copying the files by hand).

    But since these things aren't actually tied the actual file contents,
    they aren't actually 100% reliable, though they often are "good enough".

    If you really want to know whether the file is different, you have to
    compare it somehow. A common method would be to generate and store an
    MD5 hash on the file, and then generate the same hash for the file that
    is eligible for copying. If the hash is the same, don't copy.

    Of course, you would check the file size first, since that's a quick way
    to know for sure if the files are different.

    There is a theoretical possibility of hash collisions even using that
    technique, so technically speaking it's not 100% reliable. But it's far
    more reliable than looking just at date and file size, and is probably
    good enough for almost any real-world application.

    Pete

  3. Default Re: Comparing Files - quickly and definitively

    Thanks Pete - hadn't thought about the hashing alternatives.

    "good enough" is criteria I can live with on this. An occasional false
    positive won't be the end of the world. It would simply mean that we archive
    a file unnecessarily. No big deal. I think I'll go with a comparison of the
    date/times after all, do a bunch of testing, and if there are very few false
    positives, then we'll be done. We can go with more involved ****yses and
    possibly hashing if we need to tighten things up later.

    -S


    "Peter Duniho" <NpOeStPeAdM@NnOwSlPiAnMk.com> wrote in message
    news:13fapot1bnhjjc9@corp.supernews.com...
    > Smithers wrote:
    >> [...]
    >> So I'm looking for a reliable means through which to determine which
    >> files have, in fact, been changed - and make that determination with fast
    >> performance.

    >
    > Depends on your definition of "reliable". Many backup programs use only
    > the filename, size, and modified date to determine whether the file has
    > changed. Some even just use the archive bit. When they use these things,
    > they make sure that they copy not only the file but also the file
    > attributes they are checking. So if you are relying on the modified date,
    > for example, you'd have to copy the modified date too (I know that the
    > Windows Explorer does this when copying the files by hand).
    >
    > But since these things aren't actually tied the actual file contents, they
    > aren't actually 100% reliable, though they often are "good enough".
    >
    > If you really want to know whether the file is different, you have to
    > compare it somehow. A common method would be to generate and store an MD5
    > hash on the file, and then generate the same hash for the file that is
    > eligible for copying. If the hash is the same, don't copy.
    >
    > Of course, you would check the file size first, since that's a quick way
    > to know for sure if the files are different.
    >
    > There is a theoretical possibility of hash collisions even using that
    > technique, so technically speaking it's not 100% reliable. But it's far
    > more reliable than looking just at date and file size, and is probably
    > good enough for almost any real-world application.
    >
    > Pete




  4. Default Re: Comparing Files - quickly and definitively

    Smithers wrote:
    > Thanks Pete - hadn't thought about the hashing alternatives.
    >
    > "good enough" is criteria I can live with on this. An occasional false
    > positive won't be the end of the world. It would simply mean that we archive
    > a file unnecessarily. No big deal. I think I'll go with a comparison of the
    > date/times after all, do a bunch of testing, and if there are very few false
    > positives, then we'll be done. We can go with more involved ****yses and
    > possibly hashing if we need to tighten things up later.


    I think false negatives are probably the bigger problem. And they can
    occur with either approach, though IMHO they are unlikely with either.

    You could have a file with the same name and modification date and time,
    but which isn't actually the one that's been archived. There are ways
    the user can force this situation, but it's even theoretically possible
    simply as an accident.

    Likewise, hashes can collide, so it is possible using a hash of the file
    you'd detect the files as identical even though they are different and
    the file needs archiving.

    Even the date/time false negative is extremely unlikely IMHO, and the
    hash is probably (many) orders of magnitude less likely than that. So I
    don't really think they are of great concern. I just want to ensure
    those issues aren't overlooked. It's fine to be aware of them and call
    it "good enough", but one needs to at least be aware of them.

    Pete

  5. Default Re: Comparing Files - quickly and definitively

    Agreed. Failing to back up a modified file (per false negative) is, at least
    in our case, far worse than backing up a file unnecessarily. Part of our
    strategy for protecting against that is that we back up all files,
    regardless of modified or not, on a weekly basis - which is acceptable given
    the nature of the data. In fact, we have been doing this on a _daily_ basis
    for the past couple of years now. I'm looking to get away from the _daily_
    full backup and go with weekly full backup, with incremental backups between
    full backups - in order to reduce the amount of space taken up
    unnecessarily. Separately, we always advise customers to maintain their own
    local copies. And YES - I know this has practically nothing to do with us
    meeting our own SLA. But experience shows that that's where customers have
    typically gone anyway - i.e., they get their own backups without even asking
    us for them, even though we had 'em ready to go if necessary.

    -S




    "Peter Duniho" <NpOeStPeAdM@NnOwSlPiAnMk.com> wrote in message
    news:13fc7n8sot90q19@corp.supernews.com...
    > Smithers wrote:
    >> Thanks Pete - hadn't thought about the hashing alternatives.
    >>
    >> "good enough" is criteria I can live with on this. An occasional false
    >> positive won't be the end of the world. It would simply mean that we
    >> archive a file unnecessarily. No big deal. I think I'll go with a
    >> comparison of the date/times after all, do a bunch of testing, and if
    >> there are very few false positives, then we'll be done. We can go with
    >> more involved ****yses and possibly hashing if we need to tighten things
    >> up later.

    >
    > I think false negatives are probably the bigger problem. And they can
    > occur with either approach, though IMHO they are unlikely with either.
    >
    > You could have a file with the same name and modification date and time,
    > but which isn't actually the one that's been archived. There are ways the
    > user can force this situation, but it's even theoretically possible simply
    > as an accident.
    >
    > Likewise, hashes can collide, so it is possible using a hash of the file
    > you'd detect the files as identical even though they are different and the
    > file needs archiving.
    >
    > Even the date/time false negative is extremely unlikely IMHO, and the hash
    > is probably (many) orders of magnitude less likely than that. So I don't
    > really think they are of great concern. I just want to ensure those
    > issues aren't overlooked. It's fine to be aware of them and call it "good
    > enough", but one needs to at least be aware of them.
    >
    > Pete




+ Reply to Thread

Similar Threads

  1. Comparing two audio files
    By Application Development in forum DOTNET
    Replies: 0
    Last Post: 05-22-2007, 06:36 AM
  2. Comparing two XML files in C#
    By Application Development in forum XML SOAP
    Replies: 0
    Last Post: 10-11-2006, 06:24 AM
  3. Comparing Wav files
    By Application Development in forum basic.visual
    Replies: 3
    Last Post: 06-01-2004, 08:30 AM
  4. Comparing files by CRC32?
    By Application Development in forum Java-Games
    Replies: 4
    Last Post: 04-26-2004, 11:31 AM
  5. Comparing 2 Text Files
    By Application Development in forum basic.visual
    Replies: 5
    Last Post: 03-05-2004, 02:25 PM