GZipStream and buffering - DOTNET
This is a discussion on GZipStream and buffering - DOTNET ; I wrote a client / server networking application and all communication
between the two is compressed using GZipStream. However, I found a wierd
buffering problem. Despite how I had the network stream configured,
buffering would still occur (on a small ...
-
GZipStream and buffering
I wrote a client / server networking application and all communication
between the two is compressed using GZipStream. However, I found a wierd
buffering problem. Despite how I had the network stream configured,
buffering would still occur (on a small scale). I pinpointed the problem to
GZipStream. Basically it seems to buffer content and calling Flush has no
effect.
Below I wrote a small sample application which demonstrates this. Basically
in two CMDs call "App.exe L" to listen, and just "App.exe" to start the
client. You will notice the GZipStream is buffering. Simply uncommenting
"//#define DONT_USE_GZIP_STREAM_READ" and "//#define
DONT_USE_GZIP_STREAM_WRITE" will cause the application to begin working. I
understand I am not getting any gain from compression here because it is just
1 byte. That is besides the point - this is just a demonstration of the
problem (the original app has sizable messages to transmit). Regardless of
size, GZipStream.Flush should flush data to the network stream and it does
not. Also, I know it is on the sending side because if I only uncomment
"//#define DONT_USE_GZIP_STREAM_READ" then every couple seconds the stream
does not read anything implying nothing was sent over the network.
How do I get this demo app to work as expected?
Thanks!
// BUGGY CODE USED FOR DEMONSTRATION ONLY
//#define DONT_USE_GZIP_STREAM_READ
//#define DONT_USE_GZIP_STREAM_WRITE
using System;
using System.IO.Compression;
using System.Net;
using System.Net.Sockets;
class Program
{
static void Main(String[] args)
{
if (args.Length > 0 && args[0] == "L")
{
TcpListener listener = new TcpListener(IPAddress.Loopback, 40);
listener.Start();
using (TcpClient client = listener.AcceptTcpClient())
{
client.NoDelay = true;
client.ReceiveBufferSize = 1;
#if DONT_USE_GZIP_STREAM_READ
NetworkStream stream = client.GetStream();
#else
GZipStream stream = new GZipStream(
client.GetStream(), CompressionMode.Decompress, false);
#endif
while (true)
{
Console.WriteLine("{0}", stream.ReadByte());
}
}
}
else
{
using (TcpClient client = new TcpClient())
{
client.Connect(IPAddress.Loopback, 40);
client.NoDelay = true;
client.SendBufferSize = 1;
#if DONT_USE_GZIP_STREAM_WRITE
NetworkStream stream = client.GetStream();
#else
GZipStream stream = new GZipStream(
client.GetStream(), CompressionMode.Compress, false);
#endif
for (Byte b = 0;; ++b)
{
stream.WriteByte(b);
stream.Flush();
Console.WriteLine("{0}", b);
System.Threading.Thread.Sleep(1000);
}
}
}
}
}
-
Re: GZipStream and buffering
On Mon, 21 Sep 2009 18:19:01 -0700, Agendum
<Agendumatdiscussionsdotmicrosoft.com> wrote:
> I wrote a client / server networking application and all communication
> between the two is compressed using GZipStream. However, I found a wierd
> buffering problem. Despite how I had the network stream configured,
> buffering would still occur (on a small scale). I pinpointed the
> problem to
> GZipStream. Basically it seems to buffer content and calling Flush has
> no
> effect.
Not that I think it's such a great idea to:
-- Set NoDelay to true,
-- Set the send and receive buffers to 1 byte in length, or
-- Flush the GZipStream after each write
But, basically the problem here is that you expect there to be no
buffering when it's impossible for there to be no buffering.
The job of GZipStream is to take a stream of bytes and turn it into a
shorter stream of bytes. Since you get fewer bytes on the receiving end,
it should be obvious that for at least some of the bytes you send, you
will not receive a byte on the output of GZipStream.
Likewise, at the receiving end, the job of the class there is to take a
short stream of bytes and turn it back into the longer stream. Thus,
there it should also be obvious that for every byte you actually do
receive on the network, for at least some of them, you will get more than
one byte on the output of the GZipStream.
In other words, GZipStream is doing exactly what it's supposed to, as is
each TcpClient given how you've configured them (however obscenely that
may be
).
In general, trying to disable buffering on a network stream is a really
bad idea. But at the very least, it is simply impossible to avoid at
least some buffering within the compression/decompression stages, because
that's a fundamental aspect of how compression works (it's essentially a
corallary to the pigeon-hole principle...you only have so many
"pigeon-holes" on the output of the GZipStream to put the input, which has
more elements than there are "pigeon-holes", so obviously some of the
input elements don't have their own unique output "pigeon-hole").
Pete
-
Re: GZipStream and buffering
The fact I use NoDelay, have a 1 byte buffer, and flush is just a
demonstration that theres no other options other than for the byte to be
transmitted. Also, I don't Flush after each write (each byte!) in the
original app -- it is just a demonstration here.
In any case, I understand what you are saying about the GZipStream.
Basically to apply a reasonable amount of compression GZipStream reads a
minimum amount of bytes. The fact GZipStream has a Flush method is
irrelevant... it just flushes the already-compressed bytes to the stream. I
was incorrectly assuming it would compress any remaining bytes in the stream
and write it out. Apparently there is no method for doing that.
I mentioned the original application sends messages of a sizeable amount and
I am experiencing the same problem. I guess I can conclude from this that:
1) GZipStream compresses bytes on some internally defined byte boundary.
This would explain why just a "minimum number of bytes" is not enough.
2) The only way to invoke the call of "compress any remaining bytes in the
stream" is to actually close the GZipStream itself.
Thanks for your response.
"Peter Duniho" wrote:
> On Mon, 21 Sep 2009 18:19:01 -0700, Agendum
> <Agendumatdiscussionsdotmicrosoft.com> wrote:
>
> > I wrote a client / server networking application and all communication
> > between the two is compressed using GZipStream. However, I found a wierd
> > buffering problem. Despite how I had the network stream configured,
> > buffering would still occur (on a small scale). I pinpointed the
> > problem to
> > GZipStream. Basically it seems to buffer content and calling Flush has
> > no
> > effect.
>
> Not that I think it's such a great idea to:
>
> -- Set NoDelay to true,
> -- Set the send and receive buffers to 1 byte in length, or
> -- Flush the GZipStream after each write
>
> But, basically the problem here is that you expect there to be no
> buffering when it's impossible for there to be no buffering.
>
> The job of GZipStream is to take a stream of bytes and turn it into a
> shorter stream of bytes. Since you get fewer bytes on the receiving end,
> it should be obvious that for at least some of the bytes you send, you
> will not receive a byte on the output of GZipStream.
>
> Likewise, at the receiving end, the job of the class there is to take a
> short stream of bytes and turn it back into the longer stream. Thus,
> there it should also be obvious that for every byte you actually do
> receive on the network, for at least some of them, you will get more than
> one byte on the output of the GZipStream.
>
> In other words, GZipStream is doing exactly what it's supposed to, as is
> each TcpClient given how you've configured them (however obscenely that
> may be
).
>
> In general, trying to disable buffering on a network stream is a really
> bad idea. But at the very least, it is simply impossible to avoid at
> least some buffering within the compression/decompression stages, because
> that's a fundamental aspect of how compression works (it's essentially a
> corallary to the pigeon-hole principle...you only have so many
> "pigeon-holes" on the output of the GZipStream to put the input, which has
> more elements than there are "pigeon-holes", so obviously some of the
> input elements don't have their own unique output "pigeon-hole").
>
> Pete
>
-
Re: GZipStream and buffering
On Mon, 21 Sep 2009 20:06:01 -0700, Agendum
<Agendumatdiscussionsdotmicrosoft.com> wrote:
> The fact I use NoDelay, have a 1 byte buffer, and flush is just a
> demonstration that theres no other options other than for the byte to be
> transmitted.
Obviously, there _are_ other options other than for the byte to be
transmitted. The GZipStream instance can (and does) buffer it.
> Also, I don't Flush after each write (each byte!) in the
> original app -- it is just a demonstration here.
Okay, that's a relief.
> In any case, I understand what you are saying about the GZipStream.
> Basically to apply a reasonable amount of compression GZipStream reads a
> minimum amount of bytes.
It's not really about being "reasonable". It's simply how that particular
compression algorithm works. It builds a dictionary as it goes, and when
certain conditions are fulfilled (e.g. some new sequence of bytes not
already in the dictionary is seen, or a given sequence of bytes seen does
match something in the dictionary, etc.) the compression algorithm emits
bytes on the output end.
Depending on the input, this may in fact result in unreasonable amounts of
compression, or even inflation of the stream. "Reasonable" doesn't come
into play; it's basically a dynamic state machine, and at certain states,
bytes are emitted, hopefully (but not always) in a compressed state as
compared to the input.
> The fact GZipStream has a Flush method is
> irrelevant... it just flushes the already-compressed bytes to the
> stream. I
> was incorrectly assuming it would compress any remaining bytes in the
> stream
> and write it out. Apparently there is no method for doing that.
Allowing that would be counter-productive from a compression point of
view, but would prevent the decompression side from working in any case.
> I mentioned the original application sends messages of a sizeable amount
> and
> I am experiencing the same problem. I guess I can conclude from this
> that:
>
> 1) GZipStream compresses bytes on some internally defined byte boundary.
> This would explain why just a "minimum number of bytes" is not enough.
It's not "some internally defined byte boundary". It has to do with the
progress of the compression algorithm in matching the input to the current
state of its dictionary. The compression algorithm is documented. If you
care how it works, you should read about how it works.
> 2) The only way to invoke the call of "compress any remaining bytes in
> the
> stream" is to actually close the GZipStream itself.
Yes. That is the only way for that particular compression algorithm to
work.
Pete
-
Re: GZipStream and buffering
* Peter Duniho wrote, On 22-9-2009 5:20:
> On Mon, 21 Sep 2009 20:06:01 -0700, Agendum
> <Agendumatdiscussionsdotmicrosoft.com> wrote:
>
>> The fact I use NoDelay, have a 1 byte buffer, and flush is just a
>> demonstration that theres no other options other than for the byte to be
>> transmitted.
>
> Obviously, there _are_ other options other than for the byte to be
> transmitted. The GZipStream instance can (and does) buffer it.
>
>> Also, I don't Flush after each write (each byte!) in the
>> original app -- it is just a demonstration here.
>
> Okay, that's a relief.
>
>> In any case, I understand what you are saying about the GZipStream.
>> Basically to apply a reasonable amount of compression GZipStream reads a
>> minimum amount of bytes.
>
> It's not really about being "reasonable". It's simply how that
> particular compression algorithm works. It builds a dictionary as it
> goes, and when certain conditions are fulfilled (e.g. some new sequence
> of bytes not already in the dictionary is seen, or a given sequence of
> bytes seen does match something in the dictionary, etc.) the compression
> algorithm emits bytes on the output end.
>
> Depending on the input, this may in fact result in unreasonable amounts
> of compression, or even inflation of the stream. "Reasonable" doesn't
> come into play; it's basically a dynamic state machine, and at certain
> states, bytes are emitted, hopefully (but not always) in a compressed
> state as compared to the input.
>
>> The fact GZipStream has a Flush method is
>> irrelevant... it just flushes the already-compressed bytes to the
>> stream. I
>> was incorrectly assuming it would compress any remaining bytes in the
>> stream
>> and write it out. Apparently there is no method for doing that.
>
> Allowing that would be counter-productive from a compression point of
> view, but would prevent the decompression side from working in any case.
>
>> I mentioned the original application sends messages of a sizeable
>> amount and
>> I am experiencing the same problem. I guess I can conclude from this
>> that:
>>
>> 1) GZipStream compresses bytes on some internally defined byte boundary.
>> This would explain why just a "minimum number of bytes" is not enough.
>
> It's not "some internally defined byte boundary". It has to do with the
> progress of the compression algorithm in matching the input to the
> current state of its dictionary. The compression algorithm is
> documented. If you care how it works, you should read about how it works.
>
>> 2) The only way to invoke the call of "compress any remaining bytes in
>> the
>> stream" is to actually close the GZipStream itself.
>
> Yes. That is the only way for that particular compression algorithm to
> work.
In my opinion, the best way to make this work is to compress the data
first, and then send the compressed data over the wire as if it were a
message.
For that you have two options:
1) Create the message beforehand by writing it to a MemoryStream, then
stream the contents of that over the network. The problem with this
approach is that it requires more memory.
2) Create the GZipStream with the second constructor, (see
http://msdn.microsoft.com/en-us/library/27ck2z1y.aspx), and specify
false on the second parameter. This allows you to close the GZipstream,
forcing it to send its contents along to the network while leaving the
connection open. This means that you would have to create a new
GZipStream to send a new message over the wire. The problem with this
approach is that the receiving end need to know when the end of one
zipped object has been received, so that it in turn can also create a
new GZipStream to decompress the message on the other end. Meaning
you'll have to add some protocol handling on both ends.
Not knowing exactly what you're trying to do here, but I wonder if it
wouldn't be a better idea to use WCF or some other existing
communication stack to solve your problem.
--
Jesse Houwing
jesse.houwing at sogeti.nl