marshal vs pickle - Python

This is a discussion on marshal vs pickle - Python ; The documentation for marshal makes it clear that there are no guarantees about being able to correctly deserialize marshalled data structures across Python releases. It also implies that marshal is not a general "persistence" module. On the other hand, the ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 16

marshal vs pickle

  1. Default marshal vs pickle

    The documentation for marshal makes it clear that there are no
    guarantees about being able to correctly deserialize marshalled data
    structures across Python releases. It also implies that marshal is not
    a general "persistence" module. On the other hand, the documentation
    seems to imply that marshalled objects act more or less like pickled
    objects.

    Can anyone elaborate more on the difference between marshal and
    pickle. In what conditions would using marshal be unsafe? If one can
    guarantee that the marshalled objects would be created and read by the
    same version of Python, is that enough?

    --
    Evan Klitzke <evan@yelp.com>

  2. Default Re: marshal vs pickle

    Evan Klitzke wrote:
    > Can anyone elaborate more on the difference between marshal and
    > pickle. In what conditions would using marshal be unsafe? If one
    > can guarantee that the marshalled objects would be created and
    > read by the same version of Python, is that enough?


    Just use pickle. From the docs:

    | The marshal module exists mainly to support reading and writing
    | the ``pseudo-compiled'' code for Python modules of .pyc files.
    | Therefore, the Python maintainers reserve the right to modify the
    | marshal format in backward incompatible ways should the need
    | arise. If you're serializing and de-serializing Python objects,
    | use the pickle module instead.

    Regards,


    Björn

    --
    BOFH excuse #421:

    Domain controller not responding


  3. Default Re: marshal vs pickle

    On Oct 31, 3:31 am, "Evan Klitzke" <e...@yelp.com> wrote:
    > Can anyone elaborate more on the difference between marshal and
    > pickle. In what conditions would using marshal be unsafe? If one can
    > guarantee that the marshalled objects would be created and read by the
    > same version of Python, is that enough?


    Yes, I think that's enough. I like to use
    marshal a lot because it's the absolutely fastest
    way to store and load data to/from Python. Furthermore
    because marshal is "stupid" the programmer has complete
    control. A lot of the overhead you get with the
    pickles which make them generally much slower than
    marshal come from the cleverness by which pickle will
    recognized shared objects and all that junk. When I
    serialize, I generally don't need
    that because I know what I'm doing.

    For example both gadfly SQL

    http://gadfly.sourceforge.net

    and nucular full text/fielded search

    http://nucular.sourceforge.net

    use marshal as the underlying serializer. Using cPickle
    would probably make serialization worse than 2x slower.
    This is one of the 2 or 3 key tricks which make these
    packages as fast as they are.

    -- Aaron Watters

    ===
    http://www.xfeedme.com/nucular/gut.p...TEXT=halloween


  4. Default Re: marshal vs pickle

    On Oct 31, 6:45 am, Aaron Watters <aaron.watt...@gmail.com> wrote:
    > I like to use
    > marshal a lot because it's the absolutely fastest
    > way to store and load data to/from Python. Furthermore
    > because marshal is "stupid" the programmer has complete
    > control. A lot of the overhead you get with the
    > pickles which make them generally much slower than
    > marshal come from the cleverness by which pickle will
    > recognized shared objects and all that junk. When I
    > serialize,


    I believe this FUD is somewhat out-of-date. Marshalling
    became smarter about repeated and shared objects. The
    pickle module (using mode 2) has a similar implementation
    to marshal and both use the same tricks, but pickle is
    much more flexible in the range of objects it can handle
    (i.e. sets became marshalable only recently while deques
    can pickle but not marshal)

    For the most part, users are almost always better-off
    using pickle which is version independent, fast, and
    can handle many more types of objects than marshal.

    Also FWIW, in most applications of pickling/marshaling,
    the storage or tranmission times dominate computation
    time. I've gotten nice speed-ups by zipping the pickle
    before storing, transmitting, or sharing (RPC apps
    for example).


    Raymond


  5. Default Re: marshal vs pickle

    On Oct 31, 1:37 pm, Raymond Hettinger <pyt...@rcn.com> wrote:
    > On Oct 31, 6:45 am, Aaron Watters <aaron.watt...@gmail.com> wrote:
    >
    > > I like to use
    > > marshal a lot because it's the absolutely fastest
    > > way to store and load data to/from Python....

    >
    > I believe this FUD is somewhat out-of-date. Marshalling
    > became smarter about repeated and shared objects. The
    > pickle module (using mode 2) has a similar implementation
    > to marshal


    Raymond: happy days! We are both right!
    I just ran some tests from the test suite for
    http://nucular.sourceforge.net with marshalling
    and pickling switched in and out and to my
    surprise I didn't find too much difference
    on the "load" end (marshalling 10% faster),
    but for the "bigLtreeTest.py" I found that
    the build ("dump") process was about 1/3
    slower with cPickle (mode 2/python2.4). For
    the more complex tests (mondial and gutenberg)
    I found that the speed up for using marshal was
    in the 1-2% range (and sometimes inverted
    because of processor load I think, on a shared
    hosting machine).

    I'm pretty sure things were much worse for cPickle
    many moons ago. Nice to see that some things
    get better . It makes sense that the
    "dump" side would be slower because that's
    where you need to remember all the objects
    in case you see them again...

    Anyway since it's easy and makes sense I think
    the next version of nucular will have a
    switchable option between marshal and cPickle
    for persistant storage.

    Thanks! -- Aaron Watters

    ===
    The pursuit of hypothetical performance
    improvements is the root of all evil.
    -- Bill Tutt
    http://www.xfeedme.com/nucular/pydis...?FREETEXT=tutt


  6. Default Re: marshal vs pickle

    On Oct 31, 12:27 pm, Aaron Watters <aaron.watt...@gmail.com> wrote:
    > Anyway since it's easy and makes sense I think
    > the next version of nucular will have a
    > switchable option between marshal and cPickle
    > for persistant storage.


    Makes more sense to use cPickle and be done with it.

    FWIW, I've updated the docs to be absolutely clear on the subject:

    '''
    This is not a general "persistence" module. For general persistence
    and
    transfer of Python objects through RPC calls, see the
    modules :mod:`pickle` and
    :mod:`shelve`. The :mod:`marshal` module exists mainly to support
    reading and
    writing the "pseudo-compiled" code for Python modules of :file:`.pyc`
    files.
    Therefore, the Python maintainers reserve the right to modify the
    marshal format
    in backward incompatible ways should the need arise. If you're
    serializing and
    de-serializing Python objects, use the :mod:`pickle` module instead --
    the
    performance is comparable, version independence is guaranteed, and
    pickle
    supports a substantially wider range of objects than marshal.

    ... warning::

    The :mod:`marshal` module is not intended to be secure against
    erroneous or
    maliciously constructed data. Never unmarshal data received from
    an
    untrusted or unauthenticated source.

    Not all Python object types are supported; in general, only objects
    whose value
    is independent from a particular invocation of Python can be written
    and read by
    this module. The following types are supported: ``None``, integers,
    long
    integers, floating point numbers, strings, Unicode objects, tuples,
    lists,
    dictionaries, and code objects, where it should be understood that
    tuples, lists
    and dictionaries are only supported as long as the values contained
    therein are
    themselves supported; and recursive lists and dictionaries should not
    be written
    (they will cause infinite loops).

    ... warning::

    Some unsupported types such as subclasses of builtins will appear
    to marshal
    and unmarshal correctly, but in fact, their type will change and
    the
    additional subclass functionality and instance attributes will be
    lost.

    ... warning::

    On machines where C's ``long int`` type has more than 32 bits (such
    as the
    DEC Alpha), it is possible to create plain Python integers that are
    longer
    than 32 bits. If such an integer is marshaled and read back in on a
    machine
    where C's ``long int`` type has only 32 bits, a Python long integer
    object
    is returned instead. While of a different type, the numeric value
    is the
    same. (This behavior is new in Python 2.2. In earlier versions,
    all but the
    least-significant 32 bits of the value were lost, and a warning
    message was
    printed.)
    '''


  7. Default Re: marshal vs pickle

    En Wed, 31 Oct 2007 19:10:48 -0300, Raymond Hettinger <python@rcn.com>
    escribió:

    > FWIW, I've updated the docs to be absolutely clear on the subject:


    As you are into it, the list of supported types should be updated too:

    > The following types are supported: ``None``, integers,
    > long
    > integers, floating point numbers, strings, Unicode objects, tuples,
    > lists,
    > dictionaries, and code objects,


    boolean, complex, set and frozenset are missing.

    --
    Gabriel Genellina


  8. Default Re: marshal vs pickle

    Raymond Hettinger <python@rcn.com> writes:
    > ''' This is not a general "persistence" module. For general
    > persistence and transfer of Python objects through RPC calls, see
    > the modules :mod:`pickle` and :mod:`shelve`.


    That advice should be removed since Python currently does not have a
    general persistence or transfer module in its stdlib. There's been an
    open bug/RFE about it for something like 5 years. The issue is that
    any sensible general purpose RPC mechanism MUST make reasonable
    security assertions that nothing bad happens if you deserialize
    untrusted data. The pickle module doesn't make such guarantees and in
    fact its documentation explicitly warns against unpickling untrusted
    data. Therefore pickle should not be used as a general RPC
    mechanism.

  9. Default Re: marshal vs pickle

    On Nov 1, 12:04 am, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
    > Raymond Hettinger <pyt...@rcn.com> writes:
    > > ''' This is not a general "persistence" module. For general
    > > persistence and transfer of Python objects through RPC calls, see
    > > the modules :mod:`pickle` and :mod:`shelve`.

    >
    > That advice should be removed since Python currently does not have a
    > general persistence or transfer module in its stdlib. There's been an
    > open bug/RFE about it for something like 5 years. The issue is that
    > any sensible general purpose RPC mechanism MUST make reasonable
    > security assertions that nothing bad happens if you deserialize
    > untrusted data. The pickle module doesn't make such guarantees and in
    > fact its documentation explicitly warns against unpickling untrusted
    > data. Therefore pickle should not be used as a general RPC
    > mechanism.


    This is absolutely correct. Marshal is more secure than pickle
    because marshal *cannot* execute code automatically whereas pickle
    does. The assertion that marshal is less secure than pickle is
    absurd.

    This is exactly why the gadfly server mode uses marshal and not
    pickle.

    -- Aaron Watters

    ===
    why do you hang out with that sadist?
    beats me! -- kliban


  10. Default Re: marshal vs pickle

    On Oct 31, 6:10 pm, Raymond Hettinger <pyt...@rcn.com> wrote:
    > On Oct 31, 12:27 pm, Aaron Watters <aaron.watt...@gmail.com> wrote:
    >
    > Makes more sense to use cPickle and be done with it.
    >
    > FWIW, I've updated the docs to be absolutely clear on the subject:
    >
    > '''
    > This is not a general "persistence" module. For general persistence
    > and...


    Alright already. Here is the patched file you want

    http://nucular.sourceforge.net/kisstree_pickle.py

    This will make all your nucular indices portable across python
    versions and machine architectures. I'll add this to the
    next release with a bunch of other stuff too.

    By the way there is another module that uses marshal for
    strictly temporary storage in http://nucular.sourceforge.net
    -- but if I change that one the build time for nucular indices
    fully DOUBLES!! That's too much pain for me. Sorry.

    Also, it's always been a mystery to me why Python can't
    keep the marshal module backwards compatible and portable.
    You folks seem like pretty smart programmers to me. If
    you need help, let me know. It's a damn shame Python doesn't
    have a serialization module with the safety, speed, and
    simplicity of marshal and also the portability of pickle.
    I guess I have to live with it .
    -- Aaron Watters

    ===
    Wow, do you play basketball?
    No, do you play miniature golf?
    -- seen in Newsweek years ago


+ Reply to Thread
Page 1 of 2 1 2 LastLast

Similar Threads

  1. Re: marshal vs pickle
    By Application Development in forum Python
    Replies: 0
    Last Post: 11-01-2007, 04:34 PM
  2. Re: marshal vs pickle
    By Application Development in forum Python
    Replies: 0
    Last Post: 11-01-2007, 03:59 PM
  3. pickle and __slots__
    By Application Development in forum Python
    Replies: 3
    Last Post: 10-04-2007, 12:27 PM
  4. problem with pickle
    By Application Development in forum Python
    Replies: 1
    Last Post: 07-07-2007, 12:59 PM