Home Dashboard Directory Help
Search

Serialization/Deserialization (C# 3.5 but probably affects all .NET) by Daem0hn


Status: 

Closed
 as Fixed Help for as Fixed


0
0
Sign in
to vote
Type: Suggestion
ID: 334117
Opened: 3/23/2008 9:46:07 PM
Access Restriction: Public
1
Workaround(s)
view

Description

Was using C++/C on linux to do data processing and fuzzy statistical analysis, and from this analysis, make deductions or estimations about future behaviour.

The size of the dataset is approximately 350 million points of data large, all connected in an interrelated fashion.

Due to the size of the data, it was necessary to use a significant amount of File I/O, which I am sure you are aware is a very time consuming operation.

My solution was to precompute as much as possible, creating a much larger dataset with a significant amount of redundant data, stored in a fashion that could be serialized, while maintaining pseudo links between datasets. C/C++ did not handle the serialization well, and the approach was becoming increasingly complex, and hence I ported to my favourite OO language, C#.

It worked great. Easy to program and easy to update. The first problem I encountered was memory. The linux based C++ algorithm i was using was designed to use 3.1GB of memory (largest amount accessible by 32bit OS which left enough of OS & system [on linux anyway]). My C# program would fail at between 1.6GB and 2GB of memory usage. I believe this is a restriction placed upon it by Windows. So i had to adjust my algorithm. Not too much of a problem seeing File I/O is such a breeze in .NET, as is binary serialization.

The next problem, and the reason for this suggestion is as follows.

struct TheStruct { short a; byte b; .... } //i was using a class but the array took up too much room (2.5 gigs by my calculation - bit of an annoyance seeing i had 4GB free) and hence OutOfMemory exception occured.

TheStruct[][] TheArray = new TheStruct[490000][]

etc etc etc

That array took 5 hours to compute (massive amounts of File I/O) and is stored in a 600meg datastructure in memory. Coming from a C++ background, I assumed this would be stored as a contigious array of short, byte, short, byte, short, byte (as per the struct) with a little bit of OO overhead, but this overhead would be included within the 600MB total memory usage of the program.

I then binary serialized this array, so i didn't have to do the calculations again.

It binary serialized to a 1190MB file. Okay, strange, twice the size of the datastructure in memory. But perhaps it contained some (read a lot of) datastructure overhead and the serialized file signatures to protect the data external to the program (which is a great idea btw. good work on implementing it).

The next issue came with the deserialization. OutOfMemory Exception. The deserialized file was taking up 1.8+ gigs at the point of failure. Considering the intial program usage was 600MB in total, including all program overheads, how can the same data loaded back into the program use up over 3x the amount of memory?

My suggestion would be compliment the ease of use of serialization within C# with comparative functionality. A 600meg datastructure should take up the same amount of RAM whether it was deserialized or calcualted on the fly.


Other than that, great language!
Details
Sign in to post a comment.
Sign in to post a workaround.
Posted by Brian Grunkemeyer - MSFT on 4/11/2008 at 5:38 PM
Thanks for the feedback. There is a minor drawback with serialization (discussed at the very end), but I'd like to talk about how you're storing your data & the OutOfMemoryException that you saw. Unfortunately, because your data set is so large, I'm not sure there's a silver bullet here. What you are doing is on or over the edge of what is supportable on a 32 bit OS, and on a machine with a small amount of memory, will likely hurt your performance because you're spending most of your time paging. Hopefully I can explain why, and I have a suggestion for writing your app a little differently.

First, let's talk about address space limitations. As you may know, address space is the total amount of virtual memory available to a process, and it is limited by the size of a pointer. On 32 bit machines, the pointer size is 32 bits, which gives you 2^32 or 4 GB of total addressable memory (regardless of the physical memory on your box - the wonder of virtual memory is that your address space & physical memory are independent from each other). On Windows, address space is broken into a user section & a kernel section. The user part of the address space is where all of your heap allocations & stack space for each thread are allocated from, as well as where all your libraries get mapped. The kernel space is reserved for kernel-level data structures for your process such as page tables, sockets, GDI device contexts, and other arcane implementation details. So this limits you to 2 GB minus the size of your libraries, 1 MB per thread for the stack, and any other heap allocations that might happen in your program (ie, space for the CLR's JIT to compile code or represent your types in memory).

So how does this affect your program? In terms of CLR behavior, the garbage collector asks the OS for consecutive address space in segments, which may default to 128 MB at a time. Note that this is reserving entries in the process's address space, not actually allocating pages (aka committing) memory. In practice, I've generally only been able to allocate somewhere between 1.2 GB and 1.5 GB. I've allocated 1.8 GB, but only exceptionally rarely. I think this is propotional to the amount of other work that happens in your process, plus some other factor that I don't yet understand. So, in terms of design points, I would try to never use more than ~1.2 GB at once. (Besides, if your app ever runs on a machine with only 1 GB of memory, you're going to spend a large proportion of your time swapping pages to disk anyways because other processes will use memory too. Paging kills your performance.) You can allocate lots of smaller units of data then throw them away, but your app is likely to fall over if you need more than this amount of memory.

Windows does have a /3GB switch that will tell Windows to give you 3 GB in the user section of your address space. You have to edit (I think) c:\boot.ini and add /3GB to the boot options for Windows - you might want to search the web for more details before setting this. However, this has some drawbacks. First, it provides less address space for the kernel, so certain things may not work nearly as well. If you enable this on a server handling many incoming network connections, you may run out of memory faster. And secondly, if you have poorly written programs that accidentally compare pointers as signed numbers instead of unsigned, then you might run into all kinds of problems.

A slightly better option is to switch to a 64 bit machine. Then, you can reserve enough consecutive address space for your data. Obviously, 64 bit machines have 2^64, which some have estimated is on the order of the # of particles in the universe. You'll never have that much physical memory, but at least you have the address space.

However, you may be best off storing your data in smaller, non-consecutive chunks. Even if you could get a solution using the /3GB switch, it may not work if your dataset increases in size, and you may be hurting your performance by allocating so much memory. Additionally, allocating very large amounts of consecutive memory is problematic as well - if you allocate an array that requires 1 GB of space, it's very difficult to find enough _consecutive_ address space for that 1 GB of memory. Our perf architect recommends that you switch your data structure to an array of arrays. The idea is you break your data into smaller, fixed-size chunks (like 1 MB long arrays), then make an array of those. If your data is sparsely accessed (ie, you only need a fraction of your entire data set at runtime), then you may only need to fill in a small percentage of your entire data set. Alternately if you really must access all of the data set, you can implement some last-recently-used cache aging algorithm & tune your code to only keep ~400 rows (ie, 400 MB of data) loaded at once. And building an indexer that works on an array of arrays would be relatively simple. That would be something like this:

row = i / ROW_SIZE;
column = i % ROW_SIZE;

You may need to tweak your data files a little. Ideally your data files would be a fixed number of bytes for each object, but if you're using the binary formatter, I wouldn't bet on that. Perhaps you could slice your dataset into one file with X elements in each section, then keep track of offsets to the start of each section for fast random access?

We think this will give you good performance, and will help you keep your working set under control regardless of how your data set grows over time.

BTW, in terms of the binary formatter, it does have a ~500 byte overhead per type in an object graph. So if you serialize out 100 items individually in their own object graphs (by calling BinaryFormatter.Serialize on each one), you'll have an unexpected 500 * 100 extra bytes in your file. If you serialized out all of them in one array, you'll probably only see one ~500 or 1000 bytes of overhead. Serialization is great for truly arbitrary object graphs, but if you have well-structured data & aren't worried about versioning issues (where you might add a new field to your data structure), you may be best off using BinaryReader & BinaryWriter yourself to reduce the size of the file on disk. Alternately, serializing many values out in an array should amortize the fixed overhead of a type-specific header. The serialization guys have thought about separating out the type-specific info from the serialized blob, but they haven't provided that functionality yet - it adds more complexity to the users of serialization.

Lastly, if you search hard enough, you'll find that we have a MemoryFailPoint type in the .NET Framework to allow an app to predict whether a memory allocation might fail. I wouldn't use that here - we designed that more for server applications that needed to throttle their performance when processing requests from a queue of incoming work items. I'd prefer limiting some part of your application to a fixed amount of memory, so that you have more predictable & repeatable results when running your application.

I hope this was useful,
Brian Grunkemeyer
CLR Base Class Library team