Hi, it was suggested that we should publish this question here, we have been working with Amit and Maxim on HDInsight, but this question is more related to storage. Apologies if there is a more suitable connect section for storage, but I was unable to find one.
Hi Amit and Maxim, I could do with some guidance on optimising access to VHDs. I will explain a bit about where we are at present.
After the pig/hadoop process finishes we are left with some avro files on blob storage. These files are downloaded and read by a process that creates a set of files in our own format which the “database” runs off.
So for example we download networkdata_20130101_00000.avro and the process creates:
networkdata_20130101_00000.bin - consisting of structures representing the records from the avro file stored in a format which can be read into memory (or blitted directly if the data structure allows it)
networkdata_20130101_00000.bin.mmfidx - an index mapping the record number to an offset within the .bin file
and a series of files such as
networkdata_20130101_00000.bin.advertiserid.avl – which is a binary search tree index using certain properties as the key in this case the advertiserid and also mapping to an offset within the .bin file.
We have these files for the network data and also have the reference lookup data.
After the files are created they are added to 4 “master” vhds: one for the network data .bin files, one for the network data index files (mmfidx, avl etc), and equivalents for the reference data. These vhds are snapshotted when the file uploads have completed.
The test dataset is in the 100s of GB range.
Currently we deliver the data service via 2 Extra Large web roles, each of which has their own 4 vhds cloned from the latest snapshot of the masters via cloudblob.CopyFromBlob. When each of the the vhds are mounted they are each assigned their own local resource to use as disk cache.
These roles access the individual files as memory mapped files/memory mapped views
The worker roles can be updated by hitting a url which instructs the nodes to update their own vhds by recloning the latest snapshots of the masters.
To warm the disk caches, I scan through all new files on the vhd (by comparing with a tracking file stored in the role’s local temp directory) and read the whole file in 4MB chunks (using a memory mapped stream) , which I hope transfers all of the “remote” vhd data into the local disk cache. However I have noticed that even after the warm has completed there is sometimes a lot of network IO fetching regions of the vhds which really slows down the service.
Can you shed any light on why the role’s local disk cache seems to be invalidated even though the files are immutable? Perhaps the disk cache doesn’t work the way I think it does?
Since there is this extra network activity I would like to optimise the actual reads from the vhds. Can you suggest what range we should look at for buffer size when creating the FileStream underlying the MemoryMappedFiles (from memory we are using this overload http://msdn.microsoft.com/en-us/library/7db28s3c.aspx) e.g the PageBlob pages are 512 bytes but I assume we want a much bigger buffer in order to reduce the number of actual requests / overall latency. At the moment I think we are using 32KB. Is there any guidance available?