I've been working with HDF5 Group's HDF (Hierarchical Data Format) library for the last little while. It is a mechanism for managing self-described data collections, no matter how large or complicated. From their website, here are a few features:
Monday, April 28. 2008
HDF Group's Hierarchical Data Format (HDF5) Library
- A versatile data model that can represent very complex data objects and a wide variety of metadata.
- A completely portable file format with no limit on the number or size of data objects in the collection.
- A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces.
- A rich set of integrated performance features that allow for access time and storage space optimizations.
- Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.
I'm using the HDF5 library in a stock market research and trading platform I'm developing in C++. The library is used to store Bars, Quotes, Trades, and MarketDepth. Each of these data types uses ptime from the Boost DateTime library for time referencing.
I've been able to use C++'s container and iterator concepts to write a read/write container with appropriate custom random iterator capabilities. This allows me to use STL (Standard Template Library) Algorithms such as upper_bound, lower_bound, and equal_range to quickly search for selected sub-ranges of the various data types.
From a version perspective, I started out with the relatively new 1.8.0 rc5 HDF5 release, and have recently upgraded to the 1.9.3 HDF5 release. The more recent 1.9.4 HDF5 release appears to have link problems. The web pages show downloads for 1.8.0, but with a little extra digging, there is a HDF5 snapshot server available.
Building the HDF5 library on Wwindows is not too difficult. The hardest part is finding the build documentation, which is located in the /release_docs directory of the extraction. I used tar on my Cygwin install to expand/extract the HDF5 distribution file, but recent versions of Winzip or 7Zip should be also be able to handle it on a Windows machine. Building the 1.9.3 version of HDF5 was easier than the 1.8.0 rc5 version of HDF5, as I had several missing file issues.
One key point is to download both zlib and szlib and put them in directories, otherwise the HDF5 library won't build. Two environment variables are required:
- HDF5_EXT_SZIP=szlibdll.lib
- HDF5_EXT_ZLIB=zlib1.lib
To start the build process, run the copy_hdf.bat file. Then in Visual Studio, open the windows/proj/all/all.sln file, select build/debug/library options and then build the solution. After the build, run installhdf5lib.bat and you'll find the libraries and includes in hdf5lib/debug et.al. I copy the .dlls into my project's debug directory, and use tools->options->c++ general->include files to point to the include file directory.
In order to use the library, one has to be aware of dataspaces (rank size of structures), composite types (ie, bar is composed of time, open, close, and volume), datasets (the data as stored on the drive), and properties (some desciptors for tuning storage abilities).
I've been able to write a vector of Bar objects out to a dataset by being particular careful in describing the in-memory datatype vs on-drive datatype. HDF5 then takes care of handling the various offsets of the base values (time, double, int) as they are written from the class to drive and back again. This self-described dataset allows an HDF5 datafile to be created on a little-endian machine and then read from a big-endian machine with no problems.
Another interesting capability of the HDF5 library is in how the data is stored. As mentioned before, compression can be enabled with zlib (szlib has some limititations in that it is unable to work with clustered data). Further compression can be be enabled through what they call 'fletching'. I've been using data records which are identical in length. When you look at a series of records, you'll find that a number of byte positions are identical: they could be all zeros, or some other value if the data falls within a narrow range of values across a series of records. These columns of bytes serve as a convenient first order level of compression before using the more generic zlib flavor of compression. Large datasets can user minimal data storage when using these two compression concepts. I havn't done heavy testing, but I think I've seen a 50% reduction in space usage when I turned these on. Probably with cluster size tuning (a cluster being a specific number of records in a block), I could further reduce storage requirements. But of course, there will be access time considerations to handle as well.
It has taken some time to understand the concepts and subtlies of the HDF5 library, but now that I have, when coupled with C++ class and meta programming capabilities, and with suitable abstractions, quite powerful data analytics can be built.
As one more highlight, there is a Java program available called HDFView which can be used to view any HDF5 datafile. It shows just how well the self-described concepts works, as well as being useful as a debugging aid when creating data descriptions and data sets.