Binary dump file format designing
It's usually a sequence of 2 to 4 bytes that more-or-less uniquely identify a binary file format. Some attempt must be made to avoid values that are likely to occur naturally, so using a string of ASCII text characters is a poor choice if the file could be mixed with text documents.
As storage capacities have increased, short magic numbers have been replaced with slightly longer strings. Identification bytes are most useful on systems that don't have strong typing, such as a UNIX filesystem. On a Macintosh HFS filesystem it's hard to divorce a file from its file and creator types, but under Windows you can change the type of a file by renaming it. They remain useful on all systems as a sanity check: If the file type information in the filename is lost, perhaps during a network data transfer, you can make a highly educated guess at the file contents.
Some would argue that they're unnecessary for "internal use" data in a closed system, but during development it works like an "assert" statement, immediately notifying you if you try to load the wrong kind of file or a file that was severely corrupted. The all-time coolest identification string goes to the PNG graphics file format , which looks like this:. The next three make it obvious to human eyes that this is a PNG file.
The second-to-last byte is a Ctrl-Z, used on some systems as an end-of-file marker in text files. Not only does it detect improper text handling, it also stops you from getting a screen full of garbage if you "type" the file under MS-DOS.
ASCII file formats can benefit from ID strings, because applications that read them can tell immediately whether they have the right kind of file. When receiving a data stream over a network, the ID string can serve to identify the nature of the incoming data. The checksum immediately follows the magic number, and is applied to everything that follows the checksum and precedes the start of data as identified by the "offset to data" field.
It lets you know with a high degree of certainty that what you are reading from the file's header matches what was written. Many developers view internal checksums as unnecessary, and they have a valid point. However, there are still valid reasons for checking your headers. For one, storage is less stable than you might think. All of the data being written was damaged, usually by a flaky SCSI or IDE cable, but the errors were so rare that they weren't noticeable in many types of files.
You may not notice that your text file has become a "tixt" file or that your vacation picture has a few extra spots, but a bit CRC is rarely fooled by errors.
Having bytes altered or shuffled around can cause all sorts of interesting problems. With a header checksum, you can immediately detect any damage to the header. If you trust the code creating the file, you can assume the header is valid, and reduce the amount of error checking your code has to perform.
This allows the CRC to cover the entire header including the magic number, though testing that again is redundant , but more importantly it allows the header to be written as a stream over a network. You can compute the CRC as you write the header bytes out, and don't have to seek back to fill it in. Generally speaking, file headers are small and don't require this type of treatment, but it's something to keep in mind.
This should be the most obviously necessary field. Applications and file formats evolve over time, and it's important to be able to determine whether the contents of a file are readable or not. The serial approach uses a single value, often stored in one byte. The number starts at 0 or 1 and increments. The application can recognize and handle what it recognizes as the "current" version, as well as some set of prior versions, but rejects anything newer than what it understands.
The major version works like the "serial" version. Anything older can usually be handled, anything newer is rejected. The minor version starts at 0 for each major version, and goes up when new fields are added. Older fields are left untouched, and are filled out even when obsolete. This approach is useful for backward compatibility: If the file's minor version is lower, the application knows how to parse the file. If the file's minor version is greater, the application knows that all of the fields it is aware of are present, and it can just ignore the newer fields.
If a major file redesign is necessary, updating the major number prevents older applications from trying to parse newer files. File version numbers should not be tied to application version numbers, and do not need to be hairy things like "1. One or two steadily increasing values are sufficient.
It is possible to avoid explicit version numbering if you provide for it in other ways. For example, the PNG format uses named "chunks". In a hex dump, each byte 8- bits is represented as a two-digit hexadecimal number. Hex dumps are commonly organized into rows of 8 or 16 bytes, sometimes separated by whitespaces. Although the name implies the use of base output, some hex dumping software may have options for base-8 octal or base decimal output.
Some common names for this program function are hexdump , od , xxd and simply dump or even D. A sample partial hex dump of a program, as produced by the Unix program hexdump:. The above example, however, represents an ambiguous form of hex dump, as the byte order may be uncertain.
Such hex dumps are good only in the context of a well-known byte order standard or when values are intentionally given in their full form and may result in variable number of bytes , such as:. When explicit byte sequence is required for example for hex dump of machine code programs or ROM content a byte-by-byte representation is favoured, commonly organized in byte rows with an optional divider between 8-byte groups:. A Unix default display of those same bytes as two-byte words on a modern x86 little-endian computer would usually look like this:.
Alternatively the format can be made polymorphic so that the whole definition varies and functional dependencies do not arise. An example is seen in image file formats: Here colors are heavily quantized which imposes a kind of complex functional dependency between RGB values.
It is not wise, then, to store the RGB values for each pixel, but rather put them in a palette and index. When extending to true color images, the original dependency is broken. We must now discard the palette and our format probably becomes polymorphic, soon complicating software.
Had we thought it through more carefully in the beginning, we might have used another abstraction and avoided the polymorphism. TIFF almost achieves this, albeit with a considerable price in complexity. Like programs, file formats come in versions and need constant maintenance: Since most coders do not have the time or resources to maintain an extensive file format say, a database encoding , they define the basic format one time and leave extensions to be done in a decentralized fashion.
This is not always bad, but care should be taken of interoperability. If users are permitted to assign new field values, guidelines to do that should be given and the fields should be large enough to avoid collisions.
If a file format gains enough popularity and a sufficient user base, one should think about documenting it a bit more formally and possibly handing it to an instance capable of maintaining it properly. This is what has happened with most international standards, and is usually good for a format. It is only proper that HTML has returned to the standards community, then.
SGML revisions come roughly at a pace of one per decade—I would guess it was well designed and documented in the first place. In an actual design, the tradeoffs outlined in chapter two need to be resolved. This and related aspects of file formats are brought fore here in the third part.
Although this chapter is a bit more application oriented than the other ones, no code is still present. This is because the only nontrivial code involves shuffling around particular data structures, such as trees, and so was consciously left out.
A lot remains to be said, still. The behavior of the underlying file system is one of the determining factors in storage performance. No matter how well optimized a particular file format is, the file system is the ultimate bottleneck.
The functions of file formats and file system metadata can also overlap somewhat. This is why one must think rather carefully what can and what cannot be expected of the underlying storage architecture. The usual division of labor between programs and the file system is that file systems provide naming e. Conventions vary from platform to platform, of course.
The implementation of a file system is pivotal in how well it responds to differing usage patterns. Most file systems are based on lists of aggregated disk sectors e. Sometimes clusters are indexed by trees e. Sectors can also be aggregated into variable length runs and then put under a search tree e. If files are composed of linked lists of clusters then either the lists must be stored in main memory all the time or cached efficiently, or otherwise random access will be too expensive.
In tree based file systems, random access is usually very efficient, but serial access may not be as efficient as in the linked list approach. Again, caching of tree nodes greatly helps the situation. The tree paradigm also introduces some uncertainty in file continuation: In this case writing nonzero data to previously zero sectors can suddenly cause the disk to become full.
This is the first instance in which one should be paranoid: The implications to file format design are obvious. First, serial access is always the fastest method in any properly designed file system.
Critical parts of a file format should be serially readable. The same goes for writing. Second, starting from the beginning of the file is good. Most file systems handle this case far more predictably than reading from end of file, for instance. Engineering tradeoffs may push some init data to the end of file, as in. Grouping commonly used data in one location improves efficiency since the data may then be kept in the file system block cache.
As access is block based, doing small discontinuous reads is not sensible. On the other hand, doing long continuous reads of data which is then immediately discarded is even worse. This is why proper skipping and indexing is important in big files—in current multiprocessing environments messing up the block cache with unnecessary data is a mortal sin. The same goes for writing: Instead, writing should always employ buffering.
Sometimes even doing it inline can be justified e. One should rarely store sequences of zeroes. Not only is it inefficient, but can also lead to unexpected behavior in UNIX based systems. Like always in software engineering, modularity and code reuse are advisable in storage related programming. Since we often have to deal with more than one file format, isolating format dependent code into separate modules makes it much easier to support multiple formats and reuse the libraries in another program.
Because storage is rarely the number one concern in our programs, we usually want to implement it simply and throw as few lines of code at it as possible. This means that reuse of data types and structures is something to be pursued—reuse of structure leads to reuse of code. Small modules have other advantages as well: Optimization is further eased by the fact that using the same structures and code multiple times tends to generate repeatable access patterns which can then be reliably modelled.
Unified error handling is easier if interfaces are compact. Third parties like to have compact libraries with easy, intuitive interfaces. Documentation is easier if the same data structures recur over the documents. Manual checking of the resulting files is viable only if just a few complex data types are used. Packing file access code in neat modules and reusing both data structures and access code is enormously useful. Doing things this way will probably make for a cleaner format, as well—one does not feel so tempted to use irregular structures when compact code is an issue.
Some people might say the situation is now reversed and all access should be considered random. This is a valid point, but a weak one. For one, random access is always less efficient than serial. Especially so when the operating system prefetches contiguous blocks of data and when the underlying medium is serial in nature like CD s and tape. Second, serial access tends to lead to less fragmentation, since files are necessarily written in a single, swift, linear pass.
Third, random access fares especially badly when only frugal resources are available. If the environment lacks extensive buffering and multitasking facilities, seek and rotational latency, interrupts and task switches to driver code, clogged asynchronous command queues like the ones used with bus mastering SCSI cards , drive level cache thrashing and the like will become major problems. Especially, time is just wasted on head movement if only one program is running at a time, since the intermittent disk sectors crossed during a seek will not be utilized.
Serial access would also make better use of the book keeping mechanism under certain file systems e. HPFS , whose accounting mechanism uses sector runs and heavy caching of allocation structures.
Furthermore, serial access is so much more deterministic overall that even hard realtime guarantees can be achieved. This means truly dependable multimedia and realtime applications are difficult to build on top of random access file formats.
Summa summarum, one should always strive for serial access, even if it can be achieved only partially. On the other hand, not all things can be expected to smoothly translate into serial constructs. One excellent example are updatable structures, discussed below.
To get some perspective on the problems of serial access and updates, one should check out the design of the UDF 1. From a format perspective, there are three main ways to speed up file access.
The first is to reduce the number of blocks fetched. This equals minimizing retrieval of useless data i.
The second is to blindly i. This comes down to pruning structures which must be parsed in order to be accessed pointers, polymorphism, variable length fields and discontinuous streams.
All these require taking advantage of expected usage patterns. The implication is, we can burn both cycles and bytes quite liberally to achieve speed. This way, data compression, complex storage models, indexing and redundant representations can be justified.
To avoid moving around something which is not of immediate use to us, logical grouping should show up in the encoding as well. This way it is likely that once a disk block is retrieved, it will be used in its entirety. Dimensionality and locality give good guidelines. For instance, storing an editable picture by scanline would never be as good as using a tiled organization. This is because most editing operations tend to obey the principle of locality—pixels close to each other are affected together.
Since pictures are 2D , locality works in terms of distance on the plane, not 1D scanlines. This is precisely the reason we use BSP s, octrees, bounding boxes etc. Distances do not directly translate to locality, of course—if our algorithms do not work in terms of dimensions or distances, some other organization can give better performance. For example, when displaying an unpacked animation, we would store by scanline nevertheless. If we wish to be able to fetch or store files without explicitly parsing each and every byte, we must use structures which allow blind access.
This either means fixed length fields and the absence of options. To optimise our use of the file system interface which is invariably block oriented , we need to align our data blocks. Usually we do not know the exact block size of the device our data will be stored on, but a sufficiently large power of two is a good guess. Often individual fields are aligned too, but on smaller boundaries.
The rationale is that certain architectures have trouble handling anything smaller than a word or a double word. In my opinion, concerns such as these are seldom important in transfer file formats—since the main bottleneck in storage is disk latency, twiddling nonaligned fields is a minor concern. This is especially so since most storage related code is written in high level languages which usually automate the handling of nonaligned fields.
On the other hand, alignment costs little compared to the enormous disk sizes of today— RISC and Motorola advocates would probably bury anyone with a differing opinion.
As the major problem in storage subsystem performance is disk seek latency, not transfer rate nowadays, at least , we see that if we can anticipate which data are needed next, we can pipeline and gain performance.
File systems frequently use heuristics to achieve such an effect—they commonly prefetch the next block in line on the assumption that it will soon be needed. This is good, but not enough: Hence it would be nice if we could control the prefetch logic.
Unfortunately many common storage API s do not offer the programmer any such control. If we know which data will be needed next, we can ask for it ahead of time. On the down side, such explicit prefetching necessitates later storage in application memory. It also implicates some rather intricate parallelism and all that comes with it i.
The only way to achieve this is to already have the data in memory when we need it. Depending on access patterns there are many ways to accomplish this. Prefetching and pipelining were already touched upon in the last chapter. Caching is what is left, then. Not all programs benefit from caching. In this case, caching does not help. On the other hand, most larger programs use—or could be made to use—the same data over and over.
Examples are found in databases, text editors, configuration management, graphics and 3D applications, file compression and archiving and more. In this case, keeping often used data in the core helps a lot. Optimal caching requires knowledge the operating system or the application itself, for that matter does not have. Thus less must often suffice. Combined with some simple prefetching e. But only in the average. There are situations where a single program may cause significant thrashing and degrade the performance of the whole system.
The negative side is that programs become more complex if caching is done in application space. We end up using a lot of memory, too. In the extreme, we may force the operating system to page out. This is the primary benefit of OS side caching versus the more customized schemes: The system can adapt to low memory conditions quite unlike we can.
In addition to that, a global caching scheme yields a significant benefit if multiple programs share access to a file. Program state is another thing, closely related to caching. In the latter case, applications become more memory intensive and often require more careful handling of error conditions and special cases—keeping program state in sync with the external files takes some effort.
In the former they depend on the cache of the operating system, generate extraneous disk accesses metadata lookups and file updates and generally come to require random access. The problem is, if the program does not keep count of where different data are in a file, it will have to seek for what it needs. It seems statefulness is purely a programming tradeoff. But caching and intended program state often dictate how external data is encoded.
If both lowmem operation and efficient random access are required, there will usually be some redundancy in the file format. This is exemplified by. Some format designs betray an incongruence in expecting the application to be stateful but neglecting easy buildup of the state.
This is bad, since such an approach does not scale. Text based formats are a pathological example—ever tried to build an efficient editor for text files of arbitrary size?
Say, MB and beyond: Since text editors deal with lines, one needs state to tell us where the lines are. Text files have no indices, so we need to read the entire file to be able to go around in the text. Using such an encoding as a basis for another format is dubious. As a worst case scenario, some formats have even used line numbers for pointers—extremely bad since now one needs to browse through the entire file to number the lines before proceeding to read the data.
Most people would agree small files are a good thing. Although storage space is ample today, there is always a pressure to make our data smaller.
There are many ways to accomplish this: All these reduce redundancy in the resulting file, sometimes resorting to losing some insignificant data in the process. This might be called perceptual redundance. Fields which assume default values may be made optional and variable length fields employed.
At the expense of byte alignment, some fields may be packed more densely e. Redundant encodings can be collapsed into a more efficient form. Sometimes the data structures one uses contain hidden redundance.
For instance, one can code binary tree structures extremely efficiently by sorting and storing branch lengths only. This is the technique pkzip uses to store its static Huffman trees. Similarly effective constructs exist for many nontrivial data structures, such as graphs. In general, tailor made structures cut back on storage requirements.
Especially so if complex storage organizations are used. Examples are found in scientific data formats, since they are designed to hold highly structured data. Sparse arrays are a prime one. All the measures thus far take out what might be called external redundance: Nothing is said about the redundance of the data itself, though.
Substituting binary fields for numeric character strings is good, but does not suffice if most of our numeric data consists of zeroes. Compression is the tool of choice for reducing redundance of this kind. If perfect reconstruction can be sacrificed, lossy methods can be employed. These most often come in the guise of vector quantization, either in some suitable transform domain like frequencies, as in the case of DCT or not like in the case of the difference signal in CELP.
In some cases straight adaptive quantization does the trick. Even better performance results if perceptual models are used to tune the quantization process say, MP3.
The combinations and ramifications are endless. When choosing a compression scheme, many factors come into play. One must think about the computational burden the higher the compression ratio, the more data each outgoing byte depends on , memory requirements good compression equals tons of model state , intellectual property issues e.
It is also important to note that compression is generally much slower than decompression. MPEG encoders are a notorious example. In this case both ends of the pipe tend to slow down. It really does not rhyme with indexing, random access, small memory footprints or easy modification. Quirky compression methods are one excellent reason to abandon a file format altogether. Anything that reduces redundance will reduce fault tolerance. This applies to compression as well.
Checksums, error correction, synchro flags and alike often need to be added afterwards to ensure that bits survive intact. Often the best way to proceed is to use a simple, reliable compression method, do compression blockwise to limit error propagation and memory requirements and to rely little on stuffed fields and pointers. Serialising highly structured data as simple streams is a mistake.
Few sentient coders would even think of encoding a database as a simple stream of fixed size cells without any extra structure or indices. The key to processing large amounts of data efficiently is to ignore most of it. To do this, we need metadata pointers, indices, directory trees, maps, keywords etc. There are many ways to index data. The simplest is to order according to some fixed key field. This is seldom a good alternative if field lengths vary or our data are large, but if we know the number of sort items beforehand, it is still worth consideration.
Gathering a separate directory i. Such lists are best suited for serial processing and highly irregular sort keys or, on the other hand, for flatly distributed fixed length keys.