Wednesday, June 12, 2013

ZFS File format

Over the years I've reverse engineered quite a few file formats, but I've never really sat down and picked apart why a format was designed the way it was. With that said, I wanted to show the ZFS archive file format and highlight some of the peculiarities I saw and perhaps you guys can answer some of my questions.

For some context, Z-engine was created around 1995 and was used on Macintosh, MS-DOS, and Windows 95.

Format
The main file header is defined as:
struct ZfsHeader {
    uint32 magic;
    uint32 unknown1;
    uint32 maxNameLength;
    uint32 filesPerBlock;
    uint32 fileCount;
    byte xorKey[4];
    uint32 fileSectionOffset;
};
  • magic and unknown1 are self explanatory
  • maxNameLength refers to the length of the block that stores a file's name. Any extra spaces are null.
  • The archive is split into 'pages' or 'blocks'. Each 'page' contains, at max, filesPerBlock files
  • fileCount is total number of files the archive contains
  • xorKey is the XOR cipher used for encryption of the files
  • fileSectionOffset is the offset of the main data section, aka fileLength - mainHeaderLength

The file entry header is defined as:
struct ZfsEntryHeader {
    char name[16];
    uint32 offset;
    uint32 id;
    uint32 size;
    uint32 time;
    uint32 unknown;
};
  • name is the file name right-padded with null characters
  • offset is the offset to the actual file data
  • id is a the numeric id of the file. The id's increment from 0 to fileCount
  • size is the length of the file
  • unknown is self explanatory

Therefore, the entire file structure is as follows:
[Main Header]
 
[uint32 offsetToPage2]
[Page 1 File Entry Headers]
[Page 1 File Data]
 
[uint32 offsetToPage3]
[Page 2 File Entry Headers]
[Page 2 File Data]
 
etc.


Questions and Observations

maxNameLength
Why have a fixed size name block vs. null terminated or [size][string]? Was that just the popular thing to do back then so the entire header to could be cast directly to a struct?

filesPerBlock
What is the benefit to pagination? The only explanation I can see atm is that it was some artifact of their asset compiler max memory. Maybe I'm missing something since I've never programmed for that type of hardware.

fileSectionOffset
I've seen things like this a lot in my reverse engineering; they give the offset to a section that's literally just after the header. Even if they were doing straight casting instead of incremental reading, a simple sizeof(mainHeader) would give them the offset to the next section. Again, if I'm missing something, please let me know.


Well that's it for now,
-RichieSams

4 comments:

  1. I'm quite glad you're progressing well :)

    Some answers:
    - maxNameLength: it seems to be a design choice, I suppose. I can't find out a sane reason why they've done it this way, but other engines have done it like that as well. It won't make any major difference, so just go along with it :)
    - filesPerBlock: it probably exists so that the file data can be loaded in memory in blocks. Some languages like COBOL did a similar trick, but with raw byte sizes, i.e. in COBOL data files, you would see etc
    - fileSectionOffset: they probably didn't think too much about this one, and left it in. Just ignore it altogether :)

    So the only useful fields from the global header seem to be:
    - magic, to check if the file is indeed a ZFS file
    - maxNameLength, to load the file names properly
    - fileCount
    - xorKey

    The rest seems redundant...

    ReplyDelete
  2. Things like the fileSectionOffset field are done for future compatibility the file format. For example you may want to create a new version of the file format that has a bigger header struct or some data between the header and the file section. Using the fileSectionOffset, older software will be able to better read the new file format. It may not be able to use the newly added features of the new file format, but it will be able to find the files. The change to the file format may be minor enough that they would want to be able to maintain a level of compatibility with older software.

    In your case you can ignore fileSectionOffset since you're only covering ZGI and ZN. I assume they use the same header format.

    -km3k

    ReplyDelete
  3. - maxNameLength, makes sense if you know which file number you want, as it makes it easy to just scan ahead, without having to parse every string. But then again, I don't know of any proper use case where the file NUMBER is known...
    -fileSectionOffset is like the data offset in BMPs I guess, mostly there in case they decide to change the header-size at some point, or adding an additional block of data inbetween.

    I guess with the fixed sizes, they could read fileCount*headerSize, and then cast directly to an array of the struct.

    ReplyDelete