Home > Uncategorized > #ZIP file decompression from first principles in C#

#ZIP file decompression from first principles in C#

TL;DR; here is the Github repo: https://github.com/infiniteloopltd/Zip

First off, if you’re just looking to unzip a zip file, please stop reading, and look at System.IO.Compression instead, however, if you want to write some code in C# to repair a damaged Zip file, or to find a performant way to decompress one file out of a larger zip file, then perhaps this approach may be useful.

So, from Wikipedia, you can get the header format for a Zip file; which repeats for every zip entry (compressed file)

OffsetBytesDescription[31]
04Local file header signature = 0x04034b50 (PK♥♦ or “PK\3\4”)
42Version needed to extract (minimum)
62General purpose bit flag
82Compression method; e.g. none = 0, DEFLATE = 8 (or “\0x08\0x00”)
102File last modification time
122File last modification date
144CRC-32 of uncompressed data
184Compressed size (or 0xffffffff for ZIP64)
224Uncompressed size (or 0xffffffff for ZIP64)
262File name length (n)
282Extra field length (m)
30nFile name
30+nmExtra field

I only wanted a few fields out of these, so I wrote code to extract them as follows;

eader = BitConverter.ToInt32(zipData.Skip(offset).Take(4).ToArray());
if (header != 0x04034b50)
{
	IsValid = false;
	return; // Zip header invalid
}
GeneralPurposeBitFlag = BitConverter.ToInt16(zipData.Skip(offset + 6).Take(2).ToArray());
var compressionMethod = BitConverter.ToInt16(zipData.Skip(offset + 8).Take(2).ToArray());
CompressionMethod = (CompressionMethodEnum) compressionMethod;
CompressedDataSize = BitConverter.ToInt32(zipData.Skip(offset + 18).Take(4).ToArray());
UncompressedDataSize = BitConverter.ToInt32(zipData.Skip(offset + 22).Take(4).ToArray());
CRC = BitConverter.ToInt32(zipData.Skip(offset + 14).Take(4).ToArray());
var fileNameLength = BitConverter.ToInt16(zipData.Skip(offset + 26).Take(2).ToArray());
FileName = Encoding.UTF8.GetString(zipData.Skip(offset + 30).Take(fileNameLength).ToArray());
var extraFieldLength = BitConverter.ToInt16(zipData.Skip(offset + 28).Take(2).ToArray());
ExtraField = zipData.Skip(offset + 30 + fileNameLength).Take(extraFieldLength).ToArray();
var dataStartIndex = offset + 30 + fileNameLength + extraFieldLength;
var bCompressed = zipData.Skip(dataStartIndex).Take(CompressedDataSize).ToArray();
Decompressed = CompressionMethod == CompressionMethodEnum.None ? bCompressed : Deflate(bCompressed);
NextOffset = dataStartIndex + CompressedDataSize;

This rather dense piece of code extracts relevant data from the zip entry header. It also determines if the zip entry is compressed, or left as-is, because with a very small file, then compression can actually increase the file size.

public enum CompressionMethodEnum
{
            None = 0,
            Deflate = 8
}

This is the enum I used, 0 for no compression, and 8 for deflate.

Now, if the zip entry is actually compressed, then you really have to revert to some code in .NET to decompress it:

private static byte[] Deflate(byte[] rawData)
{ 
	var memCompress = new MemoryStream(rawData);
	Stream csStream = new DeflateStream(memCompress, CompressionMode.Decompress);
	var msDecompress = new MemoryStream();
	csStream.CopyTo(msDecompress);
	var bDecompressed = msDecompress.ToArray();
	return bDecompressed;
}

I would really love if someone could implement this from first principles also, but the process is very very complicated, and it just fried my head trying to understand it.

So, with this in place, here is the loop I used to extract every file in the archive;

static void Main(string[] args)
{
	var file = "hello3.zip";
	var bFile = File.ReadAllBytes(file);
	var nextOffset = 0;
	do
	{
		var entry = new ZipEntry(bFile, nextOffset);
		if (!entry.IsValid) break;
		var content = Encoding.UTF8.GetString(entry.Decompressed);
		Console.WriteLine(entry.FileName);
		Console.WriteLine(content);
		nextOffset = entry.NextOffset;
	} while (true);
}

So, you could perhaps use this code to try to repair a corrupt zip file, or maybe optimize the extraction, so you extract on certain data from a large zip – or whatever.

Categories: Uncategorized
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: