Archive

Archive for May, 2021

#ZIP file decompression from first principles in C#

TL;DR; here is the Github repo: https://github.com/infiniteloopltd/Zip

First off, if you’re just looking to unzip a zip file, please stop reading, and look at System.IO.Compression instead, however, if you want to write some code in C# to repair a damaged Zip file, or to find a performant way to decompress one file out of a larger zip file, then perhaps this approach may be useful.

So, from Wikipedia, you can get the header format for a Zip file; which repeats for every zip entry (compressed file)

OffsetBytesDescription[31]
04Local file header signature = 0x04034b50 (PK♥♦ or “PK\3\4”)
42Version needed to extract (minimum)
62General purpose bit flag
82Compression method; e.g. none = 0, DEFLATE = 8 (or “\0x08\0x00”)
102File last modification time
122File last modification date
144CRC-32 of uncompressed data
184Compressed size (or 0xffffffff for ZIP64)
224Uncompressed size (or 0xffffffff for ZIP64)
262File name length (n)
282Extra field length (m)
30nFile name
30+nmExtra field

I only wanted a few fields out of these, so I wrote code to extract them as follows;

eader = BitConverter.ToInt32(zipData.Skip(offset).Take(4).ToArray());
if (header != 0x04034b50)
{
	IsValid = false;
	return; // Zip header invalid
}
GeneralPurposeBitFlag = BitConverter.ToInt16(zipData.Skip(offset + 6).Take(2).ToArray());
var compressionMethod = BitConverter.ToInt16(zipData.Skip(offset + 8).Take(2).ToArray());
CompressionMethod = (CompressionMethodEnum) compressionMethod;
CompressedDataSize = BitConverter.ToInt32(zipData.Skip(offset + 18).Take(4).ToArray());
UncompressedDataSize = BitConverter.ToInt32(zipData.Skip(offset + 22).Take(4).ToArray());
CRC = BitConverter.ToInt32(zipData.Skip(offset + 14).Take(4).ToArray());
var fileNameLength = BitConverter.ToInt16(zipData.Skip(offset + 26).Take(2).ToArray());
FileName = Encoding.UTF8.GetString(zipData.Skip(offset + 30).Take(fileNameLength).ToArray());
var extraFieldLength = BitConverter.ToInt16(zipData.Skip(offset + 28).Take(2).ToArray());
ExtraField = zipData.Skip(offset + 30 + fileNameLength).Take(extraFieldLength).ToArray();
var dataStartIndex = offset + 30 + fileNameLength + extraFieldLength;
var bCompressed = zipData.Skip(dataStartIndex).Take(CompressedDataSize).ToArray();
Decompressed = CompressionMethod == CompressionMethodEnum.None ? bCompressed : Deflate(bCompressed);
NextOffset = dataStartIndex + CompressedDataSize;

This rather dense piece of code extracts relevant data from the zip entry header. It also determines if the zip entry is compressed, or left as-is, because with a very small file, then compression can actually increase the file size.

public enum CompressionMethodEnum
{
            None = 0,
            Deflate = 8
}

This is the enum I used, 0 for no compression, and 8 for deflate.

Now, if the zip entry is actually compressed, then you really have to revert to some code in .NET to decompress it:

private static byte[] Deflate(byte[] rawData)
{ 
	var memCompress = new MemoryStream(rawData);
	Stream csStream = new DeflateStream(memCompress, CompressionMode.Decompress);
	var msDecompress = new MemoryStream();
	csStream.CopyTo(msDecompress);
	var bDecompressed = msDecompress.ToArray();
	return bDecompressed;
}

I would really love if someone could implement this from first principles also, but the process is very very complicated, and it just fried my head trying to understand it.

So, with this in place, here is the loop I used to extract every file in the archive;

static void Main(string[] args)
{
	var file = "hello3.zip";
	var bFile = File.ReadAllBytes(file);
	var nextOffset = 0;
	do
	{
		var entry = new ZipEntry(bFile, nextOffset);
		if (!entry.IsValid) break;
		var content = Encoding.UTF8.GetString(entry.Decompressed);
		Console.WriteLine(entry.FileName);
		Console.WriteLine(content);
		nextOffset = entry.NextOffset;
	} while (true);
}

So, you could perhaps use this code to try to repair a corrupt zip file, or maybe optimize the extraction, so you extract on certain data from a large zip – or whatever.

Categories: Uncategorized

High performance extraction of unstructured text from a #PDF in C#

There are a myriad of tools that allow the extraction of text from a PDF, and this is code is not meant as a replacement for them, it was a specific case where I was looking to extract text from a PDF as fast as possible without worrying about the structure of the document. I.e. to very quickly answer the question “on what pages does the text “X” appear?”

In my specific case, performance was of paramount importance, knowing the layout of the page was unimportant.

The Github repo is here: https://github.com/infiniteloopltd/PdfToTextCSharp

And the performance was 10x faster than iText, parsing a 270 page PDF in 0.735 seconds.

It’s also a very interesting look at how one could go about creating a PDF reader from first principles, so without further ado, let’s take a look at a PDF, when opened in a text editor:

%PDF-1.7 
%âãÏÓ 
7 0 obj
<<
/Contents [ 8 0 R  ] 
/Parent 5 0 R 
/Resources 6 0 R 
/Type /Page
>>
endobj
6 0 obj
<<
/Font <<
/ttf0 11 0 R 
/ttf1 17 0 R 
>>
/ProcSet 21 0 R 
>>
endobj
8 0 obj
<<
/Filter [ /FlateDecode ]
/Length 1492
>>
stream
..... BINARY DATA ...
endstream

What is interesting here, is that the page data is encoded in the “BINARY DATA” which is enclosed between the stream and endstream markers

This binary data can be decompressed using the Deflate method. There are other compression schemes used in PDF, and they can even be chained, but that goes beyond the scope of this tool.

Here is the code to uncompress deflated binary data;

private static string Decompress(byte[] input)
{
            var cutInput = new byte[input.Length - 2];
            Array.Copy(input, 2, cutInput, 0, cutInput.Length);
            var stream = new MemoryStream();
            using (var compressStream = new MemoryStream(cutInput))
            using (var deflateStream = new DeflateStream(compressStream, CompressionMode.Decompress))
                deflateStream.CopyTo(stream);
            return Encoding.Default.GetString(stream.ToArray());
}

So, I read through the PDF document, looking for markers of “stream” and “endstream”, and when found, I would snip out the binary data, deflate it to reveal this text;

/DeviceRGB cs
/DeviceRGB CS
q
1 0 0 1 0 792 cm
18 -18 901.05 -756  re W n
1.5 w
0 0 0 SC
32.05 -271.6 m
685.25 -271.6 l
S
32.05 -235.6 m
685.25 -235.6 l
S
1 w
32.05 -723.9 m
685.25 -723.9 l
S
BT
1 0 0 1 636 -743.7 Tm
0 0 0 sc
0 Tr
/ttf0 9 Tf
(270)Tj
-510 0.1 Td
0 Tr
(08-Apr-2021)Tj

Most of this text is relating to the layout and appearance of the page, and is once again, beyond the scope of the tool. I wanted to extract the text which is represented like (…)Tj , which I extracted using a regex as follows;

 const string strContentRegex = @"\((?<Content>.*?)\)Tj";
 UnstructuredContent = Regex.Matches(RawContent, strContentRegex)
                .Select(m => m.Groups["Content"].Value)
                .ToList();

Once this was done, I could then write a Find function, that could find which pages a given string of text appeared;

public List<FastPDFPage> Find(string text)
{
            return Pages
                .Where(p => p.UnstructuredContent
                    .Any(c => string.Equals(c, text, StringComparison.OrdinalIgnoreCase)))
                .ToList();
}

And, in performance tests, this consistently performed at 0.735 seconds to scan 270 pages, much faster than iText, and a order of magnitude faster than PDF Miner for Python.

Categories: Uncategorized

Search #Skype users via an #API, by name or email address

Skype has been around since 2003 and has over 660 million registered users. This is an API wrapper that allows access to the user search facility of the consumer version of skype (i.e. not Skype4business), via a simple interface that allows searching via email address or name

To learn more – go to https://rapidapi.com/infiniteloop/api/skype-graph

This API takes a single parameter, which can be either a name “Robert Smith”, or an email address “robert.smith@hotmail.com

The response will be in JSON, following a format similar to this;

{
  "requestId": "689067",
  "results": [
    {
      "nodeProfileData": {
        "skypeId": "bobbys88881",
        "name": "Rob Smith",
        "avatarUrl": "https://api.skype.com/users/bobbys88881/profile/avatar",
        "country": "Ireland",
        "countryCode": "ie",
        "contactType": "Skype4Consumer",
        "avatarImageUrl": "https://avatar.skype.com/v1/avatars/bobbys88881/public"
      }
    }
  ]
}	

Above is a sample search for “robert.smith@hotmail.com”

Categories: Uncategorized

Vehicle licence plate #API available in #CostaRica

Costa Rica is a central American country with a population of just over 5 million, and half of all households in Costa Rica own at least one vehicle. The number of cars in Costa Rica has more than doubled since 2006 to an automotive park of 1,794,658 vehicles registered up to Feb. 2018. 1,166,042 (65%) corresponds to automotive vehicles; 589,037 motorcycles (33%), 20,918 Micro Buses (1%) and 9,661 Buses (0.5%). The average age of a Costa Rican car is 16 years with 2003 models.

As of today, we have launched an API that allows users search for a vehicle registered in Costa Rica using it’s license plate (like BFH467 in the photo above), via the website http://www.placa.co.cr/

Car registration plates in Costa Rica use the /CheckCostaRica API endpoint and return the following information:

  • Make / Model
  • Age
  • Wheelplan
  • Engine size
  • Fuel
  • Representative image

Sample Registration Number: 

706854

Sample Json:

{

  “Description”: “DAIHATSU TERIOS”,

  “CarMake”: {

    “CurrentTextValue”: “DAIHATSU”

  },

  “CarModel”: {

    “CurrentTextValue”: “TERIOS”

  },

  “MakeDescription”: {

    “CurrentTextValue”: “DAIHATSU”

  },

  “ModelDescription”: {

    “CurrentTextValue”: “TERIOS”

  },

  “EngineSize”: {

    “CurrentTextValue”: “1500”

  },

  “RegistrationYear”: “2016”,

  “Body”: “TODO TERRENO 4 PUERTAS”,

  “Transmission”: “MANUAL”,

  “Fuel”: “GASOLINA”,

  “Cabin”: “CABINA SENCILLA”,

  “WheelPlan”: “4X4”,

  “ImageUrl”: “http://www.placa.co.cr/image.aspx/@REFJSEFUU1UgVEVSSU9T&#8221;

}

Categories: Uncategorized

Upgrading #Cordova apps from #UIWebView to #WKWebView

Cordova hybrid iOS apps are based on the idea of wrapping a web application in a UIWebView in order to serve it as an app resembling a native iOS app. Apple has informed all iOS developers that UIWebView will be deprecated, and that new, or updated apps must use WKWebView instead. The idea that UIWebView could bypass fundamental restrictions that Web users are used to, is certainly problematic, and it’s not surprising that Apple has taken this move.

However, if you’ve got a Cordova based app, and need to make an upgrade, then perhaps you may need to take cognisance of these new limitations, and the upgrade path.

So, first off, to do the basic upgrade, you need to run the following command

cordova plugins add cordova-plugin-wkwebview-engine

Then Make a modification to config.xml as follows;

<platform name="ios">
    <preference name="WKWebViewOnly" value="true" />

    <feature name="CDVWKWebViewEngine">
        <param name="ios-package" value="CDVWKWebViewEngine" />
    </feature>

    <preference name="CordovaWebViewEngine" value="CDVWKWebViewEngine" />
</platform>

Then rebuild the app using

cordova build ios

That’s the basic upgrade path, but let’s dive deeper. One of the abilities of UIWebView was that you could make cross-origin requests without any restrictions. This is no longer possible in WKWebView, so you need to make sure that your AJAX requests are CORS compliant.

Assuming the API you are connecting to is not CORS enabled, and you have no direct control over it to make it CORS enabled, you will need to use a CORS proxy. One of the easiest ways to do this is via an API gateway using AWS.

Log in to AWS, and go to API Gateway. Press “Create API”, then “Build” beside HTTP API, then Add Integration -> HTTP. Set the URL endpoint to the root domain of the destination API that you are trying to call. Give the API a name and press “Next” Set Resource path to $default and press next. Press Next again, then Create.

Once created, click on the API, and press CORS on the left hand side. Press Configure. In Access-Control-Allow-Origin type * and press add. Do the same for Access-Control-Allow-Headers and Access-Control-Expose-Headers. For Access-Control-Allow-Methods select each method in turn, and press Add. Finally, Press Save.

You will now have an CORS-Enabled API endpoint, like https://xxxxxxx.execute-api.yyyyyyyyy.amazonaws.com that you can use to call the target API endpoint.

Then update your app source code, to change the domain to point at the AWS domain rather than the target API.

Before you compile, there is one last step; you have to add the following plugin:

cordova plugin add https://github.com/TheMattRay/cordova-plugin-wkwebviewxhrfix

What this does, is to set the origin header correctly within Cordova WKWebView, such that CORS will see it as a non-null origin, and will actually work.

Now, test your app, and hopefully you can now re-submit updates without apple complaining.

Categories: Uncategorized