Home > Uncategorized > Using PhantomJS Cloud to scrape Windows Phone Apps+Games Store

Using PhantomJS Cloud to scrape Windows Phone Apps+Games Store

I wanted to download the description, screenshots, and icons from apps in from the Windows Phone store, which seemed quite easy initially, until I discovered that simply using WebClient to download the HTML from the webpage fired back an error: “Your request appears to be from an automated process.”, so, I had to think of a more ‘natural’ way to extract the data from the page, that’s when I remembered PhantomJS cloud, a hosted service for running PhantomJS, the headless browser software.

I’ve left my API Key out of the code below, you’ll need to sign up for a key yourself, 

 

var strWPUrl = “http://www.windowsphone.com/en-gb/store/app/barcelona_metro/36c88182-9a6d-e011-81d2-78e7d1fa76f8”;
var strUrl =
http://api.phantomjscloud.com/single/browser/v1/YOUR_KEY_HERE/?targetUrl=” +
strWpUrl;
const string strScreenshotRegex = @”/images/(?<ScreenshotGuid>[\w-]+)\?imageType=ws_screenshot_large”;
const string strIconGuidRegex = @”/images/(?<IconGuid>[\w-]+)\?imageType=ws_icon_large”;
const string strDescriptionRegex = @”itemprop…description.{27}(?<Description>.*)./pre”;
const string strAppNameRegex = @”app/(?<Name>\w+)”;
var strAppName = Regex.Match(strUrl, strAppNameRegex).Groups[“Name”].Value;
var strPath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + @”\WP\” + strAppName;
if (!Directory.Exists(strPath))
{
Directory.CreateDirectory(strPath);
Directory.CreateDirectory(strPath + @”\Screenshots”);
}

var strHtml = wc.DownloadString(strUrl);
Console.WriteLine(“Downloaded WP url:” + strAppName);
var strDescription = Regex.Match(strHtml, strDescriptionRegex).Groups[“Description”].Value;
var fs = new FileStream(strPath + @”\” + strAppName + “.txt”, FileMode.Create);
var sw = new StreamWriter(fs);
sw.Write(strDescription);
sw.Flush();
sw.Close();
fs.Close();
var strIconGuid = Regex.Match(strHtml, strIconGuidRegex).Groups[“IconGuid”].Value;
var strIconUrl = @”http://cdn.marketplaceimages.windowsphone.com/v8/images/&#8221; + strIconGuid;
wc.DownloadFile(strIconUrl, strPath + @”\Icon.png”);
var strScreenshotGuids = Regex.Matches(strHtml, strScreenshotRegex).Cast<Match>().Select(m => m.Groups[“ScreenshotGuid”].Value);
foreach (var strScreenshotGuid in strScreenshotGuids)
{
var strScreenhotUrl = @”http://cdn.marketplaceimages.windowsphone.com/v8/images/&#8221; + strScreenshotGuid;
wc.DownloadFile(strScreenhotUrl, strPath + @”\Screenshots\” + strScreenshotGuid + “.png”);
}

 

This downloads the HTML via PhantomJS cloud, runs some regexes to get the description of the app, and downloads the icon and screenshots (in original size), and saves them to disk, in a folder called WP on the desktop

Advertisements
Categories: Uncategorized
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: