Using PhantomJS Cloud to scrape Windows Phone Apps+Games Store
I wanted to download the description, screenshots, and icons from apps in from the Windows Phone store, which seemed quite easy initially, until I discovered that simply using WebClient to download the HTML from the webpage fired back an error: “Your request appears to be from an automated process.”, so, I had to think of a more ‘natural’ way to extract the data from the page, that’s when I remembered PhantomJS cloud, a hosted service for running PhantomJS, the headless browser software.
I’ve left my API Key out of the code below, you’ll need to sign up for a key yourself,
var strWPUrl = “http://www.windowsphone.com/en-gb/store/app/barcelona_metro/36c88182-9a6d-e011-81d2-78e7d1fa76f8”;
var strUrl =
“http://api.phantomjscloud.com/single/browser/v1/YOUR_KEY_HERE/?targetUrl=” +
strWpUrl;
const string strScreenshotRegex = @”/images/(?<ScreenshotGuid>[\w-]+)\?imageType=ws_screenshot_large”;
const string strIconGuidRegex = @”/images/(?<IconGuid>[\w-]+)\?imageType=ws_icon_large”;
const string strDescriptionRegex = @”itemprop…description.{27}(?<Description>.*)./pre”;
const string strAppNameRegex = @”app/(?<Name>\w+)”;
var strAppName = Regex.Match(strUrl, strAppNameRegex).Groups[“Name”].Value;
var strPath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + @”\WP\” + strAppName;
if (!Directory.Exists(strPath))
{
Directory.CreateDirectory(strPath);
Directory.CreateDirectory(strPath + @”\Screenshots”);
}var strHtml = wc.DownloadString(strUrl);
Console.WriteLine(“Downloaded WP url:” + strAppName);
var strDescription = Regex.Match(strHtml, strDescriptionRegex).Groups[“Description”].Value;
var fs = new FileStream(strPath + @”\” + strAppName + “.txt”, FileMode.Create);
var sw = new StreamWriter(fs);
sw.Write(strDescription);
sw.Flush();
sw.Close();
fs.Close();
var strIconGuid = Regex.Match(strHtml, strIconGuidRegex).Groups[“IconGuid”].Value;
var strIconUrl = @”http://cdn.marketplaceimages.windowsphone.com/v8/images/” + strIconGuid;
wc.DownloadFile(strIconUrl, strPath + @”\Icon.png”);
var strScreenshotGuids = Regex.Matches(strHtml, strScreenshotRegex).Cast<Match>().Select(m => m.Groups[“ScreenshotGuid”].Value);
foreach (var strScreenshotGuid in strScreenshotGuids)
{
var strScreenhotUrl = @”http://cdn.marketplaceimages.windowsphone.com/v8/images/” + strScreenshotGuid;
wc.DownloadFile(strScreenhotUrl, strPath + @”\Screenshots\” + strScreenshotGuid + “.png”);
}
This downloads the HTML via PhantomJS cloud, runs some regexes to get the description of the app, and downloads the icon and screenshots (in original size), and saves them to disk, in a folder called WP on the desktop