Friday, December 11, 2009

Screen-scraping an ASP.NET application in PHP

I recently was asked to assist a friend in developing a PHP script to automate authenticating to an ASP.NET web application and scraping a small piece of data from a secured page that changes periodically. The cURL library in PHP works well for this task, but there are a few important aspects of an ASP.NET application to recognize:

  • ASP.NET applications that use forms-based authentication typically process login input and return an encrypted authentication token in the form of a cookie to the browser. Capturing this cookie properly is necessary to automate the process.

  • ASP.NET also employees a mechanism known as ViewState that must be respected. The mechanism is provided for developers that need to maintain the illusion of a stateful environment across multiple postbacks to a page. This allows, for example, a user to enter a value in a textbox, click a submit button, and have that value remembered in the textbox as the page is refreshed. This kind of functionality in other platforms is typically very complicated to build, but ASP.NET is engineered to support it natively.

    It does so by passing a value to the browser with each participating page as a special hidden <input> tag with the name "__VIEWSTATE". The encrypted value is automatically included then as a POST parameter when the page is submitted back to the server. The ASP.NET application server then processes the encrypted value to restore user interface control values and properties as it reconstructs the "new" page it sends back to the browser (with an updated __VIEWSTATE tag).

    So a PHP script that automates posting to any ASP.NET form must respect this process and properly capture - and resubmit - the appropriate __VIEWSTATE value.

  • ASP.NET has additional mechanisms to validate page posts, supplying additional encrypted information in a hidden <input> named "__EVENTVALIDATION". If this value is present, the PHP script must capture it as well.

To ensure __VIEWSTATE and __EVENTVALIDATION are respected when submitting a form typically requires two calls to an ASP.NET page. The first is a simple GET, the results of which may be parsed with regular expressions to pull the appropriate hidden <input> values. The second is the POST which submits desired <input> values as well as the previously retrieved __VIEWSTATE and __EVENTVALIDATION. Performing these two calls on a login form, and submitting appropriate account information successfully returns the authentication cookie which is then submitted for each subsequent request for a secured page.

The following script uses the cURL library to perform such a login and accesses a secured page. The variables at the top of the script may be modified to match the desired site and account information.

<?php
/************************************************
* ASP.NET web site scraping script;
* Developed by MishaInTheCloud.com
* Copyright 2009 MishaInTheCloud.com. All rights reserved.
* The use of this script is governed by the CodeProject Open License
* See the following link for full details on use and restrictions.
* http://www.codeproject.com/info/cpol10.aspx
*
* The above copyright notice must be included in any reproductions of this script.
************************************************/

/************************************************
* values used throughout the script
************************************************/
// urls to call - the login page and the secured page
$urlLogin = "http://www.server.com/blahblah/login.aspx";
$urlSecuredPage = "http://www.server.com/blahblah/securedPage.aspx";

// POST names and values to support login
$nameUsername='txtusername'; // the name of the username textbox on the login form
$namePassword='txtpassword'; // the name of the password textbox on the login form
$nameLoginBtn='btnlogin'; // the name of the login button (submit) on the login form
$valUsername ='myUsername'; // the value to submit for the username
$valPassword ='myPassword'; // the value to submit for the password
$valLoginBtn ='Login'; // the text value of the login button itself

// the path to a file we can read/write; this will
// store cookies we need for accessing secured pages
$cookies = 'someReadableWritableFileLocation\cookie.txt';

// regular expressions to parse out the special ASP.NET
// values for __VIEWSTATE and __EVENTVALIDATION
$regexViewstate = '/__VIEWSTATE\" value=\"(.*)\"/i';
$regexEventVal = '/__EVENTVALIDATION\" value=\"(.*)\"/i';


/************************************************
* utility function: regexExtract
* use the given regular expression to extract
* a value from the given text; $regs will
* be set to an array of all group values
* (assuming a match) and the nthValue item
* from the array is returned as a string
************************************************/
function regexExtract($text, $regex, $regs, $nthValue)
{
if (preg_match($regex, $text, $regs)) {
$result = $regs[$nthValue];
}
else {
$result = "";
}
return $result;
}



/************************************************
* initialize a curl handle; we'll use this
* handle throughout the script
************************************************/
$ch = curl_init();


/************************************************
* first, issue a GET call to the ASP.NET login
* page. This is necessary to retrieve the
* __VIEWSTATE and __EVENTVALIDATION values
* that the server issues
************************************************/
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$data=curl_exec($ch);

// from the returned html, parse out the __VIEWSTATE and
// __EVENTVALIDATION values
$viewstate = regexExtract($data,$regexViewstate,$regs,1);
$eventval = regexExtract($data, $regexEventVal,$regs,1);


/************************************************
* now issue a second call to the Login page;
* this time, it will be a POST; we'll send back
* as post data the __VIEWSTATE and __EVENTVALIDATION
* values the server previously sent us, as well as the
* username/password. We'll also set up a cookie
* jar to retrieve the authentication cookie that
* the server will generate and send us upon login.
************************************************/
$postData = '__VIEWSTATE='.rawurlencode($viewstate)
.'&__EVENTVALIDATION='.rawurlencode($eventval)
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn
;

curl_setOpt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);

$data = curl_exec($ch);


/************************************************
* with the authentication cookie in the jar,
* we'll now issue a GET to the secured page;
* we set curl's COOKIEFILE option to the same
* file we used for the jar before to ensure the
* authentication cookie is sent back to the
* server
************************************************/
curl_setOpt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);

$data = curl_exec($ch);

// at this point the secured page may be parsed for
// values, or additional POSTS made to submit parameters
// and retrieve data. For this sample, we'll just
// echo the results.
echo $data;



/************************************************
* that's it! Close the curl handle
************************************************/
curl_close($ch);


?>