I recently was asked to assist a friend in developing a PHP script to automate authenticating to an ASP.NET web application and scraping a small piece of data from a secured page that changes periodically. The cURL library in PHP works well for this task, but there are a few important aspects of an ASP.NET application to recognize:
- ASP.NET applications that use forms-based authentication typically process login input and return an encrypted authentication token in the form of a cookie to the browser. Capturing this cookie properly is necessary to automate the process.
- ASP.NET also employees a mechanism known as ViewState that must be respected. The mechanism is provided for developers that need to maintain the illusion of a stateful environment across multiple postbacks to a page. This allows, for example, a user to enter a value in a textbox, click a submit button, and have that value remembered in the textbox as the page is refreshed. This kind of functionality in other platforms is typically very complicated to build, but ASP.NET is engineered to support it natively.
It does so by passing a value to the browser with each participating page as a special hidden <input> tag with the name "__VIEWSTATE". The encrypted value is automatically included then as a POST parameter when the page is submitted back to the server. The ASP.NET application server then processes the encrypted value to restore user interface control values and properties as it reconstructs the "new" page it sends back to the browser (with an updated __VIEWSTATE tag).
So a PHP script that automates posting to any ASP.NET form must respect this process and properly capture - and resubmit - the appropriate __VIEWSTATE value. - ASP.NET has additional mechanisms to validate page posts, supplying additional encrypted information in a hidden <input> named "__EVENTVALIDATION". If this value is present, the PHP script must capture it as well.
To ensure __VIEWSTATE and __EVENTVALIDATION are respected when submitting a form typically requires two calls to an ASP.NET page. The first is a simple GET, the results of which may be parsed with regular expressions to pull the appropriate hidden <input> values. The second is the POST which submits desired <input> values as well as the previously retrieved __VIEWSTATE and __EVENTVALIDATION. Performing these two calls on a login form, and submitting appropriate account information successfully returns the authentication cookie which is then submitted for each subsequent request for a secured page.
The following script uses the cURL library to perform such a login and accesses a secured page. The variables at the top of the script may be modified to match the desired site and account information.
<?php
/************************************************
* ASP.NET web site scraping script;
* Developed by MishaInTheCloud.com
* Copyright 2009 MishaInTheCloud.com. All rights reserved.
* The use of this script is governed by the CodeProject Open License
* See the following link for full details on use and restrictions.
* http://www.codeproject.com/info/cpol10.aspx
*
* The above copyright notice must be included in any reproductions of this script.
************************************************/
/************************************************
* values used throughout the script
************************************************/
// urls to call - the login page and the secured page
$urlLogin = "http://www.server.com/blahblah/login.aspx";
$urlSecuredPage = "http://www.server.com/blahblah/securedPage.aspx";
// POST names and values to support login
$nameUsername='txtusername'; // the name of the username textbox on the login form
$namePassword='txtpassword'; // the name of the password textbox on the login form
$nameLoginBtn='btnlogin'; // the name of the login button (submit) on the login form
$valUsername ='myUsername'; // the value to submit for the username
$valPassword ='myPassword'; // the value to submit for the password
$valLoginBtn ='Login'; // the text value of the login button itself
// the path to a file we can read/write; this will
// store cookies we need for accessing secured pages
$cookies = 'someReadableWritableFileLocation\cookie.txt';
// regular expressions to parse out the special ASP.NET
// values for __VIEWSTATE and __EVENTVALIDATION
$regexViewstate = '/__VIEWSTATE\" value=\"(.*)\"/i';
$regexEventVal = '/__EVENTVALIDATION\" value=\"(.*)\"/i';
/************************************************
* utility function: regexExtract
* use the given regular expression to extract
* a value from the given text; $regs will
* be set to an array of all group values
* (assuming a match) and the nthValue item
* from the array is returned as a string
************************************************/
function regexExtract($text, $regex, $regs, $nthValue)
{
if (preg_match($regex, $text, $regs)) {
$result = $regs[$nthValue];
}
else {
$result = "";
}
return $result;
}
/************************************************
* initialize a curl handle; we'll use this
* handle throughout the script
************************************************/
$ch = curl_init();
/************************************************
* first, issue a GET call to the ASP.NET login
* page. This is necessary to retrieve the
* __VIEWSTATE and __EVENTVALIDATION values
* that the server issues
************************************************/
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$data=curl_exec($ch);
// from the returned html, parse out the __VIEWSTATE and
// __EVENTVALIDATION values
$viewstate = regexExtract($data,$regexViewstate,$regs,1);
$eventval = regexExtract($data, $regexEventVal,$regs,1);
/************************************************
* now issue a second call to the Login page;
* this time, it will be a POST; we'll send back
* as post data the __VIEWSTATE and __EVENTVALIDATION
* values the server previously sent us, as well as the
* username/password. We'll also set up a cookie
* jar to retrieve the authentication cookie that
* the server will generate and send us upon login.
************************************************/
$postData = '__VIEWSTATE='.rawurlencode($viewstate)
.'&__EVENTVALIDATION='.rawurlencode($eventval)
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn
;
curl_setOpt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/************************************************
* with the authentication cookie in the jar,
* we'll now issue a GET to the secured page;
* we set curl's COOKIEFILE option to the same
* file we used for the jar before to ensure the
* authentication cookie is sent back to the
* server
************************************************/
curl_setOpt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
// at this point the secured page may be parsed for
// values, or additional POSTS made to submit parameters
// and retrieve data. For this sample, we'll just
// echo the results.
echo $data;
/************************************************
* that's it! Close the curl handle
************************************************/
curl_close($ch);
?>
Thanks for your wonderful script.
ReplyDeleteIt ready did the trick for me.
I'm facing a problem where I have some data which divided between some pages.
Do you know how can I get the last page using cUrl?
It uses javascript.
@Embedded - you might want to be more specific? The cURL library will certainly work whether or not javascript is on the page, right?
ReplyDeleteCould you please post a code snippet that mimics the curl followlocation option?
ReplyDeleteI have problems using the FOLLOWLOCATIOn option on my web hosting.
Thanks
@Embedded - it's set like any other option -
ReplyDeletecurl_setOpt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
If you're having problems with it, I would imagine it would have something to do with the headers being set.
I'm having problem with it so I tried to redirect manually by using the header function.
ReplyDeleteThe problem is that I need to keep my PHP script running and to be able to parse the redirected URL data.
How can I do that?
@Embedded - I'm not sure I can help you with that.
ReplyDeleteAwesome post, saved me a boat load of time. Thank you thank you thank you....
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete@dandoen - I went to the site you listed and didn't see a login page. But that said, I think all the advice I can give you is to double-check the names of the form fields and their values. That's the portion of my script that you would need to customize specifically for your use.
ReplyDeleteMisha, there's indeed no login page, but that's why I had put it in quotes :)
ReplyDeleteAnd thanks for the advice, I had indeed missed a form variable.
As a side note, would appreciate if you could delete the url I mentioned in my previous comment.a
@dandoen - I couldn't remove just the url, so I removed your comment.
ReplyDeleteThanks a lot .. was very useful .. concise and to the point
ReplyDelete:)
Awesome - you saved me a whole day of programming.
ReplyDeleteI also learned a lot about asp.net scraping! Thanks!!
Extremely helpful. Thank you so much for sharing this code! :)
ReplyDeletesurprised no one else has mentioned yet, but I had to change $cookies to $cookieFile and I had to declare $regs with $regs=array(). Still trying to get it to work for me.
ReplyDeleteI have the same problem andr3w321
ReplyDeleteNot working for me as well. Had the same problems.
ReplyDeleteWhat if the __EVENTVALIDATION variable isn't present in the initial load? I'm getting an error as a result:
ReplyDeleteValidation of viewstate MAC failed. If this application is hosted by a Web Farm or cluster, ensure that configuration specifies the same validationKey and validation algorithm. AutoGenerate cannot be used in a cluster.
Has anyone experienced this?