I recently was asked to assist a friend in developing a PHP script to automate authenticating to an ASP.NET web application and scraping a small piece of data from a secured page that changes periodically. The cURL library in PHP works well for this task, but there are a few important aspects of an ASP.NET application to recognize:
- ASP.NET applications that use forms-based authentication typically process login input and return an encrypted authentication token in the form of a cookie to the browser. Capturing this cookie properly is necessary to automate the process.
- ASP.NET also employees a mechanism known as ViewState that must be respected. The mechanism is provided for developers that need to maintain the illusion of a stateful environment across multiple postbacks to a page. This allows, for example, a user to enter a value in a textbox, click a submit button, and have that value remembered in the textbox as the page is refreshed. This kind of functionality in other platforms is typically very complicated to build, but ASP.NET is engineered to support it natively.
It does so by passing a value to the browser with each participating page as a special hidden <input> tag with the name "__VIEWSTATE". The encrypted value is automatically included then as a POST parameter when the page is submitted back to the server. The ASP.NET application server then processes the encrypted value to restore user interface control values and properties as it reconstructs the "new" page it sends back to the browser (with an updated __VIEWSTATE tag).
So a PHP script that automates posting to any ASP.NET form must respect this process and properly capture - and resubmit - the appropriate __VIEWSTATE value. - ASP.NET has additional mechanisms to validate page posts, supplying additional encrypted information in a hidden <input> named "__EVENTVALIDATION". If this value is present, the PHP script must capture it as well.
To ensure __VIEWSTATE and __EVENTVALIDATION are respected when submitting a form typically requires two calls to an ASP.NET page. The first is a simple GET, the results of which may be parsed with regular expressions to pull the appropriate hidden <input> values. The second is the POST which submits desired <input> values as well as the previously retrieved __VIEWSTATE and __EVENTVALIDATION. Performing these two calls on a login form, and submitting appropriate account information successfully returns the authentication cookie which is then submitted for each subsequent request for a secured page.
The following script uses the cURL library to perform such a login and accesses a secured page. The variables at the top of the script may be modified to match the desired site and account information.
<?php
/************************************************
* ASP.NET web site scraping script;
* Developed by MishaInTheCloud.com
* Copyright 2009 MishaInTheCloud.com. All rights reserved.
* The use of this script is governed by the CodeProject Open License
* See the following link for full details on use and restrictions.
* http://www.codeproject.com/info/cpol10.aspx
*
* The above copyright notice must be included in any reproductions of this script.
************************************************/
/************************************************
* values used throughout the script
************************************************/
// urls to call - the login page and the secured page
$urlLogin = "http://www.server.com/blahblah/login.aspx";
$urlSecuredPage = "http://www.server.com/blahblah/securedPage.aspx";
// POST names and values to support login
$nameUsername='txtusername'; // the name of the username textbox on the login form
$namePassword='txtpassword'; // the name of the password textbox on the login form
$nameLoginBtn='btnlogin'; // the name of the login button (submit) on the login form
$valUsername ='myUsername'; // the value to submit for the username
$valPassword ='myPassword'; // the value to submit for the password
$valLoginBtn ='Login'; // the text value of the login button itself
// the path to a file we can read/write; this will
// store cookies we need for accessing secured pages
$cookies = 'someReadableWritableFileLocation\cookie.txt';
// regular expressions to parse out the special ASP.NET
// values for __VIEWSTATE and __EVENTVALIDATION
$regexViewstate = '/__VIEWSTATE\" value=\"(.*)\"/i';
$regexEventVal = '/__EVENTVALIDATION\" value=\"(.*)\"/i';
/************************************************
* utility function: regexExtract
* use the given regular expression to extract
* a value from the given text; $regs will
* be set to an array of all group values
* (assuming a match) and the nthValue item
* from the array is returned as a string
************************************************/
function regexExtract($text, $regex, $regs, $nthValue)
{
if (preg_match($regex, $text, $regs)) {
$result = $regs[$nthValue];
}
else {
$result = "";
}
return $result;
}
/************************************************
* initialize a curl handle; we'll use this
* handle throughout the script
************************************************/
$ch = curl_init();
/************************************************
* first, issue a GET call to the ASP.NET login
* page. This is necessary to retrieve the
* __VIEWSTATE and __EVENTVALIDATION values
* that the server issues
************************************************/
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$data=curl_exec($ch);
// from the returned html, parse out the __VIEWSTATE and
// __EVENTVALIDATION values
$viewstate = regexExtract($data,$regexViewstate,$regs,1);
$eventval = regexExtract($data, $regexEventVal,$regs,1);
/************************************************
* now issue a second call to the Login page;
* this time, it will be a POST; we'll send back
* as post data the __VIEWSTATE and __EVENTVALIDATION
* values the server previously sent us, as well as the
* username/password. We'll also set up a cookie
* jar to retrieve the authentication cookie that
* the server will generate and send us upon login.
************************************************/
$postData = '__VIEWSTATE='.rawurlencode($viewstate)
.'&__EVENTVALIDATION='.rawurlencode($eventval)
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn
;
curl_setOpt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/************************************************
* with the authentication cookie in the jar,
* we'll now issue a GET to the secured page;
* we set curl's COOKIEFILE option to the same
* file we used for the jar before to ensure the
* authentication cookie is sent back to the
* server
************************************************/
curl_setOpt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
// at this point the secured page may be parsed for
// values, or additional POSTS made to submit parameters
// and retrieve data. For this sample, we'll just
// echo the results.
echo $data;
/************************************************
* that's it! Close the curl handle
************************************************/
curl_close($ch);
?>
Thanks for your wonderful script.
ReplyDeleteIt ready did the trick for me.
I'm facing a problem where I have some data which divided between some pages.
Do you know how can I get the last page using cUrl?
It uses javascript.
@Embedded - you might want to be more specific? The cURL library will certainly work whether or not javascript is on the page, right?
ReplyDeleteCould you please post a code snippet that mimics the curl followlocation option?
ReplyDeleteI have problems using the FOLLOWLOCATIOn option on my web hosting.
Thanks
@Embedded - it's set like any other option -
ReplyDeletecurl_setOpt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
If you're having problems with it, I would imagine it would have something to do with the headers being set.
I'm having problem with it so I tried to redirect manually by using the header function.
ReplyDeleteThe problem is that I need to keep my PHP script running and to be able to parse the redirected URL data.
How can I do that?
@Embedded - I'm not sure I can help you with that.
ReplyDeleteAwesome post, saved me a boat load of time. Thank you thank you thank you....
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete@dandoen - I went to the site you listed and didn't see a login page. But that said, I think all the advice I can give you is to double-check the names of the form fields and their values. That's the portion of my script that you would need to customize specifically for your use.
ReplyDeleteMisha, there's indeed no login page, but that's why I had put it in quotes :)
ReplyDeleteAnd thanks for the advice, I had indeed missed a form variable.
As a side note, would appreciate if you could delete the url I mentioned in my previous comment.a
@dandoen - I couldn't remove just the url, so I removed your comment.
ReplyDeleteThanks a lot .. was very useful .. concise and to the point
ReplyDelete:)
Awesome - you saved me a whole day of programming.
ReplyDeleteI also learned a lot about asp.net scraping! Thanks!!
Extremely helpful. Thank you so much for sharing this code! :)
ReplyDeletesurprised no one else has mentioned yet, but I had to change $cookies to $cookieFile and I had to declare $regs with $regs=array(). Still trying to get it to work for me.
ReplyDeleteI have the same problem andr3w321
ReplyDeleteNot working for me as well. Had the same problems.
ReplyDeleteWhat if the __EVENTVALIDATION variable isn't present in the initial load? I'm getting an error as a result:
ReplyDeleteValidation of viewstate MAC failed. If this application is hosted by a Web Farm or cluster, ensure that configuration specifies the same validationKey and validation algorithm. AutoGenerate cannot be used in a cluster.
Has anyone experienced this?
Very Nice script.... Need to add HTTP User Agent to get it work. For example
ReplyDeletecurl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:11.0) Gecko/20100101 Firefox/11.0' );
Another idea is to urlencode parameters in the script at lines 97 to 101 when parameter names and values contains '&' for example.
ReplyDeleteWorks for me, thank you!
Hi Thanks for useful script ... but don't work for me :( I think this is related to the fact that it is in SSL ... but not sure ... ?!?
ReplyDeleteI get this :
Server Error in '/SSL/Appl' Application.
Object reference not set to an instance of an object.
it doesn't get __VIEWSTATE !
curl_setopt($ch, CURLOPT_URL, 'https://www.euro-cat24.ch/SSL/Appl/login.aspx');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$data=curl_exec($ch);
echo $data;
Any idea ?
MAN! I wish I'd run into this a year ago. Excellent use of cURL commands. CnP'd some of your sequences and ran. I must have had like 4 Ah-HA moments after I saw the stdout. I'm going to work with the preg_match() function more. I must have at least 5 string match functions written. By all appearences, preg_match() might make them obsolete. Oh well. Should trim down my code a little ;)
ReplyDeleteThanks Misha!
Just wanted to say THANK YOU a lot!!! You truly deserve this: http://onemillionthankyou.com/
ReplyDeleteI found a small problem I could not understand at first got me stuck.
ReplyDeleteI also had to add to get ssl version and get soem sort of error back.
curl_setopt($ch, CURLOPT_SSLVERSION,3);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_STDERR, $handle);
I am still stuck on login as I get 302 page not found when trying to login.
Expect: 100-continue
< HTTP/1.1 100 Continue
< HTTP/1.1 302 Found
< Date: Wed, 2012 14:05:02 GMT
< Server: Microsoft-IIS/6.0
< P3P: CP="CAO OUR"
< SRV: 30
< X-Powered-By: ASP.NET
< X-AspNet-Version: 4.0
< Location: http://example.com/Error/ErrorPage.aspx?aspxerrorpath=/default.aspx
< Cache-Control: private
< Content-Type: text/html; charset=utf-8
< Content-Length: 193
* HTTP error before end of send, stop sending
Any help would be great
Misha please contact me for paid assistance on scraping an ASP.Net page. Thanks.
ReplyDeleteHi..
ReplyDeleteI read your blog and write code according to this but its not working in my case. So please help.
I am working on scrap data from aspx page having pagination based on javascript:__doPostBack().
Thanks
i am using this code:
ReplyDelete$url = "http://www.riogrande.com/Category/Findings-and-Finished-Jewelry/132/Bails-and-Enhancers/472";
$file=file_get_contents($url);
preg_match("#.*?#mis", $file, $arr_viewstate);
$viewstate = urlencode($arr_viewstate[1]);
$eventvalidation = urlencode($arr_viewstate[2]);
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 1120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => true,
CURLOPT_VERBOSE => true,
CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ctl00$ContentPlaceHolderBody$SearchPageNavigationTop$rptPager$ctl01').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&__EVENTVALIDATION='.$eventvalidation.'&__LASTFOCUS='.urlencode(''));
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$result = curl_exec($ch);
curl_close($ch);
echo $result;
I'm working on a library to make doPostBack form actions easier. Let me know what you think.
ReplyDeleteGood code, thanks you very much!
ReplyDeleteThank you so so much, this is exactly what i needed!
ReplyDeleteHi!
ReplyDeleteThis script helped me a lot!
Thanks!!
I have a small problem with another ASP page.
I need to submit a POST form to the login page, but to get to the login page, I have to go through a previuos aspx page.
Even when I type the login url, it redirect me to the previous page.
Can you or anyone help me?
Rafael
Hi i want to scrap http://www.bcad.org/ClientDB/PropertySearch.aspx?cid=1, data from this website when searched, i didnot need login like u had created so created the as below but shows blank, could any body help me out.
ReplyDelete============================
I love you so much. This saved me literally dozens of hours.
ReplyDelete