Friday, December 11, 2009

Screen-scraping an ASP.NET application in PHP

I recently was asked to assist a friend in developing a PHP script to automate authenticating to an ASP.NET web application and scraping a small piece of data from a secured page that changes periodically. The cURL library in PHP works well for this task, but there are a few important aspects of an ASP.NET application to recognize:

  • ASP.NET applications that use forms-based authentication typically process login input and return an encrypted authentication token in the form of a cookie to the browser. Capturing this cookie properly is necessary to automate the process.

  • ASP.NET also employees a mechanism known as ViewState that must be respected. The mechanism is provided for developers that need to maintain the illusion of a stateful environment across multiple postbacks to a page. This allows, for example, a user to enter a value in a textbox, click a submit button, and have that value remembered in the textbox as the page is refreshed. This kind of functionality in other platforms is typically very complicated to build, but ASP.NET is engineered to support it natively.

    It does so by passing a value to the browser with each participating page as a special hidden <input> tag with the name "__VIEWSTATE". The encrypted value is automatically included then as a POST parameter when the page is submitted back to the server. The ASP.NET application server then processes the encrypted value to restore user interface control values and properties as it reconstructs the "new" page it sends back to the browser (with an updated __VIEWSTATE tag).

    So a PHP script that automates posting to any ASP.NET form must respect this process and properly capture - and resubmit - the appropriate __VIEWSTATE value.

  • ASP.NET has additional mechanisms to validate page posts, supplying additional encrypted information in a hidden <input> named "__EVENTVALIDATION". If this value is present, the PHP script must capture it as well.

To ensure __VIEWSTATE and __EVENTVALIDATION are respected when submitting a form typically requires two calls to an ASP.NET page. The first is a simple GET, the results of which may be parsed with regular expressions to pull the appropriate hidden <input> values. The second is the POST which submits desired <input> values as well as the previously retrieved __VIEWSTATE and __EVENTVALIDATION. Performing these two calls on a login form, and submitting appropriate account information successfully returns the authentication cookie which is then submitted for each subsequent request for a secured page.

The following script uses the cURL library to perform such a login and accesses a secured page. The variables at the top of the script may be modified to match the desired site and account information.

<?php
/************************************************
* ASP.NET web site scraping script;
* Developed by MishaInTheCloud.com
* Copyright 2009 MishaInTheCloud.com. All rights reserved.
* The use of this script is governed by the CodeProject Open License
* See the following link for full details on use and restrictions.
* http://www.codeproject.com/info/cpol10.aspx
*
* The above copyright notice must be included in any reproductions of this script.
************************************************/

/************************************************
* values used throughout the script
************************************************/
// urls to call - the login page and the secured page
$urlLogin = "http://www.server.com/blahblah/login.aspx";
$urlSecuredPage = "http://www.server.com/blahblah/securedPage.aspx";

// POST names and values to support login
$nameUsername='txtusername'; // the name of the username textbox on the login form
$namePassword='txtpassword'; // the name of the password textbox on the login form
$nameLoginBtn='btnlogin'; // the name of the login button (submit) on the login form
$valUsername ='myUsername'; // the value to submit for the username
$valPassword ='myPassword'; // the value to submit for the password
$valLoginBtn ='Login'; // the text value of the login button itself

// the path to a file we can read/write; this will
// store cookies we need for accessing secured pages
$cookies = 'someReadableWritableFileLocation\cookie.txt';

// regular expressions to parse out the special ASP.NET
// values for __VIEWSTATE and __EVENTVALIDATION
$regexViewstate = '/__VIEWSTATE\" value=\"(.*)\"/i';
$regexEventVal = '/__EVENTVALIDATION\" value=\"(.*)\"/i';


/************************************************
* utility function: regexExtract
* use the given regular expression to extract
* a value from the given text; $regs will
* be set to an array of all group values
* (assuming a match) and the nthValue item
* from the array is returned as a string
************************************************/
function regexExtract($text, $regex, $regs, $nthValue)
{
if (preg_match($regex, $text, $regs)) {
$result = $regs[$nthValue];
}
else {
$result = "";
}
return $result;
}



/************************************************
* initialize a curl handle; we'll use this
* handle throughout the script
************************************************/
$ch = curl_init();


/************************************************
* first, issue a GET call to the ASP.NET login
* page. This is necessary to retrieve the
* __VIEWSTATE and __EVENTVALIDATION values
* that the server issues
************************************************/
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$data=curl_exec($ch);

// from the returned html, parse out the __VIEWSTATE and
// __EVENTVALIDATION values
$viewstate = regexExtract($data,$regexViewstate,$regs,1);
$eventval = regexExtract($data, $regexEventVal,$regs,1);


/************************************************
* now issue a second call to the Login page;
* this time, it will be a POST; we'll send back
* as post data the __VIEWSTATE and __EVENTVALIDATION
* values the server previously sent us, as well as the
* username/password. We'll also set up a cookie
* jar to retrieve the authentication cookie that
* the server will generate and send us upon login.
************************************************/
$postData = '__VIEWSTATE='.rawurlencode($viewstate)
.'&__EVENTVALIDATION='.rawurlencode($eventval)
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn
;

curl_setOpt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);

$data = curl_exec($ch);


/************************************************
* with the authentication cookie in the jar,
* we'll now issue a GET to the secured page;
* we set curl's COOKIEFILE option to the same
* file we used for the jar before to ensure the
* authentication cookie is sent back to the
* server
************************************************/
curl_setOpt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);

$data = curl_exec($ch);

// at this point the secured page may be parsed for
// values, or additional POSTS made to submit parameters
// and retrieve data. For this sample, we'll just
// echo the results.
echo $data;



/************************************************
* that's it! Close the curl handle
************************************************/
curl_close($ch);


?>

41 comments:

  1. Thanks for your wonderful script.
    It ready did the trick for me.

    I'm facing a problem where I have some data which divided between some pages.

    Do you know how can I get the last page using cUrl?
    It uses javascript.

    ReplyDelete
  2. @Embedded - you might want to be more specific? The cURL library will certainly work whether or not javascript is on the page, right?

    ReplyDelete
  3. Could you please post a code snippet that mimics the curl followlocation option?

    I have problems using the FOLLOWLOCATIOn option on my web hosting.

    Thanks

    ReplyDelete
  4. @Embedded - it's set like any other option -

    curl_setOpt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

    If you're having problems with it, I would imagine it would have something to do with the headers being set.

    ReplyDelete
  5. I'm having problem with it so I tried to redirect manually by using the header function.

    The problem is that I need to keep my PHP script running and to be able to parse the redirected URL data.

    How can I do that?

    ReplyDelete
  6. @Embedded - I'm not sure I can help you with that.

    ReplyDelete
  7. Awesome post, saved me a boat load of time. Thank you thank you thank you....

    ReplyDelete
  8. This comment has been removed by a blog administrator.

    ReplyDelete
  9. @dandoen - I went to the site you listed and didn't see a login page. But that said, I think all the advice I can give you is to double-check the names of the form fields and their values. That's the portion of my script that you would need to customize specifically for your use.

    ReplyDelete
  10. Misha, there's indeed no login page, but that's why I had put it in quotes :)

    And thanks for the advice, I had indeed missed a form variable.

    As a side note, would appreciate if you could delete the url I mentioned in my previous comment.a

    ReplyDelete
  11. @dandoen - I couldn't remove just the url, so I removed your comment.

    ReplyDelete
  12. Thanks a lot .. was very useful .. concise and to the point

    :)

    ReplyDelete
  13. Awesome - you saved me a whole day of programming.

    I also learned a lot about asp.net scraping! Thanks!!

    ReplyDelete
  14. Extremely helpful. Thank you so much for sharing this code! :)

    ReplyDelete
  15. surprised no one else has mentioned yet, but I had to change $cookies to $cookieFile and I had to declare $regs with $regs=array(). Still trying to get it to work for me.

    ReplyDelete
  16. I have the same problem andr3w321

    ReplyDelete
  17. Not working for me as well. Had the same problems.

    ReplyDelete
  18. What if the __EVENTVALIDATION variable isn't present in the initial load? I'm getting an error as a result:

    Validation of viewstate MAC failed. If this application is hosted by a Web Farm or cluster, ensure that configuration specifies the same validationKey and validation algorithm. AutoGenerate cannot be used in a cluster.

    Has anyone experienced this?

    ReplyDelete
  19. Very Nice script.... Need to add HTTP User Agent to get it work. For example

    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:11.0) Gecko/20100101 Firefox/11.0' );

    ReplyDelete
  20. Another idea is to urlencode parameters in the script at lines 97 to 101 when parameter names and values contains '&' for example.

    Works for me, thank you!

    ReplyDelete
  21. Hi Thanks for useful script ... but don't work for me :( I think this is related to the fact that it is in SSL ... but not sure ... ?!?

    I get this :
    Server Error in '/SSL/Appl' Application.
    Object reference not set to an instance of an object.

    it doesn't get __VIEWSTATE !

    curl_setopt($ch, CURLOPT_URL, 'https://www.euro-cat24.ch/SSL/Appl/login.aspx');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    $data=curl_exec($ch);
    echo $data;

    Any idea ?

    ReplyDelete
  22. MAN! I wish I'd run into this a year ago. Excellent use of cURL commands. CnP'd some of your sequences and ran. I must have had like 4 Ah-HA moments after I saw the stdout. I'm going to work with the preg_match() function more. I must have at least 5 string match functions written. By all appearences, preg_match() might make them obsolete. Oh well. Should trim down my code a little ;)

    Thanks Misha!

    ReplyDelete
  23. Just wanted to say THANK YOU a lot!!! You truly deserve this: http://onemillionthankyou.com/

    ReplyDelete
  24. I found a small problem I could not understand at first got me stuck.

    I also had to add to get ssl version and get soem sort of error back.

    curl_setopt($ch, CURLOPT_SSLVERSION,3);
    curl_setopt($ch, CURLOPT_VERBOSE, 1);
    curl_setopt($ch, CURLOPT_STDERR, $handle);

    I am still stuck on login as I get 302 page not found when trying to login.

    Expect: 100-continue

    < HTTP/1.1 100 Continue
    < HTTP/1.1 302 Found
    < Date: Wed, 2012 14:05:02 GMT
    < Server: Microsoft-IIS/6.0
    < P3P: CP="CAO OUR"
    < SRV: 30
    < X-Powered-By: ASP.NET
    < X-AspNet-Version: 4.0
    < Location: http://example.com/Error/ErrorPage.aspx?aspxerrorpath=/default.aspx
    < Cache-Control: private
    < Content-Type: text/html; charset=utf-8
    < Content-Length: 193
    * HTTP error before end of send, stop sending

    Any help would be great

    ReplyDelete
  25. Misha please contact me for paid assistance on scraping an ASP.Net page. Thanks.

    ReplyDelete
  26. Hi..

    I read your blog and write code according to this but its not working in my case. So please help.

    I am working on scrap data from aspx page having pagination based on javascript:__doPostBack().



    Thanks

    ReplyDelete
  27. i am using this code:

    $url = "http://www.riogrande.com/Category/Findings-and-Finished-Jewelry/132/Bails-and-Enhancers/472";
    $file=file_get_contents($url);
    preg_match("#.*?#mis", $file, $arr_viewstate);
    $viewstate = urlencode($arr_viewstate[1]);
    $eventvalidation = urlencode($arr_viewstate[2]);

    $options = array(
    CURLOPT_RETURNTRANSFER => true, // return web page
    CURLOPT_HEADER => false, // don't return headers
    CURLOPT_ENCODING => "", // handle all encodings
    CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'", // who am i
    CURLOPT_AUTOREFERER => true, // set referer on redirect
    CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
    CURLOPT_TIMEOUT => 1120, // timeout on response
    CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
    CURLOPT_POST => true,
    CURLOPT_VERBOSE => true,
    CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ctl00$ContentPlaceHolderBody$SearchPageNavigationTop$rptPager$ctl01').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&__EVENTVALIDATION='.$eventvalidation.'&__LASTFOCUS='.urlencode(''));
    $ch = curl_init($url);
    curl_setopt_array($ch,$options);
    $result = curl_exec($ch);
    curl_close($ch);

    echo $result;

    ReplyDelete
  28. I'm working on a library to make doPostBack form actions easier. Let me know what you think.

    ReplyDelete
  29. Good code, thanks you very much!

    ReplyDelete
  30. Thank you so so much, this is exactly what i needed!

    ReplyDelete
  31. Hi!
    This script helped me a lot!
    Thanks!!
    I have a small problem with another ASP page.
    I need to submit a POST form to the login page, but to get to the login page, I have to go through a previuos aspx page.
    Even when I type the login url, it redirect me to the previous page.
    Can you or anyone help me?
    Rafael

    ReplyDelete
  32. Hi i want to scrap http://www.bcad.org/ClientDB/PropertySearch.aspx?cid=1, data from this website when searched, i didnot need login like u had created so created the as below but shows blank, could any body help me out.

    ============================

    ReplyDelete
  33. I love you so much. This saved me literally dozens of hours.

    ReplyDelete
  34. I had to change $cookies to $cookieFile and I had to declare $regs with $regs=array() (added as last line before the utility function).

    in my case, i did not need the GET on the second page so i commented out that section.

    then after
    // at this point the secured page may be parsed for
    // values, or additional POSTS made to submit parameters
    // and retrieve data. For this sample, we''ll just
    // echo the results.

    I added (note, i am using $postdata array instead of the original method):

    $postdata=array("key1"=>"value1" /* etc.*/);

    // from the returned html, parse out the __VIEWSTATE and
    // __EVENTVALIDATION values
    $viewstate = regexExtract($data,$regexViewstate,$regs,1);
    $postdata["__VIEWSTATE"] = $viewstate;

    $eventval = regexExtract($data, $regexEventVal,$regs,1);
    $postdata["__EVENTVALIDATION"] = $eventval;

    curl_setOpt($ch, CURLOPT_POST, TRUE);
    curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query( $postdata));
    curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile); /* 20130825 this originally had a different cookie variable name*/

    $data = curl_exec($ch);

    echo $data;

    ReplyDelete
  35. Thank you for this well written description and for including the code. You just saved my team a LOT of time, as well as helping make sense of why what we were doing previously didn't work.

    ReplyDelete
  36. not working for me ?
    i am trying to login page it also have "ctl00$ContentPlaceHolder1$hfrandomtoken" var pass in post data

    any idea please help me out i am stuck in this

    ReplyDelete
  37. i am still getting getting login page while calling securepage
    i passed extra ctl00$ContentPlaceHolder1$hfrandomtoken key value pair win post-field prams
    any idea plz help me .....

    ReplyDelete
  38. I want to log in page with captcha
    how I do it ?

    ReplyDelete
    Replies
    1. Hi there. The short answer is - you don't. The purpose of a captcha is to prevent this kind of automation by ensuring a real human is entering login information. You could try to write some code that would scan and interpret a captcha image, but perhaps instead consider contacting the site owners and seeing if they support other means for providing data feeds from their site.

      Delete
  39. thanks misha its realy work for me u save my whole time .....

    ReplyDelete
  40. Awsome.. thanks you very much!

    ReplyDelete

Submit a comment?