• Easy web scraping with PHP

    Feb 17 2008

    Web scraping is a technique of web development where you load a web page and "scrape" the data off the page to be used elsewhere. It's not pretty, but sometimes scraping is the only way to access data or content from a web site that doesn't provide RSS or an open API.

    I'm not going to discuss the legal aspects of scraping, as it may be considered copyright infringement in some situations. However, there are also perfectly legal reasons to need to scrape, like if you have permission.

    To make things really easy, we're going to let the power of regular expressions do all the work for us. If you're not familiar with regular expressions, you may want to google for a tutorial. Here is the documentation for PHP regular expression syntax.

    First, we start off by loading the HTML using file_get_contents. Next, we use preg_match_all with a regular expression to turn the data on the page into a PHP array.

    This example will demonstrate scraping this web site's blog page to extract the most recent blog posts. This is just for demo purposes - of course, the RSS feed is much better suited for this.

    // get the HTML
    $html = file_get_contents("http://www.thefutureoftheweb.com/blog/");
    

    Here is what the HTML looks like for the blog posts:

    <ul id="main">
        <li>
            <h1><a href="[link]">[title]</a></h1>
            <span class="date">[date]</span>
            <div class="section">
                [content]
            </div>
        </li>
    </ul>
    

    So we will use a regular expression that looks for all the li elements and capture the content using parentheses at the appropriate places (link, title, date & content).

    preg_match_all(
        '/<li>.*?<h1><a href="(.*?)">(.*?)<\/a><\/h1>.*?<span class="date">(.*?)<\/span>.*?<div class="section">(.*?)<\/div>.*?<\/li>/s',
        $html,
        $posts, // will contain the blog posts
        PREG_SET_ORDER // formats data into an array of posts
    );
    
    foreach ($posts as $post) {
        $link = $post[1];
        $title = $post[2];
        $date = $post[3];
        $content = $post[4];
    
        // do something with data
    }
    

    There's a lot going on inside that regular expression, but there are really only a few "tricks" that are used. Anytime I want to say "skip over whatever is between" I use .*?. And any time I want to say "match whatever is in here" I use (.*?). And lastly, the s at the end tells PHP to allow the dot . to match newlines. That's about all there is to it.

    The regular expression will only match blog posts, because they are the only <li> elements that contain an <h1>, <span class="date"> and <div class="section">.

    Web scraping is highly unreliable - if the HTML structure were to change this code would break instantly. However, it's often quite easy to write this code, and usually produces a perfectly usable hack solution.

  • Comments

    1. Perry at 5:36pm on March 17, 2008

    This is a perfect tutorial for scraping, thanks, it's a big help!

    2. Alan at 11:23pm on May 20, 2008

    Nice article! I was originally planning to write a small scraper for my web app in PHP or RoR, but then I came across Feedity ( http://feedity.com ) which made things a lot easier. Feedity generates custom RSS feeds from webpages, and now I just consume the resulting RSS feed in my application. Simple and straight! Check it out sometime!

    3. Tristan at 5:01pm on May 29, 2008

    Hi, Thanks alot for the information has really helped with scraping information from a video site, for future reference for other users they will want to fix this:

    $link = $post[1]

    swap with

    $link = $post[1];

    Lastly, I was going to ask for your help, if I wanted to use this to get the body content, although needed to get an additional piece of information, such as the amount of results, how would this be achieved?

    Thanks

    4. Jesse Skinner at 7:44am on May 31, 2008

    @Tristan - thanks for the correction! I changed it in the post.

    To answer your question, once you have the $post array you can just use count($post) to see how many posts were found.

    5. Tristan at 10:20am on May 31, 2008

    Thanks Jesse, glad to be of help, that count($post) will be helpful, but what I mean is I need to scrape a different item, say you where scraping google, what I need is the number at the top "Results 1 - 10 of about search of SOMEVALUE". How would I obtain that? Could I just run a preg_match before scraping the main content?

    Thanks Again

    6. Tom at 9:35pm on June 27, 2008

    Good post.
    Should not the preg_match_all statement use the backslash escape for each forward slash in the statement? So it would
    read:

    preg_match_all(
        '/<li>.*?<h1><a href="(.*?)">(.*?)</a></h1>.*?<span class="date">(.*?)</span>.*?<div class="section">(.*?)</div>.*?</li>/s',
        $html,
        $posts, // will contain the blog posts
        PREG_SET_ORDER // formats data into an array of posts
    );

    Thanks

    7. Jesse Skinner at 6:05am on June 28, 2008

    @Tom - Yep, you're right. I've fixed the post code. Thanks!

    8. Sam at 6:04am on July 9, 2008

    Hi,

    Nice and neat article. However this would only work if the html is predictable. I'm trying to scrap the content for ANY website/blogs and I find it very difficult. Currently I'm only relying on RSS feeds, but not everyone provides one.

    Have you tried scrapping more websites ?

    Thanks

    9. Yuriy at 12:11am on October 2, 2008

    Thanks, man! You saved me so much time. This is just a perfect tutorial on web scraping.

    10. Ankit at 3:30am on October 20, 2008

    Hi,

    First of all, thanks for an awesome tutorial. I was trying to tweak this to make a movie showtime listing engine based on Google's results. here's the code i used.

    <html><body>

    <?-php

    &s=$_GET['s];
    &s1=&_GET['s1'];

    echo "<p><i>Search for $s</i></p>";

        $s=urlencode($s);
        $s1=urlencode($s1);


    $html = file_get_contents("http://www.google.com/movies?q=".s."&btnG=Search+Movies&hl=en&near=".s1."");
    preg_match_all('/<table cellpadding=3>.(.*?)</td class=k>/s', $string, $matches),
    print $matches[1];
        $html,
        PREG_SET_ORDER
        print $matches[1];
    );




    }
    else
    {
    ?>

    <form name="form1" id="form1" method="get" action="">
      <div align="center">
        <p>
          <input name="s" type="text" id="s" size="50" />
          <input name="s1" type="text" id="s1" size="50" />
          <input type="submit" name="Submit" value="Search" />
        </p>
      </div>
    </form>

    <p>
      <?php
    }
    ?>
    </p>

    </body></html>

    As you stated, I found that the movie show timings are nested between <table cellpadding=3> and <td class=k> tags. hence we could exploit this for the engine. But the above doesn't seem to work. Could you please help?

    11. Jesse Skinner at 1:55pm on October 20, 2008

    @Ankit - You need to escape the backslash in the regular expression:

    '/<table cellpadding=3>.(.*?)</td class=k>/s'

    instead of:

    '/<table cellpadding=3>.(.*?)</td class=k>/s'

    12. Jesse Skinner at 1:57pm on October 20, 2008

    Oops I guess the backslash got lost which is probably why yours did in your comment.

    let's try this:

    '/<table cellpadding=3>.(.*?)<\/td class=k>/s'

    instead of:

    '/<table cellpadding=3>.(.*?)</td class=k>/s'

    13. Ankit at 8:47am on October 21, 2008

    Hi Jesse,

    Thanks for the prompt reply. Well I changed the code, but it still doesn't work. Would you mind reviewing it, saving the entire code as a php file and running it in a browser? I think this is a good tool, with good applicability. But it's not working, and that's bugging me :)

    Regards,
    Ankit

    14. gav at 8:10pm on November 29, 2008

    hi, is there anyway i can scrap a background image for example

    <td width="46" height="49" background="images/0.gif" align="center" valign="top" nowrap>

    well on the site its sometimes images/1.gif or 2 or 5 so can i get this information ?

    15. hassan at 8:56am on March 14, 2009

    Thanks for uploading well understandable code for scrapping.
    I have a problem that I want to scrap the links of my site which is hosted at local host . "http://192.156.1.100/$sitepreview/marketingmanager.com/" if run the following code

    <?php
    $siteurl="http://192.168.1.100/$sitepreview/marketingmanager.com/";
    $html=file_get_contents($siteurl);
    ?>
    <ul id="main">
        <li>
            <h1><a href="[link]">[title]</a></h1>
            <span class="date">[date]</span>
            <div class="section">[content]
            </div>
        </li>
    </ul>
    <?php
    preg_match_all('/<a href="(.*?)">(.*?)</a>/s',$html,$posts,PREG_SET_ORDER);
    echo $count_post=count($post);
    foreach ($posts as $post)
    {
        echo $link = $post[1];
        $title = $post[2];
        $date = $post[3];
        $content = $post[4];

        // do something with data
    }
    ?>
    Iam facing error on webpage that
    Warning: file_get_contents(https://192.168.1.100//marketingmanager.com/): failed to open stream: Invalid argument in C:Inetpubvhostsmarketingmanager.comhttpdocs est_urls.php on line 3
    and count of $post is 0. could you go through my problem.

    16. Jesse Skinner at 4:35am on March 16, 2009

    @hassan - the problem is the dollar sign $ in your URL string, which PHP thinks is a variable. You can wrap the URL in single quotes to avoid this, ie.:

    $siteurl='http://192.168.1.100/$sitepreview/marketingmanager.com/';

    17. vince at 1:11pm on March 17, 2009

    Hi,
    Great tutorial, what if the code I am scraping gets bland at one point in the scrape and it becomes hard to decipher one html tag from another?  Please see example below, I am trying to capture the percentage data, but not sure how to ignore the first 4 tds and  zero in on the correct td:

    <tr>
    <td>03/11/09</td>

    <td>3509</td>
    <td>7-13-2</td>
    <td>1-1-0</td>
    <td>8-14-2</td>
    <td>36.36%</td>
    <td>92</td>

    </tr>

    Thanks.

    18. vince at 1:13pm on March 17, 2009

    p.s. I don't control the web pages I am scraping.

    19. zlot at 11:58am on April 16, 2009

    thanx for the tutorial, now i can finally scrap my competitors website ^___^

    20. NomikOS at 1:10am on April 28, 2009

    @Vince: Try:

    <tr>.*?<td>([^<]+)%<\/td>.*?<\/tr>

    this extract only the percentage data. in your ex.: 36.36

    @ Jesse: Finally I can to participate in your blog.

    21. leo at 12:32pm on July 16, 2009

    Hi, I have the following script:

    --------------- START CODE ---------------
    <?php
    function hyperlinkextract($s1,$s2,$s){
      $myarray=array();
      $s1=strtolower($s1);
      $s2=strtolower($s2);
      $L1=strlen($s1);
      $L2=strlen($s2);
      $scheck=strtolower($s);

      do{
      $pos1 = strpos($scheck,$s1);
      if($pos1!==false){
        $pos2 = strpos(substr($scheck,$pos1+$L1),$s2);
        if($pos2!==false){
          $myarray[]=substr($s,$pos1+$L1,$pos2);
          $s=substr($s,$pos1+$L1+$pos2+$L2);
          $scheck=strtolower($s);
          }
            }
      } while (($pos1!==false)and($pos2!==false));
    return $myarray;
    }

    $content = file_get_contents('./sample.htm');
    $myarray = hyperlinkextract("href=\"","\"",$content);

    // Process all the links
    foreach($myarray as $key => $val) {
    echo "<br />".$val."\n";
    }
    ?>

    --------------- END CODE ---------------

    It´s working well and capture all links on given page, but I´m trying, without success, filtering the results to get only links from a specific id or class.

    Also I would like to get links from the current page on "$content" variable... so it should work like "$content = file_get_contents('this.href');" .

    Thanks in advance !
    LEO

    22. Jenni at 11:36am on September 1, 2009

    I often need to scrape our own web pages for legal reasons - review of text version by legal dept. I use biterscripting ( http://www.biterscripting.com ) for that. Take a look at a sample script they have posted at their site at http://www.biterscripting.com/SS_WebPageToText.html.

    That script extracts plain text from a web page. Similarly, script SS_WebPageToCSV extracts a table from a web page, such as stock table.

    Jenni

    23. Joe at 5:45pm on September 3, 2009

    I used regexes in my early days of web scraping, but found they can be fragile. Try a good library instead, like LWP for Perl.

    24. paul at 7:34am on September 12, 2009

    I need help.. I want to add a MORTGAGE RATES for my loan site.. how can i do that? I want to post a rate without the name of the link i copied..would that be possible?

    25. Gerson Jaber at 3:36pm on September 17, 2009

    You can use DOMHtml to do this, look this article:
    http://www.developertutorials.com/tutorials/php/scraping-links-with-php-8-01-05/page7.html

    26. Gerschel at 9:37pm on September 26, 2009

    Okay, I would like to scrape a large list of links:
    <a href="">""</a>

    Just before and after the links there is a header tag:
    <h3>""</h3>

    I would like to scrape every one into their own variable, not have them all go onto $post[1].

    Basically, my goal is to go through 300 different pages, where all the pages in the directory are named "page1.html"; "page2.html.

    I was able to come up with this so far:

    <? if (!isset($_POST['sub'])) {
    $page_number = 1 ;
    }
    if (isset($_POST['sub'])) {
    $page_number = $_POST['page_numeral'];
    }
    ?>
    <?
    $html = file_get_contents("http://example.com/directory/page$page_number.html");
    ?>

    And further down have a form where I can select the page number as a try and work it.

    <form id="form1" name="form1" method="post" action="">
      <label>page number
      <input type="text" name="page_numeral" id="page_numeral" />
      </label>
      <p>
        <label>
        <input type="submit" name="sub" id="sub" value="Submit" />
        </label>
      </p>
    </form>

    Now that I can control the pages of links, I would like to get one link, go into it, find the <blockquote> that follows a level three header tag that has the same name of the link. Go back to main page. Start on to the next link and do the same thing. Back to main page. Once all links are completed in main page, go to page2.html
    As I do this, I will be saving the <blockquote> into a database. The only problem that I am having is that page1.html may have a slightly different amount of links than page50.html or whatever. 

    Is there a shorthand to say something like:
    '<h3>(.*?)<\/h3><a href=(.*?)>(.*?)<\/a> next link by adding a variable that increments until <h3>.*?<\/h3>

    $link+incrementing variable = $post[incrementing variable]

    27. Gerschel at 12:13am on September 27, 2009

    Okay, there are 880 links, I don't want to write the link part 880 times. Any ideas. By the way, I got it to get the <blockquote> automatically in one swoop. Here's my code, p.s. I am new to all of this, I started sometime earlier this month:

    <? if (!isset($_POST['sub'])) {
    $page_number = 1 ;
    }
    if (isset($_POST['sub'])) {
    $page_number = $_POST['page_numeral'];
    }
    ?>
      <?php
    // get the HTML
    $html = file_get_contents("http://exampleurl.com/directory/page$page_number.html");

    preg_match_all(
        '/<blockquote>.*?<a href="(.*?)>(.*?)<\/a>/s',
        $html,
        $posts, // will contain the blog posts
        PREG_SET_ORDER // formats data into an array of posts
    );

    foreach ($posts as $post) {
        $link = $post[1];
        $a =$post[2];

    echo $link; echo $a;  // do something with data
                      $html1 = file_get_contents("http://exampleurl.com/directory/$a.html");

    preg_match_all(
        '/<blockquote>(.*?)<\/blockquote>/s',
        $html1,
        $posts1, // will contain the blog posts
        PREG_SET_ORDER // formats data into an array of posts
    );

      $quote = $posts1[0][0];

    echo $quote;  // do something with data
    }
    ?>
    <form id="form1" name="form1" method="post" action="">
      <label>page number
      <input type="text" name="page_numeral" id="page_numeral" />
      </label>
      <p>
        <label>
        <input type="submit" name="sub" id="sub" value="Submit" />
        </label>
      </p>
    </form>

    28. Laura Grant at 5:08pm on October 29, 2009

    I need 10 values scraped from this portion of HTML (name, address, etc):

    <div id="leftnav">
      <h1>Charity Rating</h1>
      </div> 
      <div id="sideads">
      <div class="rating"> 
     
        <p><strong>NAME</strong><br />

    ADDRESS<br />

    Memphis,&nbsp;TN&nbsp;38105<br />
    tel: (800) 805-5856<br /> fax: (901) 578-2805<br />
                <a href="javascript:openBrWindow('print=1')">EIN</a>: 351044
    </p>


    <p><a href="mailto:donors@donorexample.org">Contact Email</a><br /> <a href="http://www.donorexample.org" target="_blank" onclick="javascript: pageTracker._trackPageview('/outgoing/5234.htm');">Visit Web Site</a></p>


      </div>

    <div>

    And here is the PHP code I am trying to use... but it's not working. I don't understand the backslash escape issue and that might be the problem?

    $arr = array(10003,10029);

    foreach($arr as $value){
    // get the HTML
    $web = 'http://www.example.org/orgid='.$value;
    echo $web."<br/>";

    $html = file_get_contents($web);

    preg_match_all(


        '/<div id="leftnav"><h1>Charity Rating</h1>.*?<p><strong>(.*?)</strong><br />(.*?)<br />(.*?),&nbsp;(.*?)&nbsp;(.*?)<br />(.*?)<br />(.*?)<br />.*?">EIN</a>: (.*?)</p>.*?<p><a href="(.*?)".*?<a href="(.*?)"\',
        $html,
        $posts,
        PREG_SET_ORDER
    );

    foreach ($posts as $post) {
        $name = $post[1];
        $address = $post[2];
        $city = $post[3];
        $state = $post[4];
        $zip = $post[5];
        $tel = $post[6];
        $fax = $post[7];
        $ein = $post[8];
        $email = $post[9];
        $link = $post[10];
                }

    // Create date stamp
    $dateStamp = strftime("%D %T", time());

    echo $name."|".$address."|".$city."|".$state."|".$zip."|".$tel."|".$fax."|".$ein."|".$email."|".$link."|".$dateStamp."<br/>";

    }

    THANKS in advance for your help... this is really cool script and will really speed up my research :)

    29. Jesse Skinner at 11:07pm on October 29, 2009

    @Laura - yes, it's an escaping issue. Regular expressions start and end with the / slash, like:

    /hello/

    so whenever you need to put in a /, like </strong>, you need to escape it with a \ like:

    /<\/strong>/

    so just go through your regular expression, add a \ before all the /s, and make sure it ends with a / too.

    30. Jesse Skinner at 11:11pm on October 29, 2009

    @Laura - actually you may want to make sure it ends with /s - that 's' means that the dot '.' matches line breaks, and HTML is full of line breaks.

    31. Laura Grant at 11:22pm on October 29, 2009

    Thanks JEsse! This has been so helpful -- I was able to debug the code with the '\' escapes and I am getting my output - yay!

    But I have another related question -- if I want just part of a url, like the last four characters, and I am using the other part as an identifying tag and it has '/'s, e.g. http://www.example.com/1345, how do I block those '/'s?
    Cheers!

    32. Jesse Skinner at 12:23am on October 30, 2009

    @Laura - in that case it might look like:

    /http:\/\/www.example.com\/(.{4})/

    The regular expression parser will ignore the \ characters, they will just let it know that the regular expression isn't over yet.

    33. digital at 10:34pm on November 5, 2009

    Hi, I really like your tutorial. I have also found a script which search nth results of google search


    <?php

    $query = urlencode("adobe dreamweaver");

    preg_match_all('/<a title=".*?" href=(.*?)>/', file_get_contents("http://www.google.com/ie?q=" . urlencode($query) . "&num=100&start=1"), $matches);

    print implode("<br>", $matches[1]);

    ?>

    It returns the url form the searches, but i want that it also return the description of those urls.

    34. Sumit at 4:57am on December 8, 2009

    Hi,
    This is an excellent article probably the simplest one to explain evry bit of web scrapping. I am trying to use the same with the following html
    <table width="100%" border="0" cellspacing="0" cellpadding="3">
    <tr>
    <td style="padding-bottom: 0px; line-height: 20px; padding-top: 6px;" valign="top" width="1%">
    <input type="checkbox" name="job" value="7644383" /><input type="hidden" id="7644383" value="0"></td>
    <td style="padding-bottom: 0px; line-height: 20px;"><a href="http://a.com/details/7644383.html" target="_blank" id="link7644383"  style="text-decoration: underline;"  >Java</a>,<span class="small txt_grey">18th Nov 2009</span><br>Infinity Services<br><div style="line-height: normal;"><span class="txt_green">Hyderabad, 2-4 years, 2.50-3.50 lacs:</span> Total of 2 to 3 years experience with the Java language, object oriented programming, and related concepts such as refactoring1 year experience with SQL and database based programmingFamiliarity with UNIX & Junit.</div><a href="javascript:findSimilar(7644383)" class="txt_blue1">Similar Jobs</a>&nbsp;&nbsp;-&nbsp;&nbsp;<a href="http://a.com/searchresult.html" class="txt_blue1">All Jobs by this Recruiter</a></td>
    </tr><tr>
    <td style="padding-bottom: 0px; line-height: 20px; padding-top: 6px;" valign="top">&nbsp;</td>
    <td style="padding-bottom: 0px; line-height: 20px;">&nbsp;</td></tr>
    <tr>
    <td style="padding-bottom: 0px; line-height: 20px; padding-top: 6px;" valign="top" width="1%">
    <input type="checkbox" name="job" value="7466305" /><input type="hidden" id="7466305" value="0"></td>
    <td style="padding-bottom: 0px; line-height: 20px;"><a href="http://a.com/details/7466305.html" target="_blank" id="link7466305"  style="text-decoration: underline;"  >Java Specialist</a>,<span class="small txt_grey">18th Nov 2009</span><br>Magna Infotech Pvt Ltd<br>
    <div style="line-height: normal;"><span class="txt_green">Chennai, 4-7 years:</span> Java Developer with strong technical developer with focus and expertise in the Java based tools and technologies.  The individual must be proficient in Java development and unit testing</div><a href="javascript:findSimilar(7466305)" class="txt_blue1">Similar Jobs</a>&nbsp;&nbsp;-&nbsp;&nbsp;<a href="http://a.com/searchresult.html" class="txt_blue1">All Jobs by this Recruiter</a>
    </td></tr></table>
    using

    <?php
    $html = file_get_contents("data.html");
    preg_match_all(
        '/<td style="padding-bottom: 0px; line-height: 20px;">(<a href="(.*?.)" .*?.)<\/a>.*?<span class="small txt_grey">(.*?).<\/span><br>.*?<\/span><br>(.*?).<br>.*?<div style="line-height: normal;"><span class="txt_green">(.*?).</span>.*?</span>(.*?).</div>/s',
        $html,
        $posts, // will contain the blog posts
        PREG_SET_ORDER // formats data into an array of posts
    );

    foreach ($posts as $post) {
        $link = $post[1];
        $title = $post[2];
        $date = $post[3];
        $content = $post[4];
    $loc = $post[5];
    $desc= $post[6];
    echo $link."<br>". $title."<br>".$date."<br>".$content."<br>".$loc."<br>".$desc;
        // do something with data
    }
    ?>
    I am getting Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 'p'. I got the result upto <\/span><br>(.*?).<br> but when I add other tags I am getting the warning.
    I am also getting only one record instead of two. Why so?
    Can you please check this and let me know where I am doing wrong?
    Regards

    35. NomikOS at 5:36am on December 8, 2009

    >> "[function.preg-match-all]: Unknown modifier 'p'. "
    A: escape this slashes too: </span>.*?</span>(.*?).</div>',

    Besides:

    1.- (.*?.) and (.*?). are very weird expressions. the second dot seems be a redundant one.

    2.- .*? is a greedy expressions. study for lazy expressions.

    3.- (<a href="(.*?.)" .*?.) this is a double backreference. Will give you something like: $post[n] for outer parenthesis and $post[n+1] for inner parenthesis.

    In resume you must training you more in regular expressions.

    -------------

    Do you want scrape info on each table row?

    36. Sumit at 5:49am on December 8, 2009

    Hi,
    Thanks a lot.
    Can u please provide me scrape info on each table row?
    It will be very helpful.
    I am not good in php, still learning.
    With best regards
    Sumit

    37. NomikOS at 6:14am on December 8, 2009

    If the layout don't change this will work:

    <td style="padding\-bottom\: 0px; line\-height\: 20px;"><a href="(.*?)" target="_blank" id="link\d+"  style="text\-decoration\: underline;"  >(.*?)<\/a>,<span class="small txt_grey">(.*?)<\/span><br>(.*?)<br>\s*<div style="line\-height\: normal;"><span class="txt_green">(.*?)\:<\/span>(.*?)<\/div>

    use var_dump($posts) to check;

    bye.-

    38. Sumit at 6:26am on December 8, 2009

    Hi,
    Thanks a lot. Great Work!!!!!!!!.
    I want to learn this. Where I can learn preg_match_all in detail?
    Regards

    39. NomikOS at 6:38am on December 8, 2009

    Sumit, PHP has one of the best documentation online.
    http://www.php.net/docs.php

    40. Sumit at 4:53am on December 9, 2009

    Hi,
    I am getting no result when I am using
    preg_match_all(
        '/<td style="padding\-bottom\: 0px; line\-height\: 20px;"><a href="(.*?)" target="_blank" id="link\d+"  style="text\-decoration\: underline;" >(.*?)<\/a>,<span class="small txt_grey">/s',
        $html,
        $posts, // will contain the blog posts
        PREG_SET_ORDER // formats data into an array of posts
    );

    foreach ($posts as $post) {
        //$link = $post[1];
        $title = $post[2];
        $date = $post[3];
       
    echo $title."<br>".$date."<br>";
       

    }
    What is wrong?
    Regards

    41. Cwjones at 9:22am on January 12, 2010

    Im looking to scrape a page in my directory rather than writing it out all again.
    I'm using php's $_GET from the URL but scraping doesn't seem to want to do the leg work.
    If i process
    $url = 'http://localhost/~test/result.php?School=$School';
    $page = file_get_contents($url);
    I get nothing, although if i process,
    $url = 'http://localhost/~test/result.php?School=FullSchoolName';
    $page = file_get_contents($url);
    I get a response.
    I'm using the $_GET but like I say, there's no response. Any ideas?

    42. Jesse Skinner at 10:54am on January 12, 2010

    @Cwjones - PHP $variables aren't parsed between single quotes. Try this:

    $url = "http://localhost/~test/result.php?School=$School";

    or this:

    $url = 'http://localhost/~test/result.php?School='.$School;

    43. Cwjones at 11:09am on January 12, 2010

    Thanks for the quick response,

    Seems the double quotes don't want to work and this is my fault for not disclosing the full info but there are more than one $variable eg.

    $url = "http://localhost/~test/result.php?School=$School&Ward=$Ward&Term=$Term";

    So I now cant get my head around the php?School='.$School --- with extras...

    Thanks again

    44. Jesse Skinner at 11:19am on January 12, 2010

    @Cwjones - the period concatenates strings together. You can just do this:

    $url = "http://localhost/~test/result.php?School=".$School."&Ward=".$Ward."&Term=".$Term;

    Try using 'echo' to print out the URL for debugging so you can see what's actually going on.

    45. svnlabs at 6:07am on January 27, 2010

    Great Idea!!

    Really scraping is great tool for web developers...

    Why we not utilize it for productive work?

    Thanks
    SV

    46. NomikOS at 7:54am on January 27, 2010

    For a better performance use curl: http://curl.haxx.se/
    Among other things handles HTTP headers, SSL, cookies, proxies, etc.

    47. mike at 2:55pm on February 3, 2010

    I am trying to extract link from html content
    eg.

    <a href="/contents/text/logic">Value</a>
    <a href="/contents/something/logic">Value2</a>

    I am trying to pattern match and extract the Value based on the known path.
    Each link will have a different value depending on the path like "/contents/text/logic"

    Which reg. ex pattern will help me do that

    48. Noxier at 5:52am on February 19, 2010

    Hey Jesse, i am glad and thanks for your simple and meaningful web scrapping tutorial.

    i am newbie here, i tried your tutorial to scrap a web content from Wordpress based blog, i get some trouble for web which have contents like this.

    <ul id="main">
    <li id="comment1"><a href="http://some.url">Links1</a></li>
    <li id="comment2"><a href="http://some.url">Links2</a></li>
    <li id="comment3"><a href="http://some.url">Links3</a></li>
    </ul>

    in this case, every li tag have a different ID name. How to scrape it? and how the regular expression used here?

    thanks for your answer :smile:

    49. Runtest at 9:31pm on February 28, 2010

    First I would love to thank you for the super simple tutorial.
    I could use a little help though. Nothing is echoing back from this script.

    Did I mess up the syntax?

    $pGet = file_get_contents("http://fedcoelectronics.com/detail.tpl?SKU=P250C-10ALX&_fid=35");

    preg_match_all('/<TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
    <TR bgcolor="#adcfe0"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
    <TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
    <TR bgcolor="#adcfe0"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
    <TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
    <TR bgcolor="#adcfe0"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
    <TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
    <TR bgcolor="#adcfe0"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
    <TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>.*?
    <TR bgcolor="#e6f2f8"><TD align=right width="120">.*?<\/TD><TD width="166">(.*?)<\/TD><\/TR>/s',
    $pGet,
    $pInfo,
    PREG_SET_ORDER
    );

    foreach($pInfo as $pInfo) {
    $partNumber = $pInfo[1];

    echo $partNumber;
    }

    50. JEsteban at 1:49am on March 26, 2010

    I tried the file_get_contents function on a website that I'd like to collect data from.  It's a drill down type database application and I would like to get all of it's data into a database so that I can make it searchable with query tools.

    The problem is file_get_contents is no a browser and so the Ajax functions which load the most of the data on the site don't get executed because there is no browser loading the page.  Any idea?

    51. Jesse Skinner at 11:37am on March 26, 2010

    @JEsteban - you can try using Firebug or Fiddler to see what URLs are being called via Ajax, and then use file_get_contents or cURL to call those URLs and get the data you need.

    52. JEsteban at 1:23pm on March 26, 2010

    Oh that's a good idea.  I didn't know that would work. Thanks I'll try that.

    53. Goha at 10:16am on March 27, 2010

    nice tip... thanks..

    54. I3L1nd at 2:05pm on April 1, 2010

    Wow,

    This really came in handy because I have to update show dates for a clubs website.

    Now I can just pull the show dates from the Myspace.


    Thanks a lot.

    55. Forbes at 12:47pm on April 6, 2010

    awesome tutorial! I am finally able to get my head around data scraping.

    Just a note on your site code... your <div class="tags"> aren't being closed. Ran across this while tweaking the scraped data being spit out from your site.

    56. Andrei at 1:01pm on April 6, 2010

    THANKS A LOT DUDE.
    A VIRTUAL BEER FROM ME TO YOU!

    57. TechRedNeck at 4:50pm on April 22, 2010

    I've been using a scraping software called mozenda which allows you to add in custom code.  Does anyone know if this will work with them?  It's http://www.mozenda.com if anyone thinks they can find it in their support section.  I looked but I'm a dip when it comes to finding things.  Thanks :)

    58. trendzvijay at 9:07pm on April 29, 2010

    hi, i have the error with my following code.. when i try to get the count of array, it showing zero. can you help me.

    $h1count = preg_match_all('/<div id="nutritions"><table class="blk_brd" width="270" cellpadding="0" cellspacing="1">
    <tbody><tr><td colspan="3" class="PadLft" height="20"><span style="font\-size\: 20px;">
    <b>Nutrition Facts<\/b><\/span><\/td><\/tr>
    <tr><td colspan="3" class="PadLft" height="15">Serving Size 1 cup <\/td><\/tr>
    <tr><td colspan="3" class="blk" height="1"><img src="(.*?)" width="1" height="8"><\/td>
    <\/tr><tr><td colspan="3" class="PadLft" height="15"><b>Amount Per 1 Serving<\/b><\/td><\/tr>
    <tr><td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Calories<\/b> 120 <\/div><\/td><\/tr>
    <tr><td colspan="3" class="brdtp" align="right"><b>% Daily Value * <\/b><\/td><\/tr>
    <tr><td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Total Fat <\/b>1.0g<\/div>
    <div class="divrht"><b>2<\/b>%<\/div><\/td><\/tr>
    <tr>
            <td width="9%">&nbsp;<\/td>
            <td colspan="2" class="brdtp"><div class="divlft">Saturated Fat 0.0g<\/div>
              <div class="divrht"><b>0<\/b>%<\/div><\/td>
          <\/tr>
          <tr>
            <td width="9%">&nbsp;<\/td>
            <td colspan="2" class="brdtp"><div class="divlft">Trans Fat 0.0g<\/div>
              <div class="divrht"><\/div><\/td>
          <\/tr>
          <tr>
            <td>&nbsp;<\/td>
            <td colspan="2" class="brdtp"><div class="divlft">Polyunsaturated Fat 0.0g<\/div>
              <div class="divrht"><\/div><\/td>
          <\/tr>
          <tr>
            <td>&nbsp;<\/td>
            <td colspan="2" class="brdtp"><div class="divlft">MonoUnsaturated Fat  0.0g<\/div>
              <div class="divrht"><\/div><\/td>
          <\/tr>
          <tr>
            <td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Cholesterol&nbsp;<\/b>0.0mg<\/div>
              <div class="divrht"><b>0<\/b>%<\/div><\/td>
          <\/tr>
          <tr>
            <td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Sodium <\/b> 540.0mg<\/div>
              <div class="divrht"><b>23<\/b>%<\/div><\/td>
          <\/tr>
          <tr>
            <td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Total Carbohydrates <\/b>12.0g<\/div>
              <div class="divrht"><b>4<\/b>%<\/div><\/td>
          <\/tr>
          <tr>
            <td>&nbsp;<\/td>
            <td colspan="2" class="brdtp"><div class="divlft">Dietary Fiber 6.0g <\/div>
              <div class="divrht"><b>24<\/b>%<\/div><\/td>
          <\/tr>
          <tr>
            <td colspan="3" class="brdtp"><div class="divlft PadLft"><b>Protein <\/b>26.0 g<\/div>
              <div class="divrht"><b>52<\/b>%<\/div><\/td>
          <\/tr>
          <tr>
            <td colspan="3" class="blk" height="1"><img src="(.*?)" width="1" height="8"><\/td>
          <\/tr>
          <tr>
            <td colspan="3"><table width="100%" border="0" cellpadding="0" cellspacing="0">
              <\/table><\/td>
          <\/tr>
          <tr>
            <td colspan="3" class="PadLft brdtp">*  Based on a<u> 2,000 calorie diet<\/u>.<\/td>
          <\/tr>
        <\/tbody>
      <\/table>
    <\/div>/s',$file,$patterns);
    echo $h1count ;

    thank you

    59. NomikOS at 9:28pm on April 29, 2010

    Are you crazy? I never saw something like that.

    Look. First isolate the table. I designed this function:

    # take only first ocurrence on $tring (very important!)
    # return an empty string if delimeters fails
    function getUnit($string, $start, $end)
    {
        if (($pos = stripos($string, $start)) === false)
            return '';

        $str = substr($string, $pos);
        $str_two = substr($str, strlen($start));

        if (($second_pos = stripos($str_two, $end)) === false)
            return '';

        $str_three = substr($str_two, 0, $second_pos);
        return trim($str_three);
    }

    do:

    $unit = getUnit($fileToParse, '<div id="nutritions">, '</div>');

    and then preg_match_all over $unit:

    if ( preg_match_all('/<img src="(.*?)" width="1" height="8">/si', $unit, $src, PREG_SET_ORDER))
    {
      var_dump($src);
    }

    this pattern is more appropriate between quotes: ([^"]*)

    ..V; ^^

    60. NomikOS at 9:39pm on April 29, 2010

    correction:
    $unit = getUnit($fileToParse, '<div id="nutritions">, '</tbody>');

    delimeters must be unique (or at least be sure that delimit the block you're interested in $fileToParse). Here id= assure that.

    61. trendzvijay at 9:47pm on April 29, 2010

    Hi NomikOS,

              Thank you for your quick reply.. im newbie to php. after the step var_dump($src); how can we retrieve the data. please give the brief details. it will be helpful to most people like me

    62. NomikOS at 9:56pm on April 29, 2010

    Sure but you must go to php.net and study  preg_match_all. In this case we use PREG_SET_ORDER. So you must do this:

    foreach ($src as $aux)

    {

        $this_array_got_you_want[] = $aux[1];

    }

    63. trendzvijay at 10:07pm on April 29, 2010

    Thanks a lot NomikOS !! now its working well. thank you very much dude.

    64. NomikOS at 10:22pm on April 29, 2010

    Unbelievable! I not expected to give you a solution, if not just one track. You are in luck, I'm happy for you.

    Some useful regular expressions are:

    <?php
    ([^"]*) // match all until "
    ([^>]*) // match all until >
    \\$(\d+\.*\d*) // match prices (It never hurts)
    ?>

    65. trendzvijay at 10:18am on May 1, 2010

    Hi  NomikOS,

                  I got the following error while scraping conent from one blog[website]. I got this message when i collect the 347 data[ nearly 38th page]

    [file_get_contents]: failed to open stream

    $file = file_get_contents($url); is My Code for this scrape work

    is there any other way to get the complete solution?

    thank you

    66. NomikOS at 2:43pm on May 1, 2010

    for no break the program flow use @

    $file = @file_get_contents($url);
    if ($file) {}

    and go on...

    to scrape seriously you must use curl. but is your task learn how. search for a class ready to use.

    I.-

    67. trendzvijay at 4:10pm on May 1, 2010

    Thank you.. Now im learning, how can we use curl to scrape.. thanks for your support

    68. Frederick Aristotle at 6:36pm on May 5, 2010

    Is there a reason why I'm not getting any results when I code it using the following example?

    <div style="margin: 10px 0px 0px 0px; padding: 5px; width: 500px; border: 1px solid #000000;">

    <?php
    // get the HTML
    $html = file_get_contents("http://www.dailydealcafe.com/index.php");

    preg_match_all(
        '/<div class="main-product-image"><img src="(.*?)" alt="(.*?)" title="(.*?)" border="0" height="(.*?)" width="(.*?)"><\/div>
    /s',
        $html,
        $posts, // will contain the blog posts
        PREG_SET_ORDER // formats data into an array of posts
    );

    foreach ($posts as $post) {
        $link = $post[1];
        $title = $post[2];
        $date = $post[3];
        $content = $post[4];
        $content2 = $post[5];

        // do something with data
        echo $link . '<br/>' . $title . '<br/>' . $date . '<br/>' . $content;
    }


    ?>
    </div>

    69. Jeff Nelson at 3:34pm on May 12, 2010

    Great post and comments.

    Can you comment on consequence of adoption of HTML 5 on the business of site scraping as done by Yodlee and others for financial information?  Does use of RIA make scraping more difficult?

    thanks/JN

    70. NomikOS at 3:40pm on May 12, 2010

    Please provide a suitable link. Thanks...

    71. prakash at 6:57am on May 24, 2010

    hi friend i want to scrap this data
    <DIV CLASS="contenttext">
    Many USMS clubs have their own web sites with local information, workout times, club events, and other useful information. Please stop by and visit one of our club sites!
    <P><A HREF="edit_club_link.php?add=1">Add USMS Club Link</A>
    <FORM ACTION="/links/usmsclubs.php" METHOD="POST">

    <SELECT NAME="a">
    <OPTION VALUE="">-All-
    <OPTION VALUE="AL">Alabama
    <OPTION VALUE="AK">Alaska
    <OPTION VALUE="AZ">Arizona
    <OPTION VALUE="AR">Arkansas
    <OPTION VALUE="CA">California
    <OPTION VALUE="CO">Colorado
    <OPTION VALUE="CT">Connecticut
    <OPTION VALUE="DE">Delaware
    <OPTION VALUE="DC">District Of Columbia
    <OPTION VALUE="FL">Florida
    <OPTION VALUE="GA">Georgia
    <OPTION VALUE="HI">Hawaii
    <OPTION VALUE="ID">Idaho
    <OPTION VALUE="IL">Illinois
    <OPTION VALUE="IN">Indiana

    <OPTION VALUE="IA">Iowa
    <OPTION VALUE="KS">Kansas
    <OPTION VALUE="KY">Kentucky
    <OPTION VALUE="LA">Louisiana
    <OPTION VALUE="ME">Maine
    <OPTION VALUE="MD">Maryland
    <OPTION VALUE="MA">Massachusetts
    <OPTION VALUE="MI">Michigan
    <OPTION VALUE="MN">Minnesota
    <OPTION VALUE="MS">Mississippi
    <OPTION VALUE="MO">Missouri
    <OPTION VALUE="MT">Montana
    <OPTION VALUE="NE">Nebraska
    <OPTION VALUE="NV">Nevada
    <OPTION VALUE="NH">New Hampshire
    <OPTION VALUE="NJ">New Jersey
    <OPTION VALUE="NM">New Mexico

    <OPTION VALUE="NY">New York
    <OPTION VALUE="NC">North Carolina
    <OPTION VALUE="ND">North Dakota
    <OPTION VALUE="OH">Ohio
    <OPTION VALUE="OK">Oklahoma
    <OPTION VALUE="OR">Oregon
    <OPTION VALUE="PA">Pennsylvania
    <OPTION VALUE="RI">Rhode Island
    <OPTION VALUE="SC">South Carolina
    <OPTION VALUE="SD">South Dakota
    <OPTION VALUE="TN">Tennessee
    <OPTION VALUE="TX">Texas
    <OPTION VALUE="UT">Utah
    <OPTION VALUE="VT">Vermont
    <OPTION VALUE="VA">Virginia
    <OPTION VALUE="WA">Washington
    <OPTION VALUE="WV">West Virginia

    <OPTION VALUE="WI">Wisconsin
    <OPTION VALUE="WY">Wyoming
    </SELECT>
    <INPUT TYPE="submit" VALUE="Go">
    </FORM>
    <P>

    </DL><DL><DT><B>Alabama</B>
    <DD><A HREF="http://www.ag.auburn.edu/~cbailey/masters.html" TARGET="_new"> Auburn Masters Swimming</A>  (Auburn)
      <SMALL>[</SMALL>  <A HREF="edit_club_link.php?a=982"><SMALL>Modify</SMALL></A>  <SMALL>]</SMALL></DD>

    <DD><A HREF="http://wng1.home.att.net/cams/" TARGET="_new"> CAMS</A>  (Montgomery)
      <SMALL>[</SMALL>  <A HREF="edit_club_link.php?a=983"><SMALL>Modify</SMALL></A>  <SMALL>]</SMALL></DD>
    <DD><A HREF="http://www.ctaswim.com" TARGET="_new"> Crimson Tide Aquatics</A>  (Tuscaloosa)
      <SMALL>[</SMALL>  <A HREF="edit_club_link.php?a=1279"><SMALL>Modify</SMALL></A>  <SMALL>]</SMALL></DD>

    <DD><A HREF="http://www.teamunify.com/SubTabGeneric.jsp?team=csfcast&_stabid_=5389" TARGET="_new"> FAST Masters - Fort Collins Area Swim Team</A>  (Fort Collins)
      <SMALL>[</SMALL>  <A HREF="edit_club_link.php?a=1417"><SMALL>Modify</SMALL></A>  <SMALL>]</SMALL></DD>
    <DD><A HREF="http://www.swimhsa.org/masters" TARGET="_new"> Huntsville Swim Association</A>  (Huntsville)
      <SMALL>[</SMALL>  <A HREF="edit_club_link.php?a=985"><SMALL>Modify</SMALL></A>  <SMALL>]</SMALL></DD>

    <DD><A HREF="http://www.magiccitymasters.org/" TARGET="_new"> Magic City Masters Swim Team</A>  (Birmingham)
      <SMALL>[</SMALL>  <A HREF="edit_club_link.php?a=984"><SMALL>Modify</SMALL></A>  <SMALL>]</SMALL></DD>
    <DD><A HREF="http://www.shoalssharks.com" TARGET="_new"> Shoals Sharks Masters Swimming</A>  (Florence)
      <SMALL>[</SMALL>  <A HREF="edit_club_link.php?a=1249"><SMALL>Modify</SMALL></A>  <SMALL>]</SMALL></DD>

    <DD><A HREF="http://www.mybswim.org/mastersswimming.htm" TARGET="_new"> YMCA Barracudas</A>  (Montgomery)
      <SMALL>[</SMALL>  <A HREF="edit_club_link.php?a=987"><SMALL>Modify</SMALL></A>  <SMALL>]</SMALL></DD>
    </DL></DIV>


    how can i do this plz help me

    72. Jose at 11:09am on June 10, 2010

    That is great information! Nowhere else did I find an easy and effective explanation. Proven, PHP CAN do the job!@Thanks!

    73. Shrikant at 8:57am on July 23, 2010

    Hello sir
    i am new in Cakephp framework, currently i am facing a problem.
    Problem is that i am working on scrapping in cakephp and i am scrapping a site which is developed in .Net platform.
    My problem is that how can i be logged in that .net site through php code means how to POST username and password on that site and response get back to on my site which is php site. After that i will scrape data from there and store it to DB.

    Reference site link www.plentyoffish.com (.net site)
    for example i am scrapping gmail account and logged in there from my php code but how it possible.

    please help me

    Thanks in advance

    74. NomikOS at 9:46am on July 23, 2010

    That is easy, use cURL and a professional. http://www.rentacoder.com/RentACoder/DotNet/SoftwareCoders/ShowBioInfo.aspx?lngAuthorId=7064234

    75. theMaab at 5:21am on September 2, 2010

    What would the preg_match_all string look like to loop the TERMs and DEFINITIONs on this page? http://www.cancer.gov/drugdictionary/?expand=%23

    Thanks in advanced. I'm horrible with regex, :(

    76. Steve at 11:33pm on September 18, 2010

    nice tutorial. I've been working on a similar project (www.quickscrape.com) and found that some web hosts require you to use curl instead.

    77. Praveen at 2:02am on December 16, 2010

    Dear All,

    Could you please help me. How can i Scrap "http://www.indiatimes.com/" Site Latest News Sideshow Data on our site with php.

    & when i click on the Scrap Feed URL The Main site show on my next page I frame. because I show the Our site header  Portion.
    Like(http://www.samachar.com)

    Regards.
    Praveen

    78. Ashneil at 7:41pm on December 17, 2010

    Thanks for this tutorial. I really needed it. You have a really nice theme on your website.

    79. lioness at 10:27am on December 25, 2010

    need help with scraping.

    user fills a form on my site1 and request made to another site2 that sends results directly to the user browser with excessive irrelevant information.

    want to grab results before user sees results, i display only part of the results that is relevant.

    <form name="example" action="http://www.site2.com/index.php?option=com_content&task=view&id=49&Itemid=10" method="post" onsubmit="return validate_form()" target="_blank";> ......

    80. Raj Keshwani at 4:40am on March 10, 2011

    Great example this is!!!

    81. Freddy at 2:07am on March 22, 2011

    There is this new web scraping tool called Helium Scraper at http://www.heliumscraper.com also.

    82. Lisa Waters at 1:19pm on March 28, 2011

    I need to extract data from multiple urls and have it inserted into a MySQL database. I am a newbie so, I have no idea what I am doing. I need some information from the body of the page and some from the url parameters. This is what I have so far:

    <?php
    $arr = array(10003,10029);

    foreach($arr as $value){
    // get the HTML
    $web = 'http://www.doe.mass.edu/mcas/search/question.aspx?mcasyear=2010&QuestionSetID=1&grade=8&subjectcode=MTH&questionnumber=40'.$value;
    echo $web."<br/>";

    $html = file_get_contents($web);

    preg_match_all(

    '/<span class="nav em">(.*?)<br />(.*?).*?<\/span>/s',

        $html,
        $posts,
        PREG_SET_ORDER
    );

    foreach ($posts as $post) {
        $reportingcategory = $post[1];
        $standard = $post[2];
     
                }

    // Create date stamp
    $dateStamp = strftime("%D %T", time());

    echo $name."|".$reportingcategory."|".$standard."<br/>";

    }
    ?>
    <?php

    $url = "http://www.mysite.com/search/question.aspx?mcasyear=".$year."&QuestionSetID=".$QuestionSetID."&grade=".$grade."&subjectcode=".$QuestionType."&questionnumber=".$QuestionNumber;


    echo $name."|".$reportingcategory."|".$standard."|".$grade."<br/>";
    ?>

    83. Gerry_castlow at 7:26pm on March 30, 2011

    haha idiot! regex loses against tidy xpath.

    84. djam at 3:01pm on April 9, 2011

    Hi..

    This is the html code:

    <div style="margin: 0px; z-index: 1000;" class="my-entry">
    <ul>
    <li><strong><a href="link.html">Sepucuk Surat Buat Presiden...</a></strong></li>
    <li><strong><a href="link.html">Momentum, Rahasia Sukses</a></strong></li>
    <li><strong><a href="link.html">Etika Bisnis Negeri Matahari Terbit</a></strong></li>
    <li><strong><a href="link.html">Kiat Menjaga Motivasi Untuk Berolah Raga</a></strong></li>
    </ul>
    </div>


    This is my code:

    $url="http://www.theurl.com";
    $text = file_get_contents($url);

    preg_match_all(
        '/<div  style="margin\: 0px; z-index\: 1000;" class="my\-entry"><ul><li>.*?<strong><a href="(.*?)">(.*?)<\/a>.*?<\/strong>.*?<\/li>.*?<\/ul>.*?<\/div>/s',$text,$posts,PREG_SET_ORDER);

    foreach ($posts as $post) {
        $link = $post[1];
        $title = $post[2];
        $date = $post[3];
        $content = $post[4];

    echo $title;
    echo $link;
    echo $date;
    echo $content;

    }

    But I got no result..
    Please help...

    85. Elena Gallegos at 1:51pm on May 6, 2011

    Hola y si quisiera hacerlo en java como lo haria gracias?

    86. NomikOS at 2:50pm on May 6, 2011

    Elena, el método de scraping visto en este post es mediante expresiones regulares. Las expresiones regulares son difíciles de aprender y usar. Java tiene un paquete para esto:java.util.regex. (http://www.regular-expressions.info/java.html)

    Hay otras formas de scrapear también: como por ejemplo con xpath que son un poco más sencillas. En esta página se ven dos basadas en java: http://www.manageability.org/blog/stuff/screen-scraping-tools-written-in-java

    Sea como sea, no hay una solución fácil si deseas hacer un trabajo profesional.

    Espero haberte ayudado. No olvides visitar mi blog (http://nomikos.info), vale?

    NomikOS.-

    87. Sudip Rooj at 8:43am on May 27, 2011

    This code really helpful....
    gr8 job.

    88. ras at 11:18am on May 30, 2011

    Many thanks for this scraping tutorial, you saved my day and my time;)

    89. Sudip Rooj at 3:03am on June 7, 2011

    not working pregmatch function with any regular expression in clickindia dot com site pls give some suggestion.

    90. tony at 11:09am on June 27, 2011

    please help... y have 600 files.. and  im stuck with it..... this is a sample from a file..... can u give me a sample code......????
    y nedd to extract  markets item and prices
    y think  img is the first key to find id s.. then search by ids....

    <div id="centerData" class="dm">
    <table class="fixw" cellspacing="0" cellpadding="0" border="0" xmlns:fo="http://www.w3.org/1999/XSL/Format">
    <tbody>
    <tr class="h1 rh1">
    <td align="center" width="32">
    <a onclick="return false" href="#">
    <img id="cpnBtn_981#38120821" border="0" align="absmiddle" onclick="clickOpenClose('981#38120821',4115,'',1,7,'',981,4,'',38120821,3,'',1,8,'',38120821,3,'',0,0);" src="mainpage_data/iconOpen.gif">
    </a>
    </td>
    <td>
    <a onclick="javaScript: gPC(100000,'',1,7,'',981,4,'',38120821,3); return false;" href="#">MARKET 1</a>
    </td>
    </tr>
    </tbody>
    </table>
    <div id="cpnDiv_981#38120821" xmlns:fo="http://www.w3.org/1999/XSL/Format" style="display:inline">
      <table cellspacing="0" cellpadding="0" width="565" style="border-bottom: 1px solid rgb(211, 211, 211);">
          <tbody>
              <tr>
                <td style="height: 1px;"></td>
                    </tr>
                <tr class="rcpn">
                  <td class="dcpnl ex clba cbb" onclick="javascript:number('pt=N#o=21/20#f=38120821#fp=194761477#so=0#c=1#');">ITEM 1</td>
                      <td class="dcpnr ex1 clab cbb" onclick="javascript: number('pt=N#o=21/20#f=38120821#fp=194761477#so=0#c=1#');">PRICE1</td>
      <td class="dcpnl ex clba cbb" onclick="javascript: number('pt=N#o=3/4#f=38120821#fp=194761478#so=0#c=1#');">ITEM 2</td>
    <td class="dcpnr ex1 clab cbr cbb" onclick="javascript: number('pt=N#o=3/4#f=38120821#fp=194761478#so=0#c=1#');">PRICE2</td>
    </tr>
    </tbody>
    </table>
    </div>
    <table class="fixw" cellspacing="0" cellpadding="0" border="0" xmlns:fo="http://www.w3.org/1999/XSL/Format">
    <tbody>
    <tr>
    <td class="w" width="565" height="1px" colspan="1"></td>
    </tr>
    </tbody>
    </table>
    <table class="fixw" cellspacing="0" cellpadding="0" border="0" xmlns:fo="http://www.w3.org/1999/XSL/Format">
    <tbody>
    <tr class="h1 rh1">
    <td align="center" width="32">
    <a onclick="return false" href="#">
    <img id="cpnBtn_10202#38120821" border="0" align="absmiddle" onclick="clickOpenClose('10202#38120821',4115,'',1,7,'',10202,4,'',38120821,3,'',1,8,'',38120821,3,'',0,0);" src="mainpage_data/iconOpen.gif">
    </a>
    </td>
    <td>
    <a onclick="javaScript: gPC(100000,'',1,7,'',10202,4,'',38120821,3); return false;" href="#">MARKET 2</a>
    </td>
    </tr>
    </tbody>
    </table>
    <div id="cpnDiv_10202#38120821" xmlns:fo="http://www.w3.org/1999/XSL/Format" style="display: inline;">
    <table class="o4 no_b_tlrb" cellspacing="0" cellpadding="0" border="0" width="565">
    <tbody>
    <tr class="H1">
    <td width="565" height="1" colspan="5"></td>
    </tr>
    </tbody>
    </table>
    <table cellspacing="0" cellpadding="0" border="0" width="565" style="border-bottom: 1px solid rgb(211, 211, 211);">
    <tbody>
    <tr>
    <td style="height: 1px;"></td>
    </tr>
    <tr class="rcpn">
    <td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=1/16#f=38120821#fp=194832550#so=0#c=1#');">ITEM 1</td>
    <td class="dcpnr ex1 clab" onclick="javascript: number('pt=N#o=1/16#f=38120821#fp=194832550#so=0#c=1#');">PRICE 1</td>
    <td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=9/1#f=38120821#fp=194832551#so=0#c=1#');">ITEM 2</td>
    <td class="dcpnr ex1 clab cbr" onclick="javascript: number('pt=N#o=9/1#f=38120821#fp=194832551#so=0#c=1#');">PRICE 2</td>
    </tr>
    <tr class="rcpn">
    <td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=3/10#f=38120821#fp=194832553#so=0#c=1#');">ITEM 3</td>
    <td class="dcpnr ex1 clab" onclick="javascript: number('pt=N#o=3/10#f=38120821#fp=194832553#so=0#c=1#');">PRICE 3</td>
    <td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=12/5#f=38120821#fp=194832554#so=0#c=1#');">ITEM 4</td>
    <td class="dcpnr ex1 clab cbr" onclick="javascript: number('pt=N#o=12/5#f=38120821#fp=194832554#so=0#c=1#');">PRICE 4</td>
    </tr>
    <tr class="rcpn">
    <td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=5/2#f=38120821#fp=194832556#so=0#c=1#');">ITEM 4</td>
    <td class="dcpnr ex1 clab" onclick="javascript: number('pt=N#o=5/2#f=38120821#fp=194832556#so=0#c=1#');">PRICE 4</td>
    <td class="dcpnl ex clba" onclick="javascript: number('pt=N#o=2/7#f=38120821#fp=194832557#so=0#c=1#');">ITEM 5</td>
    <td class="dcpnr ex1 clab cbr" onclick="javascript: number('pt=N#o=2/7#f=38120821#fp=194832557#so=0#c=1#');">PRICE 5</td>
    </tr>
    </tbody>
    </table>
    </div>

    91. Jax at 7:20pm on June 27, 2011

    thanks for the tutorial, but I always get 0 results from this:

    '/<td style="font\-weight\: bold;" class="rightNum">(.*?)<\/td><td style="padding\-left\: 40px;"><a href="(.*?)">(.*?)<\/a><\/td><td style="white\-space\: nowrap;">(.*?)<\/td><td class="centNum"><img src="(.*?)" onmouseover="setTipText\(\'(.*?)\'\);" class="staticTip"><\/td><td style="font\-weight\: bold; color\: rgb\(103, 135, 5\);" class="rightNum">(.*?)<\/td><td style="font\-weight\: bold; color\: rgb\(154, 20, 1\);" class="rightNum">(.*?)<\/td><td style="font\-weight\: bold;" class="rightNum">(.*?)<\/td><\/tr>/'

    any ideas of why? Do someone see some error?

    92. kaushal sinha at 7:01am on June 30, 2011

    Excellent overview, it pointed me out something I didn’t realize before. I should encourage for your wonderful work. I am hoping the same best work from you in the future as well. Thank you for sharing this information with us.

    93. Stanley at 12:16pm on July 1, 2011

    Excellent - glad I found your article. I just scraped my local yellow pages (not ALL of it) to help our company do a bit of targeted marketing.

    Much easier than I thought thanks to your simple explanation of how it works.

    Many thanks.

    94. Stanley at 12:45pm on July 1, 2011

    @Jax Don't forget to end it with /s' instead of /' - that's the mistake I just made.

    Can't vouch for the rest of your code though - depends on the HTML you're using it on. I'm no expert - do you really need to escape all those hyphens and colons? Maybe.

    P.S. Just a point of interest for anyone else - when I was using this to grab a few names and addresses it worked fine until there was a piece of information (e.g. a url or an email address) missing from the records I was scraping, so the script naturally jumped ahead to find the next record where there was a "mailto", for instance.

    Instead of trying to find a way to add an IF clause or two in my script, to test if there was a URL or an email address, I simplified my script to grab all of the HTML for this section of the records, then did a bit of weeding by using str_ireplace to get rid of the bits of HTML I didn't want and add a few "|" delimeters to re-separate the URL and email values.

    Worked a treat. I also added a "page=" querystring to my page to make it quicker to load up the next page of records - similar to how Gerschel did it in Comment 27, but I just loaded the page, copied and pasted the records into my spreadsheet, then typed the next page number in my address bar and hit return, then repeated the process. I grabbed about 16 pages of records in just a few minutes using this method.

    95. Stanley at 3:55am on July 2, 2011

    I improved my method of paging through the records by adding a loop. I first checked to see how many pages of records there were and added a $totpages variable, like so:

    $totpages = 18;

    for ( $page_number = 1; $page_number <= $totpages; $page_number += 1) {

    // get the HTML
    $html = file_get_contents("http://www.example.com/listing.php?categoryid=123&page=".$page_number);

    // then all the other stuff as  per this tutorial, then an additional curly bracket right at the end to close the loop...

    }

    Works great. Cheers Jesse - you've unleashed a monster, lol!

    I suppose I "could" make a list of all the category IDs and numbers of pages and just loop through the whole lot in one go... hmmm....

    96. Rashed at 3:17pm on September 8, 2011

    $html = file_get_contents("http://www.footbo.com/Teams/Real_Madrid");

    preg_match_all(

        '/<div class="bottom rounded6">(.*?)<\/div>/s',

        $html,
        $posts, // will contain the blog posts
        PREG_SET_ORDER // formats data into an array of posts
    );


    foreach ($posts as $post) {
        $link = $post[1];
        $title = $post[2];
        $date = $post[3];
        $content = $post[4];

     
    }
    it's a great article, Jesse Skinner many many thanks your article . Regular expression is clear for me,I cant realize that array loop Please explain for me why use  $link = $post[1]; , $title = $post[2]; , $date = $post[3]; ,  $date = $post[3]; $content = $post[4]; Please let me know and correct my code.Please

    97. Juned Ahmad at 9:18am on September 16, 2011

    this tutts is very helpful for scraping. i am very thankful to You..
    thanks You

    98. toto at 2:37am on October 28, 2011

    I am a new php developer. Thank you for share how to scap site with php. this tutorial very helpuly for me

    99. Jonny at 7:13am on October 28, 2011

    Thanks this came in useful and I have linked back to you.

    100. Gwen at 4:05pm on November 9, 2011

    I can't make this script work. Can anyone pls tell me what's wrong. I don't get any error, its just not working


    <?

    $html = file_get_contents("http://www.yellowpages.com/fort-lauderdale-fl/acupunture");

    preg_match_all(
        '/
    <div class="listing_content">.*?
    <h3 .*?>
    <a .*?>(.*?)<\/a>
    <\/h3>
    <span class="listing-address adr">
    <span class="street-address">(.*?)<\/span>
    <span class="city-state">
    <span class="locality">(.*?)<\/span>,
    <span class="region">(.*?)<\/span>
    <span class="postal-code">(.*?)<\/span>
    <\/span>
    <\/span>
    <span class="business-phone phone">(.*?)<\/span>.*?
    <li><a href="(.*?)">/s',
        $html,
        $posts,
        PREG_SET_ORDER
    );


    $listing=array();

    foreach ($posts as $post) {

    $listing['title'][] = $post[1];

    $listing['street'][] = $post[2];

    $listing['city'][] = $post[3];

    $listing['state'][] = $post[4];

    $listing['zip'][] = $post[5];

    $listing['phone'][] = $post[6];

    $listing['website'][] = $post[7];

        // do something with data

    echo  $post[4];
    }


    print_r($listing)


    ?>

    101. Jesse Skinner at 4:21pm on November 9, 2011

    @Gwen - you probably need to make the regular expression one line. Try using .* to capture the whitespace in between, with the /s ending as I describe in the article.

    102. Gwen at 12:07pm on November 10, 2011

    Thanks for the reply. I replaced a few tags with .*? and that did it!!!. Thank you Jesse. This tut rocks!  : )

    103. Farhan at 12:25pm on November 21, 2011

    plz help. How can I get data from imdb coming soon movies page????

    104. Piyush at 2:22am on January 7, 2012

    Hi,
    I want scrap all text
    <tr>
    <td class="product-specs" colspan="2">
    <h1>
    <span style="font-size: small">
    <span style="font-family: Arial">
    Dell Optiplex 745 Tower Computer<br />
    </span></span></h1>
    <p>Intel Core 2 Duo 2.4 GHz <br />

    2 GB&nbsp;RAM <br />
    80 GB&nbsp;HDD <br />
    DVDRW<br />
    Windows XP&nbsp;Professional <br />
    Keyboard <br />
    and Mouse.</p>
    <p><em>Factory refurbished desktop computer</em></p>

    <h1><span style="color: rgb(255, 0, 0);"><strong><span style="font-size: small;"><span style="font-family: Arial;">3-year advance replacement warranty (no charge for parts, labor, and shipping)</span></span></strong></span></h1></td>
                        </tr>

    105. roisun at 7:47pm on January 25, 2012

    I really want to scrap some site but I do not understand att all what to do with it.

    I am new to internet and this php codes

    106. Alastair at 12:04am on February 11, 2012

    Two libraries that I recommend for scraping:
    PHP Simple HTML DOM Parser (simplehtmldom.sourceforge.net)
    Magpie RSS (magpierss.sourceforge.net)

    107. Selo Bania at 6:08pm on March 20, 2012

    I have a really interesting question. In my stat counter I've found so many Russian
    sites linking my website (http://selo-banya.com). Looks like is kind of Web scraping.
    Someone here give me some help with this please. I foung a script on other site
    (zdravstvo.rs) I don't deal with. In this script I found my domain name included. You
    go to footer menu of this website and see information under "Odakle nam dolaze"

    Here is the script - a
    href="http://www.google.bg/url?sa=t&rct=j&q=link%3Ahttp%3A%2F%2Fselo-banya.com%20zvezdaput.net&source=web&cd=3&ved=0CDoQFjAC&url=http%3A%2F%2Fwww.zdravstvo.rs%2Fbaza%2Findex.php%3Fsrch%3D%26kategorija%3D41%26grad%3DBeograd&ei=itxoT_t8heW1BtyzteYH&usg=AFQjCNHGF-LTs4xiUYZI14DS7eAx36PfAw" rel="nofollow">www.google.bg

    Also in the last 2 months my website ranked low. Alexa rank was 1 600 000 now went
    down to 4 675 000. Please I need some help.

    108. Marcus at 5:31am on March 26, 2012

    @Alastair - Thanks, that DOM parser was exactly what kind of scraping tool I was looking for.

    @Jesse - Thanks for making me find this :)

    109. Marin at 10:39am on March 26, 2012

    Good article, I use a similar approach in my own freeware PHP web scraper: http://code.google.com/p/universal-web-scraper/

    110. Joe at 9:28am on March 28, 2012

    Hi, I am trying to use a web scraping script to search for telephone numbers on websites.  Does anyone have a script that works?

    111. Martin at 4:21am on April 10, 2012

    Hi,

    I want to scrape a website (not mine) but I was wondering if they could trace me while using a script like in this example. Is it possible to trace someone who is scraping your website with "file_get_contents"?

    Thanks!

    112. nimo_Q at 2:19am on May 21, 2012

    onderfule done ! thank you

    113. Lane at 3:40pm on May 26, 2012

    This is an older post so a lot has changed over the past few years.  Most people frown upon regular expressions for regular scraping needs and file_get_contents() works in some cases but not in others.  If you are writing new web scraping code, I recommend looking at using the excellent Ultimate Web Scraper Toolkit:

    http://barebonescms.com/documentation/ultimate_web_scraper_toolkit/

    It comes with everything someone needs to get started with modern web scraping.

    @Martin - It is possible to trace someone who is scraping a site if they are paying attention to their logs (or if software is monitoring for unusual activity from an IP address).  Basically, your IP address may get banned by the administrator and, if what you are doing is illegal, there might be legal action taken, but I have yet to hear of anyone getting sued over it.  Banning the IP address is a pretty effective measure.

    Commenting is now closed. Come find me on Twitter.