• Easy web scraping with PHP

    Feb 17 2008

    Web scraping is a technique of web development where you load a web page and "scrape" the data off the page to be used elsewhere. It's not pretty, but sometimes scraping is the only way to access data or content from a web site that doesn't provide RSS or an open API.

    I'm not going to discuss the legal aspects of scraping, as it may be considered copyright infringement in some situations. However, there are also perfectly legal reasons to need to scrape, like if you have permission.

    To make things really easy, we're going to let the power of regular expressions do all the work for us. If you're not familiar with regular expressions, you may want to google for a tutorial. Here is the documentation for PHP regular expression syntax.

    First, we start off by loading the HTML using file_get_contents. Next, we use preg_match_all with a regular expression to turn the data on the page into a PHP array.

    This example will demonstrate scraping this web site's blog page to extract the most recent blog posts. This is just for demo purposes - of course, the RSS feed is much better suited for this.

    // get the HTML
    $html = file_get_contents("http://www.thefutureoftheweb.com/blog/");
    

    Here is what the HTML looks like for the blog posts:

    <ul id="main">
        <li>
            <h1><a href="[link]">[title]</a></h1>
            <span class="date">[date]</span>
            <div class="section">
                [content]
            </div>
        </li>
    </ul>
    

    So we will use a regular expression that looks for all the li elements and capture the content using parentheses at the appropriate places (link, title, date & content).

    preg_match_all(
        '/<li>.*?<h1><a href="(.*?)">(.*?)<\/a><\/h1>.*?<span class="date">(.*?)<\/span>.*?<div class="section">(.*?)<\/div>.*?<\/li>/s',
        $html,
        $posts, // will contain the blog posts
        PREG_SET_ORDER // formats data into an array of posts
    );
    
    foreach ($posts as $post) {
        $link = $post[1];
        $title = $post[2];
        $date = $post[3];
        $content = $post[4];
    
        // do something with data
    }
    

    There's a lot going on inside that regular expression, but there are really only a few "tricks" that are used. Anytime I want to say "skip over whatever is between" I use .*?. And any time I want to say "match whatever is in here" I use (.*?). And lastly, the s at the end tells PHP to allow the dot . to match newlines. That's about all there is to it.

    The regular expression will only match blog posts, because they are the only <li> elements that contain an <h1>, <span class="date"> and <div class="section">.

    Web scraping is highly unreliable - if the HTML structure were to change this code would break instantly. However, it's often quite easy to write this code, and usually produces a perfectly usable hack solution.

  • See all the articles

    Feb 12 2008

    I've just added a new page where you can see a listing of all the articles I've written (this article is my 181st). This might be an easier way to see older articles than going page by page or month by month. Check it out: All Articles

  • IBM: Where and when to use Ajax

    Feb 6 2008

    My second IBM developerWorks article is now online: Where and when to use Ajax in your applications.

    It's not a very technical article, so you can read it even if you've never programmed before. I talk about the benefits of using Ajax, and point out some problem areas that need special attention so that Ajax doesn't end up ruining your web site. It's essentially a summary of my Unobtrusive Ajax book.

    The article was fun to write and I hope you enjoy reading it!

  • Update a Dev Site Automatically with Subversion

    Jan 19 2008

    If you're using Subversion during development (and you really should be using some kind of version control system), you can wire it up so that your development site will be updated automatically every time you commit a file. And it's easy!

    Well, it's really easy if your subversion server and development web server is the same. If it's not, it's still possible, but outside of the scope of this article. You'll also want to be familiar with the command line, shell scripting and Subversion before attempting this stuff.

    The first thing is to make sure your development server is a Subversion working copy, or in other words, that you can go into the dev site folder and run "svn update" to update the site. So if you've been using "svn export" or something painful like FTP, you may need to replace the dev site with a folder created using "svn checkout".

    Okay, once you can update the dev site using Subversion, all you need to do is edit or create a file called "post-commit" inside the subversion repository, inside the "hooks" folder. If you look in that folder, there will probably be a bunch of example files like "post-commit.tmpl". These are examples of what you can do. Create the post-commit file by copying over the example, like "cp post-commit.tmpl post-commit", then edit this post-commit file.

    Inside that file, there will be some example code like:

    /usr/lib/subversion/hook-scripts/commit-email.pl "$REPOS" "$REV" commit-watchers@example.org
    

    You'll want to remove or comment out this line and stick in your own scripting. You can put any commands in here that you want to run after each commit. For example, to update your dev site, you might have something like this:

    cd /var/www/path/to/website
    svn update >> /path/to/logfile
    

    That's it!

    If you run into problems and you used the logfile like in the example, you can have a look in there are see if there are any error messages. I often have problems with permissions, so you may want to change the permissions in the dev folder (eg. chmod 770 -R *).

    This works really well when more than one person is working on a set of files. Instead of 7000 files like "file.html.backup_jesse_19-01-2008" you can just commit and see the changes instantly. It might seem annoying to have to commit files every time you make a change, but it's the same if not easier than uploading files over FTP every time.

<< older posts newer posts >>