Welcome to openkapow Sign in | Join
in Search

Tutorials

Enjoy our tutorials on building your own robots and mashups.

Creating an RSS robot that returns all stories on the Digg frontpage

This tutorial expands on the basic RSS robot from the tutorial Creating a basic RSS robot that reads from Digg. In that tutorial we made an RSS robot that returns the top Digg story. We are now going to change that robot so that it returns all the stories from the Digg frontpage, not just the first one.

It is assumed that you have read the previous tutorial and that you are familiar with the basics. It is also assumed that you have already downloaded the robot development environment RoboMaker and registered as a user on openkapow. All images in the tutorial can be viewed at full size, just click on the image you are interested in and it will be opened in a new window. Of course the robot built in this tutorial is downloadable here.

Part 1 - Create the robot with a For Each loop

Start by making a new RSS robot that starts from the URL "www.digg.com" and does not use any input values. This creates a 2 step robot with the steps "Load Page" and "Return Item". Now we need to figure out how to return all the stories from the Digg frontpage. The way to do this is to extract the title, URL and description from each story, put that data into an RSSItem output object and return that output object. That means that we will return as many output objects as there are stories on the Digg frontpage. One way to do this is, of course, to first add steps to get the data from the first story, then steps for the second story etc. This is not a good approach since we are not always sure how many stories there will be. This would create a very big robot to do something simple. Plus, it would create a lot of work to both develop and maintain the robot. Instead we are going to use a loop to solve our problem. In each iteration of the loop we are going to extract and return data from one Digg story, and then the loop is going to move on to the next iteration.

How do we then set up a loop in RoboMaker? The first thing we need to do is to figure out what HTML tag to loop through. In this case that is the same as identifying how each story is defined in the HTML. Start with clicking on the title of the first story in the RoboMaker browser view.

 Right, we have now highlighted the link tag of the first story. But this is not the type of tag we want to loop through, we want to loop through each story, not each link. So we need to move out in the HTML until we find the tag that defines one story. There are several ways of doing this. You can use the icons  to move outwards and inwards in the HTML tags. If we try the  icon we will move out to the H3 tag surrounding the a tag, click it once more and you will be on the surrounding div tag, click once more to get to the div with the class "news-summary" (you can see the class in the HTML source view). If we go out one more step we find a div that contains all the stories, so clearly we have gone to far, so let's go back to the div with the class "news-summary".

Instead of using the arrows to move outwards in the HTML, we could have clicked directly in the browser view, clicked in the HTML path view or clicked in the DOM view. These 4 views and all connected and what is highlighed in one is automatically highlighted in the other ones. This makes it easy to move around the HTML and to quickly see what strucuture a page has.

So now we suspect that each Digg story is contained in a div with the class "news-summary". Checking a couple of other stories confirms that this is probably the case. Now, we need to set up a loop that loops through each of those divs. Right click in the HTML path (or in the DOM view, or in the HTML source view...) on the highlighted div, choose Loops and then "For Each Tag".

A new step called "For Each Tag" has now been added to our robot. Time to check if the loop does what we want, this is best done by going through the iterations in your loop to check if you are iterating over the right tags and that the loop works as you intended it to. Try the arrows up and down in the icon bar to go to the next and back to the previous iteration in the loop (note that you need to have highlighted the step before or after the For Each Tagstep to be able to do this). You can also enter the number of the iteration in the little text box between the next/previous iteration arrows and hit Enter to go directly to an iteration. When iterating in the loop a blue box will show what tag in the HTML that are the current tag in the loop.

When validating our looping, we notice it also loop through 2 divs before the first story and 1 div after the last story that we are not at all interested in. So we need to change the automatically generated For Each Tag step configurations to handle this. Before we do this let us take a look at the For Each Tag step's Tag Finders. There is only one Tag Finder defined under the Tag Finders tab and it specifies that the loop should be done within a div with the id "contents".  If we check in the browser view we see that the div that the For Each Tag tag finder indicates does indeed contain all the stories on the page. This means that the robot will work fine wherever on the page this div is, so unless Digg does a really radical redesign and renames or removes this div our robot will work well. Minor redesigns are no problem and we can sleep well at night.

Let's take a look at the For Each Tag's action configuration. Here it is configured so the loop will be done to all "div" tags directly within the tag defined in the Tag Finder, precisely what we want. There are also settings do define at what tag number the loop should start and stop at. To get our loop to ignore the top 2 divs and last 1 div that we are not interested in, let's change these settings so that the "First Tag Number" is 2 from first, and "Last Tag Number" is 1 from last. Try going to the first and last iterations now and you will see that the For Each Tag loop now only covers exactly the data we want to loop through.

 Time to add what we want the robot to do in each iteration of the loop. Go to the first iteration and click the Return Item step. Now we simply right click on the title within the blue box that surrounds the first Digg story and take Extract Title.

 Our robot will now have an "Extract Title" step. That steps's Tag Finder defines that the Extract Title step should act upon an "a" tag within something called "Current Tag 1", so what is this "Current Tag 1" thing? If you take a close look at the blue square indicating the current iteration in the browser view you will see a small "1" in the top left corner. This is Current Tag 1. For each iteration this Current Tag 1 points to a new tag, so referring to the current tag in the Extract Title's tag finder means that it is basically saying "use the first a tag within the iteration, no matter what iteration I am on". Very practical. There are also many other uses of current tags outside loops, but that is not covered in this tutorial.

 In the output object RSSItem, the "title" attribute is now populated. If we now go to the next iteration of the loop we will see that value change to reflect the value in the current iteration.

Now let's add an "Extract URL" and an "Extract Description" to the robot. Once again right click on the link or description inside the current iteration and choose "Extract URL" or "Extract description" to create these steps. Make sure that these steps are added to the robot between the For Each Tag and the Return Item steps, since we want them to be executed once per iteration.

It is important to note what is actually happening when you run this robot. The digg.com page is loaded then the loop iterates through all stories. For each iteration the title, URL and description are extracted from within the current iteration tag. The extracted data is put into the RSSItem object. In the end of the iteration this populated RSSItem is returned, i.e., added to the robots output RSS feed. Then the robot moves on to the next iteration. When it goes to the next iteration this is the same as it goes back to the For Each Tag step, moves the Current Tag and then does the extract steps and return step again. It also moves the state of all objects back to the state they had last time it was at the For Each Tag. So if for the first iteration RSSItem gets the title "Build cool robots at openkapow" in the Extract Title step, RSSItem.title will actually be empty when iteration 2 starts, since that is the state the RSSItem object had when the iteration started.

The robot is now done, or is it? Let's test it and see if it works as it should.

Part 2 - Test the robot

 We want to confirm that this robot works as it should and that it will return the correct data. Open the RoboDebugger and click the "Run" icon to test this.

 The debugger will now execute the whole robot and loop through all iterations and show what the robot returns. Check if the data returned corresponds to the stories on the Digg frontpage. If your robot returns different titles, different URLs but all the same description for the stories it is likely that you added the Extract Description step outside the current iteration tag. If so simply delete the Extract Data step and add it correctly (you can change the settings on the existing step as well, but it is easier to just delete and add).

Nice, it works, time to publish it and brag to all our friends about our cool RSS feed. This tutorial will not cover the details of this since the publishing of a robot was covered in great detail in the tutorial Creating a basic RSS robot that reads from Digg, please refer to that tutorial if you are not familiar about how to do this. Since the robot is published at openkapow.com it is available here.

Summary

We have created a robot that returns all the stories from the Digg frontpage in an RSS feed, again with a few mouse clicks and without writing any code at all. Digg of course has such an RSS feed already, but now you have the skills to build an RSS feed from any data even from pages that have no RSS feed. Try it out by finding one of your favourite pages that does not have an RSS feed (or at least not the RSS feed you want) and build your own RSS robot and publish it on openkapow.com.

Our Digg robot would be much more useful if it would return more than just the stories on the frontpage, maybe if we could loop through all the pages of Digg. This is exactly what we are going to do in the next tutorial.

Published Thursday, November 23, 2006 3:53 PM by Andreas

Comments

 

Tutorials said:

In this tutorial we will see how to improve the performance and the robustness of an RSS robot. To do this we need to go into some of the more advanced and powerfull features of RoboMaker.

November 27, 2006 2:23 AM
 

Tutorials said:

In this tutorial we will see how to create and test a REST robot that searches Google and returns the first 20 results.

November 27, 2006 3:32 AM
 

gush said:

Mmmhh... I think it's not working...

July 20, 2008 12:04 PM
Anonymous comments are disabled
Copyright 2006, 2007 KapowTech.com All Rights Reserved Company | Contact | Terms | Privacy