Welcome to openkapow Sign in | Join
in Search

Tutorials

Enjoy our tutorials on building your own robots and mashups.

Creating an RSS robot that pages through Digg

This tutorial expands on the RSS robot from the tutorial Creating an RSS robot that returns all stories on the Digg frontpage. In that tutorial we made an RSS robot that returns the stories from the Digg frontpage. We are now going to change that robot so that it does return all the stories from several pages at Digg, not just the first one.

It is assumed that you have read the previous tutorial and that you are familiar with the basics. It is also assumed that you already have downloaded the robot development environment RoboMaker and registered as a user on openkapow.com. All images in the tutorial can be viewed at full size, just click on the image you are interested in and it will be opened in a new window. Of course the robot built in this tutorial is downloadable here.

Part 1 - Create an RSS robot that returns all stories from the Digg frontpage

This is exactly what was done in the tutorial Creating an RSS robot that returns all stories on the Digg frontpage, so please refer back to this tutorial. There we built a 6 step robot that looped through the stories on the frontpage of Digg and returned the title, URL and description of each story. The tutorial covered the For Each loop and how the iterations in that loop works.

This robot has a clear limitation, it does just return the stories on the first page of Digg. If we want to return the stories from page 2, 3 and 4 etc our robot simply can not handle that. This is the functionality we will add to the robot in this tutorial. However, there is no need to change the parts of the robot that we already have, it does exactly what we want to to on each individual page. So we just need to add another loop that loops through the pages at Digg.

Part 2 - Add a Repeat-Next loop

The For Each Tag loop we used to loop through the different stories on the Digg frontpage does not really fit our needs when it comes to looping through pages. To go to a new iteration in the page loop it is not only to jump to a new set of HTML tags (or moving the current tag to say it in a more technical way), instead we need to click on the "Next page" link on Digg and load the next page before going to the next iteration.

This is more than the For Each type of loop can do, so let us use the other kind of loop that RoboMaker has, the "Repeat-Next" loop. That type of loop is made up of two different steps - the "Repeat" and the "Next" step. The Repeat step starts the loop and the Next step moves the loop to the next iteration. Time to add the start of the loop, ie the Repeat step. This step should be placed after Load Page but before For Each since we want to include all of the For Each loop inside our Repeat-Next loop. Insert the Repeat step by right clicking on the Load Page step and choosing "Insert Step After".

This inserts an empty step after the Load Page step. We now need to specify what type of step this should be. Right click on the empty step and choose "Configure Step".

In the "Step Configuration" window that opens choose Loops -> Repeat in the "Select an Action" dropdown, this will make the new step into a Repeat step. Another way of doing this would have been to click on the empty step and then selecting an action directly in the step configuration in the right of the RoboMaker window.

To complete the Repeat-Next loop we need a "Next" step as well as a "Repeat" step, and to add this step we start by adding a new branch to the robot. Click on the Repeat step so it is selected and then click on the little icon for "Add branch for selected step".

In our robot we will now have 2 branches - the top branch with the For Each loop and then bottom branch with one empty step. Let's configure the new empty step as a "Next" step in the same way as we configured the previous empty step as a "Repeat" step.

If we now execute this robot it will do the "Load Page" and "Repeat" step and then execute the the "For Each" branch until that loop does not have any more iterations. Then the robot will execute the "Next" step which moves the Repeat-Next loop to the next iteration. This iteration will once again do the "For Each" loop etc. It would return the stories from the Digg frontpage over and over again since it does not acctually move to the next page of Digg. So we have acctually created a never-ending loop. This is of course not exactly what we want, we need to add more functionality to the Next branch. For more information on branches and the order of execution please refer to the RoboMaker documentation.

Part 3 - Add functionality to the Next branch

In the robots Repeat-Next loop we need to go to Digg's next page before the "Next" step in order to acctually page through all the Digg stories on all the pages. This is simply done by add a "Click" step before the "Next" step. The "Click" step will click on the "Next page" link in the bottom of the Digg page. Select the "Next" step and right click on the "Next>>" link in the browser view to do this.

If we execute the robot now we will be able to return all the stories on all pages of Digg. We can test this in RoboDebugger, but in order not to have to wait for a long time while the robot goes through the 2000+ robots we are going to use a breakpoint in the debugger. Start with open the RoboDebugger, then right click on the "Next" step and choose "Toogle breakpoint". Then you can run the robot in the debugger. Each time it comes to the "Next" step the debugger stops and if you want to execute the next iteration of the Repeat-Next loop you click on the run icon once again. This should move the robot from page to page in Digg and return all stories on those pages. Right now the robot has a serious problem though, take a minute and test the robot in RoboDebugger and see if you can find it.

Any luck with the testing? The problem is that the first time the "Click Next" step is executed it will do just what it is intended to do, and click the link that loads the next page, but the second time that step is executed it will actually click the link to the previous page. So the robot will just go back and forth between the first and second page. Clearly we need to fix this and to do this we need to refine the Tag Finders of the "Click Next" step so that it does not just click on a link with the class "nextprev" (both the next and the previous links have the same class). This is done by adding a pattern that the tag has to follow in order to be found by the Tag Finder. This pattern, as all other patterns in RoboMaker, are acctually regulat expressions. If you are not familiar with regular expressions do not worry, there is plenty of help in RoboMaker (in the "Symbol" drop downw under the Tag Pattern input field for example) or in the documentation. The pattern we use here is ".*>Next.*" which tests that the text in the link tag starts with the text "Next".

Right now we have a breakpoint in our robot to avoid a long testing run. To create a robot that does not need this breakpoint (the breakpoints only works in the debugger and not when a robot is published on openkapow.com) let us add a test to the robot that stops the robot from iterating through more than 3 pages of stories. This test need to be between the "Click Next" and the "Next" steps, since if we have reached page 4 in the "Click Next" step we do not want to execute the "Next" step. If we look on the Digg page in the RoboMaker browser view we see that the current page is highlighted in the list of pages on the bottom of the page, and with a quick look in the HTML souce view we see that the current page is acctually in a span tag with the class "current". This suits us perfect. Let's add a "Test Tag" step to test this tag and it's content. Select the "Next" step and then right click on the current page in the browser view and choose "Test Tag".

To get the new "Test Tag" step to work as we would like we need to change some of the configurations on that step, so click on it to select it and then take a look at the Action tab. There we need to add a pattern to test the tag found via the steps Tag Finders according to. In this case we add "4" and make sure that the Action is set to "Stop if Pattern Matches Found Tag". This means that if the current page is 4 the "Test Tag" step will stop, and if the "Test Tag" stops there are no other steps left in the robot to execute, so the whole robot stops.

Test the robot in the debugger again. First with the breakpoint so that if something do not work you do not have a never-ending (or at least very long running) loop. Then remove the breakpoint by using "Toogle breakpoint" and test the whole robot in the debugger. When you are confident that the robot works as it should it is time to once again publish a robot on openkapow.com. This tutorial will not cover the details of this since the publishing of a robot was covered in great detail in the tutorial Creating a basic RSS robot that reads from Digg, please refer to that tutorial if you are not familiar about how to do this. Since the robot is published at openkapow.com it is available here.

Summary

We have created a robot that returns all the stories from the 3 first page of Digg in an RSS feed. While doing this we have learned how to use the Repeat-Next kind of loop and also how to use branches and conditions in a robot.

Wouldn't it be cool if the robot instead of just returning the stories from the first few Digg pages instead returned stories related to a specific search term such as "ipod nano". This is exactly what we are going to do in the next tutorial!

Published Thursday, November 23, 2006 4:23 PM by Andreas

Comments

 

Tutorials said:

In this tutorial we will see how to improve the performance and the robustness of an RSS robot. To do this we need to go into some of the more advanced and powerfull features of RoboMaker.

November 27, 2006 2:23 AM
 

Tutorials said:

In this tutorial we will see how to create and test a REST robot that searches Google and returns the first 20 results.

November 27, 2006 3:32 AM
 

Sensi_web said:

This tutorial isn't up to date anymore!!! Please have a look and rebuild.

April 5, 2007 3:59 PM
 

Sano said:

This tutorial isn't working. Please update it so it will work

July 18, 2007 5:34 AM
 

Tutorials said:

In this tutorial an RSS robot that uses a For Each loop to return all stories from the Digg frontpage will be built.

October 26, 2007 3:26 PM
 

aaron said:

This tutorial isn't working. Please update it so it will work.

March 16, 2008 1:21 PM
Anonymous comments are disabled
Copyright 2006, 2007 KapowTech.com All Rights Reserved Company | Contact | Terms | Privacy