Welcome to openkapow Sign in | Join
in Search

Tutorials

Enjoy our tutorials on building your own robots and mashups.

Improve Performance and Robustness of a robot

In this tutorial we will see how to improve the performance and the robustness of an RSS robot. To do this we need to go into some of the more advanced and powerfull features of RoboMaker.

This tutorial builds on the other RSS tutorials about how to build a basic RSS robot, how to use For Each, Repeat-Next loops and how to use input values in the RSS robot. It is assumed that you have read and understood these tutorials and that you already have downloaded the robot development environment RoboMaker and registered as a user on openkapow. All images in the tutorial can be viewed at full size, just click on the image you are interested in and it will be opened in a new window. Of course the robot built in this tutorial is downloadable here.

Part 1 - Improving Robustness

A robust robot is a robot that will function even if the web site it interacts with changes. It is of course impossible to make a robot that can handle all thinkable and unthinkable scenarios, but a few easy changes can make a robot much more able to handle minor changes such as layout changes or added content. It is very uncommon that sites do a major rehaul of their design and structure, and it is definitly not more common than changes in other kinds of integration through API's, XML files etc.

As an example of improving the robustness of a robot we continue to work on the robot from the tutorial Creating an RSS robot with an input value to search Digg. If you haven't read that tutorial, now would be a good time. This robot searches Digg based on an input value to the robot. It returns an RSS feed based on the search results.


Most steps in this robot have been added by right clicking in the browser view in RoboMaker and then choose the type of step that should interact with the highlighted HTML tag. This is a very quick and convenient way of adding steps, and in the vast majority of cases the configuration of the steps (as created by RoboMaker) works perfectly. Sometimes though, it is good to add a bit more thought and care into the step configuration. Let us start by taking a closer look at the Tag Finder configuration of the "Test Tag" step.

The Tag Finder tries to find a div with the attribute name class and the value notice anywhere on the page. To use the attribute name class and the attribute value notice is very good for robustness since this points the step to just the correct div tag. What is not pefect for robustness in this step is the Tag Path .*.div.div.div. This Tag Path points to any div tag contained inside two other div tags. So what if Digg changes it's structure and the div we are looking for is no longer contained within two other div's? Then the step will not find what it is looking for, no matter if we can see the div in a browser. If we instead change the Tag Path manually to .*.div we immediatly improve the robustness of this step, because now the step is really looking for a div with the class notice anywhere on the page. Do not forget to test your robot in the RoboDebugger after each change, otherwise you run the risk of "improving" your robot so much that it acctually does not work.

Our robot is quite robust already thanks to the For Each loop we are using. This loop creates a Current Tag and moves that Current Tag in each iteration of the loop. All the Extract steps inside the loop have their Tag Finders defined in relationship to this Current Tag. If we take a look at the Tag Finders of the "Extract Title" step for example, we see that instead of this Tag Finder trying to find something anywhere on the page it tries to find something inside the Current Tag 1, which in this case is defined by the For Each step.

Using Current Tags is a very powerful way of making your robots more robust and if you are interacting with big complicated web pages it is also a very good way for yourself to get more control of what steps interact with what tags on the page. The For Each loops use Current Tags, but if you want to set a Current Tag ourside a loop then simply use the step "Set Current Tag". Once a Current Tag is set then all Tag Finders in subsequent steps can be set in relation to this Current Tag. If you define several Current Tags subsequent tag finders can even be defined in relation to all your Current Tags, for example as between Current Tag 1 and Current Tag 2.  Let's use the Set Current Tag step to improve the robustness of the "Click Next" step. Currently the Tag Finder of the Click step is looking for a a tag with the class nextprev and that fits the pattern .*>Next.*, the problem is that the a tag has to be within three div tags.

To improve this Tag Finder, and to see how to use the Set Current Tag step, let's add a a Set Current Tag step in front of the Click next step. This is done by making the "Click Next" step active, identifiying the div tag that contains all the paging links on the search result page and then right clicking on that div and choose "Other" and "Set as Current Tag".

Now we have added an "Set Current Tag" step between "Test Tag" and "Click Next". The Tag Path in the Tag Finder of the new step is now set to ".*.div.div.div", change this to ".*.div" to improve the robustness of the step.

Time to change the Tag Finder in the "Click Next" step so that it uses the new Current Tag. Either you can do this manually by editing the values in the existing Tag Finder, or you can make the "Click Next" step active, right click in the browser view on the tag you want to use and choose "Use only this Tag" to replace the current Tag Finder. Beware if you do it the latter way you will have to reenter the needed Tag Pattern ".*>Next.*" in the Tag Finder. Instead of doing that, let's do it manually. Open of the Tag Finder of the "Click Next" step and change the Find Where to "In Current Tag". Set in this tag to "Current Tag 1" and change the Tag Path to be ".*.a".

The "Click Next" step is now much more robust since it does not matter where on the page the div with all the page links and the next/previous links are, it will still find it and click in the correct link. Using a "Set Current Tag" in this situation is a bit of an overkill, a much simpler way of doing things would be to just change the Tag Finder of the "Click Next" step (and not add any current tag step at all) to find anywhere on page and the tag path ".*.a". But if we did that you would not have seen the usefullness and power of the "Set Current Tag" step.


Part 2 - Improving Performance

Performance is of course very important in robots that will interact directly with a user, we do not want the user to sit and wait for seconds and seconds while our robot does it's thing. This makes performance less important for RSS robots compared to REST and Clipping robots. Especially since the RSS robots are run on a preset frequency and the result cached inbetween those executions. But nevertheless performance is important for RSS robots, and if nothing else, we can use our current Digg RSS robot as an example of how to improve performance of all types of robots.

Openkapow robots interact with web sites. Thus the performance of all robots depend on things like the response time of the web site it is interacting with, the available bandwidth and many other factors outside our control when we are building robots. As we will soon see there are many things within our power in RoboMaker to improve performance anyway. The main gain to peformance is to avoid needless page loads. A page load is either a "Page Load" step, a "Click" step or any other step that loads a page from a web site. Steps that are purly internal to RoboMaker (extracts, conditions etc) usually take a few milliseconds to execute, while a page load might take many seconds depending on the web site in question (among many other things).

In the Digg RSS robot we are working on now there are not really any unnecessary page loads. If we really need to improve the performance we could replace the first few steps - that loads Digg.com, fills in the search term and clicks on the search button - with one page load directly to "http://digg.com/search?s=kapow" (if we are searching for the term "kapow"). This cuts down the execution time, but we are immediatly making the robot less robust since we require this quite specific URL to work.

Let us see what we can do to improve the robots performance without the hassle to have to change the page loads around, right now we are not that desperate to cut down the execution time. Before we get started with this it would be good to have a very rough idea of how long the robot takes to execute. Then at least we would have an idea of if we are in fact improving the execution time or not with the changes we will do. To get this rough exection time we open up the RoboDebugger and run the robot with the input search text "robot.

When RoboDebugger is done we see in the lower status bar of the debugger that the total execution time was 40.7 seconds. Keep in mind that this is when the robot ran in the debugger, which is very much slower than running the robot when it is published to openkapow.com! All we need this value for right now is to have a rough time to compare the execution time to, so we have an idea if our improvements acctually are improvements. If we run the robot again we will get another value (in my case 38.9 seconds) so 40.7 seconds is not set in stone, it depends on such things as how busy the machine it is running on is, the available bandwidth, if Digg has the search already cached, how busy Digg is etc etc.

 We now have a value to compare our progress to so let us go back to RoboMaker and open the Robot Configuration (available in the File menu). Then we click on the "Configure..." button to configure the default options of the robot. These are options used by each step that loads a page (ie "Load Page" and "Click"). For each of those steps we can define if they should use the default options (which we are about to configure) or if the step should use it's own options. To change this go to the action tab of a load page step and you can define if this step should use the default options or not.

 For now we just assume that all steps will use the default options. If this proves not to be a good idea when we are testing the robot we can change this and configure the options of some steps individually if necessary. But let's start with the default options. There are 5 tabs in the options window - All Loading, Page Loading, JavaScript Execution, JavaScript Event Handlers and Logging. In this tutorial we are just going to touch on some of the functionality in the first 3 tabs, but it is well worth to read the RoboMaker help and the manual to get to know the other options available here.

 In the "All Loading" tab there is one setting in particular that is interesting to us right now, the "Enable Cookies" one. Our robot is not really using cookies for anything, it is not logging in to any system and there is no need to save any session data in a cookie. But is the robot downloading any cookies anyway? If we load the Digg.com page in the robot (simply click on the "Enter Value" step) and check the "Cookies" tab, we see that loading Digg.com involves 3 different cookies.

 One of the cookies track the Digg session and two of them are probably used by some advertising network to track the ads they display on Digg.com. None of these are of any interest to us to let's uncheck "Enable Cookies" in the default options. Now no cookies will be downloaded when we run our robot, and that should already have saved some execution time. A quick test in RoboDebugger confirms that the robot is still working as it should and that the execution time is now down to 37.4 seconds. This is almost too small a change to mean anything (maybe Digg simply responded quicker this time), but at least nothing broke.

If we move on to the "Page Loading" tab in the default configuration window we find another very basic and very interesting option, the "Load Frames" option. The default is that all frames are loaded, but is this really what is needed in this case?

 In RoboMaker each frame is loaded into it's own seperate subtab under the "Windows" tab. Here it is clear that as soon as we go in to Digg.com two frames are loaded. One that contains all the content we need to interact with, and one that contains a banner. Since the robot does not need this banner we are probably safe with not loading any tabs, instead we just want to load the main window. So let's uncked "Load Frames" and thest the robot again in RoboDebugger.

Sidenote: to see a site using many, many frames simply make a robot that loads CNN.com. There you've got a perfect example where it really pays of not to load all those unnecessary frames.

This time the total execution time of the robot in RoboDebugger is down to 24.7 seconds, quite a significant improvement. What we have done by not loading the banner is to cut the number of page loads in half. For each page load (Load Page,Click Submit and multiple Click Next's) the robot is now only loading the main window, and never ever the banner frame.

It is well worth noting that the "Images to Load" option in the "Page Loading" tab is per default set to "None". This means that when the robot is executed none of the images on the page are downloaded to the server, already this is a huge improvement on performance, and one that we get for free. Only this option means that it is usually quicker to load a page using a robot than in a normal browser.

The "Page Loading" tab in the options window contains another very powerfull mechanism when it comes to improve performance - Page Changes. When a robot is executed the content of the pages it loads are all moved into the servers memory so that the robot easily and quickly can interact with this HTML. The more content that is moved into memory, the more content the robot has to interact with and the slower the robot. Page Changes is a way to reduce the amount of HTML that is loaded into memory and thereby improving performance. When we are interacting with Digg in our current robot we are not interested in the sidebar that lists all topics, lets users log in etc. Still this sidebar is loaded into memory on each page load. If we use Page Changes to remove this sidebar from each page load we might save some execution time. The sidebar is of course still returned by Digg, but then it is removed before the HTML Digg returns is loaded into memory, so for the robot it is just like if this sidebar never existed.

Page Changes are very powerfull and there are many ways to configure them. For now we keep it simple and use the same Page Changes for all pages in the robot. To add Page Changes choose "Same for All Pages" in the Page Change dropdown.

One of the most common page change converter is a "Replace Pattern", so let us add one of those by clicking the "+" and then configure the new converter as "Text Formatting" and "Replace Pattern".

 Double click on the new "Replace Pattern" converter to configure it. In a "Replace Pattern" converter we can define what patterns to find (using regular BLOCKED EXPRESSION and what to replace those patterns with. To know what patterns we are looking for we need to look at the page HTML in RoboMaker. There we can find the div with the id "sidebar" contains a lot of things in the page that we are not interested in. Furthermore we see that there is nothing in the HTML after that div that we are interested in, so let us remove that whole div and everything after it in one "Replace Pattern" page change. The pattern is "(<div id="sidebar".*)" and we replace it with an empty string.

All Page Changes needs to be carefully tested in RoboDebugger, if the pattern is a bit wrong we might remove parts of the page that the robot needs to do what it is suppose to do. Also keep in mind that any Page Change risks making the robot less robust, in this case we are very vulnerable if Digg moves the sidebar div to the top of it's HTML. Testing the robot in RoboDebugger once again proves that our changes has not broken the robot. The execution time is now 25.4 seconds, a slight bit more than before, but not enough of a change to really mean anything. Page Changes are very usefull when a robot interacts with a big page, in this case the page is already quite small, so the change in execution time is not great (of course there is also some overhead to executing the Page Changes that needs to be concidered).

 If we move on to the next tab in the default options window we get to "JavaScript Execution" where there are many ways to configure how javascript will be executed in the sites that the robot uses. In this simple Digg RSS robot we do not really use any javascript, unless the links we click on have some onClick javascript event handlers that does something important. Let's simply uncheck "Execute JavaScript" in this tab and test the robot in the debugger. If it still works as we want then there is no reason why we should let our robot spend any extra time executing needless javascript. Another bonus we get by not executing javascript is that no external javascript documents will be imported when executing the robot. In the header tag of the Digg HTML we see that a javascript file called "utils.js" is imported, and if we do not import this file the robot should be a bit faster.

Testing the robot in the debugger now once again shows that the robot works fine, and this time the execution time is down to 22.5 seconds. This is almost half of what we started with (40.7 seconds), which is quite remarkable since we have not made any changes to the robot steps, only to the overall robot configuration.

The performance of the robot can be improved changing what the step configuration and even removing some steps. Some examples of what can be done are:
  • Replace the first 3 steps with simple one Load Page that goes directly to the search result page of Digg
  • Remove the top "Find Tag" branch and instead add the own error handling "Ignore and Skip Branch" to several steps. After all it is not the best for performance to always test for something (ie "No results found") that will only be there in exceptional cases.
  • Do not use the step "Set Current Tag" to define the Tag Finder for the "Click Next" step, just change the Tag Path to ".*.a" in the Click step instead (as discussed earlier in this tutorial).
To take a look in detail on this robot and to test it out you can download it here.

Summary

In this tutorial we have both improved the robustness and the performance of a RSS robot that interacts with Digg.com. We have talked about how to improve Tag Finders, how to use Current Tags and how to use the powerfull options in the Robot Configuration. Most of the topics in this tutorial works the same for REST and Clipping robots as for RSS robots and understanding how to make a robot more robust and how to improve performance also gives a better understanding of how openkapow and RoboMaker works. There are many options and configurations for each robot and for each step, and any given problem can usually be solved in a number of different ways. After this tutorial you should have a very good understanding of the potential of openkapow and you should be more than ready to build your own, quite complex robots with good robustness and performance.
Published Friday, November 24, 2006 2:16 PM by Andreas

Comments

 

Tutorials : Creating an RSS robot with an input value to search Digg said:

November 27, 2006 5:21 AM
 

Sano said:

The site www.digg.com and cnn.com do not use frames any more. or atleast not that i can see. so that is out of date. plz update

July 19, 2007 1:37 AM
Anonymous comments are disabled
Copyright 2006, 2007 KapowTech.com All Rights Reserved Company | Contact | Terms | Privacy