Welcome to openkapow Sign in | Join
.

Pod-scrapping

  •  02-28-2007, 9:53 PM

    Pod-scrapping

    I am trying to scrap some data from a Podcast (such as "Title" or link to mp3 file).
    Intuitively, this should be very easy because the structure of Podcast xml files is very simple.

    So I did "Load Page" with the URL http://radiofrance-podcast.net/podcast/rss_13305.xml
    I expected to have this structure in the treeview:

    - channel
      + title
      + link
      +...
      - item
         + title
         + link
         + enclosure
         + guid

    In this case, getting the mp3 file is very easy: just identify the tag with the path .*.item.guid

    However, I was very surprised to see that RoboMaker adds a comment in the xml code:
    Kapow RoboSuite: This is an HTML representation of an XML document

    Looking at the treeview, the structure is much more complicated, and each tag is 
    surrounded by span tags such as < span class="start-tag" >

    If I use the tag automatically generated by RoboMaker, then it is not robust, and if the structure of the
    Podcast changes slightly, then it cannot find the right tag again. Making it more robust requires a lot
    of effort, namely due to the fact that this is a html representaiton of xml.


    So what am I doing wrong? Is there a way to load the page in its native xml format? Something like "Load xml"
    instead of "Load page" action?

    Filed under: , , , , ,
View Complete Thread
.
Copyright 2006, 2007 KapowTech.com All Rights Reserved Company | Contact | Terms | Privacy