Menu

#19 Implement a set of guarantees about the crawler output

open
nobody
1
2008-06-06
2007-11-04
No

While working on the issue number '[ 1800656 ] Let all crawlers use rootElementOf' It turned out that my understanding of that task was too broad, therefore I create a new ticket.

On the first glance I wanted to implement a test that would guarantee that the DataObjects extracted by the crawlers form a valid containment tree. This was inline with my ideas expressed in the NIE specification:

http://www.semanticdesktop.org/ontologies/2007/01/19/nie/

I developed a ModelTester for this purpose. It has been uploaded ot the Nepomuk SVN Repository and became a part of the nrlvalidator.jar. See here

http://dev.nepomuk.semanticdesktop.org/repos/trunk/sandbox/org.semanticdesktop.nepomuk.nrl.validator/src/main/java/org/semanticdesktop/nepomuk/nrl/validator/testers/DataObjectTreeModelTester.java

It turned out that the simple assumption that everything is a tree became difficult to implement. After complaints from Christiaan I adjusted the constraint from

"All DataObjects are part of a tree with the root element at the top"

to

"All DataObjects are part of a tree with the root element at the top, but the root element may be connected to a one higher-level element"

... to allow for the fact that a root element may contain references for the non-crawled resources. With this assumption. I succeded with implementing this check for the FileSystemCrawler, IcalCrawler and ThunderbirdCrawler.

For all other crawler, I resorted to a trivial check if there are any root elements, without any assumptions if the root elements are in any sane relation to other data objects. This is done with the RootElementModelTester. Clearly, a guarantee that there are some root elements is not enough to do anything useful with them.

So the task is
1. What constraints can be imposed on the output of other crawlers. They should be more rigorous than the one implemented in the RootElementModelTester

2. How to express these constraints with a model tester.

3. Add that model tester to the unit tests of those crawlers, and to the getAdditionalModelTester method in the appropriate Example...Crawler class in the examples folder.

4. Tweak the actual crawler as long as it takes for the unit tests to pass and for the command line example to stop reporting validation errors.

Discussion

  • Antoni Mylka

    Antoni Mylka - 2008-06-06
    • labels: --> Research Questions
    • priority: 5 --> 1
     
  • Antoni Mylka

    Antoni Mylka - 2008-06-06

    Logged In: YES
    user_id=1613065
    Originator: YES

    I marked it as a research question with a low priority.

     

Log in to post a comment.