Blog Home  Home Feed your aggregator (RSS 2.0)  
Venexus DotNetNuke Blog - And then there was Search...Part II
DotNetNuke Articles, Code Snippets, Errors, and News
 
 Saturday, November 25, 2006
I know it has been over a week since the last post. Sorry to leave you hanging, but sometimes there are just not enough hours in a day. Anyway, without further ado, here is part two…SEAMUS.

At some point earlier this year, DNN Find became a different mission. We decided to build a full blown search engine for DotNetNuke. Not one that would just index a single DNN site, but one that would allow you to index all portals in a DNN installation AND information from external sites. And how would external site indexing best be handled? …via RSS feed aggregation of course.  

Seamus is the first of the two modules that make up the Venexus Search Engine. SEAMUS = Search Engine Aggregation Module Utilizing Syndication. On a side note, there is also an obscure Pink Floyd song that not many know from the Meddle album, about an old hound dog by the same name. Our hound dog “fetches” data and stores it to a table that has enabled MS SQL Server full-text indexing. But before I go into the specifics, I think it is important to know about the framework.

We started with traditional DotNetNuke module development…until EntitySpaces was released. I’m an old ASP/VB developer and personally, it took me a bit to get my head wrapped around how ES worked, but once I figured it out, I was hooked. ES saves the day by automagically generating all the CRUD (create, read, update, delete). While very similar to the logic of a BusinessController and InfoObject, ES uses Collections and Entities. But, where I found ES the most useful is the Dynamic Queries you can write directly into the business logic.

For example, in Seamus we need to check the domain to see if it matches one we are already indexing:

            Dim colDomains As New VenexusDomainCollection
            colDomains.Query.Select(colDomains.Query.DomainName, colDomains.Query.DomainID)
            colDomains.Query.Where(colDomains.Query.DomainName.Equal(GetDomainName(sURL)))
            colDomains.Query.Load()
            If colDomains.Count > 0 Then
                     ‘a bunch of removed logic goes here..
            End IF

With the colDomains.Query.Select, we are only returning the data we need rather than all columns. With the colDomains.Query.Where, I eliminated the need to:

  1. Write a stored proc just to retrieve by DomainName
  2. Iterate through the entire table, every row of all domains, just to find the one I am looking for.

I won’t even go into the performance gain of not having to loop through those rows of all columns, nor the time (even though it would be simple) to write a stored proc to pass in DomainName and have it return the DomainID.

Here is an example of adding a record to Seamus for a new feed:

         Dim entFeed As New VenexusSeamus
         entFeed.AddNew()
         entFeed.Url = txtURL.Text
         entFeed.Title = txtTitle.Text
         entFeed.Account = txtAccount.Text
         entFeed.Password = txtPassword.Text
         entFeed.CacheTime = txtCacheTime.Text
         entFeed.FeedTimeOut = txtTimeOut.Text
         entFeed.DateAdded = Now()
         entFeed.DateUpdated = "1/1/1901"
           If chkActive.Checked = True Then
              entFeed.IsActive = True
            Else
                entFeed.IsActive = False
            End If
         entFeed.Save()

Easy enough, eh?

And here is an update of a feed for Seamus:

    Dim entFeed As New VenexusSeamus
    entFeed.LoadByPrimaryKey(hidRSSID.Value)
    entFeed.Url = txtURL.Text
    entFeed.Title = txtTitle.Text
    entFeed.Account = txtAccount.Text
    entFeed .Password = txtPassword.Text
    entFeed.CacheTime = txtCacheTime.Text
    entFeed.FeedTimeOut = txtTimeOut.Text
    entFeed.DateAdded = Now()
    entFeed.DateUpdated = "1/1/1901"
          If chkActive.Checked = True Then
              entFeed.IsActive = True
          Else
                entFeed.IsActive = False
          End If
    entFeed .Save()

And a delete example:

    Dim entFeed As New VenexusSeamus
    entFeed.LoadByPrimaryKey(hidRSSID.Value)
    entFeed.MarkAsDeleted()
    entFeed.Save()

Yeah, it’s that easy. Makes you want to fire up your IDE eh?

Sure, I have used DAL Builder Pro, which was a huge time saver, but EntitySpaces made me to never want to develop any other way. Plus, last I checked, DAL Builder Pro was still only for DNN 3 development. The ease of generating the DAL and the ability to easily REgenerate the DAL if the database schema changes, makes ES the tool of choice for all of our module development. I cannot even begin to count the hours I have previously spent hand coding changes in a DAL due to spec changes. Oh how I wish I had all those hours back!

With the new DNN admin grid templates, it is just ridiculous how much code is generated before having to write the first line. The new template will generate an editable grid of the table(s), with sorting, paging, and search. If you are interested in .Net development (this is not just a DNN tool, it works for all .Net 2.0 development and using C# or VB.Net), you must check it out.

NOTE: Just so you know, we do not have any affiliation or partnership with EntitySpaces, we just think their tool rocks.

So, even though we had much of the initial Seamus development completed, we scrapped it and started development with ES. This will make future modifications and additions so much easier, saving time in the long run.

With that said, here is how Seamus works…

After you install Seamus, you can go into the module settings:

So in this example, the display for Seamus should show the top 10 items last indexed, each with a link to the actual item in the Title and using the “…More” link. A feed icon will also be displayed that provides a link to a RSS feed for the top 10 items.

Here is an example of the display:

Now while the above example does not show any local items (tabs or modules from this site), it does have items indexed from other sites. All of these items were from RSS feeds that were aggregated. As a module editor, you have the ability to manage external feeds (or local feeds if so desired, but we will go into more detail about how Seamus works shortly). But, if there were local items visible, they would only be visible if you have the proper permissions. Seamus checks permissions on any local site at the module and tab level for the display and the RSS feed. 

Here is an example of the feeds we are are currently indexing on the Venexus Search Engine:

Here is the interface for adding new feeds:

Now we will get into how Seamus works…

First off, on the first load of Seamus, a dump of data from all modules supporting the IPortable interface (currently limited to DNN Core modules) is performed to ensure that there is data in the index. And every X hours (determined in module settings), the index is checked for new, updated, and deleted pages/modules.

Secondly, any feeds that have been added to Seamus are aggregated 5 at a time, order by last updated.  And, while the user is sitting on the page, every 30 seconds that pass, 5 more feeds are aggregated via AJAX. This user interactive aggregating decreases the load on the server, rather than running as a scheduled task like the core DNN Search.

In order to save bandwidth, and to not tick off the owners of the websites you are aggregating data from, Seamus has what I call “smart caching”.  Each time a feed is requested, if the information in the feed as not been updated, Seamus will increase the cache time. If the feed has been updated, it will request the same feed sooner than it had previously, decreasing the cache time. Over time, and based on the “average” a feed it updated, Seamus learns when to check again for updates, all while obeying TTLs.

Seamus will also index the current page/tab it is sitting on. Now you may be asking why you would index a page that displays items that have already been indexed. Well, Seamus can be setup to not display the top X items and/nor the RSS feed. Here is an example:

With the above Seamus settings and the module settings to display on all pages and set to not display the container or using an “invisible” container, when a user lands on any page of the site, the page is indexed. You can index your entire site by letting the users "crawl" the website. Also, when the page is updated, the index will be updated. Here is the module settings example:

So, not only does Seamus index all portals in the DNN installation by doing a dump of all modules that support the IPortable interface and individual page indexing based on user interaction, it will also aggregate and index data from other sites. This gives you the ability to create a full blown search engine for your niche. For example, let say you have a website about racing. You could have your entire DNN site indexed, along with aggregation of more racing data from the following sites:

http://www.sportsline.com/partners/feeds/rss/auto_news

http://rss.news.yahoo.com/imgrss/events/sp/042103autoformula

http://rss.cnn.com/rss/si_motorsports.rss

Not only are you able to display a list of the last items indexed in order to keep a page from becoming stagnant, you can also provide a RSS feed for your users, giving them a reason to return to your site. I will save a Seamus and SEO discussion for another time, but here is an example site for a legal search engine.

Speaking of time, I am once again out of it. Part III will be a discussion of the second module, the search form module. Stay tuned...

 

Saturday, November 25, 2006 5:04:44 AM (US Eastern Standard Time, UTC-05:00)  #    Comments [0]    |  |  |  |   | 
Copyright © 2008 Venexus, Inc.. All rights reserved.