Blog Home  Home Feed your aggregator (RSS 2.0)  
Venexus DotNetNuke Blog - And then there was Search...Part I
DotNetNuke Articles, Code Snippets, Errors, and News
 
 Wednesday, November 15, 2006

I know a lot of people have been waiting on this and it is literally been over 5 years in the making, but it is now time to tell the story of how the Venexus Search Engine came to be…

 

Bots, Crawlers, and Spiders, Oh My!

Once upon a time, long, long ago, well over 5 years ago anyway, but that’s like ancient history in terms of the web, I wrote a little script to rip down free fonts off of a font directory website, who shall remain nameless since they are still around today. FontGrabber.vbs crawled their entire website saving zip files of free font packages. If I remember correctly, it pulled down almost 5000 font packages in a few hours.  What a time saver! And my crawler addiction began to set in…

 

MediaGrabber

The next crawler I wrote extracted data from an online database of live music recordings. I dumped about 10 to 12 thousand records into a custom media database. My crawling habit had now increased to an hour or 2 a week perfecting the use of HTTPGets using XMLHTTP and making modification to scrape other data from the site based on URL parameters.

 

Many variations of MediaGrabber were developed over the years for aggregating data. Some of the variations include:

  • PhotoGrabber - For consuming one of the stock photography buffett sites. An interesting note, the one we crawled, which will also remain nameless, started limiting the number of photo request per day the following month. I wonder if that had anything to do with what we were doing...hehe.
  • FDAUpdater - For pulling down pharmaceutical data from the FDA to be used on a pharmacy website. Enough said about that one.
  • CategoryDump - For pulling category names from Yahoo and DMOZ.
  • And others...

 

Madhatter

Madhatter was my first bot. It was a VBScript that sat in a Direct Connect P2P Server application. Madhatter started as a trigger bot.  A user would type a message into the chat and if it contained keywords or phrases that matched a list of keywords and response(s), the bot would automatically reply with a random response from the list that was associated with that keyword. Over time, I added around 1000 different responses to about 400 keywords. Madhatter then received search capabilities. You could type +search <band> or +search <date> and it would return a top 100 list of media records from a database of about 20000 records that matched with a link pointing them to the website with the information. I then gave the ability for the Operators to allow Madhatter speak on their behalf. So in addition to Madhatter automatically responding, the operator would make new responses to the user messages via Madhatter. This worked so well, and I guess to some degree could be considered my first AI application, that many DC newbies really thought it was a live person responding to their messages, even when Madhatter was running solo. I even setup the bot so that if a user tried to send Madhatter a private message chat, it would display in the Operators chat. This led to untold hours of entertainment watching people talk to a rude, trash talking bot that would kick them off the hub if they responded in a derogatory manner. Just thinking about it again makes me want to write a DNN Bot, maybe not one as feisty as Madhatter. Or maybe “bot” interactive search anyone?

 

 

Tiny IntRAnet Crawler

I started working for Semiconductor Research Corporation in August 2001 as their Web Administrator/Developer. At that time they only had a website and a forums website. The forums website was using a product called SiteScope which was written in TCL, but we will not even go there in fear of recurring nightmares .The SRC main site was not built using a Content Management System, rather a Staging to Dev push of content. I think it was sometime in early 2002, I began writing my first true crawler that would consume all items in a domain. 

 

The need was simple…with the amount of content we had on the site, there was bound to be broken links, missing images, orphaned files, and God forbid, 500 server errors. We needed something that would crawl the site and search for any issues, compare the file system, and generate a report for the Content Management Team. I was still using XMLHTTP component for grabbing the data until I found ASPTear. ASPTear proved to be faster and was the HTTP component of choice until I found NSoftware. NSoft’s HTTP component was far superior to any of the others for speed and with many more methods/objects that could be utilized.

 

SRC had a pretty big main site and we began developing 2 other websites to fall under the SRC umbrella. This lead to TIC 2.0, which crawled all 3 domains, and would (and probably still does) generate a report of any issues. With TIC now crawling more than one website and doing it dynamically (could jump from one domain and then the other with the FIFO [First In First Out] URL queue/stack), the need came to check the first link offsite. Why? In case the link moved (301 or 302), or was generating a 404. We have no control what some site may do to their content, but we sure wanted to know if our users were going to get an error if it was broken. TIC would find those problem links and let the CM Team know they needed to remove the link, or change the URL to the new redirect. Now comes TIC 3.0...

 

Tiny IntERnet Crawler

One night I was goofing around with TIC and decided to turn off the function that performs the domain or first link offsite check and just let it run…and run. And it did, all night long. When I got up the next morning, it had crawled almost 30,000 pages and had built a queue of over 100,000. Now I was hooked. How could I get more data and faster? Since TIC was a script and utilized a central database for the URL queue, instead of an in memory stack, I was able run multiple instances of the crawler. 10 instances of TIC 3.0 crawling brought my little home router to its knees. In fact, it choked and rolled over tits up. In three hours, over 110,000 pages were crawled, over 500,000 URLs queued, and had sucked down over a gig of data. Whoa…this was getting fun.

 

Over the next year or so I really was tweaking TIC quite a bit. I’d let it run for weeks at a time. I quickly realized I was going to run into a big problem…Disk space. The database was getting bloated and slowing down dramatically after it had indexed over 1 million pages and had over 5 more million queued. While those numbers are a drop in the bucket when compared to the 800 pound gorillas of search, it is still a lot of data for such a small operation. And, TIC would crawl anything, all file types.  So I started curbing back what TIC looked for…all the way down to just XML. TIC, as the last version in use, now looks just for XML files anywhere on the Internet. Of course I added tweaks to check domain importance or linking page importance based on keywords and altered the queueing process so that TIC would not get stuck on a crappy domain. But that is a discussion for another time.

 

Tiny XML Spider

So with TIC crawling the web looking for XML files, TXS was developed to crawl and index the XML files TIC found. TXS runs continuously, iterating through all “approved” RSS feeds (about 2,500 of over 100,000). For each feed it parses through the articles and stores anything new to the database. If the feed has been updated, TXS will return in less time. Feeds that have not been updated will be crawled the next time after a longer duration. I call this “smart caching”, which will be discussed in the features of Seamus later on. TXS has aggregated over 1.7 million articles from only 2500 news feeds. Not bad considering how much other data we have to collect from feeds that have not been approved. We have been stuffing the aggregated data into a combination of DNN websites for SEO reasons.

 

DNNFind

DNNFind = DotNetNuke Fulltext INDexing. At some point about 2 years ago, and with TXS bringing in the data, we decided to build a DNN module that would perform a SQL Server fulltext index query against the aggregated data and return the results. While this is not a bot, crawler, or spider, it is a fundamental step of searching the data, which we will get into when discussing the search module of VSE.

 

DNN Spider

I started developing a standalone VB.Net application for crawling DotNetNuke websites. This was my first multi-threaded application. While similar to TIC, this application would allow 1 to many threads to be used to handle the crawling. What we found is that we can use the application for stress testing DotNetNuke websites by throwing a few hundred or thousand request at it. And, we can use multiple applications running on different servers to really pound away at a box. However, this got me thinking about distributing the load of crawling against the users of the website, which is why we are using AJAX to request more data from Seamus. More on that later on as well.

 

Okay, so you made it this far and you are probably asking why I have not even started to describe what the Venexus Search Engine does. Well, I think it is important to understand the background of the application and how it came to be. It’s not like we just came up with some flimsy half-brain ideas about how a search engine should be done, but rather years of trial and error. And, I want everyone to realize that our product is not going to disappear, but get stronger as we add more functionality from all of the code we have written over the years. With that said, here are the details...

 

Sorry, I am out of time and you will have to wait for Part II of this post.

 

In the meantime, if you want to see Venexus Search Engine in action, go to search.venexus.com. To read more about VSE, go here.

REQUIREMENTS FOR VENEXUS SEARCH ENGINE

  • DotNetNuke 4.3.5
  • SQL Server supporting Full-Text Indexing
  • .Net full trust for EntitySpaces and Reflection usage

If you would like to test our release candidate, please reply in a comment to this post and I will send you the PA's.

 

Wednesday, November 15, 2006 11:12:18 AM (US Eastern Standard Time, UTC-05:00)  #       |  |  |  |   | 
Copyright © 2010 Venexus, Inc.. All rights reserved.