DNN FIND Spider version 0.2 was built this morning after a successful nightly crawl test last night.

DNN FIND Spider (DNNFS) is currently a standalone VB.Net application that spiders the web. We are using the spider to crawl DotNetNuke sites for inclusion in the DNN FIND index. Currently the DNN FIND index is comprised of data collected from RSS feeds and the DNNFS. If you have a feed or site that you would like to have aggregated or crawled, please let us know: support at DNN FIND dot com.
Current Features:
- Written in VB.Net
- Stores data in MS SQL Server
- Multi-threaded spider
- Obeys robots.txt
- Obeys meta robots tags
- Stores HTML and parsed (HTML removed) data
- Site specific crawls
Direction:
So many crawlers, so little time! The goal is to get all of the other VBScript and Visual Basic (not .net) crawlers/spiders we have built into a single mega .Net spider. Porting the features and functionality of other crawlers we have, like TIC and TXS, is a first priority. However, the first objective is to make DNNFS to be a "global" crawler, rather than just a site specific crawler.
Currently DNNFS will crawl a single domain at a time, which is great for adding TIC features like HTTP Header Status reports for non-200 responses, giving a detailed view of problems on the website. In the hands of a DNN site operator, this could be used to view issues on the site that might otherwise be unknown without digging through web logs.
But, the work to be done next involves porting code to utilize a central database queue. By doing so, it will allow for multiple spiders to be setup on multiple servers, each talking back to the central queue for the next URL to be processed. Also, the queue can be manipulated by URL importance instead of relying on First In First Out queueing. With TIC, a global crawler, we have Domain AND URL importance affecting the queue. While a crawl is happening, an algorithm computes the "importance" of a URL based upon the importance of the domain and keywords found within the document. The more documents found on a site with related keywords, the higher the domain importance, which results in a higher URL importance during crawl time. However, this comes with a performance issue and may best be added as a backend process that updates the queue continuously. Most DNN operators would not need such features nor do they have sites to be crawled that have hundreds of thousands of pages like some of the sites we host. However, some of them might have large sites and/or would be interested in contributing their crawl data, which begs the question of developing a thin client for DNN operators.
Definite Features to be included in future releases:
- XML document support for crawl updates (similar to how Google sitemaps work)
- RSS support for site update notifications
- Non-200 status header response reporting
- Global crawls
- Central queue processing
- Multiple crawler support
Ideas we are kicking around:
- A distributed thin client for DNN operators
- DNNFS as a website stress tester? It would be easy to increase the number of threads by x fold to make a stress tester.
- Web services for remote queries
If anyone has any features, functionalities, or ideas that can be added to DNNFS that would help the DNN community, please let us know: support at DNN FIND dot com.