We released the Pro version of our DNN search engine module today.
Here is the breakdown of the feature comparison:
Venexus Search Engine Version Matrix
|
Features |
Standard Version |
Pro Version |
| Seamus Features
|
|
|
| Maximum # of Pages
|
500 |
Unlimited |
| Install on commercial site
|
No |
Yes |
| Scheduled Index Updates
|
Yes |
Yes |
| Announcements Module Support
|
Yes |
Yes |
| Contacts Module Support
|
Yes |
Yes |
| Events Module Support
|
Yes |
Yes |
| FAQ Module Support
|
Yes |
Yes |
| Links Module Support
|
Yes |
Yes |
| Text/HTML Module Support
|
Yes |
Yes |
| Allows users to add feeds
|
No |
Yes |
| Custom User Agent
|
No |
Yes |
| Obeys Robots.txt
|
Yes |
Yes |
| TTL Support
|
Yes |
Yes |
| Feed Aggregation Using AJAX
|
Yes |
Yes |
| Display Top X Latest Items
|
Yes |
Yes |
| XSLT Support
|
Yes |
Yes |
| Latest Items RSS Feed Generation
|
Yes |
Yes |
| Portal Specific Feed
|
Yes |
Yes |
| Enclosure/Podcast Support
|
No |
Yes |
| Pinging Service
|
No |
Yes |
|
|
|
|
| Search Features
|
|
|
| Search Skin Object
|
Yes |
Yes |
| Use Image or Text for Search button
|
Yes |
Yes |
| + and - (AND and OR) Support
|
Yes |
Yes |
| Quoted Search Support
|
Yes |
Yes |
| Keyword Highlighting
|
Yes |
Yes |
| Obeys DNN Security
|
Yes |
Yes |
|
|
|
|
| Support
|
|
|
| Issue Tracker
|
Yes |
Yes |
| Email
|
No |
Yes |
| Phone
|
No |
1 Call |
|
|
|
|
| Price
|
Free |
$199 Per Year
 |
I will be discussing the features of the Pro version in a later post. Stay tuned...
We have released the new version of the Venexus Search Engine. VSE Standard Version 1.1.0 has several bug fixes and shows some of the new features of the Pro version.
New standard features and bug fixes:
- VenexusSeamus - Changed TransformXSL to not create a temporary XML file
- VenexusSeamus - Modified Response.Charset
- VenexusSeamus - New Delete Tabs routine for removing deleted and expired tabs
- VenexusSeamus - Ability to reload default XSLT file
- VenexusSeamus - Shows total number of aggregated items
- VenexusSeamus - Gridview pagination
- VenexusSeamus - Link from Grid to show aggregation errors
- VenexusSeamus - Guid attribute added
- VenexusSeamus - application/rss+xml support
- VenexusSeamus - Automatic creation of fulltext index during installation (works for SQL Server Express too!)
- VenexusSearch - Support for DNN 4.4.1 and "search" URL parameter
- VenexusSearch - Non-authenticated postback issue resolved
- VenexusSearch - Limits URL length for display
- VenexusSearch - Quoted query support
If you have any issues with installation, configuration, or bugs, pleas post them in our issue tracker.
Here is a video tutorial on setting up SQL Server 2005 Express and Full-Text Indexing. It breaks down the steps for installation of SQL Server Express with Advanced Services. This is a great video that shows alot more than just setting up full-text indexing. It also shows some basic queries.
Key points of interest during installation is when you get to the Registration Information screen, uncheck "Hide advanced configuration options" before clicking Next. Then in the next screen, expand Database Services and select the option to add "entire feature will be installed on local hard drive" for Full-Text Search. After a few more steps, you must uncheck User Instances Enabled. For those who already have Full-Text Search installed, but did not uncheck that option, you can use the following SQL:
sp_configure 'user instances enabled', '0'
If you are using SQL Server Express Management Studio Express, you can go into the database properties and under files, make sure enable full-text indexing is checked. Or, run the following SQL:
sp_fulltext_database 'enable'
Now for creating the catalog and index. The example below is for our search engine module:
Create fulltext catalog VenexusSearchCatalog
Create Unique Index PKVenexusSearchEngine On Venexus_BrainDump(IndexID)
Create fulltext index On Venexus_BrainDump (IndexURL, IndexTitle, IndexWashedContent) Key Index PKVenexusSearchEngine On VenexusSearchCatalog With Change_Tracking Auto
The first beta testers of the Venexus Search Engine were the guys from True Lawyers. They created a new portal in their DNN installation for Search.TrueLawyers.com. Search.TrueLawyers.com is a legal search engine. Their instance of VSE has aggregated over 216,000 legal articles, news, and related site pages as of this morning. You can test this site and see that the speed of VSE is still great when considering the amount of data it has already indexed. When using the site, each time a page is loaded that has Seamus on it (and Seamus CAN be hidden on the page), 5 feeds are aggregated. Any new items in the feeds are added to the index immediately. While the user sits on the page, AJAX is used to pull more feeds. Plus, since the site is new and does not have much traffic...yet, they use a RSS reader to call the Seamus RSS feed, which grabs more data every 10 to 15 minutes. As you can imagine, their index is growing FAST! You can see the latest items Seamus has aggregated by visiting the True Lawyers Legal News Room.
So, not only does VSE work as a site search engine and multiple portal search engine, it also works as a full blown search engine, aggregating items from your DNN installation, as well as other sites that provides RSS feeds. One of the features we are working on for the 1.1 Pro version is the ability to index any website, regardless of having a RSS feed. You can now have the ability to build powerful niche websites that provide your users with lots of relevant information. Plus, with the RSS feed Seamus generates, you can set it to display items for only your website, allowing you to submit the link to many feed directories, providing search engine optimization. The 1.1 Pro version will ping many blog directories, greatly increasing traffic to your website, treating your entire website like a blog. And we all know that the other search engines are just eating up blog content, increasing the page rank of those sites over many traditional websites without feeds. Ready to try it out? You can download the release candidate here.
Stay tuned for more...
UPDATE 2/7/2007:
I just checked the total items indexed for this site again and it is now showing over 246,0000 items. So in 2 weeks, an extra 30k + items were indexed.
I know many of you have been patiently waiting for the release of the Venexus Search Engine. We have had several beta testers try out previous release candidates, and have several new tweaks in this release.
Seamus Additions:
-
Web.configless (No changes to web.config needed. Beta Testers should remove EntitySpaces web.config entries when installing this version)
-
Object Qualifier support (Thank you Barry White for testing this)
-
Index current tab (Seamus will index the tab it is on. You can add make Seamus invisible on the page by showing 0 items in the feed and unchecking Show Feed. Add to all pages on the site and Seamus will index and update the index when the page is updated)
-
Edit Feed Display (Only show feeds selected in Edit Feeds section. By default, all are show. Selecting feeds will filter news display of only items indexed from the feeds list)
Search Additions:
-
Web.configless
-
Object Qualifier support
-
"query" URL parameter (You can now use your existing default DNN search results page. Simply drop the module on the search results page and remove the default search results module. Utilize the DNN Search textbox in your skin with the power of Full-Text Indexing).
-
-
Form post fix (A fix was added that allows you to simply hit the enter key after adding your query, rather than forcing you to click on the button)
-
Allow user selected web or site search (allow your users to select whether their search is against the current portal or for all search results in the database)
-
URL Trim (Used to trim the URL display in the search results. Long URLs would stretch out the skin)
-
Search Query (Saves user queries and the number of "hits" for that query. This will be used in the pro version for "Top Searches" and "Latest Searches".)
Here is some information about our DotNetNuke search engine module. You can test it on our site here. We also have the latest version loaded on our DNN search site for finding DotNetNuke related pages and sites.
As for the official release, we are waiting on FlatBurger to fix an issue with their code protection that causes the module to generate an error after activation. We have been told that this may be fixed by Friday...we will see. In the meantime, please send us your thoughts on this release candidate. If you find any bugs, please post them to our issue tracker. If you have any suggestions for new features, please post them in the issue tracker or in the support forms.
REQUIREMENTS FOR VENEXUS SEARCH ENGINE
- DotNetNuke 4.3.5 or Higher (Yes it works with the DNN 4.4 release)
- SQL Server supporting Full-Text Indexing
- .Net full trust for EntitySpaces and Reflection usage
Now for the files....
Before installing this, you MUST read the instructions. You CANNOT just install both modules and expect it to work. You MUST configure fulltext indexing manually to get this to work. You will find instructions on performing this action in the Search Instructions and Configuration.
You can download both modules here. The file is also attached as an enclosure.
Please post your links here in a comment to show everyone how you are using the Venexus Search Engine.
UPDATED: Link to module downloads has been updated.
In case you were wondering from my last post, here is how to get a list of all modules by Portal:
SELECT DISTINCT ModuleDefinitions.ModuleDefID, ModuleDefinitions.FriendlyName, Modules.PortalID FROM ModuleDefinitions CROSS JOIN Modules LEFT OUTER JOIN Modules AS Modules_1 ON ModuleDefinitions.ModuleDefID = Modules.ModuleDefID WHERE (Modules.ModuleDefID IS NOT NULL) ORDER BY Modules.PortalID
Here is how to specify a specific portal in the installation:
SELECT DISTINCT ModuleDefinitions.ModuleDefID, ModuleDefinitions.FriendlyName, Modules.PortalID FROM ModuleDefinitions CROSS JOIN Modules LEFT OUTER JOIN Modules AS Modules_1 ON ModuleDefinitions.ModuleDefID = Modules.ModuleDefID WHERE (Modules.ModuleDefID IS NOT NULL) AND (Modules.PortalID = 0)
We needed to check which modules were NOT being used on a DNN site. Why? It's a multi-portal DNN 3.1 website with quite a few 3rd party modules on it and we wanted to know which modules were not in use so we could remove them from the DNN installation before performing an upgrade. No need to add extra compilications or search for module updates for modules that were not being used. And if they are not being used, why leave them on there adding to the bloat? Anyway, here is the SQL:
SELECT ModuleDefinitions.ModuleDefID, ModuleDefinitions.FriendlyName, ModuleDefinitions.DesktopModuleID FROM ModuleDefinitions LEFT OUTER JOIN Modules ON ModuleDefinitions.ModuleDefID = Modules.ModuleDefID WHERE (Modules.ModuleDefID IS NULL)
I never got around to making an announcement that DNN 4.3.6 was release a couple of weeks ago....now DNN 4.3.7 is out.
I checked the security announcements and only saw 2 issues that were already fixed with DNN 4.3.6:
I am not sure if this is the 4.4 release Shaun Walker mentioned in his blog yesterday, or this is in fact the 4.3.7 release. The change log in the bug tracker does not appear to have any related info on 4.3.7. However, the roadmap for DNN 4.4 release shows many items checked in.
| (4.4.0) Performance Release |
Admin / Host Functions |
Bug |
DNN-4492 |
ModuleTitle in multi definition modules |
|
Checked-In |
|
Admin / Host Functions |
Bug |
DNN-3868 |
Page Head tags are not properly processed |
|
Checked-In |
|
Admin / Host Functions |
Bug |
DNN-4011 |
Action Menu with Module Specific Permissions is not displayed |
|
Checked-In |
|
Admin / Host Functions |
Bug |
DNN-4476 |
Cannot use icon from module directory in action buttons |
|
Checked-In |
|
Admin / Host Functions |
Enhancement |
DNN-4503 |
Improve Delete Portal Functionality |
|
Checked-In |
|
Admin / Host Functions |
New Feature |
DNN-4496 |
Add User Quota |
|
Checked-In |
|
Admin / Host Functions |
New Feature |
DNN-4502 |
Improve Portal Management |
|
Checked-In |
|
Admin / Host Functions |
New Feature |
DNN-4504 |
Add a new Delete Expired Portals action |
|
Checked-In |
|
Admin / Host Functions |
New Feature |
DNN-4495 |
Add Page Quotas |
|
Checked-In |
|
Localization / ML |
Bug |
DNN-4273 |
Collation issue with Event Log |
|
Checked-In |
|
Localization / ML |
Bug |
DNN-4506 |
Pop-up calendar localized date format bug |
|
Checked-In |
|
Localization / ML |
Bug |
DNN-4483 |
Localized images break when using the "ShowMissingKeys" app setting. |
|
Checked-In |
|
Localization / ML |
Bug |
DNN-4560 |
Popup calendar |
|
Checked-In |
|
Localization / ML |
New Feature |
DNN-4520 |
Force a specific language for first visitors |
|
Checked-In |
|
Performance |
Bug |
DNN-4086 |
Performance: Reduce Database Calls |
|
Checked-In |
|
Performance |
Bug |
DNN-4088 |
Performance: CBO and Reflection |
|
Checked-In |
|
Performance |
Bug |
DNN-4090 |
Performance: ClientAPICaps.config caching |
|
Checked-In |
|
Performance |
Bug |
DNN-4092 |
Performance: XmlSerializer |
|
Checked-In |
|
Performance |
Bug |
DNN-4087 |
Performance : TabCache |
|
Checked-In |
|
Performance |
Bug |
DNN-4091 |
Performance ; Menu providers |
|
Checked-In |
|
Performance |
Bug |
DNN-4093 |
Performance: XPathDocument vs XmlDocument |
|
Checked-In |
|
Performance |
Enhancement |
DNN-537 |
Imporve Startup performance |
|
Checked-In |
|
Performance |
Enhancement |
DNN-662 |
Implement HTTP compression |
|
Checked-In |
Since 4.4 is called a "Performance" release, and on the website it says 4.3.7 is a stabilization release, I take it that they are indeed different. I just would like to know what was changed in the 4.3.7 if indeed it is a stabilization update.
I know it has been over a week since the last post. Sorry to leave you hanging, but sometimes there are just not enough hours in a day. Anyway, without further ado, here is part two… SEAMUS.
At some point earlier this year, DNN Find became a different mission. We decided to build a full blown search engine for DotNetNuke. Not one that would just index a single DNN site, but one that would allow you to index all portals in a DNN installation AND information from external sites. And how would external site indexing best be handled? …via RSS feed aggregation of course.
Seamus is the first of the two modules that make up the Venexus Search Engine. SEAMUS = Search Engine Aggregation Module Utilizing Syndication. On a side note, there is also an obscure Pink Floyd song that not many know from the Meddle album, about an old hound dog by the same name. Our hound dog “fetches” data and stores it to a table that has enabled MS SQL Server full-text indexing. But before I go into the specifics, I think it is important to know about the framework.
We started with traditional DotNetNuke module development…until EntitySpaces was released. I’m an old ASP/VB developer and personally, it took me a bit to get my head wrapped around how ES worked, but once I figured it out, I was hooked. ES saves the day by automagically generating all the CRUD (create, read, update, delete). While very similar to the logic of a BusinessController and InfoObject, ES uses Collections and Entities. But, where I found ES the most useful is the Dynamic Queries you can write directly into the business logic.
For example, in Seamus we need to check the domain to see if it matches one we are already indexing:
Dim colDomains As New VenexusDomainCollection colDomains.Query.Select(colDomains.Query.DomainName, colDomains.Query.DomainID) colDomains.Query.Where(colDomains.Query.DomainName.Equal(GetDomainName(sURL))) colDomains.Query.Load() If colDomains.Count > 0 Then ‘a bunch of removed logic goes here.. End IF
With the colDomains.Query.Select, we are only returning the data we need rather than all columns. With the colDomains.Query.Where, I eliminated the need to:
- Write a stored proc just to retrieve by DomainName
- Iterate through the entire table, every row of all domains, just to find the one I am looking for.
I won’t even go into the performance gain of not having to loop through those rows of all columns, nor the time (even though it would be simple) to write a stored proc to pass in DomainName and have it return the DomainID.
Here is an example of adding a record to Seamus for a new feed:
Dim entFeed As New VenexusSeamus entFeed.AddNew() entFeed.Url = txtURL.Text entFeed.Title = txtTitle.Text entFeed.Account = txtAccount.Text entFeed.Password = txtPassword.Text entFeed.CacheTime = txtCacheTime.Text entFeed.FeedTimeOut = txtTimeOut.Text entFeed.DateAdded = Now() entFeed.DateUpdated = "1/1/1901" If chkActive.Checked = True Then entFeed.IsActive = True Else entFeed.IsActive = False End If entFeed.Save()
Easy enough, eh?
And here is an update of a feed for Seamus:
Dim entFeed As New VenexusSeamus entFeed.LoadByPrimaryKey(hidRSSID.Value) entFeed.Url = txtURL.Text entFeed.Title = txtTitle.Text entFeed.Account = txtAccount.Text entFeed .Password = txtPassword.Text entFeed.CacheTime = txtCacheTime.Text entFeed.FeedTimeOut = txtTimeOut.Text entFeed.DateAdded = Now() entFeed.DateUpdated = "1/1/1901" If chkActive.Checked = True Then entFeed.IsActive = True Else entFeed.IsActive = False End If entFeed .Save()
And a delete example:
Dim entFeed As New VenexusSeamus entFeed.LoadByPrimaryKey(hidRSSID.Value) entFeed.MarkAsDeleted() entFeed.Save()
Yeah, it’s that easy. Makes you want to fire up your IDE eh?
Sure, I have used DAL Builder Pro, which was a huge time saver, but EntitySpaces made me to never want to develop any other way. Plus, last I checked, DAL Builder Pro was still only for DNN 3 development. The ease of generating the DAL and the ability to easily REgenerate the DAL if the database schema changes, makes ES the tool of choice for all of our module development. I cannot even begin to count the hours I have previously spent hand coding changes in a DAL due to spec changes. Oh how I wish I had all those hours back!
With the new DNN admin grid templates, it is just ridiculous how much code is generated before having to write the first line. The new template will generate an editable grid of the table(s), with sorting, paging, and search. If you are interested in .Net development (this is not just a DNN tool, it works for all .Net 2.0 development and using C# or VB.Net), you must check it out.
NOTE: Just so you know, we do not have any affiliation or partnership with EntitySpaces, we just think their tool rocks.
So, even though we had much of the initial Seamus development completed, we scrapped it and started development with ES. This will make future modifications and additions so much easier, saving time in the long run.
With that said, here is how Seamus works…
After you install Seamus, you can go into the module settings:

So in this example, the display for Seamus should show the top 10 items last indexed, each with a link to the actual item in the Title and using the “…More” link. A feed icon will also be displayed that provides a link to a RSS feed for the top 10 items.
Here is an example of the display:

Now while the above example does not show any local items (tabs or modules from this site), it does have items indexed from other sites. All of these items were from RSS feeds that were aggregated. As a module editor, you have the ability to manage external feeds (or local feeds if so desired, but we will go into more detail about how Seamus works shortly). But, if there were local items visible, they would only be visible if you have the proper permissions. Seamus checks permissions on any local site at the module and tab level for the display and the RSS feed.
Here is an example of the feeds we are are currently indexing on the Venexus Search Engine:

Here is the interface for adding new feeds:

Now we will get into how Seamus works…
First off, on the first load of Seamus, a dump of data from all modules supporting the IPortable interface (currently limited to DNN Core modules) is performed to ensure that there is data in the index. And every X hours (determined in module settings), the index is checked for new, updated, and deleted pages/modules.
Secondly, any feeds that have been added to Seamus are aggregated 5 at a time, order by last updated. And, while the user is sitting on the page, every 30 seconds that pass, 5 more feeds are aggregated via AJAX. This user interactive aggregating decreases the load on the server, rather than running as a scheduled task like the core DNN Search.
In order to save bandwidth, and to not tick off the owners of the websites you are aggregating data from, Seamus has what I call “smart caching”. Each time a feed is requested, if the information in the feed as not been updated, Seamus will increase the cache time. If the feed has been updated, it will request the same feed sooner than it had previously, decreasing the cache time. Over time, and based on the “average” a feed it updated, Seamus learns when to check again for updates, all while obeying TTLs.
Seamus will also index the current page/tab it is sitting on. Now you may be asking why you would index a page that displays items that have already been indexed. Well, Seamus can be setup to not display the top X items and/nor the RSS feed. Here is an example:

With the above Seamus settings and the module settings to display on all pages and set to not display the container or using an “invisible” container, when a user lands on any page of the site, the page is indexed. You can index your entire site by letting the users "crawl" the website. Also, when the page is updated, the index will be updated. Here is the module settings example:

So, not only does Seamus index all portals in the DNN installation by doing a dump of all modules that support the IPortable interface and individual page indexing based on user interaction, it will also aggregate and index data from other sites. This gives you the ability to create a full blown search engine for your niche. For example, let say you have a website about racing. You could have your entire DNN site indexed, along with aggregation of more racing data from the following sites:
http://www.sportsline.com/partners/feeds/rss/auto_news
http://rss.news.yahoo.com/imgrss/events/sp/042103autoformula
http://rss.cnn.com/rss/si_motorsports.rss
Not only are you able to display a list of the last items indexed in order to keep a page from becoming stagnant, you can also provide a RSS feed for your users, giving them a reason to return to your site. I will save a Seamus and SEO discussion for another time, but here is an example site for a legal search engine.
Speaking of time, I am once again out of it. Part III will be a discussion of the second module, the search form module. Stay tuned...
I know a lot of people have been waiting on this and it is literally been over 5 years in the making, but it is now time to tell the story of how the Venexus Search Engine came to be…
Bots, Crawlers, and Spiders, Oh My!
Once upon a time, long, long ago, well over 5 years ago anyway, but that’s like ancient history in terms of the web, I wrote a little script to rip down free fonts off of a font directory website, who shall remain nameless since they are still around today. FontGrabber.vbs crawled their entire website saving zip files of free font packages. If I remember correctly, it pulled down almost 5000 font packages in a few hours. What a time saver! And my crawler addiction began to set in…
MediaGrabber
The next crawler I wrote extracted data from an online database of live music recordings. I dumped about 10 to 12 thousand records into a custom media database. My crawling habit had now increased to an hour or 2 a week perfecting the use of HTTPGets using XMLHTTP and making modification to scrape other data from the site based on URL parameters.
Many variations of MediaGrabber were developed over the years for aggregating data. Some of the variations include:
-
PhotoGrabber - For consuming one of the stock photography buffett sites. An interesting note, the one we crawled, which will also remain nameless, started limiting the number of photo request per day the following month. I wonder if that had anything to do with what we were doing...hehe.
-
FDAUpdater - For pulling down pharmaceutical data from the FDA to be used on a pharmacy website. Enough said about that one.
-
CategoryDump - For pulling category names from Yahoo and DMOZ.
-
And others...
Madhatter
Madhatter was my first bot. It was a VBScript that sat in a Direct Connect P2P Server application. Madhatter started as a trigger bot. A user would type a message into the chat and if it contained keywords or phrases that matched a list of keywords and response(s), the bot would automatically reply with a random response from the list that was associated with that keyword. Over time, I added around 1000 different responses to about 400 keywords. Madhatter then received search capabilities. You could type +search <band> or +search <date> and it would return a top 100 list of media records from a database of about 20000 records that matched with a link pointing them to the website with the information. I then gave the ability for the Operators to allow Madhatter speak on their behalf. So in addition to Madhatter automatically responding, the operator would make new responses to the user messages via Madhatter. This worked so well, and I guess to some degree could be considered my first AI application, that many DC newbies really thought it was a live person responding to their messages, even when Madhatter was running solo. I even setup the bot so that if a user tried to send Madhatter a private message chat, it would display in the Operators chat. This led to untold hours of entertainment watching people talk to a rude, trash talking bot that would kick them off the hub if they responded in a derogatory manner. Just thinking about it again makes me want to write a DNN Bot, maybe not one as feisty as Madhatter. Or maybe “bot” interactive search anyone?
Tiny IntRAnet Crawler
I started working for Semiconductor Research Corporation in August 2001 as their Web Administrator/Developer. At that time they only had a website and a forums website. The forums website was using a product called SiteScope which was written in TCL, but we will not even go there in fear of recurring nightmares .The SRC main site was not built using a Content Management System, rather a Staging to Dev push of content. I think it was sometime in early 2002, I began writing my first true crawler that would consume all items in a domain.
The need was simple…with the amount of content we had on the site, there was bound to be broken links, missing images, orphaned files, and God forbid, 500 server errors. We needed something that would crawl the site and search for any issues, compare the file system, and generate a report for the Content Management Team. I was still using XMLHTTP component for grabbing the data until I found ASPTear. ASPTear proved to be faster and was the HTTP component of choice until I found NSoftware. NSoft’s HTTP component was far superior to any of the others for speed and with many more methods/objects that could be utilized.
SRC had a pretty big main site and we began developing 2 other websites to fall under the SRC umbrella. This lead to TIC 2.0, which crawled all 3 domains, and would (and probably still does) generate a report of any issues. With TIC now crawling more than one website and doing it dynamically (could jump from one domain and then the other with the FIFO [First In First Out] URL queue/stack), the need came to check the first link offsite. Why? In case the link moved (301 or 302), or was generating a 404. We have no control what some site may do to their content, but we sure wanted to know if our users were going to get an error if it was broken. TIC would find those problem links and let the CM Team know they needed to remove the link, or change the URL to the new redirect. Now comes TIC 3.0...
Tiny IntERnet Crawler
One night I was goofing around with TIC and decided to turn off the function that performs the domain or first link offsite check and just let it run…and run. And it did, all night long. When I got up the next morning, it had crawled almost 30,000 pages and had built a queue of over 100,000. Now I was hooked. How could I get more data and faster? Since TIC was a script and utilized a central database for the URL queue, instead of an in memory stack, I was able run multiple instances of the crawler. 10 instances of TIC 3.0 crawling brought my little home router to its knees. In fact, it choked and rolled over tits up. In three hours, over 110,000 pages were crawled, over 500,000 URLs queued, and had sucked down over a gig of data. Whoa…this was getting fun.
Over the next year or so I really was tweaking TIC quite a bit. I’d let it run for weeks at a time. I quickly realized I was going to run into a big problem…Disk space. The database was getting bloated and slowing down dramatically after it had indexed over 1 million pages and had over 5 more million queued. While those numbers are a drop in the bucket when compared to the 800 pound gorillas of search, it is still a lot of data for such a small operation. And, TIC would crawl anything, all file types. So I started curbing back what TIC looked for…all the way down to just XML. TIC, as the last version in use, now looks just for XML files anywhere on the Internet. Of course I added tweaks to check domain importance or linking page importance based on keywords and altered the queueing process so that TIC would not get stuck on a crappy domain. But that is a discussion for another time.
Tiny XML Spider
So with TIC crawling the web looking for XML files, TXS was developed to crawl and index the XML files TIC found. TXS runs continuously, iterating through all “approved” RSS feeds (about 2,500 of over 100,000). For each feed it parses through the articles and stores anything new to the database. If the feed has been updated, TXS will return in less time. Feeds that have not been updated will be crawled the next time after a longer duration. I call this “smart caching”, which will be discussed in the features of Seamus later on. TXS has aggregated over 1.7 million articles from only 2500 news feeds. Not bad considering how much other data we have to collect from feeds that have not been approved. We have been stuffing the aggregated data into a combination of DNN websites for SEO reasons.

DNNFind
DNNFind = DotNetNuke Fulltext INDexing. At some point about 2 years ago, and with TXS bringing in the data, we decided to build a DNN module that would perform a SQL Server fulltext index query against the aggregated data and return the results. While this is not a bot, crawler, or spider, it is a fundamental step of searching the data, which we will get into when discussing the search module of VSE.
DNN Spider
I started developing a standalone VB.Net application for crawling DotNetNuke websites. This was my first multi-threaded application. While similar to TIC, this application would allow 1 to many threads to be used to handle the crawling. What we found is that we can use the application for stress testing DotNetNuke websites by throwing a few hundred or thousand request at it. And, we can use multiple applications running on different servers to really pound away at a box. However, this got me thinking about distributing the load of crawling against the users of the website, which is why we are using AJAX to request more data from Seamus. More on that later on as well.
Okay, so you made it this far and you are probably asking why I have not even started to describe what the Venexus Search Engine does. Well, I think it is important to understand the background of the application and how it came to be. It’s not like we just came up with some flimsy half-brain ideas about how a search engine should be done, but rather years of trial and error. And, I want everyone to realize that our product is not going to disappear, but get stronger as we add more functionality from all of the code we have written over the years. With that said, here are the details...
Sorry, I am out of time and you will have to wait for Part II of this post.
In the meantime, if you want to see Venexus Search Engine in action, go to search.venexus.com. To read more about VSE, go here.
REQUIREMENTS FOR VENEXUS SEARCH ENGINE
- DotNetNuke 4.3.5
- SQL Server supporting Full-Text Indexing
- .Net full trust for EntitySpaces and Reflection usage
If you would like to test our release candidate, please reply in a comment to this post and I will send you the PA's.
I got a new Dell D-820 laptop last week. I spent much of the week installing new software, setting up my dev environment, and moving all of the data from my Dell D-810. I was pretty much ready to rock n' roll with this new setup on Friday and giddy about the performance of the 2 Ghz Core 2 Duo and 2 Gb RAM (especially when compiling DNN 4 in VS 2005)....that was until a virus unleashed havoc on my pristine setup. AAAARRRRRRRRR!!!!!!!
I have been a firm believer in Symantec and we have used it for years. I decided to try AVG at the recommendation of our network admin and other users and at this time I cannot say I recommend it. For whatever reason, AVG did not catch it until the virus had dropped a payload of bibical proprotions on my drive. I got the full treatment, including something I had not seen in a while...Blue Screen of Death. Not knowing at that time I actually had a virus, I went ahead and rebooted. Before Windows could finish loading, the virus set in to installing adware and trojans, more than I have ever seen for any payload. AVG finally decided to come to the rescue. I started a full scan and let AVG cleanup. Rebooted in safe mode and ran AVG again. I got everything removed...according to AVG, but had my network connections hosed up. I finally gave up and gave it our network admin to see if he could fix it. On Saturday I get the news that we were going to have to recover from the intial Ghost we made or from check points, that he could not fix the issue with the network connection. So, sometime in the wee hours of the morning on Sunday, I had my laptop back up and running with a check point that was a few days old. I decided to do another scan. AVG came out clean. Since I am biased to Symantec, I decided to use TrendMicro Housecall for another scan. It found 2 more trojans that AVG totally missed. That was the last straw. AVG removed, Symantec back on. May the writer of the virus burn in the firery depths of the blackest....
We had a backup domain controller running on our dev machine. After moving it to a different server, we got the following error on our development DNN websites:
Server Error in '/' Application.
The
current identity (NT AUTHORITY\NETWORK SERVICE) does not have write
access to 'C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary
ASP.NET Files'.
Description: An
unhandled exception occurred during the execution of the current web
request. Please review the stack trace for more information about the
error and where it originated in the code.
Exception Details: System.Web.HttpException:
The current identity (NT AUTHORITY\NETWORK SERVICE) does not have write
access to 'C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary
ASP.NET Files'.
Source Error:
An unhandled exception was generated during the execution of the
current web request. Information regarding the origin and location of
the exception can be identified using the exception stack trace below.
|
Stack Trace:
[HttpException (0x80004005): The current identity (NT AUTHORITY\NETWORK SERVICE) does not have write access to 'C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files'.] System.Web.HttpRuntime.SetUpCodegenDirectory(CompilationSection compilationSection) +3482363 System.Web.HttpRuntime.HostingInit(HostingEnvironmentFlags hostingFlags) +226
[HttpException (0x80004005): The current identity (NT AUTHORITY\NETWORK SERVICE) does not have write access to 'C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files'.] System.Web.HttpRuntime.FirstRequestInit(HttpContext context) +3434991 System.Web.HttpRuntime.EnsureFirstRequestInit(HttpContext context) +88 System.Web.HttpRuntime.ProcessRequestInternal(HttpWorkerRequest wr) +252
|
I checked all the security permissions and ensured network service had full permission for the folder. I even reset permissions to make sure. No dice. I found the following trick to fix it quickly...despite my time looking for it. In a command prompt, navigate to the following: C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727 Then run the following command: aspnet_regiis -ga "NT AUTHORITY\NETWORK SERVICE" Everything worked fine after running that. Since I am bound to forget this, I am posting it here for me and others, hopefully saving me or someone time.
Did you know that SQL Server Express now has Full-Text Indexing? With the soon to be release of our DotNetNuke search module, which requires SQL Server Full-Text Indexing, I thought it would be helpful to make a post here for those who did not know the differences between SQL Server Express versions.
SQL Server Express Edition Comparison
You have several specific products to choose from when you install SQL Server Express Edition Use the following table to see how features for SQL Server compare across other Express Edition products.
|
Express Edition Products for SQL Server Compared |
| Feature |
SQL Server 2005 Express Edition |
SQL Server 2005 Express Edition with Advanced Services |
SQL Server 2005 Express Edition Toolkit |
| Database Engine |
* |
* |
|
| Client Components |
* |
* |
* |
| Full Text Search |
|
* |
|
| Reporting Services |
|
* |
|
| Management Studio Express |
|
* |
* |
| Business Intelligence Developer Studio |
|
|
* |
Each SQL Server Express Edition product has a specific use. Read the following sections to learn how each Express Edition product for SQL Server compares to the others.
SQL Server 2005 Express Edition
How does the Express Edition of SQL Server compare to other SQL Server Express Edition products? SQL Server Express Edition is perfect for use as an embedded database for a desktop application that requires a fully functional SQL Server Database Engine. SQL Server Express offers the smallest package size for faster downloads or to conserve space on deployment media.
SQL Server 2005 Express Edition with Advanced Services
How does the Express Edition with Advanced Services for SQL Server compare to the other SQL Server Express Edition products? SQL Server 2005 Express Edition with Advanced Services is perfect for use as a backend to a small, multiuser application that requires more advanced features such as Web reporting or Full-text Search.
SQL Server 2005 Express Edition Toolkit
How does the Express Edition Toolkit for SQL Server compare to other SQL Server Express Edition products? Install this package if you need the management tools and client components, but do not need the Database Engine. |
|
 |
Source: MSDN
Ready to download SQL Server 2005 Express with Advanced Services (has Full-Text Index)? Its free and here.
If you are not familiar with Google Alerts, you should check it out. I have been tracking things from Google Alerts for at least 2 years, maybe longer. While I have noticed more things coming in for "DotNetNuke", starting on October 27th I noticed ALOT more alerts for "DotNetNuke" coming in. What did they all have in common? BLOGS. Also a few days ago, while looking into the activity of this blog, I noticed a new user agent I had not seen:
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
If you go to that link, you are redirected to a FAQ page and at the bottom is a section called Feedfetcher. Here is an interesting Q and A:
How do I request that Google not retrieve some or all of my site's feeds?
Since Feedfetcher requests are all user-initiated, it does not follow the typical robots.txt guidelines for robots. For detailed information about how to prevent Feedfetcher from requesting all or part of your site, please see our removal instructions.
Very interesting. I was under the assumption that any "bot", and I will define Feedfetcher as a "bot" regardless of whether it is "user-initiated" or not, should obey robots.txt.
With that said, our feed aggregation module for Venexus Search Engine, called Seamus, does obey robots.txt. I am sure this discussion will come about with the release of VSE, so I decided to go ahead and post it now in preparation. And speaking of Venexus Search Engine...we have made the final compile and are finishing testing tonight...but more on that later.
I was updating a DNN site today and at the same time was migrating the SQL Server 2000 database to SQL Server 2005. I decided to use the Copy Database Wizard since I had never tried it and it worked great. However, the logins did not get updated properly. I created the login in the SQL Server 2005 security, but could not access the database via the old login. I tried doing a generic detach > attach with the same issue. Trying to edit the SQL Server Account through SQL Server Studio Management Studio would generate an error of "Login must be specified", yet it would not give me the ability to update (all grayed out). After doing some digging, I found the following stored procedure that did the trick:
EXEC sp_change_users_login 'Auto_Fix', 'USERNAME', NULL, 'PASSWORD'
After running the above, I was able to use the old login to access SQL Server. Now back to the grind...
The news is out, DotNetNuke is going corporate. Perpetual Motion Interactive Systems, Inc., started by Shawn Walker, has been managing the DotNetNuke Project. According to a press release on the DotNetNuke website today, the formation of DotNetNuke Corporation in Seattle, Washington will "serve the growing needs of the project and its ever-expanding community".
This is indeed big news! At this time I am not sure whether to be excited or worried. While I understand the past year has been challenging, with such a huge adoption rate of the project among all types and sizes of business entities (we have seen this first hand), and has brought in the extra administrative burden to the core team, I had hoped that there was a plan to offset the growth. With any "open source" project, people immediately think "free" which has been the downfall of many projects IMO. In any business model, 0 times 0 is still 0. And let's face it, people just can't afford to work for free. While I am grateful for the core team and their many volunteer hours, and I for one am unable to devote such hours, I do feel these people should be compensated for their hardwork. I felt DotNetNuke was on the right path with the Benefactor program (we joined within hours of its announcement) and with the announcement of providing 3rd party module reviews and a 3rd party marketplace, I felt it was bound to gather the dough required to float the venture. But the idea of DotNetNuke going corporate has changed the possibilities greatly.
From the article:
“DotNetNuke Corporation is not a typical commercial entity,” Walker added. “Rather, it is dedicated to the public benefit goal at the heart of the DotNetNuke project, which is to create opportunities and spread entrepreneurship to the world by providing a superior Open Source web application framework."
AND...
In addition to spearheading the Open Source project, DotNetNuke Corp. will also focus on developing and delivering services which support the ecosystem, including marketing, sponsorships, and a wide range of partner-related activities. These activities are expected to generate revenue, but the company intends to focus on those opportunities that are consistent with the community values and public goals of the project, Walker said. This includes providing funding for aspects of the project that are difficult or challenging for volunteer teams to solely undertake such as professional marketing, large-scale platform and feature development, product certification and ecommerce initiatives, he added.
With that said, it seems to say, DotNetNuke is going corporate so that they can fund the development of additional activities that need more funding. Now one has to consider the rumors that have been flying about the changes in DNN 4.3 related to membership, and the mysterious source that funded these changes. Also, is there a reason for making the headquarters in Seattle, Washington? To get closer to Microsoft maybe? How will the business model change, or will it? Will DotNetNuke eventually be sold? I think there are still lots of questions in my mind about the reasoning for this move, but we all know the answer...$$$. I am not saying any of the items above are a bad thing. Afterall, anyone who complains about Microsoft being a monoploy is just jealous of a beautiful business model. At the same time, in the words of Google, "Don't be evil", should be taken to heart.
I want to think that this will be the big push DotNetNuke needed to get into the limelight, but only time will tell. In the meantime, we will be keeping busy with the many clients Venexus has accumulated over the last couple of years, all due to a little CMS called DotNetNuke. We can't thank DotNetNuke enough for our own business growth, and hope the new path is one that will continue to benefit the ecosystem and community and allow DotNetNuke Corporation to prosper.
|
Copyright © 2010 Venexus, Inc.. All rights reserved.
|
|