When Not To Crawl Content

It is generally accepted that searching for content in MOSS or WSS 3.0 requires the content to first be crawled by the SharePoint Search Service.  However in traditional Enterprise Content Management (ECM) scenarios this typically doesn’t make a lot of sense.  If evaluate how most organizations manage content for the purposes of managing that content, you will quickly see why crawling content doesn’t make a whole lot of sense.  A typical ECM related business process involves the capture (data stream or scanning of content), categorization, processing, and archival of content.  In many cases significant time, money, and effort is expended in these business processes.  So if you spent significant resources to capture and categorize content then why would you rely on a search technology that is better suited for unstructured, full text queries to retrieve your content?  In most (and I say “most” because there are exceptions to this rule) ECM scenarios users are not conducting broadly scoped searches for content.  User’s search critieria is very targeted and specific.  For example an accounting user might want to search for an invoice for a specific vendor based on vendor id and/or invoice number.  A slightly broader search might be executed where the same accounting user is looking for all invoices from a specific vendor for the 2008 calendar year.  In either case the search is targeting.  Attenmpting to crawl this content doesn’t result in a favorable outcomes.  For starters crawling content in SharePoint doesn’t occur immediately after content is added and incremental crawls can take long periods of time to execute depending on how much content was added since the last incremental crawl was executed.  In many EMC scenarios users are required to immediately validate the content once it’s archived to SharePoint but requiring the content to first be crawled doesn’t support this process due to the latency by which items are made available for searching. 

 The performance challenges with crawling large volumes of content in SharePoint are well documented.  If you are not familiar with SharePoint limitations I would recommend reviewing Microsoft’s TechNet article title Plan for Software Boundaries (Office SharePoint Server) located here http://technet.microsoft.com/en-us/library/cc262787.aspx.  If you have ECM scenarios where users are conducting targeted searches in SharePoint, I would suggest evaluating existing search utilities that leverage CAML (Collaborative Application Markup Language) or developing your own.  In large volumen scenarios it makes sense to exclude the content from the SharePoint crawl all together.  I have personally experienced extremely unfavorable crawl performance as a result of larger content volumes in SharePoint even when the underlying SharePoint server infrastructure was optimal.

Bookmark and Share

Leave a Comment

Your email address will not be published. Required fields are marked *