Home » IDOL Indexer

Category Archives: IDOL Indexer

IDOL Indexer – Numeric Searches

By default, only searches for numbers of four digits or less will yield results when searching the document content even if that number (larger than four digits) exists in the document.

 

From page 6 and 7 of the WorkSite Search Tips for End Users document:

 

Searching in the document content:

 

• The rules of searching for numbers or alphanumeric terms in the document content (i.e.  the actual document text) Numbers are only searchable up to four digits.

 

Example: A search for 123 will return hits for documents with the exact number “123”. Likewise a search for 1234 will return hits for documents with the exact number “1234”. However, a search for 12345 would return zero results as would a search for 123456 or a search for 1234567, and so forth. Only searches for numbers of four digits or less will yield results when searching the document content even if that number exists in the document.

 

• It should be noted that while numbers are in the document content are limited by default, alphanumeric terms are searchable up to 250 characters long as in the profile Description field.

 

Furthermore, the limitation on indexing only numbers up to four digits in length is strictly for full-text search. If you are performing a search by profile metadata only, this should be carried out against the SQL database and has no limitations (other than field size) on the number of numeric characters.

 

This full-text limitation is designed to prevent swelling of the index collection. This is primarily enforced on the body of documents, so that spreadsheets will not dramatically increase the number of unique terms stored in the collection. It also applies to some of the metadata fields, but again, only when full-text searching.

 

The number of unique four-digit numeric terms is far greater than the number of alpha terms. As a result, indexing documents like spreadsheets can lead to a significant increase in the number of terms added to the collection. This leads to an increase in the size of the collection. Typically, there is not a need to search for these specific numbers. However, in some cases firms have a need for searching for pure numeric strings longer than four characters. For example, they may have a nine-digit SKU number in the body of documents. In these cases, they will increase the setting prior to building their collection in order to account for this.

 

This setting is configurable.

 

In order to index larger numbers instead of the default 4 digits, the WorkSite Content.cfg will need to be modified to increase the minimum number of characters a number must be in order to be indexed. Remember, if the WorkSite Indexer configuration consists of multiple content engines, the change must be made in the config file for each content engine. The following section of the WorkSite Content.cfg file will be modified:

 

//—————————Properties——————————-//

 

[IndexFields]

Index=TRUE

IndexNumbersType=TRUE

IndexNumbers=1

IndexNumbers1MaxLength=4

IndexNumbers2MaxLength=250

 

In this case, the IndexNumbers1MaxLength will need to be increased to a value that will allow the longest numeric string that needs to be indexed.  For example, if a common numeric string value looks like “1234567,” then the IndexNumbers1MaxLength=4 value above will need to be modified to a value of 7 instead of 4 in order to account for the maximum numeric string length of 7 characters.

 

If the numeric strings are not always the same format or the same length, the IndexNumbers1MaxLength number will have to be set to allow for the longest expected numeric string.

 

The value should only be increased as much as required to index the absolute maximum character set length as each additional character introduces a greater potential for index growth.

 

Note: Once the change is made in the config files, the index collections will need to be rebuilt.

 

* For further information, please refer to Chapter 8, specifically, the section titled “Configure the Number Index Process” in the WorkSite Indexer Administrator’s Guide 8.5.

 

IDOL Indexer Stop Words

Which file contains all the stop words for IDOL indexer?

 

First determine which file is being used as the stop word list file.  This information is stored in Worksite Content configuration file (Worksite Content.cfg).  This file is located in Indexer install directory.

E.g.: C:\Program Files\Autonomy\Indexer\WorkSite Content.

 

Open this file and check the Language types section.

 

In [English] section check the value for Stop list.  (It should be English.dat.  This is out of the box configuration)

 

English.dat file is located in Indexer install directory.

E.g.: C:\Program Files\Autonomy\Indexer\WorkSite Content\langfiles

 

This file contains the stop word lists for IDOL indexer. Making any changes to this file to remove the stop words or add the stop words for searching purposes, requires rebuild of index collections.

Indexer Timing

Subject Indexer timing
From Jason Stavrenos
To WorksiteSupport@microstrat.com
Sent Friday, December 28, 2012 1:52 PM

Here is a nice write up from Autonomy from when the client wants a better understanding about timing with the indexer:

1. Customer drags a document into WorkSite (~1-2 seconds, very minimal)

2. WorkSite Crawler crawls searching for new/updated documents during interval of time between each crawl. (takes a 1 minute rest between crawls)

3. Document is moved through Ingestion (~1-2 seconds, very minimal)

4. Document is moved through Active DIH (~1-2 seconds, very minimal)

5. Document is moved to Active Content, where every 15 seconds Active Content writes the data to disk making it available for search.

So the time for a new document to be indexed is a lot based on where the Connector and Active Content is on it’s interval. Potentially you could see a 1 minute 30 second delay, or potentially only a 30 second delay before searchable.

_________________________________________________________

Jason Stavrenos  |  Systems Engineer

Micro Strategies Inc.

100 Enterprise Drive Suite 610, Rockaway, NJ 07866

Ph: (973) 625-7721 ext. 6452 | C: (973) 867-8498 | F: (973) 328-1248

jstavrenos@microstrat.com     www.microstrat.com

Picking up where you left off

So your indexer is running but it doesn’t seem like anything new going into the system is getting indexed.  Here’s how you can “Pickup where you left off”:

1. Stop all of the Indexer Services using the Stop Services batch file.

2. In the WorkSite Connector directory, rename the *.db file(s) to *.db.old

3. Run the _cleanup.bat file that resides in the WorkSite Connector, Active DIH and WorkSite Ingestion folders

4. Rename the *.db.old file(s) back to *.db

5. Open the *.db files within Notepad

6. Configure the *.db file to have a date/time stamp of a date prior to the issue (using www.onlineconversion.com/unix_time.html).

Note:  This date should be a few days back from when the indexer last indexed.  You can tell by looking at the “Indexer Status” in the Indexer Browser.  The last entry will have a date/time stamp of the last document processed.  You see entries for 11pm every night.  These are the Synch jobs between the Active Content and Server Content engines.  You need to look for the last entry that occurred other than 11pm.  Set your search results number on the menu up around 1000 if necessary.

NOTE:  Be sure to add the last three digits to the UNIX timestamp:

Problem:

After back dating the .db files in the connector folder, documents are being crawled from beginning instead from the timestamp specified in the .db files.

Solution:

The .db files were missing the last three numbers of the UNIX timestamp. Place the last three number of the UNIX timestamp from the backup .db files and start the Worksite Connector service back again to start fetching the document from the time specified and onwards.

7. Start all services.