Microsoft Office Online
Sign in to My Office Online (What's this?) | Sign in

 
 
SharePoint Portal Server 2003 IT Documentation
Search
Search
 
Check for updates: (c) Microsoft
Office downloads
 
 
 
Warning: You are viewing this page with an unsupported Web browser. This Web site works best with Microsoft Internet Explorer 6.0 or later, Firefox 1.5, or Netscape Navigator 8.0 or later. Learn more about supported browsers.

Email this linkEmail this link Printer-Friendly VersionPrinter-Friendly Version Bookmark and ShareShare
Editing a Thesaurus File
 

The thesaurus is a query-expansion search feature. It allows the user to type a phrase in a search query and receive results for related words. For example, the user can search for "run" and receive results that contain either "run" or "jog" if those two terms are related by the thesaurus file. The thesaurus also enables the server farm administrator to affect search ranking by assigning weights to words. Microsoft SharePoint Portal Server provides thesaurus files for the following languages:

  • Chinese-Simplified (tschs.xml)
  • Chinese-Traditional (tscht.xml)
  • Czech (tscsv.xml)
  • Dutch (tsnld.xml)
  • English-International (tseng.xml)
  • English-US (tsenu.xml)
  • Finnish (tsfin.xml)
  • French (tsfra.xml)
  • German (tsdeu.xml)
  • Hungarian (tshun.xml)
  • Italian (tsita.xml)
  • Japanese (tsjpn.xml)
  • Korean (tskor.xml)
  • Polish (tsplk.xml)
  • Portuguese (Brazil) (tsptb.xml)
  • Russian (tsrus.xml)
  • Spanish (tsesn.xml)
  • Swedish (tssve.xml)
  • Thai (tstha.xml)
  • Turkish (tstrk.xml)

The thesaurus files contain inactive (commented out) sample content. The neutral thesaurus file (tsneu.xml) is applied to queries that do not have an associated thesaurus file for the query language. The neutral thesaurus file is always applied to queries, in addition to the thesaurus file associated with the query language.

By default, SharePoint Portal Server stores thesaurus files in the following directory of the server: local_drive\Program Files\SharePoint Portal Server\DATA\Config. The data directory is located elsewhere if you chose to install the data files elsewhere during the server installation process.

The thesaurus files are also copied to local_drive\Program Files\SharePoint Portal Server\Data\Applications\Application UID\Config for each specific instance of the Microsoft Search service (MSSEARCH) or Microsoft SharePointPS Search service (SharePointPSSearch). You can modify the thesaurus at the application level instead of at the server or server farm level. For example, if SharePoint Portal Server and Microsoft SQL Server are installed on the same server, each can have different thesaurus files.

Important  There is one additional file called tsschema.xml. Do not modify this file.

You can edit the thesaurus entries by editing the XML file in a text editor. The file must be well-formed XML (matching opening and closing tags around each entry) to load properly. If the XML is malformed, SharePoint Portal Server logs an error in the Microsoft Windows Server 2003 event log referencing the file and line.

Note  Do not change the case of the tags in the XML file. Only the XML tag is uppercase. All other tags are lowercase. For example, the <replacement> tag must remain lowercase.

Thesaurus files contain the following types of thesaurus entries:

ShowReplacement set

A replacement set specifies a pattern that is replaced by a substitution or substitutions in a search query. For example, you can add a replacement set where "W2K" is the pattern and "Windows 2000" is the substitution. If users query for "W2K," SharePoint Portal Server returns only search results containing "Windows 2000." It does not return results containing "W2K."

Each replacement set is enclosed within a <replacement> tag. Within the replacement tag, you specify one or more patterns by enclosing them in a <pat> tag. You specify one or more substitutions by enclosing them in a <sub> tag. Patterns and substitutions can contain a word or a sequence of words. For the above example, you would add the following lines:

<replacement>
         <pat>W2K</pat>
         <sub>Windows 2000</sub>
         
</replacement>

You can have more than one substitution for each pattern.

By default, patterns are case sensitive. For example, if your thesaurus file contains the preceding entry and a user searches for "w2k," SharePoint Portal Server does not necessarily return search results containing "Windows 2000." SharePoint Portal Server does not recognize "w2k" as "W2k" because the case of the text differs.

You can specify that patterns are case sensitive or case insensitive by adding a tag to the thesaurus file for your language. For example, if you specify that patterns are case insensitive, the <pat> and <sub> terms will match query terms regardless of the case of the query term. For information about adding the case tag to the thesaurus file, see "Edit a thesaurus file" later in this section.

A query with a CONTAINS FORMSOF thesaurus works as described previously. For more information about the CONTAINS FORMSOF syntax, see the Microsoft SharePoint Products and Technologies 2003 Software Development Kit.

The type of query that the portal site uses by default is called FREETEXT. FREETEXT queries automatically activate the thesaurus. However, if you type your search term or terms in double quotation marks, SharePoint Portal Server disables the FREETEXT query and does not consult the thesaurus. As a result, SharePoint Portal Server returns results based on the exact search term or terms within the double quotation marks. If the thesaurus replaces one word of a phrase with another word, a FREETEXT query returns results for the new version of the entire phrase.

For the replacement set shown earlier, the following table shows results based on different user input typed in the search interface on the portal site. This example assumes that the thesaurus is set as case sensitive, but search is set as not case sensitive.

User input Thesaurus consulted Search results include documents that contain
w2k Yes (FREETEXT query)

W2k or W2K or w2k or w2K

No results are returned for Windows 2000 because the pattern in the thesaurus is uppercase W2K.

"w2k" No w2k or W2K or W2k or w2K
W2K Yes (FREETEXT query)

Windows 2000 or windows 2000 or case combinations (such as wInDows 2000) or

w2k or W2k or w2K

No results are returned for W2K.

"W2K" No W2K or w2k or W2k or w2K
W2K Server Yes (FREETEXT query)

Windows 2000 (and case combinations as shown above) or

Server (and case combinations such as server or SeRvEr) or

W2K Server (and case combinations)

No results are returned for W2K operating system.

"W2K Server" No

W2K Server or w2k Server or W2k Server or w2K Server or

W2K server or w2k server or W2k server or w2K server

Note  In each of the previous examples, the case sensitivity setting for search is specified as false. Otherwise, all the case differences become significant when doing the pattern matching.

If you have two replacement sets with similar patterns being matched, the longer of the two takes precedence. For example, if you have the following two replacement sets, "Internet Explorer" takes precedence over "Internet":

<replacement>
         <pat>Internet</pat>
         <sub>intranet</sub>
</replacement>

and

<replacement>
         <pat>Internet Explorer</pat>
         <sub>IE</sub>
         <sub>IE 5</sub>
</replacement>

For the replacement sets shown above, the following table shows results based on different user input typed in the search interface on the portal:

User input Thesaurus consulted Search results include documents that contain
Internet Yes (FREETEXT query)

Intranet or intranet or case combinations (such as iNtranEt)

No results are returned for IE or IE 5.

Internet Explorer Yes (FREETEXT query)

IE or IE 5 (and case combinations such as iE or Ie 5)

No results are returned for Internet or Internet Explorer or intranet.

ShowExpansion set

An expansion set is a group of substitutions that are synonyms of each other. Queries containing matches in one substitution are expanded to include all other substitutions in the set. For example, you can add an expansions set where "writer," "author," and "journalist" (the substitutions) are synonyms. If you then query for "author," SharePoint Portal Server also returns search results containing "writer" or "journalist."

Each expansion set is enclosed within an <expansion> tag. Within the expansion tag, you specify one or more substitutions enclosed by a <sub> tag. For the preceding example, you would add the following lines:

<expansion>
         <sub>writer</sub>
         <sub>author</sub>
         <sub>journalist</sub>
</expansion>

You can also configure the following two options:

ShowWeighting

Substitution entries support weighting. Weighting enables you to rank certain words higher in search results by giving those words a higher value relative to the other words in the substitution set. You can specify a value between 0 and 1. For example, you can weight the following substitutions as shown:

 <expansion>
      <sub weight="0.8">Internet Explorer</sub>
      <sub weight="0.2">IE</sub>
      <sub weight="0.9">IE5</sub>
 </expansion>

ShowStemming

You can specify stemming in pattern and substitution entries. Word stemming maps a linguistic stem to all matching words. For example, in English, the stem "buy" matches "bought," "buying," and "buys."

You can specify stemming by adding "**" at the end of the string. SharePoint Portal Server returns matches for variations of the word you enter when you specify stemming.

For example, you can make queries for "run" also return "running," "jog," and "jogging." You would modify the expansion set as shown:

 <expansion>
      <sub weight="0.5">run**</sub>
      <sub weight="0.5">jog**</sub>
 </expansion>

If you query for "run" or "running," you get search results for "jog," "jogging," and so on. If you query for "running," you get the same results as for "run."

If your thesaurus file includes the pattern <pat> Stefan ran to the store** </pat> or the substitution <sub> Stefan ran to the store**</sub>, the query will return the following strings or search will add them to the query:

  • Stefan runs to the store
  • Stefan running to the store
  • Stefan ran to the store
  • Stefan runs to the stores
  • Stefan running to the stores
  • Stefan ran to the stores

If you create a thesaurus file that contains large numbers of multi-word expansions or substitutions, and you create a custom query page that uses the CONTAINS search predicate, there is a query performance degradation because each multi-word substitution or expansion is treated as a phrase. Phrasal queries are much more expensive to execute that non-phrasal queries. This should not apply to FREETEXT predicate queries.

Edit a thesaurus file
  1. Open the file in Microsoft Notepad. If double-byte character set (DBCS) characters are used, you must save the files in Unicode.
  2. If you are editing the thesaurus file for the first time, remove the following two comment lines at the beginning and end of the file, respectively:

    <!--Commented out

    -->

  3. If you want the patterns to be case insensitive, add the following tag at the beginning of the file: <case caseflag="false"></case>

    If you later want the patterns to be case sensitive, change false to true in the tag, as shown: <case caseflag="true"></case>

  4. Add, modify, or delete a replacement set, expansion set, weighting, or stemming.

    Note  Entries you add to the thesaurus file should not contain only special characters or be noise words. You can, however, have blank entries. For example, if you want to ensure that queries for a specific word, such as windows, return no results, you would have an entry as follows:

    <replacement>
    
          <pat>windows</pat>
    
          <sub></sub>
    
    </replacement>
    

  5. Save the file and close Notepad.
advertisement