Search Krishna Nareddy using Microsoft Retrieval Server
Windows NT Query Group
Microsoft Corporation
January 30, 1998
Introduction:
This is the third article used to help you understand and effectively deploy Microsoft Research Methods, these articles are used on your web site and intranet. The first article: "Analysis of Research Methods" to help you understand the search scheme that meets the needs of the site; the second article: "Microsoft Retrieval Server Introduction", introduces the characteristics and functions of the retrieval server. This article is an object of this article to make you understand, manage, and optimize this search. It helps to make the literature of Microsoft retrieve server easily cross-reference.
The retrieval server directory includes all aspects of the search infrastructure. So we have learned its structure from the directory. Below we deeply explore the search process. When each step of the check process, you will be brought into control and standardize it. So you got a introduction to help you use the retrieval server and monitor its status and performance. When you finish this article, you will understand the search process, and how to adjust the status, diagnose, and solve ordinary problems.
The information in this article is suitable for Index Server 2.0 issued with Microsoft Windows NT® Option Pack 4.0. Most of the INDEX Server 3.0 that will be released with Windows NT 5.0, but there is still no difference. These differences will be involved in the update article after the final release of Index Sever 3.0.
table of Contents:
A retrieval server directory contains all the details of using and retrieves your full works of your literature, and the document indexes contain it. It includes a directory related to the literature, a directory index that stores a fully text retrieval, a feature buffer that stores a literature, a control characteristic device for optimizing the retrieval process. All information about the directory is saved in the registry below. All registration parameters unless otherwise specified, it is related to this keyword. Many parameters can be replaced by the secondary keyword of the index directory keyword under Catalogs /
HKEY_LOCAL_MACHINE / SYSTEM / CURRENTCONTROLSET / CONTROL / ContentIndex
UPON SUCCESSFUL CREATES A Default Catalog name "Web".
After the creation is successful, retrieval server 2.0 generates a default directory for "web".
Resource catalog:
This is a collection of directories, all of which is included in the literature complete works. The directory can be the physical path of this unit or a remote path that meets the General Name Specification (UNC). When queries, the directory set can be used to limit the query scope, and the file that helps complete the full path name (or UNC name), which makes the file funds to locate them. Because the directory set is used to qualify the search server query scope, the term "range" is often used in the server literature instead of the directory set.
And retrieve the relevant directory column in the second keyword of the index directory keyword under Catalogs /
0 = directory is not retrieved (excluded)
1 = directory retrieval (including) 2 = directory is a virtual path
4 = Directory is the actual path
Resource cache:
It is a native disk memory that optimizes acceleration common feature retrieval. The resource stored in the cache is divided into the following categories:
The retrieval server is limited only for internal use. You can't directly control these resources.
Retrieve frequently used resources defined by the server, such as paths and file names, and more. These resources are characteristics of acquiring documents in the file collection process.
The retrieval server is limited, and the resources acquired or created from the document in the file filter process. For example, document topics and document author resources (for HTML and Microsoft Office documents).
The resources defined from the user acquired in the document. If there is only custom resources in the document, it is not enough to retrieve the response. The custom resources that are interested in should be added to the resource cache to find it when getting results. The custom resources that can be obtained directly from the documentation are the OLE resources related to the document. The retrieval server can obtain OLE resources without using a document filter. However, when considering efficiency, you should consider cache OLE resources and non-OLE custom resources. Details How to check the full list of available resources and change the settings of the storage resource listed in the monitoring status and operation.
Because the resource cache contains resources to be retrieved, it is a physical entity that can be compared to the content size. It is quite large, usually can't be loaded in the main memory. Therefore, it can only be partially paged in memory. Every part is 64K size. You can control these maximum numbers that can simultaneously load memory through resource storage mapping cache registration parameters. The more parts are loaded, the better when the program is executed. Of course, you should also break for more physical memory (RAM).
The resource cache is changed when each document is added, deleted, and modified. All of these modifications occur in loading the memory, if these pages do not re-enter the disk, the resource reservoir will be chaotic. When the retrieval server suddenly terminated, it will find that the resource cache and retrieval content are inconsistent. If this happens, the cache will recover to the normal state of the most recently known. The information that needs to be recovered is stored in the backup file of the resource cache. The backup file section determines the reproduction disk frequency of the resource cache. The larger the backup file, the less the resource cache re-entry the number of disks. The size of the backup file is measured by the number of pages of the operating system, which can be controlled by the PropertyStoreBackupSize this registry parameter. The size of the operating system is dependent on the processor structure and can be limited by Windows NT. Because the operating system page size will vary, the same backup size parameter produces different sizes of generated files.
Content search
Content retrieval includes all plain text information acquired from the document that is arranged in an efficient match when queries. It is distributed in several files, and these files are in a special path Catalog.WCI. This path cannot be divided into multiple drives, which should be placed in a fixed (ie, an unmovable) local drive. When this directory is defined, the location of this path should be detailed, ie, create a full path. If you choose to directly modify the location parameters located in the Catalogs /
Under certain forms, content retrieval contains your complete summary of your research materials. Anyone who visits this path can get part of the information from the retrieved file, and can re-establish documents that cannot be accessed through the file access organization of Windows NT. You should protect your catalog.wci path, use the appropriate safety license to prevent others from abuse
Control properties
The retrieval server supports the creation and use of multi-level directory. Although each directory is different, they still share a lot of common control properties. Copying the same parts are a waste and is easily erroneous. Therefore, all control attributes that affect the server directory operation are available in the core location. Directory can select different aspects by copying specific properties. For example, if a directory does not select a summary of each document, it can set the relevant properties within its impact range. When the retrieval server needs to know if a given directory supports document profile, it first checks this directory. If those attributes from the directory cannot be used directly, it uses global attribute variables. The global attribute variable is related to the ContentIndex keyword, while the specific directory entry and Catalogs /
Control the filtering process in multiple aspects of the filtering process. They are DaemonResponseTimeout, FilterContents, FilterDelayInterval, FilterDirectories, FilterFilesWithUnknownExtensions, FilterRemainingThreshold, FilterRetries, FilterRetryInterval, MaxFilesizeFiltered, and MaxFilesizeMultiplier.
The language-related parameters list the details of the resource relative to each installation language. The InstalledLands registration parameter lists the settings for language installation. Each string in the installedlangs variable is named at the keyword under the ContentIndex / Language keyword. Under each language key, the available parameters are SapideFaultErrorfile, IsapiidQerrorfile, IsapiidqErrorfile, ISAPIRESTRICRORFILE, LOCALE, NoiseFile, Stemmerclass, and wbreakerclass.
The process of establishing the main index with the index merge. They are MasterMergeCheckpointInterval, MasterMergeTime, MaxFreshCount, MaxIdealIndexes, MaxIndexes, MaxMergeInterval, MaxWordlistSize, MinDiskFreeForceMerge, MinMergeIdleTime, MinSizeMergeWordlist, and MinWordlistMemory.
Parameters associated with resource cache Control cache available memory and frequency of submitted to the disk. They are PropertyStoreBackupSize and propertystoremappedcche.
CPU Management Parameters Control The Amount of CPU Available To Perform Specific Tasks. They Are ThreadClassFilter, ThreadPriorityFilter, And ThreadPriorityMerge.
The CPU management parameter control can perform the number of special tasks CPUs. They are ThreadClassFilter, ThreadPriorityFilter, And ThreadPriorityMerge.
The remaining mixing parameters not mentioned above are related to the retrieval. They are EventLogflags, GenerateChacTerization, IsapideFaultcatalogDirectory, Isindexingnntpsvc, IsindexingW3SVC, MaxCharacterization, NNTPSVCINSTANCE, and W3SVCINSTANCE. Retrieval process
A enumeration mechanism determines all of the retrieval files listed in the directory and adds them to a queue. The document filter opens each queue file and gives the document resource and content included. The filter given by the filter enters the character manager, which can identify characters and digital features in the file stream. The characteristics independent of the stop table (interfering word list) are ultimately compiled as a main index to solve doubt.
The generation of the main index is a multi-step process that is gradually moved from the words taken from the document to a vocabulary of the label to a middle-related shadow index table, and finally to the permanent main index of the index problem. This multi-step process allows the search processor to quickly find the filtered document, is like gradually moving closer to the permanent main index. Glossary, shadow index table, the collection of the main index table is seen as content retrieval. By combining several resource indexes, the catalytic process transitions the intermediate data structure into the final form is referred to as merging. This is a time and disk input and output, but it is also necessary. Because the result is more effective than its resources that occur. Index servers provide several ways to control the merge process. Later introduction.
In addition to the full text content, the filter also obtains resources from the document. These resources will be stored in a resource cache that is optimized. The retrieval server uses the resource cache to decompose the search, and reclaim the desired resources from the resource storage.
In this section, we will discuss how to control each step of the search process.
Document collection
The retrieval server is collected by scanning and messaging. Scan is a loop process that determines which documentation should be retrieved by checking all included paths. When the file controlled by Windows NT is modified, it will send a message. At any possible time, retrieving the server relies on the message because this structure is more efficient than a specific scan.
The retrieval server performs two types of scanning - full scanning and increment scanning. Fully scan a list of all documents, usually do it when the path is first added. Other Perform full scans are recovering in severe crash.
When the retrieval server is turned off, it cannot track changes in the document. When restarting, it needs to know the documentation that is modified when it is not working, so that it can update its index table. The incremental scan has the ability to detect all documents that need to be refilled and retrieved. In startup, the retrieval server performs increment scan in all paths. Even if the retrieval server is lost, the incremental scan can still be executed. If the correction rate of the document is high, or the buffer is used to get overflow information from Microsoft Windows NT, this situation may happen.
You should use full scan and increment scan in any search path. When a new filter is installed, remove a filter or modify the registration information of a filter, you should use a fully scan. You can also use scanning because of any other reasons, but you should pay attention to the full scan will result in retrieving all files in the directory, which will consume a lot of time. How to initialize the details of the detail of the scan.
In the normal operation of the retrieval server, if the retrieved directory runs under Microsoft Windows NT, all documents in the directory will be automatically tracked. Note that a directory can point to a network directory. These network drives can be run in systems such as Novell NetWare or Microsoft Windows 95 file servers, which do not support changes in messages. To handle this type of directory, retrieving the server cycle scan for the shared portion. You can control the frequency of the scan cycle by registering the parameter forcedNetPathscanInterval.
Document filtering
The document is composed of a variety of formats. The retrieval server cannot know all document formats, or limit in some well-known formats. Therefore, the retrieval model allows us to insert, that is, a filter obtained from a plurality of formats. The process of obtaining text content from one document is called filtering. The default filter installed on the retrieval server can process documents in Microsoft Office format (Microsoft Excel, Microsoft Word, And Microsoft PowerPoint), HTML 3.0 or lower, text documentation, and binary. You can add another filter or replace the existing filter by modifying the registry. The retrieval server document describes how to modify the registry to change the DLLS of the filter. With additional tools, FiltReg (without parameters) can get a list of filters and additional items column under registration. Using additional tools, FiltDump (use -?) Can know what the filter (related to the file extension) gives the retrieval server. When the retrieval server is ready to filter a file, it can know the format of the file by checking the extension of the file. The registry contains the connection between the file extension and the filter DLLS. The retrieval server uses this connection to determine which DLL is suitable for a given file. Not all extensions are listed in the registry. How to retrieve the server to handle the unknown extension file? You can control it by registering FilterfileswithunkNowNextensions. At the time of setup, this parameter makes the retrieval server filter documents at the default simple text filter.
There may be a few "binary" files in your information. In the environments where the server is filtered, the binaries contain some original information for retrieval. You can identify these files and enable them to be filtered by some of the nominal filters. This gives a simple file attribute, such as file size and file name, so you can still find binary by looking for its properties. You can determine the binaries in your information, or you can determine the extension in the registry. For example, the extension is ".nul" binary file type, add one. NUL's second keyword with the default string variable binaryfile of the binary, as shown below.
HKEY_LOCAL_MACHINES / SOFTWARE / CLASS / .NUL = Reg_sz BinaryFile
A filter that cannot be filtered, you can try multiple methods to filter it. Registry Parameters FILTERRIES Controls the maximum number of filtration methods. If a file is not filtered with an existing method, then it is not filtered. The file can also be retained without filtering because they are damaged. When the filter found a damaged file, it wrote this event in the event log. You can open the management page of the retrieval server to issue a request for the non-filtered page. This request will be included in all non-filtered files. Note that the file with password protection cannot be filtered. You can control event log messages related to the filtering by using the EventLogFlags variable in the registry. The retrieval server document describes how to configure this parameter.
The retrieval server uses a child process Cidaemon to filter documents. This process wall protection retrieval server process CISVC exempts the easier or harmful filter DLL capable of destroying related processes. The retrieval server provides a child process to the file table to be filtered. Filtering the process filtering these files and provides content to the retrieval server process. If the filtering process terminates for some reason, the parent process CISVC will automatically terminate it.
If there is a document gives too much data compared to the size of the file and its file size, the retrieval server is from infringed by terminating this filtering process. How much data is too much? You can control it by registry parameters MaxFileSizeMultIPLier.
You can control the steps for the filtering process by registering the THREADCLASSFILTER and THREADPRIRITYFILTER.
lexical analysis
The text obtained from each document can be any language. Searching the language 2.0 supported languages English, Chinese, French, German, Korean, Spanish, Italian, Dutch, Swedish and Japanese. For each supported language, the retrieval server provides all the tools discussed below. You can only install the language you are interested in, retrieve the server with country information to identify the language used by the document, and you can select the vocabulary tools that fit this language. The default country of the document is the country of the server where the document is located. Use MS.Locale Meta Tag to make your personal HTML file to break through this default. The original text obtained from the document is operated by a lexical processor, which identifies the vocabulary in the text stream. However, retrieval server 2.0 does not allow a custom processor to be installed. Otherwise, you can control the vocabulary, phrases, numbers, and other features identified by the processor.
Documents typically contain several frequent words, these words are not very role in distinguishing documents. The meaning of special words specified in the search is that documents containing these words (they have potential interests to users) and the rest of the documentation. If a usual word, such as "this", in the search, you may match all documents in the information. Therefore, "this" has no distinction, so it is called an interference word. Most lookup methods allow you to exclude these interference words from the search. The interference word list is also called a stop list because they prevent the interfering words from entering the index. But which can be an acceptable interference word? You should be able to determine from your user's purpose and data theme. For example, in a file containing C code, because "this" has a specific meaning in the C program, it cannot be placed in a stop list. If you can't determine if a word can be used as an interference word, you should carefully do not put it into the stop table.
The smart choice for interfering words improves the quality of the retrieval document collection, and in turn improves the satisfaction of the user's search method. Because the interfering words often appear, they can significantly reduce the index size in the index. Smaller indexes improve the performance of the retrieval server. It is worth noting that this higher performance is just the desirable side effects of the table. User experience of search methods is the main purpose.
Remove the interference word only when the file is filtered. If you change the stop table when you have established an index, the retrieval server will affect the filtered documentation. To benefit from the modified stop list, you must rescan all directories.
Retrieve the Registry Parameters NOISEfile to get all the interference words in a file indicated by the registry parameter, this parameter is the keyword under ContentIndex.
Glossary generation
Once the document is filtered by the vocabulary processor, the result vocabulary is placed in a vocabulary. The vocabulary list is a temporary memory index, which is a few documents as cache data. There are several vocabulary lists in any given time. You can use the vocabulary to engage in the problem of classic memory and speed. MaxWordLists and MaxWordListsize controls the maximum memory used by vocabulary. MaxWordLists is the maximum number of vocabularys that retrieve the server in memory before initializing the shadow of the shadow retrieval. MaxWordListSize is the maximum amount of memory available to accommodate the word gap. When the memory used by the vocabulary increases, the retrieval server will result in a reduction in the number of shadow combining of the disk. Instead, reduce the word gap can be used to increase the disk operation. Other two parameters, MINWordListMemory and MinsizeMergeWordLists help control the memory used by the gllis. MINWORDLISTMEMORY is the minimum idle memory used by vocabulary generation. MinSizeMergeWordLists is the minimum combination of glossary that excites the shadow mergers.
Generation of shadow index
The shadow merge should be made when the number of vocabulary exceeds MaxWordLists or the memory exceeds MinSizemergeWordLists. As the data in memory, the vocabulary can be quickly edited, and it cannot be compressed well. It is also not possible to avoid the closure and restart of the retrieval server. The lasting data solves these problems. The first step in this direction is the generation of shadow retrieval. Shadow retrieval is a persistent index generated by a merged vocabulary or a combined shadow index. The process of generating a shadow index is called shadow merge. This quick operation consistently acts on vocabulary and clears the memory. The resource index used for shadow merge is usually a vocabulary. However, if the total number of shadow indexes exceeds maxIndexes, some of whom will be shadowed with resource indexes.
A special shadow merge, called annealing merge, is performed when the system idle exceeds a certain time and the number of related indexes exceeds maxIndexes. The registry parameter minmergeidletime determines the percentage of the time percentage that must be idle (by the registry parameter MAXMergeInterval), which activates annealing merge. Annealing is increasing to increase the index performance and the disk space used by reducing the number of shadow indexes.
The generation of the main index
The main index is the ultimate goal of all the vocabulary generated by the index server. This is a relatively good persistent data structure that effectively solves the index problem. The main index is generated from all existing shadow merges, and the main index in the current process is identical. The main merge is a fairly time consumption time and the operation of the disk input and output. When the merge is completed, the resource is released and the intermediate shaded index is deleted. In the end, the retrieval speed has been improved in the past.
The main merger is an intensive process that must have a considerable robustness to enable environmental control. The step speed of the indexing process can be controlled by registry parameter ThreadPriorityMerge. If you are not satisfied with the current step, you can stop the index server, and change this parameter. When the index server restarts, the merge will continue. The main merger can also be spared from some unpredictable events such as disk full or sudden system interruption. After reboot, the main merges start from the previous stay. The index server writes the event to the event log when the primary merger start, restart or stop.
You can trigger the start of the master by controlling different parameters. The main merger is started due to the following reasons.
Daily maintenance of the main merger. Daily can be performed at a specific time. Registration parameter MASTERMERGETIME, which means the number of minutes after midnight. The default is midnight. This value should be adjusted to the server load minimum.
The number of documents changed from the last master merger is called Freshcount. The larger the freshcount, the more memory used in the form of the word gap. The maximum value of Freshcount can be controlled by registry parameters maxfreshcount. When Freshcount exceeds MaxFreshcount, the master merge is performed to reduce Freshcount to zero, which reduces the memory used by the index server. Adjust MaxFreshcount is based on the memory it has. The larger the maxfreshcount, the more memory is assumed, and the index speed is also faster.
The vocabulary accounts for memory space, while the shadow indexizes disk space. For large or dynamic data sets, there may be a lot of disk space to be temporarily occupied by the shadow index. To avoid disk full problems, the disk space used for the shadow index should be controlled by registering the MINDiskFreeForceMerge. The disk space on the catalog drive is less than the cumulative space occupied by the MindiskFreeForceMerge and the shadow index exceeds MaxShadowFreeForceMerge, the main merge is started.
When the disk space occupied by the shadow index exceeds the registry parameter MAXSHADOWINDEXSIZE, the main merge is also launched. This condition has a higher priority than the former.
Finally, you can use the management tool to impose a master merge. Suitable for doing so when you expect high retrieval load. Although the main consolidation resource is integrated, the final result increases the index response time. Monitoring status and performance
There are three ways to manage index servers - using Microsoft Management Console (MMC) index servers Snap-in; use the Index Server Management page (provided for Index Server Program Group); use the registry editor to edit directly, such as regedit. Perhaps using MMC Snap-in may be the simplest. It is also recommended for tools because future Windows NT-based management tools will be MMC-based Snap-Ins. You can have a leading opening while convenient. All three tools allow the management of the index servers on the remote server. We only discuss the MMC Snap-IN of the index server in this article.
With SNAP-I, you can complete the following tasks:
Create and delete a directory
Start and stop service
Monitor the status of all directories
Add and delete the path while initial scanning in the path
Modify the settings of storage resources in the cache
After the installation is successful, in the Microsoft Retrieval Server program group located under the Windows NT Options Compressed program group, the retrieval server creates a project labeled "Retriever Server Manager". You can use this item to launch the retrieval server Snap-in. You can also use the "Add / Remove Snap-In ..." menu to individually put the retrieval server Snap-in into the MMC, this menu item is under the console main menu item. Click the Add ... button of the Add Standalone Snap-in dialog box and select the search server from the list.
Directory management
Simple in generating a directory via Snap-in. Simply provide a directory name and marking the index file location. You can add a path and modify the resource cache later. Snap-in stores all details of the directory in the registry, and generates a physical path to Catalob.wci at a specific location. The directory delete is simpler. Right-click on the directory icon, request to delete. This makes Snap-in clear all items in the registry and delete the Catalog.wci path. If the directory is deleted, the retrieval server is running, SNAP-I is waiting for it to stop in front of the physical deletion.
Path management
Once you have created a directory, you can define your information. Search Server 2.0 supports content managed by the web server, network news transfer protocol (NNTP) server, Windows NT operating system. You can get content from any of these resources using SNAP-I.
In order to include content managed by the Web and NNTP servers, open the Directory Resource dialog, select the web tag. If you want to retrieve a Web site, check the basic segment of the virtual, and select the virtual server you want to track. If you want to know an NNTP site, check the basic segment button of the NTTP and select the NNTP server you want to track.
In order to obtain the content managed by the Windows NT file system, open the folder to find the next folder. The path is added to add a path by adding a path, and this dialog box can be turned on under the next level folder. The dialog also allows you to select the remote path.
Attribute cache management
If you have a result set or the standard property used in the property variable index, these should be known for the attribute cache. With retrieval server Snap-in, you can view all known properties and its definition, you can also add or delete properties from a cache. The known attribute should be enumerated when retrieving the server.
All cache properties have a non-zero value in the Cached Size column. The attributes with a space or zero value are not placed in a cache. Open the "Properties" dialog that you are interested in attributes. In order to put the property into the cache, check the "cached" box, provide the attribute to the attribute. In addition to string properties, most data types have a fixed size, so it is easy to determine this. The retrieval server will handle this overflow for string properties that exceed specific sizes. This means you don't have to set the maximum value for the string property type. Simply select an average. Choose a larger value waste space and lead to low operating efficiency. Choosing a smaller value does not waste space, but the efficiency is lowered because too much overflow processing. The discussion of string size is more important because the HTML filter installed on retrieval server 2.0 is only reported to a string of HTML metammark variables. Most of the materials in the typical web site are HTML documents, so the custom properties containing the HTML document will be reported as a string. For each string, the wise size selection is to make it a more obvious difference.
These changes are submitted when you make all changes to the cache properties. This menu option is in the Task menu, this menu is part of the submenu that pops up after right-click the "Properties" subfolder. Submit make changes. The retrieval server generates a new property cache for each cache property, and copies the existing cache attribute value to each of the new cache has been retrieved. This process is quite a fee, so the number of submitted attribute caches should be reduced. All modifications can be used to submit them with a separate part.
A document filter acquires attributes in the filtration. Therefore, when you add new properties to the cache properties setting, all the properties of all previously filtered documents will be null. Only the document filtered later can get the appropriate value from the filter. This will result in an incorrect result setting because the user wants that non-empty attribute values are actually empty. To avoid this, you must perform a full scan in all paths so that they are re-filtered. Scan can be scanned through the Task menu option, and then pop up this menu when you click on the directory. If your submission can only remove the existing properties from the cache, you have nothing to do. Those unnecessary properties have been removed from the cache, and will not be placed in the cache later.
Monitoring
For filtering processes and retrieval processes, the retrieval server provides an execution counter. These counters can be used in Windows NT monitor PerfMon.exe.
The counter associated with the filtration is separated by two processes. The counter of this tracking file process is under the content retrieval object. They are the #, filtered file, the entire # of the document. This counter is directly related to the filtering of the content retrieval filter object. They are binding times per millisecond, per hour filter speed, and unit per hour per hour per hour.
The counter related to the retrieval and merge is under the content retrieval object. They are the number of retrieval size, persistent retrieval, percentage, and number of words in memory.
Message of event log log
System errors for retrieving servers will be reported to the CI filtering service and the Windows NT application event log log in the CI service directory. The report includes filtering problems, exceeding the scope of resources, retrieving file damage, and so on. The documentation of the retrieval server includes a detailed list of all messages and the appropriate operation.
Catalog design
It is easy to create and delete the directory by retrieving the Snap-in the server. But this is not as simple as imagined. Unless you create an original search method or work with a small documentation, you need to spend some time to design your directory and some configuration, such as use, run, size, and maintenance. The following about selecting hardware, determines a single and multiple directories, and discussing a directory growth involves retrieving many specific topics of the server. Most themes may be applied to any Windows NT-based server, although they are important, they are not discussed here. You can find discussions about these topics in the rest of the MSDN library under Windows Resource Kits. Hardware Configuration
The retrieval server can effectively use multiple processors and a large number of RAMs. The more resources it occupies, the better the work. The retrieval server costs a lot of disk I / O. In a typical configuration, disk I / O is more adjustable than other factors. The faster the disk drive, the faster the bus, the faster the retrieval process. You can also use disk stripping to increase I / O throughput. Use Windows NT File System (NTFS) for your information and catalog. This makes more promptability, performance, and security than the FAT file system.
What is a disk mirror? The retrieval of the retrieval server can be completely obtained from the registration information and information. Therefore, disk mirror icons are not as important as they are as important as the retrieval server directory. Considering that the mirror is to achieve sacrificial throughput, this cost is not worth it. On the other hand, the mirror is not reduced to lose the possibility of retrieval data. The environment where the server can not be tolerated during the work period, which may be a good reason. After all, the index from mirror disks is much better than from the information.
A directory or multiple directory?
In any possible time, you should use a directory instead of multiple directories. There are several benefits. In some cases, multi-level directory is a good choice, but in-depth observation will drive you to choose a separate directory. We will first discuss the benefits of using a separate directory. Then discuss how the multi-level directory is good.
The main advantage of using a single directory is to manage simplified. The initial installation, configuration, and daily maintenance of a directory require too much management. If you have to operate several times, some trivial things will increase the workload. Compared with multiple directory, there is no difference in processing individual directory.
Another advantage of using a separate directory is to improve performance and reduce workload. Each activity directory is loaded into memory when retrieving the server starts and runs. If the same content distributed in multiple directories is unified in a separate directory. The retrieval server will focus the resource into a single-level directory, which improves performance.
Finally, if you want to make your information become a separate entity, the best way is to use a separate directory. Search Server 2.0 does not have the ability to set up the merge results from multiple directories. You can set up the results of the merge results in different directories, and you can also use scripts to transfer individual results settings, but also add time to query round trips. If you have only a separate directory, the retrieval server has handled sorting and merge internally. This is more efficient than scripting based on scripting.
If you have a huge information that is difficult to operate under a server, you cannot use a single directory. In this case, a wide range of solutions, such as Microsoft Site Server Search, is better than using the AD HOC script connection directory.
A typical organization, regardless of size, should meet the needs of user settings. For example, your organization only provides technical documents for engineering departments and legal documents that are for legal departments. The document settings of these two users are independent of each other. Obviously, you can only use one way to solve this situation, that is, each department has established a directory. In fact, it doesn't have to do this. The retrieval server allows the use of the path to split your information. You can use a separate directory and use two different paths to allow each document to set two different settings of similar search scripts, each case corresponding to a suitable path. If security is involved, you may have a correct license in the document. The retrieval server is known for the security of Windows NT, which does not allow access to documents that cannot be accessed. Need more instructions? The same organization may have a third document setting, which is a new article that is interested in all groups. When you use multiple directories, you can create a third directory, or retrieve the same article twice in each directory. However, using a single directory, you only need a directory that contains two script settings. If you need to build multiple directories on the server, you should remember that you don't activate the directory when you don't need it. The CataloginaActive parameter under the child key of Catalogs /
If your information is too large, you have considered a larger scale before splitting data to several small servers. Getting the cost of the appropriate server is lower than that of multiple servers.
Catalog growth
Your data size continues to increase is accidental. Therefore, the retrieval of the retrieval must keep up with the increased information. The important limit for retrieving server 2.0 is that retrieval can only reside in a single drive. If this drive is full, even if the remaining huge space in other drives is not. At the foreseeable future, choose sufficient space for the index. So how much disk space is needed? This rely on your information on the size of the original text and the properties you store in the properties cache. The default setting of the retrieval server is expected to be 40% of the data size.
Another aspect of additional aspects of data can be introduced. Before introducing a new document format, be sure to determine that the document filter is valid for this format.
Troubleshooting points
Most of the fault feature occurs when you retrieve a document known to exist and the keywords used in the index. If the document set is not found, the following troubleshooting algorithm is used. Read all the steps before use. The steps will be more suitable for your environment later.
View the application event log for the CI generation event. If necessary, the standard is wrong.
Does the file have extensions that are not filtered? Use the filter enumeration tool FilTreg to see the contents of the retrieval server in the registry.
Does the file have password protection? Filters cannot access files with password protection.
Do you retrieve the server is still scanning? Casual situation is that the lost file is not retrieved. Wait for the scan to complete (see the directory status of the MMC Snap-in report).
Does the search server returns "index is out of date" when responding to the query? If it returns, the accident is that the lost file is not filtered. Wait for filtering all files.
Use the management page to query the unfiltered document. This is a list of retrieval servers that cannot filter documents. The file may have an extension that cannot be recognized and cannot be filtered. Or because the format of the file cannot be identified or file corruption makes the filter cannot be filtered. The retrieval server can also give you a document of excessive information by hanging and its size, making yourself from harm.
Whether all the files come from the same path? If it is, it is possible that this path is not scanned. Check if this path is overwritten by the setting of the path. Note If they are covered, the path under catalog.wci will not be filtered. If a path is overwritten, it is checked for a program such as Windows Explorer and DIR. If everything is normal, a scan is imposed in all the effects paths.
If the path containing the lost file is a remote path? So, may not be retrieved because of incorrect username. When the login ID of the remote path is selected, the local area / user format is input to the local domain name and username. Note that if the count is the computer local area, the local area name may be actually the name of the computer, using Filtdump in the file, and check the output. Does all the keywords in the dump appear? If there is no actual appearance, the filter device should be checked.
If these words are in the output of the filter, they may not include them because they are considered to be interfering. Check the interfering words list, it is necessary to change it.
Is different languages use to filter files and put forward queries? Vocabulary analysis is a process of relying on language. When a query is proposed in a language, the query result is unpredictable when filtering the document in another language. There are some file formats, such as Microsoft Word, using a language identification document, which is used in the filter. Additional formats, such as simple text, without any language classifier. For these documents, most filters defaults to the systems where these files are located. The location of the query is clearly explicit using Cilocale. If CILOCALE is not clear, the location of the browsing is used as (if possible) or the location of the server is the server.
Each file is properly retrieved, but it is not necessarily to get the desired result because you don't have the correct license.
Maybe inquiry is not suitable, or the retrieval server is considered to be too complicated. In both cases, you should accept this influence caused. Refer to the documentation to retrieve the server to recognize such errors and what measures to avoid them.
Thank you
I want to thank Thank David Lee, Kyle Peltonen, Susan Dumais. Valuable information.