The Standard Librarian: Searching in The Standard Library
Matthew austern
http://www.cuj.com/experts/1911/AUStern.htm?topic=Experts
The Genius As Well As The Standard C Library Surface In A Simple Discussion of Its Linear and Binary Search Algorithms.
-------------------------------------------------- ------------------------------
If you store a series of elements, perhaps one of the reason is that you can find some objects in it. Finding a special entry in a collection is a very important issue; all books will write it. Not surprising, the standard C runtime provides many different search technologies.
In the C run library, specifying a common method of a collection is to indicate the interval through the Iterator. The interval can be written as [first, last), here, * first is the first element in the interval and Last points to the next element. This article shows how we consider a general issue: given a range and a criterion, find the iTerator pointing to the first element that meets the guideline. Because we expressed the interrupt as an asymmetrical form, so as to avoid a special case: the search fails can return to LAST, and there is no elements can be written as [I, I).
Linear search and its variant
The simplest search is linear search, or, such as Knuth, sequential search: Sequentially view each element, check if it is the one we are searching. If there is n elements in the interval, the worst case requires N times comparison.
The standard runtime provides some versions of linear search; two most important versions are Find () (which accepts one interval and value x, and finds an elements equivalent to X) and Find_IF () (accepting a range and decision) P and find an element that meets P. Other versions of linear search are Find_First_of () (accept two intervals, [first1, last1), and [first2, last2), and find any of [first1, Last1) any one in [First1, Last2) Elements of elements and adjacent_find () (accept single intervals, and find the first elements equivalent to its backup elements).
For example, if V is an int. You can find the first 0:
Vector
Find (v.begin (), v.end (), 0);
You can find the first non-0 value like this:
Vector
Find_if (v.begin (), v.end (),
NOT1 (BIND2ND (Equal_to
0))));
You can find the first small number:
A [4] = {2, 3, 5, 7};
Vector
Find_first_of (v.begin (), v.end (),
A 0, A 4);
You can find the first repetition:
Vector
Adjacent_find (v.begin (), v.end ());
There is no independent version to reverse search to the interval, because you don't need: you can use a simple Iterator Adaptor to reach the same effect. For example, find the last 0 in V, you can write: Vector
Find (v.rbegin (), v.rend (), 0);
Linear Search is a simple algorithm that doesn't seem to be discussed. Many books (including me) show a simple implementation of std :: find ():
Template
Initer Find (Initer First, Initer Last,
Const T & Val
{
While (first! = last &&!
(* first == VAL)))
first;
Return first;
}
This is indeed a loyal implementation of the linear search algorithm to meet the requirements of C standards; the name of the first template parameter, initer means that the argument only needs to be very weak in Input Iterator [Note 1]. It seems to be so simple, so that it is better to write directly in the code. Even so, there is still an annoying problem: this implementation does not reach its efficiency. The circulatory conditions are complicated and need to be two tests for each element obtained. Conditional branches are expensive, and complex cycle conditions will not be equally optimized with simple cycle conditions.
One of the answers to the question, and is "unwaped" cycle by some standard runners, it is "unlocked" cycle, each checks 4 elements. This is a more complex solution because Find () must then process the residual element (the interval will not always be a multiple of 4!), And it also requires Find () to decompose based on the type of Iterator - "Unope" can only Working in the interval of the Random Access Iterator indication, the general situation is still required to use the old implementation. However, "unopened" is effective: it will drop the number of tests from each element from 2 to 1.25. It is a technique that realizes the standard library does not need to change any interface.
You will see a different answer in common books that describe algorithms. The reason for the two tests for each element is that if the end of the interval has not found the elements you want to find, we must admit that it has failed. But if we happen to find the elements you want to find, how do you do not fail? In that case, the test ends for the interval is redundant; there is no reason to think that the search algorithm should first master the information of the interval (There Wouldn't Be Any Reason for the Search Algorithm to Keep TRACK OF THE END OF The Range in The First Place. Substitution std :: find (), we can implement linear search algorithms as follows:
Template
Initer
Unguarded_find (Initer First,
Const T & Val
{
While (! (* first == val)))
first;
}
Knuth's linear search version [Note 3] is closer to unguarded_find () instead of std :: find (). Note that unguarded_find () is not part of the C standard. It is also slightly poor in danger than Find (). You can only use it to make sure there is an element equivalent to VAL - this usually means that you have put that element inside and as a sentry ended. The use of the sentinel is not always established. (If you are searching for a read-only interval?) But when it is available, Unguarded_Find () is faster and simpler than all things in the standard library. Two-point search
Linear search is simple, and for cells, it is the best way. However, if the interval is getting longer, it is no longer a reasonable solution. When I recently used XSLT, I remember this question. My XSLT script includes a line similar to this:
I am using the XSLT engine that I will run this is definitely used for linear searches. I searched in a list and perform this search for each entry in the list. My script is O (N2), it takes a few minutes to run.
If you are searching a completely general interval, you can't do better than linear searches. You must check each element, otherwise you may miss it, you are looking for. But if you ask this section to organize some way, you can do better.
For example, you can ask the interval to be sorted. If there is a sequential area, you can use a linear search for an improvement version (when you reach a bigger element than the elements you are looking, you don't need to continue to the interval, you can know that the search has failed), but better The method is to use two-point search. By looking at the elements of the interval, you can say that the searched elements are still in the first half or the second half; repeat this decomposition process, you don't need to traverse all elements can find the elements you want to find. Linear search requires O (n) comparison, while the two-point search only requires O (log n).
Standard Runlation Contains four different versions of the two-point search: Lower_Bound (), Upper_Bound (), Equal_Range () and binary_search (). They all have the same form: accept one interval, an element that tries to look for, and optional comparison functions. The interval must be sorted according to this comparison function; if the comparison function is not provided, it must be sorted according to the usual "<" calculation. So, for example, if you are searching in an ascending INT array, you can use the default behavior. If you search in a descending Int array, you can get into a std :: greater
In the four two-point search functions, the most useless one is the most natural place: binary_search (). It returns to a simple YES or NO: Returns true when existing in the interval, otherwise false. But there is no use of such a message; I have never encountered anything to use binary_search (). If you want to search for elements, you may want to know its location; if you don't exist, you may want to know if it exists, this location is where it exists.
With regard to the location of the element, you can ask a few different questions, and this is the reason for several different versions of the two-point search. When the same elements have several copies, they are important. For example, if you have an INT array, then use Lower_Bound () and Upper_Bound () to find the same value:
INT A [10] =
{1, 2, 3, 5, 5, 5, 5, 7, 8, 9}; int * first =
Std :: Lower_Bound (A 0, A 10, 5);
INT * last =
Std :: Upper_bound (A 0, A 10, 5);
Name First and Last hintd the difference: Lower_Bound () Returns the first value you are looking for (for this example, is & a [3]), and Upper_bound () returns the next Iterator you are looking for (this Example, is & a [7]). If the value you searched does not exist, you will get if it exists, you should be in place. As with the previous, we can write:
INT * first =
Std :: Lower_Bound (A 0, A 10, 6);
INT * last =
Std :: Upper_Bound (A 0, A 10, 6);
First and last will be equal to & a [7] because this is 6 unique position that can be inserted when not brewing.
In practice, you can't see the call of LOWER_BOUND () immediately follows a Upper_Bound (). If you need these two information at the same time, it is the reason for introducing the last two-point search algorithm: equal_range () Returns a pair, the first element is the value to return, the second element is Upper_Bound () return value.
Until at this time, I deliberately compacted in the discussion: I said Lower_Bound () and Upper_Bound () find a value, but did not correctly explain its meaning. If you write
Iterator i =
Std: Lower_Bound (First, Last, X);
And search success, do you guarantee * i and x equal? Not necessarily! Lower_bound () and Upper_bound () are tested from non-equivalence (WQ Note: Logic et al., using Operator == ()). They use the comparison function you provide: Operator <() or some other functionality like "Less Than" comparison function. This means that the search success (WQ note, ie equivalent, mathematical phase) is searched when * i is not less than X and X is not less than * i.
This difference looks like it is blown, but it is not. If you have a complex record with a lot of fields, you use a field as a sorted Key value (for example, the surname). Two records may have the same Key value, so even which one of them is different, which is not less than one.
Once you start thinking of records and key values, another problem of binary search is natural: Can you search for records according to KEY? More specifically, suppose we define a struct x:
Struct x {
Int ID;
... // Other Fields
}
Then there is a vector
One way is to create a specified ID dummy X object and use it in two-point search:
X Dummy;
Dummy.id = 148;
Vector
= Lower_Bound (v.begin (), v.end (),
Dummy,
X_compare);
At present, this is the most reliable method. If you care about the maximum portability, it is what you should use. On the other hand, it is not very elegant. You must create a complete X object, although you need only one of the fields; other fields have to be initialized to default or random values. The initialization may be inconvenient, expensive, or sometimes it is impossible. May pass the ID directly to LOWER_BOUND ()? Perhaps, by incorporating a heterogeneous comparison function, it accepts a x and one ID? This issue has no simple answer. The C standard does not fully explain whether such a heterogeneous comparison function is allowed to be in accordance with it, and the most natural reading of the standard is not allowed. In today's practice, the heterogeneous comparison function is feasible in some practical, and there is still some. On the other hand, the C Standardization Committee considers this to be a defect, and in the future version standard will clear whether heterogeneous comparison functions [Note 4].
to sum up
The C runtime also provides some other form of search algorithms. Using Find () and Lower_Bound (), search is limited to a single element, but the standard runtime also provides serch (), which looks for the entire sub-region. For example, you can search for a word in a string:
Std :: string the = "the";
Std :: string :: Iterator I
= std :: search (s.begin (), s.end (),
The.begin (), the.end ());
Return value, i, will point to "the" in the beginning of the first appearance in S - or, and it is the same, if "the" does not exist will return S.End (). There is also a variant to start searching from the tail:
Std :: find_end (s.begin (), s.end (),
The.begin (), the.end ());
It returns an item, pointing to the beginning of "The" last appearance, not the first one. (If you think this is very strange, Search's reverse variant is called find_end () instead of search_end (), then you are not lonely.)
Search can be encapsulated into the data structure. Most Obviously, the associated container, SET, MULTISET, MAP and MULTIMAP of the standard runtime library are specifically designed to search according to Key, which will be highly efficient [Note 5]. The String class of the running library also provides a number of members of the search: Find (), RFind (), Find_First_of (), Find_Last_Of (), Find_First_not_Of () and Find_Last_not_OF (). I recommend avoiding them. I found that these special member functions are difficult to memorize because they have so many forms, and the interface form is different from the other parts of the running library; no matter what, they will not provide anything that cannot be from Find (), Find_if (), Search () Get the function.
However, if you still think that you have seen some important omissions, you are correct! I didn't mention the Hash table because there was no Hash table in the standard running library. I mentioned the sub-zone matching of Search (), but it is only a special example of pattern matching - the standard run library does not have a regular expression search or anything similar.
The C Standardization Committee has just begun to expand the standard run library, while the Hash table and regular expression is a priority candidate for future versions. If you think that the standard runtime lacks, and you want to submit a proposal, now you should start preparation.
Note
[1] See Table 72 in The C Standard. Some of the Other Search Algorithms, Which I Discuss Later, Rely On The Stronger Forward Iterator Requirements. [2] See, for Example,
[3] See "Algorithm Q," IN § 6.1 of D. E. Knuth, The Art of Computer Programming, Vol. 2, Sorting and Searching, Second Edition (Addison-Wesley, 1998).
[4] See
[5] But The Containers Aren't The Most Efficient Choice As Offense AS One Might Think. See My YOARLIER Column "Why You Shouldn't Use Set - And What You Should Uses Instead," C Report, April 2000.