CUJ: Standard Library: Sort Algorithm in Standard Library

zhaozj2021-02-16  48

The Standard Librarian: Sorting in The Standard Library

Matthew austern

http://www.cuj.com/experts/1908/AUStern.htm?topic=Experts

-------------------------------------------------- ------------------------------

When you have a data sequence, one of your most wants to do is sorted. Sorting data makes it easy to understand, and sorting is the first step of many generic algorithms - even calculating a series of digital and such a micro-end algorithm. Each programming system provides several forms of order; standard C runtime provides 6! (Or possibly, depending on how much you have.) How much differences they have, and when you should use some instead of another?

Sort by generic algorithm

C standard 24 chapter has a section called "Sorting and Related Operations". It contains many operations on the sequential interval, and three sort generic algorithms: sort (), stable_sort (), and partial_sort ().

The first two, sort () and stable_sort (), in nature have the same interface: like the usual STL generic algorithm, incoming the Random Access Iterator that indicates the interval that needs to be sorted. Similarly, as an option, you can also provide the third parameter to indicate how to compare elements: The third parameter is a functor, accepts two parameters (x and y), and returns TRUE when x should be in Y. So, for example, if V is an Int,

Std :: sort (v.begin (), v.end ());

It will be sorted in ascending order. To change to descending, you need to provide different comparison methods:

Std :: sort (v.begin (), v.end (), std :: greater ());

Note that we are using Greater as a third parameter, not greater_equal. This is important, it is a prerequisite for all STL sorting algorithms: The comparison function must return false at the same time (WQ Note: See "Effective STL" ITEM 21). To a certain extent, this is too arbitrary: it looks completely useful to use "<" or "<=" to express a sorting algorithm. However, make a clear choice is required, and always adhere to this choice. The standard C runtime selection the former. If you pass it to Sort () an equivalent parameter to return true, you will get unpredictable, completely dependent on the results of the specific implementation; in some systems, it seems to work, but in other systems It may cause an infinite loop or memory protection error.

For relatively simple occasions, use stable_sort () instead of sort (), you won't see a lot of differences. Like sort (), stable_sort () sort the [First, Last) interval, default is ascending, and, like you can provide a comparison function to change the order. If you read the C standard, you will see two differences between sort () and stable_sort ():

L as indicated by the name, stable_sort () is stable. If the two elements compare the results are equivalent, Stable_sort () does not change their relative order. l The performance of the performance is different.

The first point, stable, easy to understand. It usually does not matter for int type (a "42" and another "42" identical), but sometimes it is very important to other data types. For example, you are sorting TASK based on priority, you may expect high priority Task before being mentioned before the low priority Task, but the same priority task keeps their original order. Sort () does not have this commitment, and Stable_sort () is committed to this.

Standard's description of performance is not intuitive: sort () and stable_sort () commitments are complex and are not complete. The standard says that "average" complexity of sort () is O (n log n), and there is no worst case; the complexity of Stable_Sort () is O (N (log n) 2), but there is enough additional memory The same is O (n log n).

How to figure out all these different situations?

Performance's commitment is to link together with special implementation skills, and if you know what these skills are, this promise is more understanding. Allows Sort () to be implemented in a fast sort (recursive segmentation section), while stable_sort () is sorted by merge (in-order and sequence sub-section) [Note 1]. Rapid sorting is one of the fastest sorting algorithms known, complexity is almost always O (n log n), but for some special sequences is O (N2); if the standard total requirements Sort () is O (n log n) Will not be used quickly. Similarly, the two sub-intervals will be easily realized when there is an additional buffer with a copy result; normal sorting is O (n log n) when the same auxiliary buffer that can use the original section, if any auxiliary buffer cannot be obtained It is O (n (log n) 2). If it can only use a smaller auxiliary buffer, complexity will be between the two - but, in reality, it is closer to O (n log n).

This explains what the standard says, but it is not full. Standard "If the behavior when the worst situation is important", you should not use sort (). However, the fact is not like the standard for the first time. Many standard runtime, including GNU G and Microsoft Visual C , now use a new variant of rapid sorting, called Introsort [Note 2] (WQ Note: See Houjie "STL Source Code"). Introsort is generally the same as fast sorting, but its complexity is always O (n log n). Unless you are concerned with the actual work of the old standard library, behavior when you are worst, no longer a reason to avoid using sort (). And, although Stable_Sort (usually) is also o (n log n), this O masks a lot of details.

In most cases, stable_sort () is much slower than sort () because it must do more for each element; this is the price you want to pay for "stability". Use stable_sort () to deal with equation elements to maintain the original relative order (for example, sorting according to priority, but to maintain the order of the same priority), use SORT () to process other Happening.

Another generic sorting algorithm is partial_sort () (including a small variety, partial_sort_copy ()). Its feature is slightly different, and the syntax is slightly different: it looks out and sorted the first N name elements in the interval. As in other cases, the default is to arrange as sequencing by "<", but it can provide a function to change it. So, for example, if you only interested in the top 10 elements of the sequence, you can find them: Partial_Sort (First, First 10, Last, Greater ());

Then, the maximum 10 elements will be accommodated in [First, Fist 10) (descending order), the remaining elements are accommodated in [First 10, Last).

There is a significant situation here: If you write partial_sort (first, last, last), you are asking for partial_sort () sorting the entire [first, last) interval. Thus, although the syntax is slightly different, you can still use partial_sort () in the same way using sort (). Is there such a reason to do? That is actually no. Check out the description of the C standard for complexity, you will see partial_sort () and sort, but also o (n log n), but again, this is an incomplete description. The operations of two O (n log n) do not have to have the same speed; for this example, sort () is usually much faster.

The existence of Partial_Sort () is for partial sorting. If you have a list of one million names, and you need to find a thousand, sort by letters. You can get this thousand names by sorting and ignoring the entire list, you can get this thousand names, but it will be very wasteful. Use partial_sort () or partial_sort_copy ().

Vector Result (1000);

Partial_sort_copy (names.begin (), Names.end (), Result.egin (), Result.end ());

When you use Partial_Sort () when you are interested in the front part of the sequential interval, and use sort () when you need to sort the entire interval.

If you have done to sort () and stable_sort (), it will be helpful to see how Partial_Sort () is achieved. The usual implementation is also recommended by the C standard settler. It is used in the input area to rearrange in a data structure called HEAP. HEAP is essentially a binary tree implemented by an array. It is easy to find the largest element. It is easy to maintain a valid HEAP when removing the maximum element. If we continuously remove the elements from a HEAP, you will leave the minimum N elements: it is exactly from partial_sort. If all elements are removed from the HEAP, a sequence interval will be obtained.

The standard C runtime contains a generic algorithm for direct manipulating HEAP, so instead of using partial_sort (), you can write:

Make_Heap (First, Last);

Sort_heap (first, last);

This looks and

Partial_sort (First, Last, Last);

Different, but it is not true. In both cases, you have used the stack sorting; the two forms should have exactly the same speed.

Finally, there is the last "generic" sort algorithm, inheriting from the C language: Qsort (). For "generic", it is because Qsort () is actually universal as sort (), stable_sort () and partial_sort (). It cannot be sorted with objects with constructor, virtual functions, base classes, or private data. C language does not know these features; QSort () assumes that it can locate the object, and this is only true for the easiest Class. It is also difficult to use QSort () in the template, because you must pass a parameter is a comparison function of the VOID * type, and know the exact type of object to be sorted inside this function. The C language standard does not make performance commitments to QSORT (). In the case of qsort (), it is usually more slower than sort (). (Mainly because of a simple reason: the interface of sort () can be designed to be inline, and the QSort () interface does not do this.) Unless it is a legacy code, you should avoid using QSort (); Sort () has a simpler and safer interface, which is relatively small, and faster.

Sort in special containers

I started to discuss generic algorithms because the basic principles of standard C runtime are things that unnecessary coupling. The algorithm is separated from the container. In the needs of the container, no sorting is mentioned; sorting a vector is a generic algorithm using a stand-alone at :: Vector:

Sort (v.begin (), v.end ());

However, any complete discussion of sorting in C must include a container. Typically, the container does not provide a sorting method, but some special containers are provided.

You can't sort a vector by writing v.sort () because std :: Vector does not have such a member function, but you can sort a list by writing L.Sort (). As usual, you can explicitly provide a different comparison function: if l is an int type list, then L.Sort (Greater ()) will be sorted in descending order.

In fact, List :: sort () is the only way to sort a list: std :: list only provides Bidirectional Iterator, but independent generic sorting algorithm (sort (), stable_sort () and partial_sort ()) request More powerful Random Access Iterators [Note 3]. This special member function list :: sort () does not use iTerator, which exposes the List to be implemented with each other. It uses a variant of the merged sort by connecting and unlinking the node instead of copying the elements in the list.

Finally, sort a set (or a multiset if you need a repetitive element) is the easiest: it is already sorted! You can't write sort (s.begin (), s.end ()), but you will never do this. A basic feature of Set is its elements in order. When INSERT () When a new element, it automatically places it in the appropriate location to maintain the sort state. Inside, SET uses a database that provides fast (Log N)) and inserted data structures (usually a balance tree).

Time test

Where is this? I have made some arguments for time, and we can also work more intuitive forecasts: For example, inserting an element in the SET will be slower than sorting a vector, because set is a complex data structure, it needs to use dynamic memory allocation Or, sorting a list should be almost the same as using Stable_Sort, which is all sorted variants. However, intuition can't replace time test. The measurement is difficult (more precisely, what do you measure, how to measure?), But this is necessary. The program of Listing 1 constructs a vector , the elements of the elements are arranged, and then each method we discussed (except qsort () (WQ Note: Original)) repeatedly sorts it. When you pass the vector to each test case, we deliberately use the pass value: we don't want any test case to get a sorted Vector!

With the Microsoft Visual C 7 beta compiler (similar to G similar), on 500M Pentium 3, the results are as follows:

Sorting a Vector of 700000 Doubles.

Sorting Method T (SEC)

Sort 0.971

Qsort 1.732

Stable_sort 1.402

HEAP SORT 1.282

List sort 1.993

Set sort 3.194

This is the expectation: sort () fastest; stable_sort (), stacking and QSORT () slightly slow; sort a list and set (it uses dynamic memory allocation, and is a complex data structure), slower. (In fact, QSort () has an unusual achievement: on the old version of G and VC, QSort () is only slow than sort ().)

But this is not enough to be called sorting benchmark - too uninterrupted. I am in a specific system, using a specific (dictated) initialization to test sorting a vector . Only by practicing can determine how much universality these results are. At least, we should do some tests outside of Double.

Listing 2 Sort a Vector : It opens a file and copies each line into a separate String. Because Qsort () cannot handle elements with constructor, QSORT () is no longer included in this test. As an input "Anna Karenina" "Anna Karenina" [Note 4] with Project Gutenberg, the result is:

Sorting a file of 42731 lans.

Sorting Method T (SEC)

Sort 0.431

Stable_sort 1.322

HEAP SORT 0.751

List sort 0.25

SET SORT 0.43

Suddenly, things have changed. We still see that sort () is much better than stable_sort () and stacks, but the results of List and SET have dramatic changes. The speed of using SET is almost almost the same, and the List is actually faster. what's happening?

The change is that Double is a native type, and std :: sting is a complex class. Copy a string or assign it to the other, meaning calling a function - even means that dynamic memory allocation is required or create a MUTEX lock. The balance point is changed; sort String is more impact than the number of assignments is more than sorted Double. When sorting a list, there is no calling operation at all: Define all the reason for a special list :: sort () member function is that it works by manipulating a pointer to the internal node of the List. The pointer to the Node that is reconfiguously connected is fast than copying a String. We discover an old scorpion say: If you are dealing with big records, you must never sort the record itself, and sort points to their pointers. Using List makes this automatic, but you can easily explicitly do this: not sorted the original Vector , replaced, create an index vector, the element type is vector :: const_iterator, Then sort this index vector. You must tell sort () how to compare the elements of the index vector (you must make sure that the value referred to by Iterator is not the Iterator itself), but this is just a small problem. We only need to provide a suitable comparison function object:

Template

Struct Indirect_lt {

Bool Operator () (Iterator X, Iterator Y) Const

{RETURN * x <* y;}

}

Listing 3 shows how to use Indirect_lt and compare the speed of direct ordering and indirect sorting. The increase in speed is significant.

Sorting a file of 42731 lans.

Sorting Method T (SEC)

Indirect Sort 0.29

Sort 0.401

Suggest

Standard C run library provides six sorting methods: QSort (), Sort (), Stable_Sort (), Partial_Sort (), Lsit :: sort (), and SET / MULTISET.

You should not use QSort () in the new code. It is slower than sort (). Its interface is ugly because it requires type conversion and is easy to use. Write a comparison function to pass to Qsort () may be very troublesome, especially in the generic code. You cannot use qsort () sort with something with constructor or virtual functions, or any data structure other than the array of C language. Although qsort () is not officially used, its only real use is backward compatible with the C language.

In other five sorting methods, the top three are generic algorithms, and then two of them use certain containers. All of these methods use Operator <() by default, but allows you to specify your own comparison functions when necessary. Each provides some special features:

l sort () is usually the fastest.

l Stable_sort () guarantees the stability of the equivalent element in order.

l Partial_sort () allows only the first N elements from being ranked.

l List :: sort () manipulate the pointer instead of copying elements.

l Set allows quick insertion and deletion in a sequence area

If you don't need a special feature of other four methods, the preferred should usually be sort ().

Listing 1: Measurements - Sorting a vector

#include

#include #include

#include

#include

#include

#include

#include

#include

#include

#include

#include

Using std :: pair;

Using std :: vector;

Using std :: string;

Struct timer {

Clock_t start;

Timer () {start = clock ();

Double Time () const {

Return Double (Clock () - start) / clocks_per_sec;

}

}

Template

PAIR

Do_Sort (Container C)

{

Timer T;

Std :: sort (c.begin (), c.end ());

Return std :: make_pair (String ("sort"), t.time ());

}

EXTERN "C" INT

DCompare (Const void * px, const void * py)

{

Double x = * static_cast (px);

Double y = * static_cast (py);

Return x y? 1: 0;

}

PAIR

Do_qsort (Vector V)

{

Double * a = new double [v.size ()];

Std :: Copy (v.begin (), v.end (), a);

Timer T;

Qsort (a, v.size (), sizeof (double), & dCompare;

Double Elapsed = t.time ();

delete [] a;

Return std :: make_pair (String ("Qsort"), ELAPSED;

}

Template

PAIR

Do_STable_Sort (Container C)

{

Timer T;

Std :: stable_sort (c.begin (), c.end ());

Return std :: make_pair (String ("stable_sort"), t.time ());

}

Template

PAIR

Do_heap_sort (Container C)

{

Timer T;

Std :: make_heap (c.begin (), c.end ());

Std :: sort_heap (c.begin (), c.end ());

Return std :: make_pair (String ("HEAP Sort"), T.Time ());

}

Template

PAIR

Do_List_Sort (Container C)

{

TypeDef TypeName Container :: Value_Type T; Std :: List L (c.begin (), c.end ());

Timer T;

L.sort ();

Return std :: make_pair (String ("List Sort"), T.Time ());

}

Template

PAIR

Do_set_sort (Container C)

{

Typedef Typename Container :: Value_Type T;

Timer T;

Std :: MultiSet S (c.begin (), c.end ());

Return std :: make_pair (String ("set sort"), t.time ());

}

Void Report (Vector > v,

Std :: Ostream & OS)

{

Typedef Vector > vect;

OS << std :: setw (20) << "Sorting Method" << "

<< "t (sec) << std :: endl;

For (vect :: itemic i = v.begin (); i! = v.end (); i)

OS << std :: setw (20) << I-> first << "

<< I-> Second << std :: endl;

}

Int main (int Argc, const char **

Argv)

{

INT n = 0;

IF (argc == 2)

N = atoi (Argv [1]);

IF (n <= 0) {

Std :: CERR << "Usage:"

<< argv [0] << ""

<< std :: endl;

Return 1;

}

Vector v (n);

For (int i = 0; i

v [i] = i;

Std :: random_shuffle (v.begin (), v.end ());

Std :: cout << "sorting a vector of"

<< v.size () << "Doubles." << std :: end1;

Vector > results;

Results.push_back (do_sort (v));

Results.push_back (do_qsort (v));

Results.push_back (do_stable_sort (v));

Results.push_back (do_heap_sort (v));

Results.push_back (DO_LIST_SORT (V));

Results.push_back (do_set_sort (v));

Report (Results, std :: cout);

Return 0;

}

Listing 2: Measurements - Sorting a Vector (ONLY ELSE IS The Same As in Listing 1) Int Main (int Argc, const char ** argv)

{

IF (argc! = 2) {

Std :: CERR << "Usage:"

<< argv [0] << ""

<< std :: endl;

Return 1;

}

Std :: ifstream in (Argv [1]);

IF (! in) {

Std :: cerr << "can't open" << argv [1] << std :: end1

Return 1;

}

Vector V;

String Str;

While (std :: getLine (in, str))

V.PUSH_BACK (STR);

Std :: cout << "Sorting a file of"

<< v.size () << "LINES."

<< std :: endl;

Vector > results;

Results.push_back (do_sort (v));

Results.push_back (do_stable_sort (v));

Results.push_back (do_heap_sort (v));

Results.push_back (DO_LIST_SORT (V));

Results.push_back (do_set_sort (v));

Report (Results, std :: cout);

Return 0;

}

Listing 3: Measurements - Comparing Direct and Indirect Sort

#include

#include

#include

#include

#include

#include

#include

#include

Using std :: vector;

Using std :: string;

Using std :: pair;

Template

Struct Indirect_lt {

Bool Operator () (Iterator X, Iterator Y) Const

{RETURN * x <* y;}

}

Struct timer {

Clock_t start;

Timer () {start = clock ();

Double Time () const {

Return Double (Clock () - start) / clocks_per_sec;

}

}

Void

Report (Vector > V,

Std :: Ostream & OS)

{

Typedef Vector > vect;

OS << std :: setw (20) << "sorting method" << "" << "t (sec) << std :: endl;

For (vect :: itemic i = v.begin (); i! = v.end (); i)

OS << std :: setw (20) << I-> first << "

<< I-> Second << std :: endl;

}

PAIR

Do_sort (Vector C)

{

Timer T;

Std :: sort (c.begin (), c.end ());

Return std :: make_pair (String ("sort"),

T.Time ());

}

PAIR

Do_indirect_sort (Vector

:: Const_Iterator> C)

{

Timer T;

Std :: sort (c.begin (), c.end (),

Indirect_lt

:: Const_Iterator> ());

Return std :: make_pair (String ("Indirect Sort"),

T.Time ());

}

INT Main (int Argc, const char ** argv)

{

IF (argc! = 2) {

Std :: CERR << "Usage:"

<< argv [0] << ""

<< std :: endl;

Return 1;

}

Std :: ifstream in (Argv [1]);

IF (! in) {

Std :: cerr << "can't open" << argv [1] << std :: end1

Return 1;

}

Vector V;

String Str;

While (std :: getLine (in, str))

V.PUSH_BACK (STR);

Std :: cout << "Sorting a file of"

<< v.size () << "LINES."

<< std :: endl;

Typedef st :: vector :: const_iterator it;

Std :: Vector iv;

For (iter i = v.begin (); i! = v.end (); i)

Iv.push_back (i);

Vector > results;

Results.push_back (do_indirect_sort (iv));

Results.push_back (do_sort (v));

Report (Results, std :: cout);

Return 0;

}

Note

[1] For more details on quicksort and merge sort, and other sorting algorithms, see DE Knuth, The Art of Computer Programming, vol. 2, 1998. [2] DR Musser. "Introspective sorting and selection algorithms," Software Practice and Experience, 27 (8): 983, 1997.

[3] This restriction is partly artificial:. It is possible to implement quicksort and merge sort in terms of Forward Iterators, albeit at the cost of some complexity Artificial or not, however, it's what the Standard requires.

[4] see .

About the Author

转载请注明原文地址:https://www.9cbs.com/read-26699.html

New Post(0)