Online about MapReduce, the most authoritative is the Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simpli Ed Data Processing On Large Clusters You can download this article on Labs.google.com.
For Goole, a company that needs to analyze massive data is required, the ordinary programming method is not enough. So Google has developed MapReduce. Simply, MapReduce on the syntax is like lisp, using the MapReduce model you can specify a map method to handle data such as Key / Value, and generate the middle-form key / value pair, then use the reduuCE method to combine all the same key Intermediate Key / Value to generate the final result. Google's MapReduce is a programming tool that runs TB data on thousands of machines.
It is said that under the programming model such as MapReduce, the program can automatically be executed in parallel in the cluster environment. Just like the Java programmer may not consider memory leaks, MapReduce programmers don't want to care about how massive data is assigned to multiple machines, do not need to consider what if the machine is in the machine, do not need to consider these machines How to collaborate work together.
For example: Recently, when I do Bayes forum garbage shield demonstration system Beta 1, it is necessary to calculate the frequency of each word in the sample data. My calculation step is the first word and then use the Hash table. If I encounter TB's data, my Caiyang CPU can be eaten. So what will it look like under MapReduce?
Here is a pseudo-realization: Step 1: Map (String Key, String Value): // Key: Document Name // Value: Document content for Each Word W in Value: EmitInterMediate (W, "1"); second step : Reduce (String Key, Iterator Values): // Key: One Word // VALUES: Frequency Data About this word int result = 0; for Each V IN VALUES: Result = PARSEINT (V); EMIT (Astring ));
If you look at the vector space model, this is the semantic implementation of the TF and IDF.
Google's WebReduce is implemented with C , in the MapReduce: Simpli Ed Data Processing on Large Clusters, you also contain a real WebReduce code, you can take a look, full of eyes.
Sogou.com Search Engine Current Scale:
Parameter 1, each two-week code efficiency is increased by 10 times parameter 2, and it can account for the program parameters of 4G memory after several optimization, and the 50 server 200G memory 10T hard drive is controlled. Ten thousand query parameters 4 per minute, from 2 Warlike configuration file control, and users print 100M logs again