my country's banking, securities, telecommunications and insurance industries are talking about "data concentration", hoping to achieve customer relationship management and business intelligence on this basis. "Data Mining Engineer" This novel job name is also voided in the recruitment position of the company.
Is there a data mining? The leaders of some companies have doubts about this. The data mining staff is some of the rare weird technical nouns. Their bodies are complex, that is, they are not completely learning computers. They are not as statisticians, they are not marketing planners. The results they get are not easy to understand, they What does the work do? Some technologies originated by managers, they may be hot, I hope to find new business models as soon as possible, find new opportunities for money; and business intuitionally strong managers are easy to conflict with this precise quantitative analysis method, data mining itself The defect also causes it to be easily attacked.
In order to better play the efficacy of data mining, it is necessary to understand the understanding of corporate managers, more efforts of data mining personnel. The author of this article attempts to explain some of some confusion issues based on the experience of the past data mining project.
1. The application of the result
Question: Some of the results of data mining are submitted in the form of probability data, which is the easiest way to enhance the unsuspect. Enterprise managers may ask, I want you to make a forecast for my customers, why can't you tell me what customers will lose next month? I can only tell me the probability of each customer's loss. I want you to predict which customers will have insurance fraud, and you are still the probability of the customer's fraud. How do I use this probability? I dare to use it?
Explanation: The predictive model of data mining is approximate to the real world. The reason is that the behavior information of customers stored in the enterprise customer database is impossible, and those customer information that may not be stored is exactly whether or not to be lost or The most relevant information of the fraud, so the predictive model established according to the existing information is not accurate, and it is impossible to determine the results, but only the probability value. Such a result is still useful, because the predicted customers are highly lost, the actual loss is often very much, and the company's key points implement the retaining dimension of this part of the client, and the targeted is very strong, saving the resources of the enterprise. Similarly, in a higher fraud probability, the ratio of fraudulent behavior is also much higher than that in other customer groups, so special investigators can focus on these customers, often more than half of them. The savings of resources are meant to grow.
2. Selection of variables
Question: Establishing a prediction model is an attractive idea. The forecast is better to determine, you want to predict customer loss, then "customer loss" (binary variable) is a target variable; you have to predict the glory of stock, then "whether the market price increase" is the target variable. But how to determine which variables are used as an argument (recalling the definition of the function in high school algebra), it is quite amend. In other words, it is often necessary to see which factors is related to the target variables. This problem solves the problem, it will directly affect the performance of the predictive model. So, should it be the business personnel to decide, or the data mining personnel decide?
Explanation: The best way is the combination of both parties. Long-term business experience of business personnel, so that they can be keenly related to target variables closely related to target variables. But experience is limit, even binding thinking, business personnel will miss many surfaces but actually important factors, and because the human brain has limited processing capacity, sometimes there is some factors and some factors Complex and subtle interactions, and this is where data mining people can play a role. There are a lot of mature methods in statistics, which can help us select the right variable to construct our forecast model.
There is also a common phenomenon: a variable selected by data mining personnel, afterwards, it is very good to improve model accuracy, but it may not be reasonably explained. At this time, business staff will request to delete this variable. In fact, the results of data mining often exceed our imagination, our instinct is something that tends to refuse to understand, and even risks that the model prediction performance - this practice is harmful, because the current unable to explain does not mean I can't explain later (it is said that Wal-Mart's "Beer and Diaper" rules discovery is also explained afterwards, and the data mining results are not spoken, but through the math of human development in thousands of years. The theory is obtained on the basis of countless confirmation effective complex algorithms, and cannot be given to negation; more, if this variable enters the prediction model, it is proven to be conducive to model accuracy, then it is a pity. Don't forget "practice is the only standard for testing truth" this basic law.
3. Superstition of "Lift"
Question: After the performance assessment of the prediction model, business personnel may often ask data mining engineers: "What is the increase in your model?" It seems less than 3.0 is a bad model. So how do you need to accept it?
Explanation: The lifting is an important indicator for measuring the prediction model, but is not unique. We also have a mixed matrix, response to capture rate, ROC curve, and threshold-based diagnostics, and so on. Models in different industries are different, and different regions in the same industry may vary. We have tried, using roughly identical auto-variable factors to predict the loss of mobile phone users, the improvement of the model in Guangdong's model is only 2.2, and the model is expected in another time period up to 5.2, and a place in Hubei Reach 7.0. Therefore, the degree of acceptance of the model cannot be used only by the level, and should be measured by its prediction result, and its return on investment should be calculated. However, data excavators should take the initiative to try different enhancements, and try to improve the prediction accuracy of the model without causing the model "Overfitting", because a percentage of model accuracy is rising, it may Means that the merchant's millions of yuan in income feeding.
4. Segmentation purposes
Problem: Customer segmentation generated by data mining, compared to traditional experience segments, can consider more behavioral properties to obtain more behavioral properties, and have a more abundant segmentation, and each customer group has a more distinct behavioral characteristics. However, what kind of customer segmentation is good? Is it the most appropriate to divide how many groups are Is the number of people between the groups that are different from a varied result?
Explanation: The predictive model has a lot of measurement indicators, but customer segmentation model performance does not have a certain measure. We don't know which group should be part of a customer. The customer's segmentation model is good and bad, more to judge from the perspective of the business. Divide customers into hundreds of groups, which can achieve more detailed understanding of the purpose of each group customer, but is our client manager? Can the existing customer management system support so many customer group processing? If it is not, it is necessary to force a few groups less. The number of people between the groups sometimes has a great difference, which may be that the customer is indeed that there is some large group of customers in some large groups of customers very close, and there are also some small group of customers to show the same behavioral characteristics. These people are less customers. Groups may be a group of people with abnormal behavior, for example, groups with fraudulent behavioral characteristics. If the business handling is related (for example, each customer manager is responsible for approximately equal number of customers), companies often require the number of people in each group to be more uniform, then the similarity of customer characteristics in the same group will receive a point. damage. In addition, because data mining tools are strong, data mining people may be fascinated by a lot of subdivision results, and neglect the purpose of subdivision, and corporate business personnel may think that these segments are in conjunction, and they cannot be adjusted again. The best practice should be close interaction of business personnel and data mining staff, determine the subdivision scheme based on business needs, and try to adjust a reasonable appropriate solution and results. For example, if you want to focus on your customer's long-distance call behavior, you can select multiple factors related to long distance as a subdivision variable, and even multiply these variables in a certain weight, more emphasize their role.
5. Selection of tools
Problem: The expensity of data mining tools is well known. There are millions of dollars for two years, hundreds of thousands of dollars buying. how to choose?
Explanation: It should be determined according to the needs of the enterprise, budget, and user quality. If you need to establish hundreds of models, data and model management of data, and models are very complex. The expected benefits of data mining are very good. Users have a good theoretical basis and application level, and should choose powerful and flexible excavation tools; otherwise It should be considering those features relatively single, kit tool products. Companies can pay attention to the excavation software evaluation report launched by some consulting agencies. It is worth mentioning that some of the popular free software, such as ADE-4, LISP-STAT, R, and gradually be recognized and used by domestic people. The R is an independent programming software that has many packages (packages), which is almost unbearable in giant commercial software such as SAS, but has a high requirement for users.
6. Not "excavation" can solve problems
Question: The business community has been divided by the long-term lack of quantitative analysis, and is not based on whether or not, it is divided into data mining. For example, companies may present how to optimize their own network resources, how to propose optimal operational programs for uncertain systems (logistics, factory supply chains, queuing systems, etc.) with numerous random factors, how to develop future market share changes And competitive advantage. Can data mining be qualified?
Explanation: In the academic sense, these are not in the field of data mining, respectively, which belong to the field of operation, discrete event simulation, system dynamics simulation. These technologies have little applications in my country. Data mining staff should expand their positions, and promote their statistical analysis capabilities and data modeling capacity, to meet the new needs of enterprises. For example, "Marketing Preview" that is often discussed often in the telecommunications industry, that is, the results can be predicted before the implementation of the marketing plan, which is adjusted in advance to pursue the best results, actually a typical competitive dynamics simulation problem. Such a problem, it is necessary to consider time factors, consider positive, negative feedback between factors, establish a structured model for the interaction of various factors. After verification, it is used for predictions of actual scenes. Since the model running on the computer, the enterprise manager can test any of his ideas in the model without risk, testing the impact of various factors to adjust the effectiveness, and inspecting whether the reaction to competitors is appropriate, and its own behavior What the environment will cause any effect. In summary, data mining along with other mathematical modeling methods, the newly renovation of my country's corporate community, more efficient, will play more and more significant roles. This will rely on the hard exploration of business personnel and data mining staff and other class analysts.
Author: Yue Yadin