Let data mining works Ken North (Ken_North@compuserve.com) Selected from DB2 Magazine 2004 Data Mining Data Mining from Data Take Value Information, Today it is hotter than ever. So why have so many people can't use it correctly? In 1999, the Data Miner column first appeared on DB2 magazine. After 4 years, data mining is still a hot topic. Recently, a Gartner report (December 2002 "technology adopted and value: survey results") The third digit of 37 emerging technologies in all industries in all industries. Data mining has been accepted, applied, and prevails. Recently, I used my colleagues in IBM Thomas J. Watson Research on the future development trend on data mining. These include data analytics, CHID APTE, and his team implemented pure theory in data mining and related fields. In addition, the members of the team of APTE Naoki Abe, Rick Lawrence and Ed Pednault have also joined this discussion. They are from scientists and business consultants (they often spend many times and IBM customers, helping customers find their own point of view. Hermiz: Before we discuss the future, let us talk to the past. What do you think is the biggest success of data mining? What places in this technology have not reached its expected goals? APTE: The greatest success of data mining is that it analyzes and explores a commercial activity with large amounts of data in a more automated way, which in the past is needed to extract valuable information from industry experts and statistical experts. Although this may not conform to the traditional sense of success, it is indeed opening a door, which is the biggest achievement I think data mining. As for its shortcomings, I think the challenge is mainly in operability, and we still don't solve this problem so far. Pednault: I have already had a considerable number of companies to do data mining and completely rely on forecasting models to engage in their business activities and profit from them. From those companies, data mining has enhanced their capabilities. For me, this is the definition of success. For example, some companies have used data mining to make credit risk assessments for a long time, they rely on data mining to support their risk management. Lawrence: Continue to extend the topic of the APTE, and ask what you have successfully applied in the field of data mining field? My opinion is that credit card fraud detection, in this application, the amount of data is very large, once you make a mistake, it will cause significant loss, and the processing must be very fast online. Of course, I can definitely have more successful cases. On the other hand, if we criticize data mining, we can draw this conclusion: Data mining usually fail when it is used as a universal medicine to solve any particular problem. In these cases, its failure is because people are too high for its expectations. People who do not practice data excavation think that it is almost incredible. They think that they are not suitable for uncomfortable problems, unbelievably fill in some dirty data into the data mining tool, and then inexplicably produce a useful solution . This is impossible. ABE: Talking about the achievements of data mining, it is impossible to talk about Web's achievements. What I want to add is that the expected goal of data mining is a certain extent, because the web-based application model does not achieve its predetermined goals. There is such a view: on the web, the data should be able to automatically appear, and the operation should also occur automatically. It is not so simple that business activities also involves factors such as people, physical storage, and operational problems.
Hermiz: If you let you consider the elements based on information solutions - people, processes and technologies - Where do you think challenges and opportunities? APTE: These elements are closely linked. The challenge of a factor is an opportunity for another factor. We can apply technologies to pipelined business processing processes, by reducing people's workload, making them the best things. Pednault: I think technology is the most important, it creates opportunities, but there is still a lot of changes to process, of course, the ultimate use technology is still people. So what changes are necessary? Take the Customer Relationship Management (CRM) system, usually you will arrange different managers to be responsible for different business activities, and the manager of the business activities may be other people, so that the customer is no longer independent. A business activity, but a series of activities, so you may lose our customer relationship during such business. The business process in the CRM needs to be started in its own way. Even if the existence of technology helps to manage individual customer relationships, doing so requires a huge transformation of business processes. Business people must realize that these changes are required for business development, then they will process all questions related to people (who is managed, who has anything, and how to measure everyone). There may be very large inertial resistance in some organization groups, resulting in the advantages of technology that cannot be fully utilized. Lawrence: I think the advancement of technology cannot lead to some corresponding skills, such as data analysis, statistical, etc. become outdated. However, this part of the IT practitioners who collect data require stronger business understanding, they need to take a way to collect data in a way that is actually used by commercial intelligence tools. APTE: If you can improve the number of must-have this skill, you can of course make more use of technology we have already have today. But is this investment we should do? - It makes the process more dependent on labor - rather than improving technical levels to reduce dependence on proficiency. ABE: I agree that those skills will never have time, but I also believe that there is such a driving force that causes more parts of the data mining process to become automated. In the next 3 to 5 years, I think automation will have a huge role in the reduction in skill skill dependent. Hermiz: When we talk about data mining, we always have to focus on data. What do you have to collect, purify, and storage, including data collection, purification, and storage? Is the lack of business data quality? Is it a major constraint? Lawrence: I believe that for data mining communities, our progress in this regard is almost a very embarrassing situation. Even if you go back in 10 years, I think we will be shocked. If you talk about the customer database, from the inspection to the order implementation, the joints of the joints between the entire data collection process are so frequent, so that we are surprised. It is very difficult to build data that can show a specific marketing activity and the final effect of procurement decision-making. Therefore, it is also very difficult to use those data to develop a data mining model for improvement. APTE: Although the data warehouse and the associated data purification tool have existently existed, they are not widely used as we hope. Moreover, I don't think those who use tools have solved some problems, including collecting data and organize them in format that can be used according to data mining. Our research group spends exploring and utilizing techniques to solve this problem than it is time to take. Lawrence: In fact, what I want to say is that the data collection process is so poor, so that data mining researchers are always required to rebound and correct the defects in the data collection system. Now we are using a technology, we plan to use it to purify data to correct errors in the data collection system.
A small but annoying example is: Allow a CRM system that allows a specific procurement decision in any format. With it to provide the system to provide users with three, four different results to choose - buy or do not buy, etc., we are not as good as applying text analysis to the free format response to derive results. ABE: I think the automation of data purification, pretreatment, and text mining will become a very large technical challenge, which is caused by data accuracy issues. Pednault: If you look from a data perspective, these customers who are engaged in business activities understand the value of each data element they collected. Therefore, they can properly arrange the business processing process to ensure the quality of the data, and determine the connection between marketing decisions and final results, thereby establishing a predictive model to improve their processes. In order to arrange these processes to a suitable location, you need to understand the management of data value, and maintain contact with appropriate analysts, which can help design the database to ensure that data can be correctly expressed. They have made great efforts to collect and purify customer attribute data, and ensure that the data is sufficient. Hermiz: Perhaps due to local safety considerations, people seem to have interested in text excavation and analysis. Do you think the data and text excavation will be integrated in the future? APTE: Data mining and text mining may be integrated to some extent, that is, text knowledge library can be used as an important feature and attribute source to complete some of our data mining today. Text excavation has its own unique contribution, which focuses on information extraction, trend forecasts, and intelligent assessments for documents and knowledge bases, which make them supplements to data mining, but not to be combined with data mining. ABE: I still have some integration. In the natural language, data mining and machine learning technology have grown sharply, and now they already account for most of the paper. Of course, as CHID APTE said, the part of the text mining research is applied by using data mining techniques to extract text features. However, a very important part of text mining research includes specific natural language issues (such as automatic acquisition of syntax knowledge and semantics knowledge useful). Lawrence: I have already talked about these problems with some customers, and they want to analyze structured data and non-structured data from news. Hermiz: Where do you think that from now on 3 to 5 years, where is the biggest opportunity for data mining applications? What improvements will make it possible? APTE: I think that standing on the supply chain, a lot of data is available, but today's system and solutions have not yet developed to use the extent to which the complex method such as data mining can be used. They seem to still stay in traditional statistical prediction technology. Pednault: At that time, there will be a good opportunity to combine data mining, predict model, and optimization, and enhance the data mining to a bigger range. Many predictive methods for supply chain management are rawly based on techniques for credit card points and CRM, so there are many opportunities to enter this area. In the supply chain, the importance of data has been approved - you can get a horizontal perspective from producers, suppliers, distribution channels until customers. And the system will be properly deployed to collect, manage, and maintain all data. However, many of the currently existing data decisions is still very backward, and the management of the entire supply chain has become more and more important. To make this all become reality, it is necessary to change the transformation of the business process and cooperation in different roles in the supply chain. ABE: People have great interest in the financial sector, which is related to operational recovery capabilities and risk management. Future data analysis will play an important role in these respects. Lawrence: Local security is of course a field that people have gradually interested.