Although there are various approaches to data mining that seem to offer distinct features and benefits, many may not be powerful enough to meet your corporate knowledge discovery needs. But in fact just a few fundamental questions can quickly clarify the business benefits and the power of a . data mining system, setting its advantages in a clear perspective These questions need to be asked both from the view points of business and technical users However, please note that these questions refer to data mining -. please also see the many benefits of the knowledge access paradigm which uses the patterns discovered by data mining within a PatternWarehouseTM. Here are two sets of "Top Ten Data Mining Questions" from business and technical perspectives. Each question has three parts that together highlight one specific aspect of a data mining system's power And Capability. The Top Ten Data Mining Business Questions The Top Ten Business Questions About The Ben . Efits, quality and usability of the system They are: Question 1:?? Business Benefits a) How will this system help us b) How well does this system work for our industry-specific applications c) What information can we get that we do not already have It is essential to ask this question again and again you should, of course, get new refined information, but it is not enough just to know something -?. you should have information that allows you to "act" within The Context of Your industry. and, you will measure the bottom-line dollar benefits Delivered by a data mining system. See The Paper "Measuring The Dollar Value F Mined Information"
For A Framework for this. Question 2: Technical Know-how a) How Technical Sophistated Do We need to be to use it it? b) Can Business Users Operate IT WITHOUT CALLING THE I Group All The Time? C) Is IT As Easy To use as an internet browser? Business users should be empowered with direct, on-demand access to refined knowledge. They should not have to know statistics, yet should be given consistent and correct answers. The system interface should be as easy to use as a web-browser Question 3:. Understandability and Explanations a) Are the results intuitive or difficult to understand b) Do we get clear explanations for any information item presented c) Will the explanations be in technical statistical terms or in a form that we?? can understand? Results should be presented to business users in plain English, accompanied with graphs. The system should be able to explain each piece of information it presents in clear, English-like terms that business users can easily comprehend and use. Qu Estion 4: FOLLOW-UP Questions a) What Kinds of Follow-Up Questions Can We ask from the system? b) DO We NEED TO GO ANAALYST for FURTHER QUESTION ANSWERING? C) How Fast Can We Drill-Down On The Fly to see more patterns? Response to follow-up questions must be immediate. Business users should not need to use intermediaries such as analysts to get more information after they have seen some results. If follow-up questions take time and involve intermediaries, the business users effectiveness will be impacted Business users should get refined information, as need it, when they need it Question 5 they:..? Business Users a) How many business users can this system support b) Can the business users tailor their own questions for THE SYSTEM?
c) Can users utilize the knowledge for day-to-day decision making? The system should be able to use the same fundamental knowledge to support a few hundred business users, each with a different group-perspective. Yet, all of these users must . be given consistent answers as they ask their own questions The information must be presented such that can be utilized for day-to-day actions Question 6:. Accuracy, Completeness and Consistency a) How accurate are the results the system delivers b)? Can some patterns be missed by the system? c) Are the results always consistent or can 100 users get 100 different answers? The system must cover a wide range of patterns and should provide high quality, information. The knowledge provided to business users should be derived from the entire data set (and not samples) in order to increase accuracy. All business users should access the same knowledge so that they all receive consistent answers, increasing the quality of corporate information. Question 7 : Incremental Analysis A) Can We Automatic Or AS WEEKLY / MONTHLY DATA AS IT BECOMES AVAILABLE? B) Can The System Compare The "Month to Month"
results and patterns by itself? c) Can we get automatic pattern detection over time, every week or month? The system should analyze data as it becomes available every week or month and perform on-going trend analysis, highlighting the key items and influence factors that impact significant changes The incremental analysis should be performed automatically in the background, informing the user of significant trends and the underlying causes Question 8:.. Data Handling a) How much data can the system deal with b) Can it work directly on? our database, or do we need to extract data c) If it works on extracts, how do we know that some patterns are not missed The system should handle moderate to large volumes of data on a powerful server -?? of course, large data volumes should not be expected to be managed on small servers The system should work directly on the SQL database, without extracts so that patterns are not missed and performance is improved Question 9:.. Integration a) How will it integrate into our computing environment? b) Will it just work on our existing SQL database? c) How easily will the system work on our intranet? The system should run smoothly on existing open server platforms (eg Unix) and popular DBMS engines ( .. eg Oracle, Sybase Informix, etc.) on the server The system should present results to users on the corporate intranet The absence of data conditioning requirements and extract files will make integration much easier Question 10:. Support Staff a) What staff do I NEED TO KEEP THISTEM INSTALLED AND RUNNING? B) How do we get support and training to get started? C) What happens install the system?
After the initial system design, the support personnel for the system should be kept minimal. One database administrator should be able to manage the DBMS, and one analyst should occasionally help in setting up discovery models, etc. Thereafter, business users should be able to use the system on their own. There should be no need for a large number of resident support analyst to act as intermediaries for the business users. The Top Ten Data Mining Technical Questions The top ten technical question should be asked by technical users about the architecture ., power and the scalability of the system They are: Question 1:?? Architecture a) How are computations distributed between the client and the server b) Is any data brought from the server to the client c) Can the system run in a three tiered architecture? The best option is for the discovery to take place entirely on the server. Any attempt to bring data to the client will seriously limit the applicability of the system to larger datab . Ases The best architecture is a thin-client, three-tiered system that uses the power of a large server-based SQL engine but operates on an intranet Question 2:. Access to Real Data a) Does the system work on the real SQL database or on samples and extracts? b) If it samples or extracts, how do we know that it is accurate? c) If it builds flat files, who manages this activity and cleans up for on-going analyses, and how can it sample Across SEVERAL TABLES? The Best Option IS for a Data Mining System To Work on The Real Databases and NOT ON SAMPLES, Extracts and / or Flat Files. Working on The Real Database Uses The SQL ENGINE '
. S power (eg parallel execution) and provide much more accurate results And, the system should be able to access database tables in their native form, reaching across tables by itself Question 3:. Performance and Scalability a) How large of a database can the system analyze? b) How long does it take to perform discovery on a large database? c) Can the system run in parallel on a multi-processor server? The system should work on databases with a large number of records. It should derive its capabilities from the power of the server and the SQL engine, whenever possible. The system should be able to use the built-in parallelism of the SQL engine, but should also be able to use multiple processors for its own parallel non-SQL computations Question 4: Multi-Table Databases a) Does The SYSTEM WORK ON A SINGLE TABLE TABLES? B) Does The System NEED TO Perform A HUGE JOIN TO Access All of Our Tables? C) IF IT Works On A Single Table, How Can We Feed IT Our EX isting data schema? The real world is full of multi-table databases which can not be joined and meshed into a single view. In fact, the theory of normalization came about because data needs to be in more than one table. Using single tables is an affront to a decade of work on database design If you challenge the DBA of a really large database to put things in a single table you will either get a laugh or a blank stare -. in many cases the database size will balloon beyond control The System Should Be Able To Mine Large Multi-Table Databases Directly by Itself on The Server. Question 5: Multi-Dimensional Analysis A) Does The System Analyze Data Along A Single Dimension Only?
b) How are multi-dimensional patterns discovered and expressed by the system? c) How do we specify the dimensional structure of our data to the system? The OLAP phenomenon has conclusively demonstrated that the business world's data is not single-dimensional. Hence a data mining system should be able to automatically discover patterns along multiple dimensions. In fact, there are many cases where no single dimensional view can correctly represent the semantics of influence because the influence ratios will always be off regardless of how one aggregates. See the paper : OLAP &
Data Mining: Bridging the Gap for a detailed discussion of this Question 6:. Types and Classes of Patterns Discovered a) How powerful and general are the patterns the system can discover and express b) Can the system mix different pattern types, eg influence? and affinity patterns? c) Can the system discover time-based patterns and trends? The format of the patterns discovered by the system is very general and goes far beyond decision trees or simple affinities. The advantage to this is that the general rules discovered are far more powerful than decision trees. Decision trees are very limited in that they can not find all the information in a database. being rule-based keeps the system from being constrained to one part of a search space and makes sure that many more clusters and patterns Are Found - Allowing The System To Provide More Information and Better Predictions. Question 7: System Initiative A) Does The System Use Its Own Initiative To Perform Discovery Or Is It Guided by T he user? b) Can the system discover unexpected patterns by itself? c) Can the system start-up by itself on a weekly or monthly basis and perform discovery? In some cases the user has to interact and guide the system, eg build a decision tree. However, a better approach is for the system to use its own initiative in the data mining process, forming hypothesis automatically based on the character of the data. The system should start-up by itself, select the significant patterns in the data And Filter The Unimportant Trends. The Analyz SHOULD BE DONE ROUTINELY ON A Weekly Or Monthly Basis. Question 8: Treatment of Data Types a) Are All Data Types Handled in Their OWN FORM OR TRANSLATED TO OTHER TYPES?
b) Can the system find numeric ranges in data by itself? c) Do a large number of non-numeric values cause problems for the system? The system should manage all data types in a uniform manner and in their native formats, ie numbers, Dates and constants in the data sale be discovered by the system, not required "Number Bin"