Spatial data mining technology theory and method

zhaozj2021-02-16  55

Spatial data mining technology theory and method

Ge Jike

(Southwest Agricultural University Information College 400716)

Abstract This paper briefly discusses the theory and characteristics of spatial database technology and spatial data mining technology, and analyzes the level and method of spatial data mining technology, and focuses on the currently commonly used spatial data mining methods such as classification, clustering, association rules. Problems, development trends and directions in the current spatial data mining technology.

Key words space data mining classification clustering correlation rules

0 Preface

Geographic Information System, Abbreviation GIS) is a comprehensive technology of computer science, geography, measurement, and mapology, etc. [1]. The basic technology of GIS is the spatial database, map visualization, and spatial analysis, and the spatial database is the key to GIS. Spatial Data Mining Technology As the most active branch and knowledge acquisition method of current database technology, the application in GIS promotes the development of GIS toward intelligence and integration.

Characteristics of 1 spatial database and spatial data mining technology

With the continuous development of database technology and the wide application of database management systems, the amount of data stored in the database is also sharply increased, and many decision-making information is hidden behind these massive data. However, most applications in today's database remain inquiry, retrieval phase, and have a rich knowledge hidden in the database. The dramatic growth in the database and the difficulties of people 's handling and understanding of the database. A strong contrast, leading to the phenomenon of "people are submerged, but they are hungry in knowledge".

Spatial Data in Spatial Database (Data Warehouse) has rich implicit information, such as digital elevation model [DEM or TIN], in addition to the load elevation information, implies geological rock nature and construct Aspect information; the type of plant is explicit information, but there is also information about the horizontal zone of the climate, and the like. These implied information can only be displayed by data mining. Spatial Data Mining, SDM, or discovered knowledge from the spatial database, a research branch of a new data mining expanded to solve space data massive characteristics, refers to extracting implicit from the spatial database. , User-interested spatial or non-spaced pattern and universal characteristics [2]. Since the object of SDM is primarily a spatial database, but also stores the geometric data of space or objects, the property data is stored, but also stores the graphical space between the space or objects, so the processing method is different from the general Data mining method. The essential difference between SDM and traditional geological data analysis methods is that SDM is not clearly taping information, discovered knowledge, and excavated knowledge should have prior unknown, effective and practical three features.

Spatial data mining technology requires integrated data mining technology and spatial database technology, which can be used for understanding of spatial data, the discovery of spatial relationship and space and non-space relationship, the structure of space knowledge base, and the optimization of the restructuring of the spatial database, etc. .

2 main methods and characteristics of space data mining technology

Common spatial data mining techniques include: sequence analysis, classification analysis, prediction, cluster analysis, association rules analysis, time series analysis, rough set method and cloud theory. From the perspective of excavating tasks and mining methods, this paper focuses on the analysis of three commonly used important methods for classification analysis, cluster analysis, and associated rules.

2.1, classification analysis

Classification is a very important task in data mining and is currently available in commercial applications. The purpose of classification is to learn a classification function or classification model (also known as a classifier) ​​that maps the data items in the database to one of the given categories. Classification and our well-known regression methods can be used for predictions, and the purpose of both is to automatically derive the promotion description of the given data from the historical data record, which can predict future data. Unlike the regression method, the classification output is a discrete category value, while the return output is a continuous value. Both often express as a decision tree, start searching from the root of the tree according to the data value, walking along the branch of the data, walking to the leaves can determine the category. The rules of spatial classification are the abstraction and generalization of a given data object set, which can be represented by a Hongyuan group. To construct a classifier, there is a need to have a training sample data set as an input. The training set consists of a set of database records or tuples. Each tuple is a feature vector consisting of a feature (also known attribute) value. In addition, there is a category mark for training samples. A specific sample can be: (V1, V2, ..., VN; C); where VI represents a field value, C represents a category.

The structure of the classifier has a statistical method, a machine learning method, a neural network method, and the like. The statistical method includes the Bayesian method and non-parameter method (neighbor learning or instance based learning), and the corresponding knowledge representation is the discriminant function and prototype. The machine learning method includes decision tree method and rule induction method, the former corresponds to decision tree or discrimination tree, the latter generally generating a rule. Neural network methods are mainly back-propagation, referred to as BP) algorithms, and its model represents forward feedback neural network model (a architecture consisting of nodes representing neurons and representing the edge of the connection weight) The BP algorithm is essentially a nonlinear discriminant function [3]. In addition, a new approach has recently raised: Rough Set, its knowledge is a generated rule.

Different classifiers have different characteristics. There are three classifier evaluations or comparisons: 1) predictive accuracy; 2) Calculation complexity; 3) The simpleness of the model description. The prediction accuracy is a comparative scale, especially for predictive classification tasks, and is currently recognized by a 10-point cross-validation method. Calculation complexity depends on specific implementation details and hardware environments, in data mining, due to operational objects are massive databases, the problem of complexity of space and time will be a very important part. For description type classification tasks, the more concise and more popular. For example, the classifier constructor indicated by the rule induction method is useful, and the results produced by the neural network method are difficult to understand.

It is also important to note that the effects of classification are generally related to the characteristics of data. Some data noise is large, some have lack, some distribution sparse, some fields or attributes are strong, and some attributes are discrete and some are continuous values ​​or mixed. It is generally believed that there is no data that can be suitable for various features.

Classification technology is very important in practical applications, such as: can determine the grade of the house according to the location of the house.

2. 2 Cluster analysis

Clustering refers to the principle of "objects", and the sample itself is collected into different groups, and the process of describing each such group. Its purpose is to make the samples belonging to the same group should be similar to each other, and the samples of different groups should not be similar. Unlike classification analysis, do not know how to divide a few groups and what kind of group before classifying, and do not know which spatial distinguishing rules are defined. It is intended to discover the functional relationship between the properties of the spatial entity, and the knowledge of excavation is used to represent the mathematical equation of the attribute name variable. Clustering methods include statistical methods, machine learning methods, neural network methods, and methods for databases. Spatial data mining algorithms based on clustering analysis include mean approximation algorithms [4], Clarab, Birch, DBSCAN and other algorithms. At present, research on spatial data cluster analysis methods is a hot spot.

For spatial data, the regional division can be automatically performed according to the geographic location and the presence of the obstacle. For example, the residents are divided according to the distribution of the ATM machines in different geographic locations. According to this information, the setting planning of the ATM machine can be effectively carried out, avoiding waste, while avoiding every business opportunity. 2.3 Analysis of related rules

The association rule analysis is mainly used to find the correlation between different events, that is, when a thing occurs, the other thing often occurs. The focus of association analysis is to quickly discover events that have practical value associations. Its main basis is that the probability and conditional probability of incident should comply with a certain statistical significance. The form of space association rules is X-> Y [S%, C%], where x, y is a collection of space or non-space predicate, and S% indicates the support of the rule, and C% represents the confidence of the rule. There are three forms of spatial predicates: a predicate that represents a topological structure, indicating predicates of the spatial direction and a predicate of the distance [5]. Various spatial predicates can constitute spatial association rules. For example, distance information (such as Close_to (near), FAR_AWAY (Overlap), Disjoin, Disjoin (Separation)) and spatial orientation (such as Right_OF (right), West_of (west)) . In fact, most of the algorithms are to improve their classification algorithms using the association characteristics of spatial data, so that it is suitable for excavating the correlation in spatial data, so that the geographic location of another space entity can be determined according to a spatial entity, facilitating space Location query and reconstruction of space entities, etc. The general algorithm can be described as follows: (1) Find related spatial data according to the query requirements; (2) Describe spatial attributes and specific properties using the approximation; (3) filtering unimware of data according to the minimum support principle; (4) Application Other means further purify data (such as overlay); (5) Generate a correlation rule.

Association rules can usually be divided into two: Boolean association rules and multi-value association rules. Multi-value association rules are more complicated, a natural idea is to convert it into a Boolean relational rule, because the mining of space association rules requires calculating multiple spatial relationships in a large number of spatial objects, so its cost is very high. - Mining optimization method for gradually refinement can be used in space association analysis, the method first uses a fast algorithm to minuse a larger data set, and then uses high costs on the reduction data set. The algorithm further improves the quality of the excavation. Because of its very high cost, the association method of space requires further optimization.

For spatial data, the association of the geographic location can be found with the association rule analysis. For example, 85% of the large towns close to the highway are adjacent to the water, or the objects that are usually adjacent to the golf course are parking lots.

3 Space Data Mining Technology Research Direction

3.1 Handling different types of data

Most databases are relational, so it is critical to perform data mining on the relational database. However, there are various data and databases in different applications, and there are often complicated data types, such as structural data, complex objects, transaction data, historical data, etc. A data mining system is unlikely to process various data due to the diversity of data types and different data mining targets. Therefore, a specific data mining system is required for a particular data type.

3.2 Effectiveness and Measuring of Data Mining Algorithms

Massive databases typically have hundreds of properties and tables and millions of tens of millions. GB orders of order is not seen, the TB order database has appeared, the high-dimensional large database not only increases the search space, but also increases the possibility of discovery error mode. Therefore, the domain knowledge must be used to reduce the dimension, and the independent data is removed, thereby increasing the efficiency of algorithm. The algorithm extracted from a large space database must be efficient, measurable, that is, the run time of the data mining algorithm must be predictable, and the algorithm that accepts, index, and polynomial complexity does not have practical value. However, when the algorithm uses finite data to find appropriate parameters for a particular model, sometimes it will also result in value for money and reduce efficiency. 3.3 Interactive User Interface

The results of data mining should accurately describe the requirements of data mining and are easy to express. From different perspectives, the discovered knowledge is investigated, indicated in different forms, indicating data mining requirements and results with high-level language and graphical interface. Many knowledge discovery systems and tools lacks interactions with users, difficult to effectively use domain knowledge. This can use the Bayesian method and the translation database itself to discover knowledge.

3.4 Interactive Excavation Knowledge on Multi-Abstract Layers

It is difficult to predict what kind of knowledge will be excavated from the database, so a high-level data mining query should be a clue to further exploration. Interactive mining enables users to interactively define a data mining requirement, deepen data mining processes, flexiblely viewing data mining results on the multi-abstract layer from different angles.

3.5 Mining information from different data sources

LAN, WAN, and Internet Nets will connect multiple data sources into a large distribution, heterogeneous databases, from the formatting and non-formatted data containing different semantics and unformatted a challenge for data mining. Data mining can reveal the knowledge of ordinary queries in large heterogeneous databases that cannot be discovered. The critical scale, extensive distribution and data mining method are calculated, requiring data mining of parallel distribution.

3.6 Privateness and Security

Data mining can look at data from different angles and different abstractions, which will affect the privateness and security of data mining. By studying data illegal intrusion caused by data mining, the database security method can be improved to avoid information leakage.

3.7 Integration of other systems

Method, the application range of functional single discovery system is bound to be limited. To find knowledge in a wider range of fields, the spatial data mining system should be integrated with database, knowledge base, expert system, decision support system, visualization tool, network and other technologies.

4 problem to be studied

Although we have achieved great achievements in the research and application of spatial data mining technology, there is still an urgent need to solve in some in some theories and applications.

4.1 Efficiency and scalability of data access

The complexity of spatial data and the large number of data, the appearance of the TB order of the database, will inevitably increase the search space of the discovery algorithm, which increases the blindness of the search. How to effectively remove data-independent data, reduce the dimension of the problem, and design a more efficient mining algorithm put a huge challenge for spatial data mining.

4.2 Improvements for the lack of time attributes and static storage of some of the current GIS software

Since the application of data mining involves a timing relationship in a large extent, static data storage severars the application of data mining. Based on layer-based computing mode, complete splitting between different scale spatial data also sets heavy obstacles to spatial data mining. The connection between spatial entities and attribute data is only dependent on identification code. This one-dimensional connection will undoubtedly lose a lot of connection information, which cannot effectively represent multi-dimensional and implicit internal connection relationships, which have added data mining calculations. The complexity, greatly increases the amount of workload and artificial intervention in the data preparation phase.

4.3 Discovering the refinement of the pattern

When the space is found, a large number of results will be obtained, although some are not related or meaningless modes, then the domain knowledge can be used to further refine the discovered mode, which gives meaningful knowledge.

In terms of spatial data mining technology, important research and applications also include: data mining, grid vector integration, data mining, data mining, data in distributed environments, data mining, data Mining the query language and new efficient mining algorithms.

5 small knot

With the continuous development of GIS and data mining and the scientific research in the relevant field, spatial data mining technology has been deeply in-depth, in the near future, in the near future, a GIS, GPS, and RS integration system that integrates excavation technology will be movable. Intelligent, networked, globalization and popularity of popularization. references:

[1] Yan Lun and other geographic information systems - principles, methods and applications. Scientific publishing house .2001.

[2] 邸 昌. Theory and Method for Spatial Data Mining and Knowledge Discovery [D]. Wuhan: Wuhan Surveying and Mapping University, 1999.

[3] Cai Ziuxing, Xu Guangyou. Artificial intelligence and its application. Tsinghua University Press. 1999. 206 ~ 216.

[4] Sheikholeslami G, Chatterjee S, Zhang A. Wave-Cluster: A multi-resolution clustering approach for very large spatiall databases In:.. Proceedings of the 24th International Conference on Very Large Databases New York, 1998. 428 ~ 439.

[5] Zhu Jianqiu, Zhang Xiaohui, Mei Weijie, Zhu Yang Yong. Data Mining Language Analysis [Z].

http://www.sqlmine.com/warehouse/htm/40.htm.

The Technology and Methods of Spatial Data Mining

GE ji-ke

(Information College South West Agricultural University Chongqing 400716)

Abstract: This paper introduces the theory and characteristic of spatial database and spatial data mining, analyses the hierarchy method and knowledge's classification of spatial data mining, introduces spatial classification rules, spatial clustering rules and spatial association rules, points out unsolved question, trend and direction .

Key Words: Spatial Data Mining, Classification, Clustering, Association Rules

转载请注明原文地址:https://www.9cbs.com/read-22908.html

New Post(0)