An improved multidimensional index based on R-TREE [

xiaoxiao2021-03-06  66

An improved multidimensional index based on R-TREE

Abstract: In order to adapt to the application of spatial multidimensional index technology in video database

Keywords: R-Tree video database multi-index data spatial segmentation

1 Introduction

The expansion of information has led to more and more problems that modern multimedia database retrieval needs. In building an index, the most important thing is how to construct an efficient index algorithm to support multimedia database systems, especially how effective use algorithms to achieve acceleration retrieval. In summary, the multimedia database index structure should be achieved: support high dimensional data space; effectively divided data space to accommodate an indexed organization; efficient implementation of a variety of query mode systems. Many newly proposed index structures are simply simply to accelerate some kind of query or improvement of performance, but often ignore other effects, which is likely to cause more unnecessary performance consumption.

There are already many scholars that have proposed their spatial models of multimedia databases, especially video databases, have a lot of results in terms of spatial index structure. For example, R-Tree [1], KD-Tree [], mesh file [], and structural body based on these structures [], and more.

Due to index structures such as Grid File (grid file), linear quadrants have poor performance on high dimensional data, and thus dynamically scalable index structure R-Tree [], r * -tree [] is generally application.

This article focuses on dynamic indexing structure of similarity retrieval based on vector spatial model. A vector representation of the fixed dimension in the vector space is represented by the index structure based on this.

In Section 2 of this paper, the data model is proposed, and the improved algorithm and test experiments in Section 3, 4, and the analysis of the experimental results of Section 5, and finally conclusions.

2, database models and basic definitions

2, 1 model hypothesis definition

Various feature vectors extracted by the feature extraction algorithm

The system feedback has different feature corresponding to the weight

Image combination characteristics

Image model

;

The same degree of A and B images

Distance available in European space

To describe it.

It is the weight factor. Feature distance

Small, similarity

The bigger it.

Through the definition of image combination characteristics and similarity, the image content information is taken to make the search conversion to the image content to find the combination of the same degree of similarity to the image to be retrieved, that is, the minimum European distance of the combination feature. The problem.

2, 2 classification model based on feature

As shown in Figure 2-1, the first level is a global or based on the basic semantic classification, which is performed by manual or various types of classifiers or clustering algorithms. This article is divided into video, audio, text, etc., and the second level of classification can be used, such as domain classification, feature classification, or annotation classification. The classification method also uses a manual or classifier; the third level of terminals involve specific video streams, lens segmentation, feature extraction, and final processes, will be a feature picture set of each frame. Finally, based on color histogram information, the index structure is used to use the index structure of this article.

2, 3 image feature information based on PCA

This article focuses on the adaptation of the multimedia database in the high-dimensional index structure, and thus the color characteristics are used as research objects.

For a given image / video frame I, its corresponding chrominance image I 'is obtained. It is formed to form an amplitude vector obtained by the I 'after the discrete payment transformation (DFT), and X is formed into one vector in the sequence of the Fourier Spectrum (amplitude spectrum) matrix. In this way, we get A high dimensional color feature X of the image. In the image retrieval application, high-dimensional typical vector is generally not directly used, but to perform design, then the image retrieval of the low-dimensional feature after design, then we will apply the main component below. The analysis method (PCA) or KL transform performs high-dimensional vectors X, compresses the information contained in X to a few dimension therein, and represents the color of the image to the image retrieval.

The data image herein is represented by the way the PCA, K-L transformation described in the use of []. In this way, each image in the image is used, and the K-L transform is performed in an offline manner, and the resulting image feature vector is stored in the database. In the actual image retrieval system, only the first k KL transform coefficient (the number of K

3, R-Tree data spatial segmentation algorithm improvement

3,1 R-TREE algorithm basic improvement

As shown in Figure 2-1, the model uses a multi-hierarchical structure. Use the hierarchical index to reduce the size of the index file accordingly; for different levels, the optional index structure can also be different, increasing the flexibility of the index system. Of course, this mode is selected, bringing another problem, that is, the implementation of data classification in hierarchies.

With hierarchical structure, the classification of data is very important. After the data classification, the data spatial segmentation is to be performed. The purpose of performing MBR data spatial segmentation is to increase the efficiency of R-TREE on the area query and K-NN (nearest neighbor) query. Through improved data spatial segmentation methods can achieve the purpose of simplifying the K-NN query and regional query, the main improvement is to speed up the data and reduced speed of the query by two MBR structural design. Specific spatial segmentation algorithm see [3, 2 data spatial segmentation algorithm]

3, 2 data spatial segmentation algorithm

The specific design is as follows:

Represents the first level of data blocks

Represents Secondbr, the minimum level of data points

The minimum full coverage rectangle generated by the entire data set

Figure 3-1 Schematic diagram of data distribution in spatial segmentation

The classified data set DataSet is divided into the first layer of data space by offset the optimal rectangular algorithm to obtain a series of FIRSTBRs (rectangles from the first layer divided).

Then, in FirstBRS, the secondary division is performed using the K-offset optimal rectangular algorithm to obtain a series of secondbrs (the rectangle of the second layer division) in FigSTBRS, which contains K data points (there may be less than one rectangle k).

1. For regional queries, as long as the central point of several FirstBRs is similar to the similarity of the query condition, select the suitable firstBRS and then refine the refinement comparison, without the need to query all the data sets. The query data set is reduced from several data sets adjacent.

2. For K-NN queries: First, first, the first-firstBRS is obtained, and the location of the first FIRSTBRS is obtained, and then, as long as the center point of SECONDBRS is compared to the similarity of the SECONDBRS, it is finally determined for several secondbrs, and the basic comparison Operation looks out the k result data point. Also this is also simplified on the query data.

〖Offset Most Rectification Algorithm〗

Prerequisites: N data points, distributed in the R-dimensional space, the representation of each data point

Algorithm function: Regional divide the n data points to obtain multiple FIRSTBRS (rectangular) to ensure that the overall data coverage of the rectangle is minimal. Regional rectangular domain value

(Ie the radius of the regional rectangle center to the rectangular boundary

).

Algorithm step:

1, in N points, take a little A

, Immediately constitute a regional rectangle

, which is

`

;

2, determine if there is a data point B

(Or more)

Middle, judgment method:

for

,in case

So the B point is

in.

If you don't meet the conditions, you will continue to look up. If there is no data point to meet the requirements, the point is reserved (so-called reservation, "the information saves the point information into a temporary list templist, then jumps to 1 operation, reselect points; if one is found B Data points, continue; 3, for the B data point found in 2, two points of A, B constitute a rectangular area

,mobile

Center to

Center

(Translation);

4, loop operation 23 until no data point can be inserted

; Re-point

Perform 1 to 4 operation;

5. Finally, countless sites remain behind after the above process, then the algorithm ends.

If the Templist exists unparalleled data points, find the data point C constructed with the data point distance.

(I represents the number of such areas that may exist).

At this point, the algorithm ends and the data set has been valid, and the result is a series of FIRSTBRs.

〖K- Offset Most Model Algorithm In fact, this algorithm is derived by the offset of the optimal rectangular algorithm, and simple changes are K. The remaining algorithm is the most important rectangular algorithm of offset.

It is undeniable that this split method, for data sets of data, there is poor performance defects, but considers that the data itself in the corresponding multimedia database is basically not available. Because for multimedia data, especially video data, it is unable to add or delete the form of updates, some are just the entire update change of video data, and this article considers index between independent video data, considering each video The index after the internal lens division. Therefore, this split algorithm is used to apply the index of this paper.

4. Implementation and experimental analysis of improved algorithm

In order to improve the above-described data division algorithm, the data structure of the R-Tree is improved.

4,1 Improved R-Trees Structure Illustration

The improved R-Tree structure can be expressed as the following figure:

Figure 4-1 Structure Simplified Schematic of R-Tree

4, 2 improved algorithm verification experiment design

The data source used is processed actual picture data.

The feature vector is shown that the image representation of the PCA conversion feature is used as a feature vector, which is represented as a characteristic vector.

;among them

Represents the value of the color histogram of the first dimension.

Assume that the degree of metrics between query images:

;

Verification experiment design

1. Improve the correctness of the scheme

For array data sets, the build index and similar queries are made by R-TREE and improved algorithms. Compared with the result of the result data set and the time consumption of both.

2. Performance test of improvement

For the same picture set, different data extraction methods are used to extract several sets of dimensions different data sets. Construct indexes and similar queries with R-TREE and improved algorithms. Compared with the results and time consumption.

The data set used in this experiment is manually design, of which data points are approximately as shown below:

The minimum full coverage rectangle generated by the entire data set

Represents the first level of data blocks

Represents Secondbr, the minimum level of data points

representative

Figure 4-2 Schematic diagram of manual data

4, 3 experimental results analysis

Feedback similar node number

Experimental results illustration:

Data set size

Data set size

Random query node number

Random data set

4-3-1

Random data set

Figure

4-3-2

Experiment 1 adopts random distribution data, the number of data is 50, 100, 200, 500, 1000, within 20 data space dimension (20). As shown in Figures 4-3-1, the threshold (similarity) of the query execution under the conditions of 0.75 (5 times each experiment, the average result), and Figure 4-3-2 gives the query execution The result of the number of similar nodes after the post-feedback (5 times each experiment, the average value). The number of query nodes of the improved R-Tree algorithm is significantly less than the original R-Tree and the increase in the amount of data is more obvious (Fig. 4-3-1), and the same feedback of the two is relatively Almost (Figure 4-3-2); Improved R-Tree algorithm is better than original R-TREE in terms of query, especially when the corresponding amount of data is large.

The experiment selection calculates the scheme of the number of access nodes. Since the improved R-Tree algorithm is the same as the R-Tree, that is, the unit consumption is the same, the number of R-TREE queries the number of access points and the improved algorithm have a relationship.

Therefore, it is obviously

This proves the feasibility and correctness of our use.

In summary, from the above two figures, when the same data uses R-TREE and the improved R-TREE query, the latter's average number of access data is basically lower than the former, which reflects the latter from the other side. Low consumption in the CPU time.

From this point, the improved algorithm has effectively reached the function of the query speed.

Evenly distributed data set Figure 4-3-3 Non-uniform distribution data set Figure 4-3-4

Experiment 2 Decisive the effectiveness of the data segmentation, using the data set of different data distribution modes, the determination of the construction of the trees and the size of the tree structure. Figure 4-3-3 Number of nodes under uniform data distribution conditions; Figure 4-3-4 Non-uniform data distribution conditions Number statistics.

Special data set mode, sparse mode

Figure

4-3-3

Graph

4-3-4

The picture given the right image, the data node constitutes a linear representation, it can be seen that with the increase of the number of data, the number of nodes is increased, which reflects

R-Tree

A disadvantage constructed in the index, the index file is increased soon, and the pressure on the database store is getting bigger and bigger. In terms of memory use, the acceleration function of the search has a certain impact.

For uniform distributed data, the experiment is performed according to the variation of data objects. As shown in Figure 4-3-3, the R-TREE algorithm generates an indexed number of nodes to increase the significant algorithm.

Figure 4-4

For non-uniform distributed data, an extreme distribution data and data distribution are very strong, as shown below

4-4

From this point, for the improvement algorithm, the number of primary nodes generated by the group must be due to two rectangular areas (

Firstbr

with

Secondbr

The adoption and reduction, the larger the data amount, the more obvious it will be explicitly. Easy to know, when the size of the times

CPU

Each time the query operation calls the page size image while the improved algorithm will reach the optimal (of course, the ideal state is not considered)

I / O

It is the impact of the size and time consumption of this structure, only when the page size is

=

The data size of the secondary tree). Similarly, from the figure

4-3-4

In the two figures we got the picture

4-3-3

The same result.

After the improved R-Tree algorithm, with the increase of the data object, the increase of the number of nodes produced by the index is no longer as obvious as the R-Tree algorithm, Fig. 4-3-3 and Figure 4-3 -4, the two curves in the comparison chart, the curve generated by the improved algorithm is relatively flat, and the change is alleviated. This shows that the algorithm obtained after the improvement has achieved the expected effect on the reduction in the size of the index file. *********

Experimental data matching result

News Frequency Section Video

Football band video

Film clip video

Table 4-1

Experiment 3 Decide that the index uses a targeted experiment, using different experimental data, and performs the determination of the construction of trees and the size of the tree structure.

**********

****** The background change is not a big video segment its index query function is better!

analysis Summary:

Through the above experimental data analysis, the improved R-TREE algorithm has a good improvement in overall performance than classic R-Tree algorithms. In particular, the support of multimedia streaming data is adopted, and the adaptability and selectivity of the index structure are added by optimizing the combination. Improvements in the data processing mode improve the role of the algorithm in retrieval speed, while avoiding the excessively large huge index file, better avoiding memory call indexing is the possible repetitive page conversion time consumption.

6, conclude

This paper combines high dimensional vector index related knowledge, and has been studied for classic multidimensional spatial indexes (focusing on the R-Tree) algorithm. The basic idea of ​​existing popular index structural models has been expanded, and a multimedia database index model with higher adaption is proposed, and based on this method of using data drop, data set hierarchy, data initialization, etc. Improvement with the design to make up for the defects of the classic algorithm, and achieve the expected experimental effect.

This article only analyzes the data spatial segmentation method in the R-Tree, the next study will face, to build the time of the index, better support the performance of multimedia data storage query, better to K-NN Support for similar queries, optimize the design implementation of the data classifier (according to semantic), and the design of the corresponding storage structure.

转载请注明原文地址:https://www.9cbs.com/read-113812.html

New Post(0)