When I did a project, a colleague was transferred to the data, I accidentally made the data in a table, that is, all records in this table have a repetition. The data of this table is tens of millions and is a production system. That is, you cannot delete all records and you must delete your duplicate record quickly.
In this regard, summarize the method of deleting repeated records, and the advantages and disadvantages of each method.
For the convenience of shooting, it is assumed to have a TBL, and there are three columns of colipi, col2, col3, where col1, col2 is the primary key, and coll1, col2 adds an index.
1. Create a temporary table
You can import the data into a temporary table, then delete the data of the original table, then guide the data back to the original table, the SQL statement is as follows:
Creat Table TBL_TMP (SELECT DISTINCT * FROM TBL);
Truncate Table TBL; / / Clear Picture Record
INSERT INTO TBL SELECT * from TBL_TMP; // Inserts the data in the temporary table back.
This method can achieve demand, but it is obvious that for a 10 million-level record table, this method is slow, in the production system, this will bring a lot of overhead to the system, not feasible.
2, use RowID
In Oracle, each record has a RowID, and RowID is unique in the entire database, and the RowID determines which data file, block, and rows of each record is in Oracle. In repeated records, all columns may be the same, but the RowID will not be the same. The SQL statement is as follows:
Delete from TBL Where Rowid in (SELECT A.ROWID FROM TBL A, TBL B Where A.Rowid> B.RowID and a.col1 = B.COL1 and A.COL2 = B.COL2)
If each record has only a repetition, this SQL statement is applicable. However, if the repeated record of each record has n, this N is unknown, it is necessary to consider the following method.
3, use the max or min function
It is also necessary to use the RowID, which is different from the above, is accomplished, and the MAX or MIN function is implemented. SQL statement is as follows
Delete from TBL a WHERE ROWID NOT IN (SELECT MAX (B.ROWID) from TBL B Where A.col1 = B.COL1 and A.COL2 = B.COL2); // Here MAX can also use MIN or
Or use the following statement
Delete from TBL a WHERE ROWID <(SELECT MAX (B.ROWID) from TBL B Where A.col1 = B.COL1 and A.COL2 = B.COL2); // This here, if you change your MAX to min, the front where WHERE In the clause, "<" is required to be ">" the above method is basically the same, but the group BY is used, which reduces the explicit comparison conditions and improves efficiency. The SQL statement is as follows: delete from TBL Where Rowid Not in (SELECT MAX (Rowid) from TBL T Group By T.COL1, T.COL2); Delete from TBL Where (Col1, Col2) in (Select Col1, Col2 from TBL Group by COL1, Col2 Having Count (*)> 1) And Rowid Not in (SELECT NIN (ROWID) from TBL Group By Col1, Col2 Having Count (*)> 1) There is also a method for recording records in the table. Less, and have an index, it is more applicable. Suppose there is index on COL1, COL2, and there are few records in the TBL table, the SQL statement is as follows, using Group By, improve efficiency