HP’s Oracle Blog

Notes of a Junior on the Path to Oracle Mastery


Notes of a Junior on the Path to Oracle Mastery
On this blog I am writing about thoughts, solutions, failures and pitfalls on the path to Oracle Mastery. It’s still a long way to go. On this blog, the main focus is learning Oracle.

Removing Duplicate Rows

Author: HP Fuchs | Category: Oracle Database, SQL Server

Oracle
For our example we need to insert a duplicate row into HR.EMPLOYEES first. The new row has to be identical on first_name, last_name, email and department_id columns. To insert such a row, we have to disable the unique constraint on the HR.EMPLOYEES table.

ALTER TABLE employees DISABLE CONSTRAINT emp_email_uk;

INSERT INTO employees (
EMPLOYEE_ID, FIRST_NAME, LAST_NAME, EMAIL, PHONE_NUMBER,
HIRE_DATE, JOB_ID, SALARY, COMMISSION_PCT, MANAGER_ID,
DEPARTMENT_ID
) VALUES (
‘501′, ‘Donald’, ‘OConnell’, ‘DOCONNEL’, ‘650.507.9833′, TO_DATE(’21.06.99′,’DD.MM.RR’), ‘IT_PROG’, ‘2600′,null,’124′,’50′;

Duplicate rows can be removed like this:

DELETE FROM employees
WHERE rowid IN (
SELECT rowid
FROM employees
MINUS
SELECT MAX(rowid)
FROM employees
GROUP BY first_name, last_name, email, department_id
);

Rows will be considered duplicate, if they match on first_name, last_name, email and department_id. In production it is not wise to delete duplicate rows without copying the table first.

SELECT * FROM employees INTO employees_copy;

To enable the emp_email_uk unique constraint again:

ALTER TABLE employees ENABLE CONSTRAINT emp_email_uk;

SQL Server
In SQL Server it is a bit trickier to delete duplicate rows. It is not possible without a second table, for example a temporary table. For our examples we need a table with a few rows, one of them is duplicate.

CREATE TABLE t (
c1 INT NULL,
c2 INT NULL,
c3 INT NULL);

INSERT INTO t (c1, c2, c3) VALUES (1, 2, 3);
INSERT INTO t (c1, c2, c3) VALUES (4, 5, 6);
INSERT INTO t (c1, c2, c3) VALUES (7, 8, 9);
INSERT INTO t (c1, c2, c3) VALUES (7, 8, 9);

It’s easy if a row is considered duplicate when all column values are identical:

SELECT DISTINCT * INTO #t_tmp FROM t;
TRUNCATE TABLE t;

INSERT t SELECT * FROM #t_tmp;
DROP TABLE #t_tmp;

SELECT * FROM t;

If somebody by accident has loaded the same file twice into our table, and we have a column to log the insert date, rows are duplicate but not identical. SELECT DISTINCT won’t help anymore here.

ALTER TABLE t ADD d DATETIME NOT NULL DEFAULT getdate();

INSERT INTO t (c1, c2, c3) VALUES (1, 2, 3);
INSERT INTO t (c1, c2, c3) VALUES (4, 5, 6);
INSERT INTO t (c1, c2, c3) VALUES (7, 8, 9);

SELECT * FROM t;

Here a UNIQUE INDEX with the IGNORE_DUP_KEY option may help.

CREATE TABLE #t_tmp (
c1 INT NULL,
c2 INT NULL,
c3 INT NULL,
d DATETIME NULL);

CREATE UNIQUE INDEX ui ON #t_tmp (c1, c2, c3) WITH IGNORE_DUP_KEY;

INSERT #t_tmp SELECT * FROM t;
TRUNCATE TABLE t;

INSERT t SELECT * FROM #t_tmp;
DROP TABLE #t_tmp;

SELECT * FROM t;

September 25th, 2006.


One Response to “Removing Duplicate Rows”

  1. Hyejin Says:

    I get useful information. Thank you.
    Have a nice day ^^

Your comment