Delete duplicates or old revisions from snowflake table
I was working on one of my data engineering projects, and I faced this challenge of deleting duplicate records from existing table in snowflake database of production environment. I thought it would be much easier as snowflake is well established cloud-based data service.
So, I started working on writing direct Delete query as I knew it will work but then I came to know that it’s not that easy and straight forward. I found solution eventually and that is not straight forward as you expect.
I thought this problem may be faced by many data engineers, so I decided to write an article. Further, I started collecting different scenarios and found as following which are good to use for deleting duplicate records from existing prod table -
Scenario — 1
Delete duplicate records from a table where there is one primary key exists but that is not a sequential number. That primary key is auto generated uuid or id. Now rows can be duplicated based on other columns. So how can you delete duplicate records?
CREATE TABLE TBL1 (
uuid VARCHAR(100),
name VARCHAR(100),
age NUMBER,
dob DATE,
email VARCHAR(100),
address TEXT,
mobile VARCHAR(15)
)
DELETE FROM TBL1 WHERE uuid in (
SELECT uuid
FROM TBL1
QUALIFY ROW_NUMBER() OVER (
PARTITION BY name, age, dob, email, address, mobile
ORDER BY name
) > 1
);
Note — If table is huge, probably I may use solution from Scenario 2.2.
Scenario — 2
Delete duplicate records from a table where no primary key exists. Now, rows can be duplicated with all the columns exist in the table. So how can you delete duplicate records?
CREATE TABLE TBL1 (
name VARCHAR(100),
age NUMBER,
dob DATE,
email VARCHAR(100),
address TEXT,
mobile VARCHAR(15)
)
Scenario — 2.1
Size of the table is not huge then following approach can be used to delete duplicate records from the table. In fact, we are not deleting but overwriting the table.
INSERT OVERWRITE INTO TBL1
SELECT DISTINCT * FROM TBL1;
overwrite
keyword make sures that before inserting the data to table, table will be truncated in the transaction and once all the distinct records are inserted, transaction will be committed.
Points to remember —
- If data size is high and still you are using this approach. Keep in mind that data size won’t decrease immediately as due to data retention set on table, changes will be tracked because of having time travelling facility.
Tried with following data —
- Records count — 2 million records
- Warehouse — Medium size
- Data size of table — 200 MB
- Duplicates count — 50%.
- Time taken to delete — 6.7 seconds.
Scenario — 2.2
Size of the table is very huge then following approach can be used to delete duplicate records from the table.
The ultimate solution which can be used for all the scenarios, we should create a new table with all the unique records from original table as per your criteria.
Next you validate that table with unique data once and as a final step, you have to swap both the tables.
Queries to follow above steps are mentioned below.
--Create new table in which only inserts unique records as per your criteria
CREATE TABLE TBL1_UNIQUE AS
SELECT *
FROM TBL1
QUALIFY ROW_NUMBER() OVER (
PARTITION BY name, age, dob, email, address, mobile --Use columns which you have unqiue
ORDER BY name -- any column would work as all the columns are having same data for duplicate records
) = 1
/*
Once table is manually validated and if there is no record
missing or no duplicate exists
*/
-- Rename existing table to TB1_BACKUP
ALTER TABLE TBL1 RENAME TO TBL1_BACKUP;
-- Rename newly created table with unique records to TBL1
ALTER TABLE TBL1_UNIQUE RENAME TO TBL1;
-- Once everything looks good drop the backup table
DROP TABLE TBL1_BACKUP;
Note — In this last solution, you will lose the history which was there in original table because you are creating new table and then going to rename to original table name.
Scenario — 3
Delete old revisions of records from a table where there is a composite key exists. Just look at the table design as following and see how you can delete older revisions of the user details. In below table, user_uuid is identifier for user but not unique in this table so you can create multiple entries for the same user with same user_uuid.
CREATE TABLE TBL1 (
user_uuid VARCHAR(100),
name VARCHAR(100),
age NUMBER,
dob DATE,
email VARCHAR(100),
address TEXT,
mobile VARCHAR(15),
updated_at TIMESTAMP_NTZ(0)
)
As above table can have same user multiple time because we wanted to maintain history of that user details but now you decided that you want to delete all revisions other than latest.
You can use the last solution exactly the same way but with specific order by and partition by columns.
--Create new table in which only inserts unique records as per your criteria
CREATE TABLE TBL1_UNIQUE AS
SELECT *
FROM TBL1
QUALIFY ROW_NUMBER() OVER (
PARTITION BY user_uuid --user_uuid is unique for user record
ORDER BY updated_at desc -- timestamp to take latest record
) = 1
/*
Once table is manually validated and if there is no record
missing or no duplicate exists
*/
-- Rename existing table to TB1_BACKUP
ALTER TABLE TBL1 RENAME TO TBL1_BACKUP;
-- Rename newly created table with unique records to TBL1
ALTER TABLE TBL1_UNIQUE RENAME TO TBL1;
-- Once everything looks good drop the backup table
DROP TABLE TBL1_BACKUP;
Conclusion
Snowflake is still working on many features; I hope they implement something very efficient solution for this. Even if they don’t, we have few solutions as mentioned above. I hope you got to know something from this article. Please subscribe to get more articles similar to this.
Thanks for reading and keep learning.