I’ve been reading a lot of books on SSIS (SQL Server Integration Services) and BI (Business Intelligence) over the course of the year. I want to pass along a little tidbit I haven’t seen in any of them. I’ll preface this by stating our staging tables and data warehouse are all in SQL Server 2005.
Our process is probably similar to others, we pull the data in, and if the warehouse needs to be updated we place the data into a staging table. At the end of the process we do a mass update (via a SQL statement) from the staging table to the main data warehouse tables we use for reporting. Then we delete the records in the staging table. Which seemed like a reasonable thing to do, but wound up getting us in a lot of trouble. Over the course of the last few months our run times for the SSIS job have gotten slower and slower and sloooooooooooooower. Our job was taking as long as 50 minutes to complete sometimes. One of our developers noticed the database seemed to be taking up a lot of space. He found a simple select count(*) was taking eight minutes on what was supposed to be an empty staging table.
Some research on the web explained what we were doing wrong. In one of my favorite SQL Server blogs, I want some Moore, blogger Mladen Prajdic has a great article on the differences between delete and truncate.
The solution then was to not perform a delete, but a truncate on our staging tables. We went ahead and manually issued a truncate on our staging tables, and saw an immediate beneift. Our average run time went from 50 minutes to 8 minutes!
I’m not sure why I haven’t seen this mentioned before, perhaps I just haven’t read the right blog or book yet. But I wanted to pass this along so you could be spared some of the headaches we went through. If your SSIS uses SQL Server 2005 tables, use a truncate and not a delete to avoid speed issues. Alternatively, at least make sure to run truncates on a regular basis to keep those staging areas cleaned out!
Truncate works on older versions of SQL Server too. And even on other database systems like f.e. Oracle, etc.
The thing is however, that you cannot put a transaction around it. That’s also the reason it is faster than a Delete statement that has to write its deleted records to a transaction log.
So it might be OK for your Staging Area but be careful with it 😉
Another performance tip may be dropping and recreating your Staging Area’s database objects between ETL rounds. This requires you to have DDL access to the database, but since you Truncate now I assume that’s not an issue.
Truncate is logged and can be used in a transaction.
create table test(myint int)
insert test values(1);
insert test values(2);
insert test values(3);
select * from test;
begin tran
truncate table test;
rollback tran
select * from test;
My process is very similar. I’m importing XML files from restaurants every 30 minutes based on a schedule. My SSIS Package workflow unzips my file. Imports into repository. Deletes rows from staging tables before staging the data. Once staged running Updates on the reporting Tables. My import was starting to have the same performance issues. Size keep getting slower and slower. Our job also was taking as long as 50 minutes to complete sometimes. I went ahead and tested my process with Truncate instead of delete and Cut my import in 1/2 the time.
@Dave
” February 3, 2009 at 10:58 am
Truncate is logged and can be used in a transaction…”
Let’s clarify something because that statement is misinforming.
When a SQL expert says ‘a logged operation’ they are referring to actions of the operation logged in the Transaction Log itself for begin and commit checkpoints. TRUNCATE is not a logged operation in that sense in most SQL server platforms. Technically, the logged operation of TRUNCATE is that the extents which hold the data pages are flagged to release those pages for reuse. This is the only logged operation of TRUNCATE, there are “begin” and “commit” transactions in this sense but flagging & keeping the data available for rollback is platform specific. Currently, only Microsoft SQL Server and Postgre SQL allow rollbacks on TRUNCATE because of the “logged” occurrence. The data in the data pages still technically exists until the checks are committed. In other versions of SQL this is not acceptable because (say Oracle for example) the rollback or begin segments are only used in DML or at least they are not used in this DDL operation. Oracle uses commit commit for TRUNCATE instead of begin commit, and so do the others from what I am aware. Anyway, that’s it, that’s why it is such a speedy operation and why the removal of the records are not logged, because it technically never touches the records. It just tells the extents to go tell the pages to go blank themselves.
Here are the other differences between TRUNCATE & DELETE and perhaps it will give you a better idea in the future on how to use them.
TRUNCATE is a Data Definition Language (DDL) statement, uses a TABLOCK, doesn’t activate triggers, resets identity counters, and can’t be used on tables that have FK references with OTHER tables. (Tables with FK references to themselves can be truncated).
DELETE is a Data Manipulation Language (DML) statement, locks every row in the range, activates triggers, does not reset the identity, writes each record to the log and can be used on a table with FK references.
That’s the gist of it. I hope this helped to clarify TRUNCATE a little bit without being too technical. (If you know a little SQL you’re already a little technical to begin with).
Addendum – TRUNCATE in SQL Server also uses SCH-M locks as well for the pages. (It locks both Table & Pages). That’s probably important in case somebody wanted to know.
Thanks for the extra info Torch. I imagine Dave was speaking in terms of SQL Server only, since that’s what the original post focuses on, however knowing how this works across a variety of SQL platforms is helpful and will keep people who primarily work in one platform from making mistakes when they venture into other types of database servers.