Narrowing the Data Types on a very large table

I have a table that has over 150 Million records. Currently the types are thus:

id (Primary Key, Bigint)
idResult (Foreign key, Bigint null)
idItem (Foreign Key, Bigint null)
Number_1 (Bigint null)
Number_2 (Bigint null)
IsActive (Bigint null)

Number_1 and Number_2 can never be bigger than 10. IsActive is, plainly a Boolean. And none of these columns are nullable anywhere else in the code base. I'd also like to change the foreign key fields to int, but that is another story all together. This table was built years before I started, and we're experiencing some growing pains.

I'm looking for the best way to convert these columns (and several others on other tables, though this one is the main offender), and reclaim that disk space. I've tried a straight Alter Table, but that, somewhat expectedly, just didn't work. I can't remember the error specifically, but I believe it was related to the size of the table in question.

Right now I'm considering manually dropping and recreating the table, but i'm struggling to figure out a better way to do this, other than with a Select TOP 10000 * FROM dbo.TABLENAME WHERE id > 0, and simply incrementing the where clause several times.

I've looked at Switch To, but this requires that the data types of the target table match the source table, which is what I'm trying to fix!

Any suggestions? Am I looking at this the wrong way?

回答1:

First off, thank you for doing this. It is such an obvious win that many wouldn't see much value in, but it will be well worth it :). Making the world ever so slightly saner.

Regarding IsActive being a boolean. My guess is that you are thinking of making it a BIT field. That might be the way to go, but sometimes it is better to go with TINYINT as there is the possibility of expanding the meaning into more than 2 states. In which case it really becomes more of StatusID. Usually it is a case of something starting out simplistically as Active / Inactive, but later on maybe Deleted and/or others. From a sizing perspective, TINYINT is always 1 byte. On the other hand, BIT is 1 byte for up to 8 BIT fields. Meaning, one BIT field is 1 byte, 2 BIT fields is also one byte, and so on up to 8 BIT fields being stored in a single byte. So, there is no space savings choosing BIT over TINYINT when the table only has 1 BIT field. Just something to consider.

Doing an ALTER TABLE is a bit much for a large table, as you saw. One option, though not a great one, is to add a NOT NULL field--Number_1new--with a DEFAULT value (this will be instantaneous due to the default, at least starting with SQL 2012) that none of them would naturally have (e.g. 255), and then slowly migrating the values, in a loop, as in:

UPDATE TOP (5000) tab
SET tab.Number_1new = tab.Number_1
FROM [table] tab
WHERE tab.Number_1new = 255;

And when that is done then do:

sp_rename 'table.Number_1', 'Number_1old', 'COLUMN';
sp_rename 'table.Number_1new', 'Number_1', 'COLUMN';

Of course, best to wrap that in a TRANSACTION, and that wrapped in a TRY / CATCH. When the related code has been updated and everything has been tested and the data looks good, then you can drop the Number_1old column.

However, the best way that I have found is to create a new table, slowly transition the data over, then swap the tables and code at the same time. I detailed the steps in an article on SQL Server Central: Restructure 100 Million Row (or more) Tables in Seconds. SRSLY! (free registration required). Just in case there are issues getting to that article, here are the basic steps:

Create a new table with the ideal structure--[tableNew]. If you are on Enterprise Edition, consider enabling either ROW or PAGE compression as they can sometimes help. But please do some research first as there are some situation when they have a negative effect. There is documentation on MSDN to help you figure it out as well as some tools to help estimate potential saving. But even if you do enable compression, I wouldn't see that action as replacing the project you are doing here.
Add a trigger AFTER UPDATE, DELETE on [table] to keep changes in sync (but no need to worry about new rows)
Create a SQL Agent Job that moves over missing rows in batches. Do this in a loop that does an INSERT INTO [tableNew] (Columns) SELECT TOP (n) Columns FROM [table] WHERE ?? ORDER BY ??
The WHERE and ORDER BY clauses depend on the situation. They should be geared towards making the best use of the clustered index. If the clustered index of the new table is structurally the same as the old/current table, then at the start of each loop you can get the MAX([id]) from [tableNew] and use it to get WHERE table.[id] > @MaxIdInTableNew ORDER BY table.[id].
Create the new table, trigger on the current table, and SQL Agent Job a week or so before you need to do the full cut-over. That time-frame might change based on your situation, but just make sure to give yourself plenty of time. It is far better for the job to finish migrating rows and only have a few trickling in at a time as opposed to being 100k shy of the full set as the release is supposed to begin.
If the plan is to migrate the other related tables (the PK references for the two FKs that you want to turn into INTs), then make those fields here INT now and just don't add the FK until those other tables are migrated over to having INT fields as their PKs. You don't want to have to rebuild this table again just to make that change for the FK fields.
During the cut-over (in a TRY / CATCH, of course):
1. BEGIN TRAN
2. do a final row count on both tables to make sure everything is moved over (might want to sanity check the rows before the release to make sure that the trigger did the updates and deletes as expected)
3. rename the current table to "old"
4. rename the "new" table to not have the "new"
5. drop the SQL Agent job (or at least disable it)
6. rename and dependent objects such as constraints, etc
7. COMMIT

回答2:

Don't drop

Create a table with proper column types and then insert into in a loop

Or you may be able to do it a column at a time.

Update table 
set Number_1tiny = Number_1 
where Number_1 is not null 
and Number_1tiny <> Number_1