Home Dashboard Directory Help

Multi-threaded rebuilds of Clustered Columnstore Indexes break the sequence of pre-sorted segment ordering (Order Clustering) by Neugebauer



Sign in
to vote
Type: Suggestion
ID: 912452
Opened: 7/3/2014 9:38:00 AM
Access Restriction: Public


In order to get a better performance as well as compression results, we can sort the data in a RowStore engine before creating Columnstore Index to get a better Segment Elimination.

What happens is that if we build a Clustered Columnstore directly after sorted RowStore with (DROP_EXISTING=ON) to get a perfect Order Clustering, we have the order maintained just partially.

To test it, download ContosoRetailDW (http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=18279),
and run the following script:

ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimCurrency];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimCustomer];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimDate];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimProduct];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimPromotion];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimStore];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [PK_FactOnlineSales_SalesKey];

-- Create sorted RowStore
create unique clustered index [PK_FactOnlineSales_SalesKey] on dbo.FactOnlineSales (OnlineSalesKey)

-- Create Columnstore
create clustered columnstore index [PK_FactOnlineSales_SalesKey] on dbo.FactOnlineSales
    with (DROP_EXISTING = ON);

-- Check out the segments being a kind of ordered (depends on the number of MAXDOP being used in the Columnstore Index creation)
select segment_id, row_count, base_id, min_data_id, max_data_id
    from sys.column_store_segments
    where column_id = 1;

I have a more detailed blog post on this matter here:
Sign in to post a comment.
Posted by geoff patterson on 6/29/2016 at 9:59 AM
Thanks for the update, Kevin.

I'd still love to see a version in the future where the Columnstore index build runs in parallel, builds segments in the desired order, and does not require creating an intermediate b-tree index first. But definitely understand there may be tough technical challenges involved.
Posted by Kevin Farlee [MSFT] on 4/5/2016 at 10:13 AM
In SQL Server 2016, parallel index (re)builds of Column Store Indexes will not preserve ordering.
As noted, the workaround is to do the build with MAXDOP = 1, so it becomes a tradeoff between index build performance and query performance.
We clearly understand the problem space, and why this would be beneficial. Unfortunately, we haven't solved some of the technical issues involved.
We will continue to look into this area.

Posted by Microsoft on 12/8/2014 at 5:40 PM
Niko: I am closign this as this is in the plans for next release. If this gets pushed out unepectedly, please feel free to re-activate

Posted by Microsoft on 11/11/2014 at 3:31 PM

Thanks for your suggestions. We are considering this for the next release

Posted by geoff patterson on 8/6/2014 at 1:54 PM

I also agree that the subsequent inserts can be unordered. Neugebauer's logic makes sense to me. This is really about optimizing the initial creation of large columnstore tables. We would typically be loading new data only by building / rebuilding an entire new partition at a time, and it seems like that must be a common use case for many of the large data sets towards which columnstore is geared.
Posted by Neugebauer on 7/4/2014 at 8:40 AM
Thank you for the commentary, Sunil.

In my personal opinion, the subsequent inserts can be unordered, since we can easily rebuild index/partition to get them back in order.
Clustered Columnstore Index is not a truly ordered index and hence there is no need to upheld the order.

It would be awesome to have it ordered one day, but I am not holding my breath:),
especially since one can get items order by Rebuilding CCI.

I like the suggestion by Geoff of eliminating by rowstore index creation before CCI build,
but given it's impact (~0.7 times slower)
and syntax implications (column order specification) - I see this feature coming into a potential ordered version of the Columnstore.
Posted by Microsoft on 7/3/2014 at 10:30 AM
Neugebauer: thanks. you identified the issue correctly. This is something we are actively looking. One question
(1) once the index is build, the subsequent inserts may not be ordered, Is this a big issue?

Posted by geoff patterson on 7/3/2014 at 10:03 AM
This would definitely be a good build. I walked through an initial investigation of clustered columnstore on one of our data sets, and the number of reads was reduced by about 80% for some queries in the workload once I started paying careful attention to segments and loading data in a single-threaded manner to optimize segment elimination. However, this results in slower loading of data, and it would be great to be able to create columnstore indexes in order for segment elimination.

Having an optional ORDER BY clause for the columnstore initial creation, removing the need to first create a clustered index in order to control order and allowing parallelism without breaking order, would be particularly powerful.
Sign in to post a workaround.
Posted by Neugebauer on 7/4/2014 at 8:43 AM
Rebuild CCI without multi-threading:

create clustered columnstore index [PK_FactOnlineSales_SalesKey] on dbo.FactOnlineSales
    with (DROP_EXISTING = ON, MAXDOP = 1);