Home Dashboard Directory Help

Multi-threaded rebuilds of Clustered Columnstore Indexes break the sequence of pre-sorted segment ordering (Order Clustering) by Neugebauer



Sign in
to vote
Type: Suggestion
ID: 912452
Opened: 7/3/2014 9:38:00 AM
Access Restriction: Public


In order to get a better performance as well as compression results, we can sort the data in a RowStore engine before creating Columnstore Index to get a better Segment Elimination.

What happens is that if we build a Clustered Columnstore directly after sorted RowStore with (DROP_EXISTING=ON) to get a perfect Order Clustering, we have the order maintained just partially.

To test it, download ContosoRetailDW (http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=18279),
and run the following script:

ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimCurrency];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimCustomer];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimDate];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimProduct];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimPromotion];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [FK_FactOnlineSales_DimStore];
ALTER TABLE dbo.[FactOnlineSales] DROP CONSTRAINT [PK_FactOnlineSales_SalesKey];

-- Create sorted RowStore
create unique clustered index [PK_FactOnlineSales_SalesKey] on dbo.FactOnlineSales (OnlineSalesKey)

-- Create Columnstore
create clustered columnstore index [PK_FactOnlineSales_SalesKey] on dbo.FactOnlineSales
    with (DROP_EXISTING = ON);

-- Check out the segments being a kind of ordered (depends on the number of MAXDOP being used in the Columnstore Index creation)
select segment_id, row_count, base_id, min_data_id, max_data_id
    from sys.column_store_segments
    where column_id = 1;

I have a more detailed blog post on this matter here:
Sign in to post a comment.
Posted by geoff patterson on 5/16/2017 at 6:42 AM
For pretty much every client that we work with, this keeps coming up as a pain point. We're essentially telling clients to choose between much slower ETL processes or reduced DW query performance for certain query patterns that would benefit hugely from segment elimination, including many of the most common reporting and analytic queries they are performing. This feature would be hugely valuable in order to unlock both good ETL performance and good DW query performance, which are currently at odds with each other.

Our request would be two features:
(1) Allow optional control of the order in which new rows are packing into segments when using INSERT INTO ... WITH TABLOCK to insert data into a clustered columnstore index in parallel
(2) Add optional control of the row order to the REBUILD command for columnstore indexes

The first of these two features is the critical one, as it unlocks the parallel load capability into a single columnstore table (which we would use as a staging table to swap into a partition of a much larger fact table). Even if explicit rebuild support is not present, a rebuild could still be performed by re-inserting all of the data using feature (1). Feature (2) is a nice to have that would be convenient, but we'd be satisfied with (1) even if (2) is not implemented or not implemented in the same release.

Controlling the order in which rows are distributed into row groups will yield huge DW query performance benefits for queries that can benefit from segment elimination by providing the ideal scenario: each segment has a disjoint range (almost, ignoring some possible ties) on the leading column of our ordering. However, it is understood that neither of these features would guarantee the order of future inserts, and neither would impact the delta store in cases where the number of rows inserted is small enough that a new segment is not created. Therefore, the query optimizer would continue to avoid making any assumptions about the order of data for a clustered columnstore index loaded using this feature.
Posted by geoff patterson on 6/29/2016 at 9:59 AM
Thanks for the update, Kevin.

I'd still love to see a version in the future where the Columnstore index build runs in parallel, builds segments in the desired order, and does not require creating an intermediate b-tree index first. But definitely understand there may be tough technical challenges involved.
Posted by Kevin Farlee [MSFT] on 4/5/2016 at 10:13 AM
In SQL Server 2016, parallel index (re)builds of Column Store Indexes will not preserve ordering.
As noted, the workaround is to do the build with MAXDOP = 1, so it becomes a tradeoff between index build performance and query performance.
We clearly understand the problem space, and why this would be beneficial. Unfortunately, we haven't solved some of the technical issues involved.
We will continue to look into this area.

Posted by Sunil [MSFT] on 12/8/2014 at 5:40 PM
Niko: I am closign this as this is in the plans for next release. If this gets pushed out unepectedly, please feel free to re-activate

Posted by Sunil [MSFT] on 11/11/2014 at 3:31 PM

Thanks for your suggestions. We are considering this for the next release

Posted by geoff patterson on 8/6/2014 at 1:54 PM

I also agree that the subsequent inserts can be unordered. Neugebauer's logic makes sense to me. This is really about optimizing the initial creation of large columnstore tables. We would typically be loading new data only by building / rebuilding an entire new partition at a time, and it seems like that must be a common use case for many of the large data sets towards which columnstore is geared.
Posted by Neugebauer on 7/4/2014 at 8:40 AM
Thank you for the commentary, Sunil.

In my personal opinion, the subsequent inserts can be unordered, since we can easily rebuild index/partition to get them back in order.
Clustered Columnstore Index is not a truly ordered index and hence there is no need to upheld the order.

It would be awesome to have it ordered one day, but I am not holding my breath:),
especially since one can get items order by Rebuilding CCI.

I like the suggestion by Geoff of eliminating by rowstore index creation before CCI build,
but given it's impact (~0.7 times slower)
and syntax implications (column order specification) - I see this feature coming into a potential ordered version of the Columnstore.
Posted by Sunil [MSFT] on 7/3/2014 at 10:30 AM
Neugebauer: thanks. you identified the issue correctly. This is something we are actively looking. One question
(1) once the index is build, the subsequent inserts may not be ordered, Is this a big issue?

Posted by geoff patterson on 7/3/2014 at 10:03 AM
This would definitely be a good build. I walked through an initial investigation of clustered columnstore on one of our data sets, and the number of reads was reduced by about 80% for some queries in the workload once I started paying careful attention to segments and loading data in a single-threaded manner to optimize segment elimination. However, this results in slower loading of data, and it would be great to be able to create columnstore indexes in order for segment elimination.

Having an optional ORDER BY clause for the columnstore initial creation, removing the need to first create a clustered index in order to control order and allowing parallelism without breaking order, would be particularly powerful.
Sign in to post a workaround.
Posted by Neugebauer on 7/4/2014 at 8:43 AM
Rebuild CCI without multi-threading:

create clustered columnstore index [PK_FactOnlineSales_SalesKey] on dbo.FactOnlineSales
    with (DROP_EXISTING = ON, MAXDOP = 1);