Managing Slowly Changing Dimensions

There was a new thing that Talend said they supported in July: Slowly Changing Dimensions. I guess they were playing catchup, because as far as I know this has been supported by Kettle for a while. Never mind, I thought I would give it a try and compare how well both tools support SCDs.

Bottom line: booth tools make SCD management super easy. Congratulations guys, you made a pretty difficult concept easy to implement. Clearly, Talend’s implementation is still young, it is missing some features such as surrogate keys or specifying the end date. Kettle has a more thorough functional coverage.

Something that’s missing from both tools however: Type 3 SCDs. OK, I’ll grant you this – in my years of consulting, I have never had to implement a Type 3 SCD. But still, it would be good to have it, just in case you need it 🙂

From the performance standpoint, Talend clearly makes up for its functional gaps. I ran a test with 25,000 source records. When creating the dimension, TOS went through the process in 8.7 seconds but it took Kettle 675 seconds! Updating the dimension, a much more resource consuming process, took TOS 512 seconds and Kettle 1,323 seconds.

Which tells me another thing: no vendor can claim to always be 50 or 100 times faster than others! Performance comparisons depend so much on which test you run. In my case, TOS is 78 times faster than Kettle in the first test, but only 2.6 times faster in the second one.

Advertisements

8 Responses to “Managing Slowly Changing Dimensions”

  1. Matt Casters Says:

    If a dimension doesn’t have a surrogat key, it’s not a slowly changing dimension.

  2. Matt Casters Says:

    I would like to point out that the performance numbers are bogus as well.
    We have a number of test-transformations in our performance test-suite: http://kettle.pentaho.org/svn/Kettle/trunk/test/org/pentaho/di/run/dimensionlookup/

    For a typical result (v2 versus v3), see for example an older run here: http://kettle.pentaho.org/svn/Kettle/trunk/test/org/pentaho/di/run/RunResults-Matt-20070522.txt

    I quote for v3: 250,000 row at 12,529 rows/s = 20 seconds. (Initial load)

    I still would like to know how you can have a Slowly Changing Dimension without a surrogate key 🙂

    Matt

  3. Fabrice Says:

    I agree, our SCD implementation is young and it is available only in milestone version right now. In our next Main version (2.2 comming out October 5), we will for sure support surrogate key (in many ways in fact)!

  4. marcrussel Says:

    Hi Matt,
    Why am I not surprised that you have performance data that makes your tool the greatest? I am sure Fabrice has some, too. Sorry if you don’t like my “bogus” numbers… I won’t try to convince you!

  5. Matt Casters Says:

    Dear Marc,

    All sarcasm aside, our SCD implementation has been around for about 4 years now, so I’m not surprised it’s running well either. However, I resent that you are trying to imply I’m trying to “fix” our numbers somehow.

    All our transformations and input files we use for testing are completely out in the open and if you want, you can run them yourself. I even included links to the test used in question for your convenience. (v2.5 vs 3.0 so you have both)

    That is a lot more than I can say about your test claiming our SCD implementation is incredibly slow. So you probably forgot to put an index slowing our transformation down or something. Who knows, right? A local or a remote database? What type of database? Index configurations? A lot of things can influence the numbers. You mention creating the SCD, what about lookups, partial updates with mostly inserts, mostly updates, etc?

    If you are interested in those numbers for PDI, I put the link to the transformations and the unit test-cases in my previous post.

    By the way, it should be fairly simple to do a Type 3 SCD in PDI. You can look up the original or (more often required) previous entry of a particular column with a second lookup step before you do the actual update.

    HTH,
    Matt

  6. Hugo Says:

    Actually you can’t do a comparision of this nature.

    I bet if you look at the generated code Talend will do a select * type operation on the target table and store all this data in a hash.
    The problem with this approach is that it isn’t parallel processing safe (what if a separate workflow updates values in the dimension while it is being used)
    And it’s also NOT scaleable, if I have say a customer dimension with millions of customers I will quickly reach the RAM limits and the transformation will then become very very slow.
    You may be able to do a join with the source database and restrict the values you cache in the lookup to those in your source but this is going to affect your performance and require you stage everything.

  7. Andrew Collins Says:

    Hugo,
    I agree and disagree with you 🙂
    Agree: Kettle allows you to define a max size for the lookup cache, whereas Talend loads all active records (but when Kettle needs data not in the cache it becomes very slow).
    Disagree: in slowly changing dimension, there is “slowly”. Data in the dimension tables isn’t supposed to change every second. Even if records get updated during the SCD update, latest value is used. Parallelizing several SCDs on a table should be no problem, as long as source keys are spread across components.

    Andrew

  8. Matt Casters Says:

    Parallel processing & caching…

    Slowly as a slowly changing dimension may be, if you take the typical case of a customer dimension, you can often see that the same record gets changed multiple times a day or a week and then not anymore for months on end. More often than not, some typo was detected and data re-entered. As such, you get multiple changes for the same record in your input stream. If you send this same record to multiple copies of a step, you can not just get cache-misses, but also create deadlocks if the threads use multiple database connections.

    However, since we allow data to be partitioned in PDI we can as such launch a “dimension lookup/update” in parallel guaranteeing that data is NOT in the cache of another step. That is because we can send the same natural key to the same copy of the step each time it passes.

    We can extend that principle to multiple servers running multiple copies of the same dimension updater as well.

    HTH,
    Matt

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: