Tool selection project

I just wrapped up the first part of a tool selection project for an old client of mine, who wants to introduce open source data integration on a pilot project.

Working with the client, we have short listed the two open source solutions we deem to be the most mature: Pentaho’s Kettle and Talend’s Open Studio. Their respective positioning is interesting: Pentaho being an open source BI vendor focuses exclusively on ETL for the BI market, whereas Talend say they address more than BI but also operational data integration.

There are a number of other open source ETL projects out there, but none of them has the backing of a “real” company. Not to say these projects are bad, but my client is just tip-toeing into open source and they wanted to feel reassured about tech support, viability, etc. So have just been looking at these 2 vendors.

A big part of the evaluation was about performance. I have run Kettle and TOS on identical scenarios to see how they perform.

I did not only try the latest versions, but also went back one version. It’s interesting to see how both products have made significant performance improvements in their latest builds.

Anyway, I thought I’d share the results of this benchmark: benchmark-tos-vs-kettle.pdf

I’ll be posting more info as my work with this client progresses. Right now they have not entirely confirmed their choice, they want to look at other criteria beyond pure performance. But the scale if clearly leaning toward one side… (check the benchmark if you want to know which one!)

9 Responses to “Tool selection project”

  1. Matt Casters Says:

    Hi Marc,

    If all you really want to do is write a text file to another text file, I can recommend the “cat” ETL tool. (copy would work as well)

    Seriously, I was under the impression that databases and data warehouses where involved in a typical ETL project. Your own tests show that TOS is slower than Kettle when you hit a database. However, *my* tests show that Kettle is up to 30% faster than Talend for text file handling in Kettle version 3.0. (we couldn’t be bothered with text files in earlier versions :-))

    I also want to add that discarding Talend/Jaspersof and Pentaho as “real” companies is not going to land well with either.


  2. James Dixon Says:

    Hi Marc,

    It looks like you are working on a good comparison report.

    While Pentaho as a company focuses primarily on the BI market our ETL offering is not BI specific and can be used for solving many different data integration needs including operational data integration.

    We are still in development with V3.0 of Kettle. If you can make your data files and transformations available to us it will help us remove any defects you are encountering.


  3. marcrussel Says:

    Hi Matt,
    Thanks for the comment. I am aware that every test case is different, however it would help if you were to publish your results in a format similar to mine: run Kettle and Talend side by side in the exact same conditions and show the values. I don’t pretend to know everything about either tool, I am sure there are optimization parameters that can be fine tuned.


  4. Open source data integration in Network World « Marc Russel’s Blog Says:

    […] Marc Russel’s Blog Just another weblog « Tool selection project […]

  5. Fabrice Says:

    I agree with Matt, your results are a little bit strange. Here, dealing with Oracle, Ingres & MySQL latest Talend Open Studio version is often 40 to 55% faster than last available PDI version (of course, I don’t speak about our ELT approach wich depends only on target RDBMS wich is muuuuch faster).

    I’ll send you some numbers about that in a couple of weeks ; we are curently working on large scale benchmark with Unisys.


    PS: Hi Matt! ๐Ÿ˜‰

  6. YvesM Says:

    I don’t mind Talend being called a “real” company…

  7. marcrussel Says:

    Like I told Matt Casters – please post your results! You guys obviously know your tools better than I do and can tune them better. But I won’t believe numbers thrown at me by vendors unless I see concrete proof.
    BTW, I’d be curious to see how ELT can improve performance here.


  8. Matt Casters Says:

    Hi Marc & all,

    I tried to run a little test of my own to explain the difference and difficulties involved in handling text using Java/Kettle:

    As you can see, I ran the same test (Text File to Text File) on PDI and TOS. The result was: 10 seconds for Kettle 3.0 (rev. 4725) vs 16 seconds for TOS (v2.0.3 – r3791-20070612-1833)

    See, unlike all these claims above, I have always been very open about performance and anyone can try it themselves to see what performance they are getting from the respective tools they are using.

    Mind you, claiming that Kettle 3.0 is 60% faster than TOS for handling text files would simply be foolish. There is saying that goes like this: “Lies, Damed Lies and Benchmarks”.

    Of-course, we’re comparing apples and oranges here. As far as I can tell, it’s not really possible to do any real data conversions in the “tFileInputDelimited” or “tFileOutputDelimited” operators in TOS. Kettle allows you to process weird data formats, localized number formats, time-zones, trims data, etc. My hunch is that the Perl is doing lazy conversion as well by treating everything as single byte text. While that’s fine, reality is different most of the time.

    Unfortunately, reality is that if you can’t get your transformations to work at all, performance is equal to zero…

    As fas as ELT is concerned, it’s true that certain operations can be done faster on a database. The PDI approach is to allow everyone to execute whatever SQL or procedure on the database of his/her choice in a way that is as transparent as possible.

    Unfortunately, the way Talend is handling all their database operators is pretty horrible. By creating different operators/steps for each individual database, they are helping database vendors lock you in. Try running your development on MySQL and then try to switch your transformations to a different database. I would like to point out that this is completely painless in Kettle and a major bump in TOS. (as far as I can tell, perhaps they have a migration tool hidden somewhere).

    Of-course, that way, it’s easy to get 100+ steps in your environment. If Kettle would have followed the same approach we would have over 125 steps ONLY for database operations.= 25 databases x (Input, Output, Lookup, Database Join, DB Procedure, …)

    Think of it this way, each time we add a feature, ALL our supported databases gain from it. Having 125 db steps must be a maintenance problem, a problem to get it all tested, etc.

    Then again, being firmly convinced that code-generators are not the way forward, I can understand that the Talend people have a different opinion on this matter.

    There! Enjoy the rest of the flame-war ๐Ÿ˜‰ I’ve tried to spread as much FUD about TOS as I could. I leave it as an exercise for the reader to figure out what’s true about my claims and what’s not in his/her specific situation.



  9. Dialog with the vendors… « Marc Russel’s Blog Says:

    […] with the vendors… My post last week on a tool selection project attracted lots of interest, at least from the two vendors whose tools I looked at: Pentaho and […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: