Dialog with the vendors…

My post last week on a tool selection project attracted lots of interest, at least from the two vendors whose tools I looked at: Pentaho and Talend.  I am actually impressed (and proud!) to have gotten comments from Matt Casters and Fabrice Bonan – respectively founders of Kettle and Talend.  Thanks for your interest in my blog!

Something nice Matt said (“the way Talend is handling all their database operators is pretty horrible”) got my attention.  I did not feel this way when looking at Talend… but you may have a point.  Both tools have very different approaches, and it’s likely that each of them might have pluses and minuses, depending on the situation.

If Matt and Fabrice are still reading, I would be very interested in getting their perspective on how differently the tools handle database operators.  I already know which one each of you will find best but some factual elements would be interesting.  Please reply as comments but if I get interesting stuff I will summarize it in another post.

Advertisements

7 Responses to “Dialog with the vendors…”

  1. Matt Casters Says:

    Hi Marc,

    The goal of Kettle from the beginning was to be as open as possible. My definition of open included:

    – open source with an LGPL license (no GPL)
    – open, readable XML format
    – open, readable relational repository format
    – easy to set up (<2 minutes)
    – open to all kinds of databases
    – easy-to-use GUI

    That list didn’t include any code-generation. Although it was an option I seriously considered, I didn’t like the other code-generators I dealt with (OWB, etc) until that date. The problem with those is that when things go wrong (and they always do :-)) you’ll have to become an expert in the generated language/code pretty soon.
    That was most certainly the case in OWB at the time. Re-arranging operators in a certain order would yield inexplicable different results and you found yourself digging through mega-bytes of generated PL/SQL code. That was OK for the simple exercises like loading a bit of data, it became horrible for complex dynamics like slowly changing dimensions.

    As such, I considered this way of working not as open as possible since the generated code is by definition not transparent and highly subject to change at re-generation.

    That lead to the decision to go for a stream-based meta-data driven engine. Although I never used Informatica before, people told me later that doing ETL this way is very much like what these nice people do.

    As is the case with every engine though, there is a drawback. Even though splitting tasks up into different pieces allows you to run a lot of things in parallel, there is an overhead because you need to pass data from one step to another. Because of that overhead it is not unheard of for us to recommend that you run joins or sorts on the database server where the data is coming from. By allowing the Kettle user/developer to execute whatever SQL of procedure he or she wishes we are again as open as possible to this scenario.

    Here’s another way to look at it: suppose you write a nice ELT program: everything is being processed on the database server. Chances are very high that either one of these conditions is going to occur:
    – the database server is not tuned for high CPU load (rather for I/O)
    – the source system you are reading from is already too occupied
    – parallelism is low or non-existing in a typical stored procedure
    – stored procedure languages are typically interpreted byte-code languages and CPU-wise pretty slow/costly.
    – debugging becomes very hard or impossible
    – you hit some limit on the database (there are many, even on Oracle)
    – deployment must be carefully managed or something goes wrong with that: stale/not-compiled procedures because tables changed, referenced functions changed, etc)
    – …

    Enough reasons for me not to go with ELT at the time. Heck, I still don’t like it and personally I think for all the right reasons. This is going to sound harsh, but the reasoning I had and still have to some extent is that if you can’t write an SQL query, I don’t think you should be reading this blog or be doing heavy database lifting.

    One possible problem is latency. Since we don’t run on the database engine we suffer from it at times. Sometimes is not a problem, and sometimes it is. Calls to the database have a certain turn-around time. We solve that for the better part by allowing you to launch multiple copies of a step in parallel (database lookup comes to mind) effectively opening multiple connections and reducing overall latency. Of course a bigger reduction of latency is obtained with memory lookups, large caches, etc. In a lot of cases, it’s even faster to lookup the data in an in-memory cache than doing a join on the database.

    On a higher level, we added the ability to do database partitioning (result of a POC with Google) That is an interesting strategy in situations where one database simply doesn’t cut it anymore or in grid-like environments. As you probably noticed on my blog recently, version 3.0 is going to build on top of that. In my own tests I have obtained near-linear scalability with database partitioning.

    There, I hope that this pretty much answers your questions.

    To finish my long post, the term “horrible” was a bit too strong in my previous blog-entry. Fabrice, please forgive a non-English speaker (like yourself) a slip-up. More appropriate would probably have been be “un-natural” or “counter-intuitive”. Contrary to what these technical debates might portray, I do wish you guys all the best. (as I’ve stated before) Even though I have to believe that the design decisions made in Kettle where very careful and not at all par-hazard, it will still not stop me from saying hello or offering you a beer the next time we meet. 😉 Also, like I said in SF last time we met, I’m still interested to work/cooperate with Talend/JasperSoft to leverage our new Pentaho metadata platform based on CWM.

    All the best,

    Matt

  2. Fabrice Says:

    In your answer, Matt, you talk about:
    – definition of “open”
    – architecture (Engine based vs code generator)
    – OWB
    – ELT approach
    – database partitioning
    – and so on…
    It is just too many different topics in one go for me!
    Maybe I am wrong, Marc, but after reading again & again your post, I think your first point was about the way we handle DB operations in Talend Open Studio:
    “…on how differently the tools handle database operators”

    So if I correctly read your post, here is my point then:
    In very old versions of Talend Open Studio, we had only generic DB Input and Output components (this is the case in PDI).
    We decided to move away from this, by creating DB specific input and ouput components (eg: tMySqlInput, tOracleOutput…), and this, for one main reason: RDBMS are all different hence they present different features; and a generic component won’t allow us to use more features than the ones that are common to ALL rdbms.

    We believe that this option gives more power to our users.

    There are two points upon which I agree with Matt:
    – It’s harder to maintain ; it’s a lot of work for us eventhough we benefit from our internal virtual component architecture (IVCA) to share the same code accross numerous components. IVCA allows us to write code once and deploy it into numerous components.
    – When a TOS user creates a Job that includes an OracleDB, he cannot easily transform it into a job that will work with another DB.
    I really understand your point. But in my opinion it doesn’t matter so much as this is a functionality that our users hardly ever asked us for.
    And by the way, as a breaking news, the next release (early October) will include generic JDBC Input/output components (so now TOS users will be able to do exactly what Matt said).

    In an other hand, we are currently working on VirtualInput/Ouput components. Those components are going way farther, allowing you to map them at run time onto any source/target (eg Flatfile, RDBMS, Webservice…): endless possibilities.

    Did I answer your question?

    Regards,
    Fabrice

    Matt: We started to work on CWF but it’s a quite tricky specification ;
    I’ll contact you when we have a clearer picture over it.

  3. Matt Casters Says:

    Marc, Fabrice,

    We’ve had a few documented cases for dynamic database connections, where for example a retail store HQ gets data from hundreds of different databases (stores) around the world. Not only do these database NOT run the same versions, they also can be of a different database type (!). Kettle allows you to handle this situation with a single shared transformation & everything else set up dynamically.

    Discarding that functionality as something that is ‘hardly ever asked for’ is somewhat sad.

    As to my previous post being to much for you… well… I’ll write slower next time 😉

    Cheers,
    Matt

    P.S. I was talking about the Common Warehouse Metamodel (CWM). And indeed it is complex stuff. For that very reason we wrote an easy-to-use abstraction layer. I have heard the good news that a Pentaho partner wrote a first Cognos to Pentaho Metadata convertor. (via CWM XMI)
    Because handling CWM is not for sissies we are offering our open source layer under a Mozilla Public License and are as such completely open to anyone using this. As such you are certainly allowed to make pictures of it 😉

  4. Matt Casters on Data Integration » Making the case for Kettle Says:

    […] P.S. This is a re-hash of some of my opinions in a post over on Marc Russel’s blog. […]

  5. Roland Bouman Says:

    “We decided to move away from this, by creating DB specific input and ouput components (eg: tMySqlInput, tOracleOutput…), and this, for one main reason: RDBMS are all different hence they present different features; and a generic component won’t allow us to use more features than the ones that are common to ALL rdbms.”

    Fabrice, can you tell us a little bit more about this? I mean, sure, rdbms-es are different: they have different communication protocols, and at a higher level, different SQL dialects. But, usually, the communication protocol is tucked away in a driver provided by the database vendor, and the ETL tool allows you to specify whatever dialect of SQL you like to actually get or put the data. In addition, drivers can usually be configured to one’s heart’s content.

    So how’s talend’s approach different from this? Do you write you own drivers? Do you rewrite queries to optimize for a specific database?

    kind regards,

    Roland Bouman
    http://rpbouman.blogspot.com/

  6. Fabrice Says:

    Hi Roland,

    I was talking about feature like auto-commit (every # lines) ; It is supported out of the box by MySQL but not by many other RDBMS. This is an example, you can easily find one hundred more!
    Regards,
    Fabrice

  7. Fabrice Says:

    Matt,

    “I’ll write slower next time ;-)”

    Thanks, and if you can writer in bigger, it will be even better!

    ;-D

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: