Archive for August, 2007

Dialog with the vendors…

August 27, 2007

My post last week on a tool selection project attracted lots of interest, at least from the two vendors whose tools I looked at: Pentaho and Talend.  I am actually impressed (and proud!) to have gotten comments from Matt Casters and Fabrice Bonan – respectively founders of Kettle and Talend.  Thanks for your interest in my blog!

Something nice Matt said (“the way Talend is handling all their database operators is pretty horrible”) got my attention.  I did not feel this way when looking at Talend… but you may have a point.  Both tools have very different approaches, and it’s likely that each of them might have pluses and minuses, depending on the situation.

If Matt and Fabrice are still reading, I would be very interested in getting their perspective on how differently the tools handle database operators.  I already know which one each of you will find best but some factual elements would be interesting.  Please reply as comments but if I get interesting stuff I will summarize it in another post.


Open source data integration in Network World

August 24, 2007

Today I ran into an article in Network World in which they have selected eight open source companies to watch. I usually don’t pay too much attention to these marketing things, but I found interesting that two of the companies selected are doing data integration: Apatar and Talend.

Talend, I know them already (see this post about my recent product selection project). But Apatar was new to me, so I decided to take a closer look. I don’t have bandwidth at this point to try the product (maybe later) but the positioning itself is interesting.

First thing I noticed (in the Network World article) is that the reason the company is called Apatar is because it starts with an A and the domain name was available. Well, I guess that was a good move, since Apatar is listed first in the article! I guess at some point journalists will have to use reverse alphabetical order, or random order, to be fair to vendors starting with X, Y and Z. Talend also commented on this in their blog (tough luck, they start with a T…).

So what’s Apatar about? They say they are the first provider of on-demand open source data integration. There seems to be a lot of “first something” in this field.

According to their site, it’s about integrating data from Web 2.0 applications such as Flickr and Amazon S3. Well, granted, that’s kind of trendy, but how useful is it? I have been doing data integration for 15 years, and never seen anyone store enterprise data in Amazon S3 (even recently)! Maybe I am only doing business with dusty companies… but let’s face it: a lot of data is still stored in legacy systems (mainframes, files…); today the majority of systems my clients deal with are RDBMS (usually proprietary, although open source ones such as MySQL and Ingres show up more and more often) and ERP/CRM (hosted in house). A few have successfully deployed a SaaS CRM ( or SugarCRM) but that’s the extent of on-demand data I have seen. So focusing on data stored in on-demand systems sounds an odd strategy. Maybe in 10 or 15 years… but I doubt their financial backers will wait for that long.

Another thing that I don’t get about Apatar is this: in which stage is their product? The log on their Web site still says “beta”. OK, open source projects (the ones without a company driving them) tend to remain in beta forever. But if Apatar is a real commercial open source company, its customers are entitled to the best of both worlds: the openness and flexibility of open source, and the pro support and backing of a real company. That’s what the client I was working with was getting from MySQL. And that’s what I would expect should I use this product.

Anyway, it’s always interesting to see new companies emerge, and to see that data integration is still a hot space. If I can find time, I’ll try Apatar’s product at some point.

Tool selection project

August 22, 2007

I just wrapped up the first part of a tool selection project for an old client of mine, who wants to introduce open source data integration on a pilot project.

Working with the client, we have short listed the two open source solutions we deem to be the most mature: Pentaho’s Kettle and Talend’s Open Studio. Their respective positioning is interesting: Pentaho being an open source BI vendor focuses exclusively on ETL for the BI market, whereas Talend say they address more than BI but also operational data integration.

There are a number of other open source ETL projects out there, but none of them has the backing of a “real” company. Not to say these projects are bad, but my client is just tip-toeing into open source and they wanted to feel reassured about tech support, viability, etc. So have just been looking at these 2 vendors.

A big part of the evaluation was about performance. I have run Kettle and TOS on identical scenarios to see how they perform.

I did not only try the latest versions, but also went back one version. It’s interesting to see how both products have made significant performance improvements in their latest builds.

Anyway, I thought I’d share the results of this benchmark: benchmark-tos-vs-kettle.pdf

I’ll be posting more info as my work with this client progresses. Right now they have not entirely confirmed their choice, they want to look at other criteria beyond pure performance. But the scale if clearly leaning toward one side… (check the benchmark if you want to know which one!)

Blogging about ETL and data integration

August 9, 2007

Just a quick note to introduce myself.  I am an independent consultant from the greater Chicago.  I do mostly design and development work around data integration – that’s the processes that load a data warehouse or move data between databases and applications.  I have been in this space for over 15 years, and used lots of technologies: good old scripts (Perl, Python…) and SQL programs (PL/SQL, Transact-SQL…), super expensive ETL tools, and now I am starting to work with open source products.  My clients range from large corporations and government, all the way down to a couple SMBs who completely outsource their integration problems to me (yeah, SMBs have integration problems, too).

I will be blogging mostly about the technologies I use, how they compare to one another, and some best practices I have discovered in the data integration space.