Congrats to MySQL, already some rumors?

January 22, 2008

Just a quick post to congratulate the MySQL folks on this acquisition.  Much has been said in the blogosphere so I don’t need to repeat it.  As far as I am concerned, as an IT consultant doing a lot of open source, this is a great validation of the open source model.

I find it funny that some journalists are already speculating on what’s going to happen next.  Take for example Rich Seeley from SearchSOA.com who predicts that Sun will now go on to buy Greenplum, Talend, and why not Pentaho and JasperSoft (probably not both, though!)  I don’t know if these vendors find these rumors to be funny – well, at least Talend had the good grace of commenting on it in their blog.

We’ll see what the future has in store for open source.

Busy, on deadline

December 20, 2007

I haven’t posted anything for a while, been very busy with a project deadline.  I didn’t get much time to enjoy the Winter recently, but the weather has been bad anyway!

Right now I am managing a team of 10 enginners for a large national bank.  They use Talend for the their project.  And I have to say we are amazed by the reliability of the tool, and the execution performance of the code it generates.

To tell you – this project started initially as a pure DW loading project, but the client keeps expanding the scope, and my team is turning into a full Integration Competency Center!

As soon as I have some free time, I need to update my ETL comparison… not sure when this will happen though.

Kettle also coming out with a new RC

September 28, 2007

This is the RC week for open source ETL! A friend of mine forwarded me this email he got from Pentaho (I need to sign up for their mailing list, never got to do it). Again, not sure when I will have time to look at this RC, but new versions are always good. It shows that Pentaho, like Talend, continue to invest in their product.

Dear friends,

Even though we had our work cut out for us the last couple of months, there
was no sign of a slowdown the last couple of weeks. In fact, a couple of
long standing items on my TODO/WANTED list finally got in:

– The debugger with breakpoints, pause/resume
– Remote execution of jobs and transformations (using job entries)

At the same time, a series of bugs got fixed too, ranging in severify from
cosmetic to blocking.

There is always room for improvement, but it looks like we’ll have to go for a
feature freeze (RC1) sooner or later anyway. Let’s do it sooner rather than
later.
We had a chat internally at Pentaho and I thought that next Monday, October
1st would be a great day to kick RC1 out of the door.

I hope you will take the oppertunity with me to do final testing and bug
fixing to make RC1 as stable as possible. For the next 4 to 8 weeks we’ll be
focussing on documentation and testing to ensure that 3.0.0 is as good as we
can humanly make it. If all goes as expected those efforts should bring us
an RC2 on October 29th and a release November 19th.

All the best,

Matt

Talend asked me to beta test their RC

September 26, 2007

A couple weeks ago I had signed up for Talend’s beta tester newsletter. Yesterday I got the following email from them. Not sure when I will have time to try this new version but it seems to address some of the points discussed lately.

Here is the email:

Dear Talend Community Member,

We are proud to inform you that Talend Open Studio 2.2.0 Release Candidate is now available. This version contains all the features of Talend Open Studio 2.2.0 and we need you to track all problems that might exist in your Open Source data integration tool, before its release.

What’s new in this version?

The numerous new features of Talend Open Studio 2.2.0 include:
– enhancement of the management of contexts (GUI, new tContextDump component)
– export jobs as Java Web Services
– graphical expression builder

Talend Open Studio is now based on the latest Eclipse version (3.3), you can benefit from all the improvements of this new framework (including support for Windows Vista).

We have also integrated new components:

Java :
– Support for more databases: AS/400 connector, generic JDBC connector
– Slowly Changing Dimensions for MySQL, Oracle, Ingres, MS SQL, DB2, Sybase (support for types 1, 2 & 3, support for Surrogate Keys, etc.)
– Support for stored procedures in Oracle, MS SQL, Ingres, MySQL, DB2
– Connection sharing for Oracle and PostgreSQL
– Support for LDIF/LDAP
– “Wait for file” and “Wait for SQL Data” to start a job upon the apparition of a file or of certain records in a table
– Flow merge and split (tUnite and tReplicate)
– Support for SCP

Perl :
– Multiple substitutions, simple and complex (tReplace)
– Connection sharing for Oracle and PostgreSQL
– Lookup with multiple matches
– “Wait for file” and “Wait for SQL Data” to start a job upon the apparition of a file or of certain records in a table
– Flow data metering
– File touch
– Flow merge and split (tUnite and tReplicate)
– Support for SCP

Performance of complex jobs have been significantly improved with the passing of data structures as references. Check out this scenario to feel the performance enhancement: http://www.talendforge.org/wiki/doku.php?id=performances:scenario_3.

Please download and test this Release Candidate, read the documentation, go through the tutorials, chat with us on the Forum, suggest new features and report bugs on our Bugtracker, check out our technical documentations on the Wiki…

Joining the Talend community is the best way to influence the progress of your preferred data integration solution!

The download is available at http://www.talend.com/download.php (http://www.talend.com/download.php).
The community tools (Forum, Bugtracker, Changed Log, Wiki, Subversion, Trac, Flash tutorials…) are available at http://www.talendforge.org (http://www.talendforge.org).

Thanks again for your support and your involvement!
Best regards,

The Talend Team

Managing Slowly Changing Dimensions

September 11, 2007

There was a new thing that Talend said they supported in July: Slowly Changing Dimensions. I guess they were playing catchup, because as far as I know this has been supported by Kettle for a while. Never mind, I thought I would give it a try and compare how well both tools support SCDs.

Bottom line: booth tools make SCD management super easy. Congratulations guys, you made a pretty difficult concept easy to implement. Clearly, Talend’s implementation is still young, it is missing some features such as surrogate keys or specifying the end date. Kettle has a more thorough functional coverage.

Something that’s missing from both tools however: Type 3 SCDs. OK, I’ll grant you this – in my years of consulting, I have never had to implement a Type 3 SCD. But still, it would be good to have it, just in case you need it 🙂

From the performance standpoint, Talend clearly makes up for its functional gaps. I ran a test with 25,000 source records. When creating the dimension, TOS went through the process in 8.7 seconds but it took Kettle 675 seconds! Updating the dimension, a much more resource consuming process, took TOS 512 seconds and Kettle 1,323 seconds.

Which tells me another thing: no vendor can claim to always be 50 or 100 times faster than others! Performance comparisons depend so much on which test you run. In my case, TOS is 78 times faster than Kettle in the first test, but only 2.6 times faster in the second one.

Dialog with the vendors…

August 27, 2007

My post last week on a tool selection project attracted lots of interest, at least from the two vendors whose tools I looked at: Pentaho and Talend.  I am actually impressed (and proud!) to have gotten comments from Matt Casters and Fabrice Bonan – respectively founders of Kettle and Talend.  Thanks for your interest in my blog!

Something nice Matt said (“the way Talend is handling all their database operators is pretty horrible”) got my attention.  I did not feel this way when looking at Talend… but you may have a point.  Both tools have very different approaches, and it’s likely that each of them might have pluses and minuses, depending on the situation.

If Matt and Fabrice are still reading, I would be very interested in getting their perspective on how differently the tools handle database operators.  I already know which one each of you will find best but some factual elements would be interesting.  Please reply as comments but if I get interesting stuff I will summarize it in another post.

Open source data integration in Network World

August 24, 2007

Today I ran into an article in Network World in which they have selected eight open source companies to watch. I usually don’t pay too much attention to these marketing things, but I found interesting that two of the companies selected are doing data integration: Apatar and Talend.

Talend, I know them already (see this post about my recent product selection project). But Apatar was new to me, so I decided to take a closer look. I don’t have bandwidth at this point to try the product (maybe later) but the positioning itself is interesting.

First thing I noticed (in the Network World article) is that the reason the company is called Apatar is because it starts with an A and the domain name was available. Well, I guess that was a good move, since Apatar is listed first in the article! I guess at some point journalists will have to use reverse alphabetical order, or random order, to be fair to vendors starting with X, Y and Z. Talend also commented on this in their blog (tough luck, they start with a T…).

So what’s Apatar about? They say they are the first provider of on-demand open source data integration. There seems to be a lot of “first something” in this field.

According to their site, it’s about integrating data from Web 2.0 applications such as Flickr and Amazon S3. Well, granted, that’s kind of trendy, but how useful is it? I have been doing data integration for 15 years, and never seen anyone store enterprise data in Amazon S3 (even recently)! Maybe I am only doing business with dusty companies… but let’s face it: a lot of data is still stored in legacy systems (mainframes, files…); today the majority of systems my clients deal with are RDBMS (usually proprietary, although open source ones such as MySQL and Ingres show up more and more often) and ERP/CRM (hosted in house). A few have successfully deployed a SaaS CRM (Salesforce.com or SugarCRM) but that’s the extent of on-demand data I have seen. So focusing on data stored in on-demand systems sounds an odd strategy. Maybe in 10 or 15 years… but I doubt their financial backers will wait for that long.

Another thing that I don’t get about Apatar is this: in which stage is their product? The log on their Web site still says “beta”. OK, open source projects (the ones without a company driving them) tend to remain in beta forever. But if Apatar is a real commercial open source company, its customers are entitled to the best of both worlds: the openness and flexibility of open source, and the pro support and backing of a real company. That’s what the client I was working with was getting from MySQL. And that’s what I would expect should I use this product.

Anyway, it’s always interesting to see new companies emerge, and to see that data integration is still a hot space. If I can find time, I’ll try Apatar’s product at some point.

Tool selection project

August 22, 2007

I just wrapped up the first part of a tool selection project for an old client of mine, who wants to introduce open source data integration on a pilot project.

Working with the client, we have short listed the two open source solutions we deem to be the most mature: Pentaho’s Kettle and Talend’s Open Studio. Their respective positioning is interesting: Pentaho being an open source BI vendor focuses exclusively on ETL for the BI market, whereas Talend say they address more than BI but also operational data integration.

There are a number of other open source ETL projects out there, but none of them has the backing of a “real” company. Not to say these projects are bad, but my client is just tip-toeing into open source and they wanted to feel reassured about tech support, viability, etc. So have just been looking at these 2 vendors.

A big part of the evaluation was about performance. I have run Kettle and TOS on identical scenarios to see how they perform.

I did not only try the latest versions, but also went back one version. It’s interesting to see how both products have made significant performance improvements in their latest builds.

Anyway, I thought I’d share the results of this benchmark: benchmark-tos-vs-kettle.pdf

I’ll be posting more info as my work with this client progresses. Right now they have not entirely confirmed their choice, they want to look at other criteria beyond pure performance. But the scale if clearly leaning toward one side… (check the benchmark if you want to know which one!)

Blogging about ETL and data integration

August 9, 2007

Just a quick note to introduce myself.  I am an independent consultant from the greater Chicago.  I do mostly design and development work around data integration – that’s the processes that load a data warehouse or move data between databases and applications.  I have been in this space for over 15 years, and used lots of technologies: good old scripts (Perl, Python…) and SQL programs (PL/SQL, Transact-SQL…), super expensive ETL tools, and now I am starting to work with open source products.  My clients range from large corporations and government, all the way down to a couple SMBs who completely outsource their integration problems to me (yeah, SMBs have integration problems, too).

I will be blogging mostly about the technologies I use, how they compare to one another, and some best practices I have discovered in the data integration space.