<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments for Marc Russel's Blog</title>
	<atom:link href="http://marcrussel.wordpress.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://marcrussel.wordpress.com</link>
	<description>Just another WordPress.com weblog</description>
	<lastBuildDate>Wed, 04 Feb 2009 13:29:35 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>Comment on Open source data integration in Network World by Ichiro Kato</title>
		<link>http://marcrussel.wordpress.com/2007/08/24/open-source-data-integration-in-network-world/#comment-192</link>
		<dc:creator>Ichiro Kato</dc:creator>
		<pubDate>Wed, 09 Jul 2008 08:25:43 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/08/24/open-source-data-integration-in-network-world/#comment-192</guid>
		<description>Hello,

I think, people who have just skimmed through the text may get the wrong idea   Apatar and Talend, though they are sometimes compared are not designed for the same purpose.
Apatar is not made to deal with large amounts of data. The &quot;zero code&quot; choice limits the flexibility of the software. Anyway, Apatar is, indeed, more geared towards data mashups.  Performing ETL requires a tool that can keep up with the amount of data that may be encountered in the reality, and in the open source world, I think Talend is a good choice.

Ichiro.</description>
		<content:encoded><![CDATA[<p>Hello,</p>
<p>I think, people who have just skimmed through the text may get the wrong idea   Apatar and Talend, though they are sometimes compared are not designed for the same purpose.<br />
Apatar is not made to deal with large amounts of data. The &#8220;zero code&#8221; choice limits the flexibility of the software. Anyway, Apatar is, indeed, more geared towards data mashups.  Performing ETL requires a tool that can keep up with the amount of data that may be encountered in the reality, and in the open source world, I think Talend is a good choice.</p>
<p>Ichiro.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Talend asked me to beta test their RC by Sudhendra</title>
		<link>http://marcrussel.wordpress.com/2007/09/26/talend-asked-me-to-beta-test-their-rc/#comment-191</link>
		<dc:creator>Sudhendra</dc:creator>
		<pubDate>Wed, 04 Jun 2008 15:29:46 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/09/26/talend-asked-me-to-beta-test-their-rc/#comment-191</guid>
		<description>Marc  I am trying to learn this tool. If you have more information or application which has been developed that could be taken a starting point to do some development work. Can you please share it. Thanks very much for your blog.</description>
		<content:encoded><![CDATA[<p>Marc  I am trying to learn this tool. If you have more information or application which has been developed that could be taken a starting point to do some development work. Can you please share it. Thanks very much for your blog.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Open source data integration in Network World by Planet Apatar &#187; Blog Archive &#187; Open source data integration in Network World</title>
		<link>http://marcrussel.wordpress.com/2007/08/24/open-source-data-integration-in-network-world/#comment-89</link>
		<dc:creator>Planet Apatar &#187; Blog Archive &#187; Open source data integration in Network World</dc:creator>
		<pubDate>Fri, 16 Nov 2007 13:50:27 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/08/24/open-source-data-integration-in-network-world/#comment-89</guid>
		<description>[...] Today I ran into an article in Network World in which they have selected eight open source companies to watch. I usually dont pay too much attention to these marketing things, but I found interesting that two of the companies selected are doing data integration: Apatar and Talend.[Link] [...]</description>
		<content:encoded><![CDATA[<p>[...] Today I ran into an article in Network World in which they have selected eight open source companies to watch. I usually dont pay too much attention to these marketing things, but I found interesting that two of the companies selected are doing data integration: Apatar and Talend.[Link] [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Managing Slowly Changing Dimensions by Matt Casters</title>
		<link>http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-45</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Sun, 21 Oct 2007 11:53:52 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-45</guid>
		<description>Parallel processing &amp; caching...

Slowly as a slowly changing dimension may be, if you take the typical case of a customer dimension, you can often see that the same record gets changed multiple times a day or a week and then not anymore for months on end.  More often than not, some typo was detected and data re-entered.  As such, you get multiple changes for the same record in your input stream.  If you send this same record to multiple copies of a step, you can not just get cache-misses, but also create deadlocks if the threads use multiple database connections.

However, since we allow data to be partitioned in PDI we can as such launch a &quot;dimension lookup/update&quot; in parallel guaranteeing that data is NOT in the cache of another step. That is because we can send the same natural key to the same copy of the step each time it passes.

We can extend that principle to multiple servers running multiple copies of the same dimension updater as well.

HTH,
Matt</description>
		<content:encoded><![CDATA[<p>Parallel processing &amp; caching&#8230;</p>
<p>Slowly as a slowly changing dimension may be, if you take the typical case of a customer dimension, you can often see that the same record gets changed multiple times a day or a week and then not anymore for months on end.  More often than not, some typo was detected and data re-entered.  As such, you get multiple changes for the same record in your input stream.  If you send this same record to multiple copies of a step, you can not just get cache-misses, but also create deadlocks if the threads use multiple database connections.</p>
<p>However, since we allow data to be partitioned in PDI we can as such launch a &#8220;dimension lookup/update&#8221; in parallel guaranteeing that data is NOT in the cache of another step. That is because we can send the same natural key to the same copy of the step each time it passes.</p>
<p>We can extend that principle to multiple servers running multiple copies of the same dimension updater as well.</p>
<p>HTH,<br />
Matt</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Managing Slowly Changing Dimensions by Andrew Collins</title>
		<link>http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-34</link>
		<dc:creator>Andrew Collins</dc:creator>
		<pubDate>Sat, 29 Sep 2007 17:14:44 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-34</guid>
		<description>Hugo,
I agree and disagree with you :-)
Agree: Kettle allows you to define a max size for the lookup cache, whereas Talend loads all active records (but when Kettle needs data not in the cache it becomes very slow).
Disagree: in slowly changing dimension, there is &quot;slowly&quot;. Data in the dimension tables isn&#039;t supposed to change every second. Even if records get updated during the SCD update, latest value is used. Parallelizing several SCDs on a table should be no problem, as long as source keys are spread across components.

Andrew</description>
		<content:encoded><![CDATA[<p>Hugo,<br />
I agree and disagree with you <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /><br />
Agree: Kettle allows you to define a max size for the lookup cache, whereas Talend loads all active records (but when Kettle needs data not in the cache it becomes very slow).<br />
Disagree: in slowly changing dimension, there is &#8220;slowly&#8221;. Data in the dimension tables isn&#8217;t supposed to change every second. Even if records get updated during the SCD update, latest value is used. Parallelizing several SCDs on a table should be no problem, as long as source keys are spread across components.</p>
<p>Andrew</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Managing Slowly Changing Dimensions by Hugo</title>
		<link>http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-32</link>
		<dc:creator>Hugo</dc:creator>
		<pubDate>Wed, 26 Sep 2007 10:55:19 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-32</guid>
		<description>Actually you can&#039;t do a comparision of this nature. 

I bet if you look at the generated code Talend will do a select * type operation on the target table and store all this data in a hash.
The problem with this approach is that it isn&#039;t parallel processing safe (what if a separate workflow updates values in the dimension while it is being used)
And it&#039;s also NOT scaleable, if I have say a customer dimension with millions of customers I will quickly reach the RAM limits and the transformation will then become very very slow.
You may be able to do a join with the source database and restrict the values you cache in the lookup to those in your source but this is going to affect your performance and require you stage everything.</description>
		<content:encoded><![CDATA[<p>Actually you can&#8217;t do a comparision of this nature. </p>
<p>I bet if you look at the generated code Talend will do a select * type operation on the target table and store all this data in a hash.<br />
The problem with this approach is that it isn&#8217;t parallel processing safe (what if a separate workflow updates values in the dimension while it is being used)<br />
And it&#8217;s also NOT scaleable, if I have say a customer dimension with millions of customers I will quickly reach the RAM limits and the transformation will then become very very slow.<br />
You may be able to do a join with the source database and restrict the values you cache in the lookup to those in your source but this is going to affect your performance and require you stage everything.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Managing Slowly Changing Dimensions by Matt Casters</title>
		<link>http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-24</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Fri, 14 Sep 2007 09:34:18 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-24</guid>
		<description>Dear Marc, 

All sarcasm aside, our SCD implementation has been around for about 4 years now, so I&#039;m not surprised it&#039;s running well either.  However, I resent that you are trying to imply I&#039;m trying to &quot;fix&quot; our numbers somehow.  

All our transformations and input files we use for testing are completely out in the open and if you want, you can run them yourself.  I even included links to the test used in question for your convenience. (v2.5 vs 3.0 so you have both)

That is a lot more than I can say about your test claiming our SCD implementation is incredibly slow.  So you probably forgot to put an index slowing our transformation down or something.  Who knows, right?  A local or a remote database?  What type of database?  Index configurations?  A lot of things can influence the numbers.  You mention creating the SCD, what about lookups, partial updates with mostly inserts, mostly updates, etc?

If you are interested in those numbers for PDI, I put the link to the transformations and the unit test-cases in my previous post.

By the way, it should be fairly simple to do a Type 3 SCD in PDI.  You can look up the original or (more often required) previous entry of a particular column with a second lookup step before you do the actual update.

HTH,
Matt</description>
		<content:encoded><![CDATA[<p>Dear Marc, </p>
<p>All sarcasm aside, our SCD implementation has been around for about 4 years now, so I&#8217;m not surprised it&#8217;s running well either.  However, I resent that you are trying to imply I&#8217;m trying to &#8220;fix&#8221; our numbers somehow.  </p>
<p>All our transformations and input files we use for testing are completely out in the open and if you want, you can run them yourself.  I even included links to the test used in question for your convenience. (v2.5 vs 3.0 so you have both)</p>
<p>That is a lot more than I can say about your test claiming our SCD implementation is incredibly slow.  So you probably forgot to put an index slowing our transformation down or something.  Who knows, right?  A local or a remote database?  What type of database?  Index configurations?  A lot of things can influence the numbers.  You mention creating the SCD, what about lookups, partial updates with mostly inserts, mostly updates, etc?</p>
<p>If you are interested in those numbers for PDI, I put the link to the transformations and the unit test-cases in my previous post.</p>
<p>By the way, it should be fairly simple to do a Type 3 SCD in PDI.  You can look up the original or (more often required) previous entry of a particular column with a second lookup step before you do the actual update.</p>
<p>HTH,<br />
Matt</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Managing Slowly Changing Dimensions by marcrussel</title>
		<link>http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-23</link>
		<dc:creator>marcrussel</dc:creator>
		<pubDate>Thu, 13 Sep 2007 17:08:23 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-23</guid>
		<description>Hi Matt,
Why am I not surprised that you have performance data that makes your tool the greatest?  I am sure Fabrice has some, too.  Sorry if you don&#039;t like my &quot;bogus&quot; numbers... I won&#039;t try to convince you!</description>
		<content:encoded><![CDATA[<p>Hi Matt,<br />
Why am I not surprised that you have performance data that makes your tool the greatest?  I am sure Fabrice has some, too.  Sorry if you don&#8217;t like my &#8220;bogus&#8221; numbers&#8230; I won&#8217;t try to convince you!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Managing Slowly Changing Dimensions by Fabrice</title>
		<link>http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-21</link>
		<dc:creator>Fabrice</dc:creator>
		<pubDate>Wed, 12 Sep 2007 08:31:41 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-21</guid>
		<description>I agree, our SCD implementation is young and it is available only in milestone version right now. In our next Main version (2.2 comming out October 5), we will for sure support surrogate key (in many ways in fact)!</description>
		<content:encoded><![CDATA[<p>I agree, our SCD implementation is young and it is available only in milestone version right now. In our next Main version (2.2 comming out October 5), we will for sure support surrogate key (in many ways in fact)!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Managing Slowly Changing Dimensions by Matt Casters</title>
		<link>http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-20</link>
		<dc:creator>Matt Casters</dc:creator>
		<pubDate>Wed, 12 Sep 2007 08:01:15 +0000</pubDate>
		<guid isPermaLink="false">http://marcrussel.wordpress.com/2007/09/11/managing-slowly-changing-dimensions/#comment-20</guid>
		<description>I would like to point out that the performance numbers are bogus as well.
We have a number of test-transformations in our performance test-suite:  http://kettle.pentaho.org/svn/Kettle/trunk/test/org/pentaho/di/run/dimensionlookup/

For a typical result (v2 versus v3), see for example an older run here: http://kettle.pentaho.org/svn/Kettle/trunk/test/org/pentaho/di/run/RunResults-Matt-20070522.txt

I quote for v3: 250,000 row at 12,529 rows/s = 20 seconds.  (Initial load)

I still would like to know how you can have a Slowly Changing Dimension without a surrogate key :-)

Matt</description>
		<content:encoded><![CDATA[<p>I would like to point out that the performance numbers are bogus as well.<br />
We have a number of test-transformations in our performance test-suite:  <a href="http://kettle.pentaho.org/svn/Kettle/trunk/test/org/pentaho/di/run/dimensionlookup/" rel="nofollow">http://kettle.pentaho.org/svn/Kettle/trunk/test/org/pentaho/di/run/dimensionlookup/</a></p>
<p>For a typical result (v2 versus v3), see for example an older run here: <a href="http://kettle.pentaho.org/svn/Kettle/trunk/test/org/pentaho/di/run/RunResults-Matt-20070522.txt" rel="nofollow">http://kettle.pentaho.org/svn/Kettle/trunk/test/org/pentaho/di/run/RunResults-Matt-20070522.txt</a></p>
<p>I quote for v3: 250,000 row at 12,529 rows/s = 20 seconds.  (Initial load)</p>
<p>I still would like to know how you can have a Slowly Changing Dimension without a surrogate key <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Matt</p>
]]></content:encoded>
	</item>
</channel>
</rss>
