<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Radino's blog &#187; parsing</title>
	<atom:link href="http://radino.eu/tag/parsing/feed/" rel="self" type="application/rss+xml" />
	<link>http://radino.eu</link>
	<description>Focusing on interesting SQL and PL/SQL problems</description>
	<lastBuildDate>Thu, 13 Jan 2011 16:22:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Parsing CSV string</title>
		<link>http://radino.eu/2009/01/01/parsing-csv-list/</link>
		<comments>http://radino.eu/2009/01/01/parsing-csv-list/#comments</comments>
		<pubDate>Thu, 01 Jan 2009 20:17:20 +0000</pubDate>
		<dc:creator>Radoslav Golian</dc:creator>
				<category><![CDATA[PL/SQL]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Tuning]]></category>
		<category><![CDATA[dbms_profiler]]></category>
		<category><![CDATA[parsing]]></category>

		<guid isPermaLink="false">http://radino.eu/?p=77</guid>
		<description><![CDATA[Parsing a CSV string is something very trivial, but often many people choose an inefficient approach usually based on this pseudo-code: loop if the CSV list is empty then break copy the first value to some variable remove the first value from the CSV list end loop There&#8217;s lots of unnecessary work hidden in the [...]]]></description>
			<content:encoded><![CDATA[<p>Parsing a CSV string is something very trivial, but <span class="vcb_lt">often </span>many people choose an <span class="vcb_rt">inefficient</span> approach usually based on this pseudo-code:</p>
<pre>loop
  if the CSV list is empty then break
  copy the first value to some variable
  remove the first value from the CSV list
end loop
</pre>
<p>There&#8217;s lots of unnecessary work hidden in the last step of this algorithm. Removing of the first value from the list <span class="vcb_rt">implicates </span>copying of the entire tail of the list. This could be very <span class="vcb_rt">inefficient especially when we are parsing large CSV list. Over and over again, I saw </span>developers take this approach not only in PL/SQL, but also in the other languages. Even Tom <a href="http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:210612357425" target="_blank" onclick="pageTracker._trackPageview('/outgoing/asktom.oracle.com/pls/asktom/f?p=100_11_0_P11_QUESTION_ID_210612357425&amp;referer=');">did it</a> <img src='http://radino.eu/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . I have to confess, I did it too, many many years ago <img src='http://radino.eu/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</p>
<p>Correct algorithm should be based on moving a pointer (offset) along the list, it should look like this:</p>
<pre>set the offset to the start of the CSV list
loop
  find the next comma starting the search at the actual offset
  if no comma is found then break
  copy the value between the offset and the comma position to some variable
  update offset to the value of (the comma position + 1)
end loop;</pre>
<p>As you can see, there&#8217;s no unnecessary copying, just moving the offset.</p>
<p>Let&#8217;s test these two implementations with <a title="DBMS_PROFILER documentation" href="http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_profil.htm#sthref5533" target="_blank" onclick="pageTracker._trackPageview('/outgoing/download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_profil.htm_sthref5533?referer=');">DBMS_PROFILER</a> package. If you don&#8217;t have DBMS_PROFILER package installed on your database, then you can take a look at <a title="DBMS_PROFILER setup and usage" href="http://www.oracle-base.com/articles/9i/DBMS_PROFILER.php" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.oracle-base.com/articles/9i/DBMS_PROFILER.php?referer=');">this article</a>.</p>
<p>Firstly, we&#8217;ll create a PL/SQL procedure based on the first algorithm:</p>
<pre>
<pre class="brush: sql">
CREATE OR REPLACE PROCEDURE parse_csv_wrong(i_csv_list IN VARCHAR2) IS
  l_comma_pos  PLS_INTEGER;
  l_csv_list   VARCHAR2(32767);
  l_dummy_num  NUMBER;
BEGIN
  l_csv_list := i_csv_list || &#039;,&#039;;

  LOOP
    EXIT WHEN l_csv_list IS NULL;

    l_comma_pos := instr(l_csv_list, &#039;,&#039;);
    l_dummy_num := to_number(substr(l_csv_list, 1, l_comma_pos - 1));

    l_csv_list := substr(l_csv_list, l_comma_pos + 1);
  END LOOP;
END parse_csv_wrong;
/
</pre>
</pre>
<p>Line 6: We can add a <a title="Sentinel (Computer science)" href="http://en.wikipedia.org/wiki/Sentinel_(computer_science)" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Sentinel_computer_science?referer=');">sentinel</a> comma to simplify code.</p>
<p>Line 9: Exit when the list is empty.</p>
<p>Line 11: Find the first comma.</p>
<p>Line 12: Extract the first value.</p>
<p>Line 14: Remove the first value.</p>
<p>Secondly, we&#8217;ll create a PL/SQL procedure that does not use copying, but advancing the offset:</p>
<pre>
<pre class="brush: sql">
CREATE OR REPLACE PROCEDURE parse_csv_right(i_csv_list IN VARCHAR2) IS
 l_offset     PLS_INTEGER;
 l_comma_pos  PLS_INTEGER;
 l_csv_list   VARCHAR2(32767);
 l_dummy_num  NUMBER;
BEGIN
  l_csv_list := i_csv_list || &#039;,&#039;;
  l_offset := 1;

  LOOP
    l_comma_pos := instr(l_csv_list, &#039;,&#039;, l_offset);
    EXIT WHEN l_comma_pos = 0;

    l_dummy_num := to_number(substr(l_csv_list, l_offset, l_comma_pos - l_offset));
    l_offset := l_comma_pos + 1;
  END LOOP;
END parse_csv_right;
/
</pre>
</pre>
<p>Line 7: We can add a <a title="Sentinel (Computer science)" href="http://en.wikipedia.org/wiki/Sentinel_(computer_science)" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Sentinel_computer_science?referer=');">sentinel</a> comma to simplify code.</p>
<p>Line 8: Set offset to the start of the list.</p>
<p>Line 11: Find the next comma, starting the search from current offset.</p>
<p>Line 12: Exit when comma is not found.</p>
<p>Line 14: Extract the current value.</p>
<p>Line 15: Advance offset.</p>
<p>At the end, we&#8217;ll create an anonymous PL/SQL block, which populates CSV list with 6002 values, starts profiler and calls both procedures passing created CSV list as a value for the input parameter.</p>
<pre>
<pre class="brush: sql">
DECLARE
  l_csv_list   VARCHAR2(32767);
BEGIN
  l_csv_list := &#039;&#039;;
  for i in 1000..7000 loop
    l_csv_list := l_csv_list || i || &#039;,&#039;;
  end loop;
  l_csv_list := l_csv_list || 7001;

  dbms_profiler.start_profiler(run_comment=&gt;&#039;csv parsing test&#039;);

  parse_csv_wrong(l_csv_list);
  parse_csv_right(l_csv_list);

  dbms_profiler.stop_profiler;
END;
/
</pre>
</pre>
<p>I ran the test 5 times and I used this select to identify runids.</p>
<pre>
<pre class="brush: sql">
SELECT runid,
           to_char(run_date, &#039;DD.MM.YYYY HH24:MI:SS&#039;) run_date,
           run_comment,
           run_total_time
FROM    plsql_profiler_runs
ORDER BY
           runid;
</pre>
</pre>
<p>Now, let&#8217;s take a look at the results:</p>
<pre>
<pre class="brush: sql">
SELECT u.runid,
       u.unit_type,
       u.unit_name,
       sum(d.total_time) / 1000 microsec
FROM   plsql_profiler_units u
       INNER JOIN plsql_profiler_data d ON (u.runid = d.runid AND u.unit_number = d.unit_number)
WHERE  u.runid between 53 and 57
AND    u.unit_name LIKE &#039;PARSE_CSV%&#039;
GROUP BY
       u.runid,
       u.unit_type,
       u.unit_name
ORDER BY
       u.runid,
       u.unit_name
</pre>
</pre>
<pre> RUNID UNIT_TYPE            UNIT_NAME                          MICROSEC
------ -------------------- -------------------------------- ----------
    53 PROCEDURE            PARSE_CSV_RIGHT                       35790
    53 PROCEDURE            PARSE_CSV_WRONG                      212523
    54 PROCEDURE            PARSE_CSV_RIGHT                       36408
    54 PROCEDURE            PARSE_CSV_WRONG                      214583
    55 PROCEDURE            PARSE_CSV_RIGHT                       35547
    55 PROCEDURE            PARSE_CSV_WRONG                      213146
    56 PROCEDURE            PARSE_CSV_RIGHT                       35832
    56 PROCEDURE            PARSE_CSV_WRONG                      214468
    57 PROCEDURE            PARSE_CSV_RIGHT                       35585
    57 PROCEDURE            PARSE_CSV_WRONG                      213242

10 rows selected.
</pre>
<p>What a difference! Procedure PARSE_CSV_RIGHT is 6 times faster then procedure PARSE_CSV_WRONG. We can easily identify lines, where procedures spend most of the time.  I&#8217;ll show this for runid 53.</p>
<pre>
<pre class="brush: sql">
SELECT u.unit_name,
 d.total_occur,
 d.total_time / 1000 microsec,
 substr(s.text, 1, 60) plsql_code
FROM   plsql_profiler_units u
 INNER JOIN plsql_profiler_data d ON (u.runid = d.runid AND u.unit_number = d.unit_number)
 INNER JOIN all_source s ON (s.owner = u.unit_owner AND s.type = u.unit_type AND s.name = u.unit_name AND s.line = d.line#)
WHERE  u.runid = 53
AND    u.unit_name LIKE &#039;PARSE_CSV%&#039;
ORDER BY
 u.unit_number, d.line#;
</pre>
</pre>
<pre>UNIT_NAME        TOTAL_OCCUR   MICROSEC PLSQL_CODE
---------------- ----------- ---------- -------------------------------------------------------------
PARSE_CSV_WRONG            0          4 PROCEDURE parse_csv_wrong(i_csv_list IN VARCHAR2) IS
PARSE_CSV_WRONG            1         85   l_csv_list := i_csv_list || ',';
PARSE_CSV_WRONG         6003       6851     EXIT WHEN l_csv_list IS NULL;
PARSE_CSV_WRONG         6002      10532     l_comma_pos := instr(l_csv_list, ',');
PARSE_CSV_WRONG         6002      11269     l_dummy_num := to_number(substr(l_csv_list, 1, l_comma_p
PARSE_CSV_WRONG         6002     183778     l_csv_list := substr(l_csv_list, l_comma_pos + 1);
PARSE_CSV_WRONG            0          4 END parse_csv_wrong;
PARSE_CSV_RIGHT            0          5 PROCEDURE parse_csv_right(i_csv_list IN VARCHAR2) IS
PARSE_CSV_RIGHT            1         63   l_csv_list := i_csv_list || ',';
PARSE_CSV_RIGHT            1          1   l_offset := 1;
PARSE_CSV_RIGHT         6003       9806     l_comma_pos := instr(l_csv_list, ',', l_offset);
PARSE_CSV_RIGHT         6003       7243     EXIT WHEN l_comma_pos = 0;
PARSE_CSV_RIGHT         6002      11144     l_dummy_num := to_number(substr(l_csv_list, l_offset, l_
PARSE_CSV_RIGHT         6002       7523     l_offset := l_comma_pos + 1;
PARSE_CSV_RIGHT            1          5 END parse_csv_right;

15 rows selected.
</pre>
<p>In the procedure PARSE_CSV_WRONG, most amount of the time was spent by removing the first value. In the procedure PARSE_CSV_RIGHT, the time was spent by extracting the values from the list.</p>
]]></content:encoded>
			<wfw:commentRss>http://radino.eu/2009/01/01/parsing-csv-list/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

