Jan 01 2009

Parsing CSV string

Category: PL/SQL,SQL,TuningRadoslav Golian @ 10:17 pm

Parsing a CSV string is something very trivial, but often many people choose an inefficient approach usually based on this pseudo-code:

loop
  if the CSV list is empty then break
  copy the first value to some variable
  remove the first value from the CSV list
end loop

There’s lots of unnecessary work hidden in the last step of this algorithm. Removing of the first value from the list implicates copying of the entire tail of the list. This could be very inefficient especially when we are parsing large CSV list. Over and over again, I saw developers take this approach not only in PL/SQL, but also in the other languages. Even Tom did it :). I have to confess, I did it too, many many years ago :).

Correct algorithm should be based on moving a pointer (offset) along the list, it should look like this:

set the offset to the start of the CSV list
loop
  find the next comma starting the search at the actual offset
  if no comma is found then break
  copy the value between the offset and the comma position to some variable
  update offset to the value of (the comma position + 1)
end loop;

As you can see, there’s no unnecessary copying, just moving the offset.

Let’s test these two implementations with DBMS_PROFILER package. If you don’t have DBMS_PROFILER package installed on your database, then you can take a look at this article.

Firstly, we’ll create a PL/SQL procedure based on the first algorithm:

CREATE OR REPLACE PROCEDURE parse_csv_wrong(i_csv_list IN VARCHAR2) IS
  l_comma_pos  PLS_INTEGER;
  l_csv_list   VARCHAR2(32767);
  l_dummy_num  NUMBER;
BEGIN
  l_csv_list := i_csv_list || ',';

  LOOP
    EXIT WHEN l_csv_list IS NULL;

    l_comma_pos := instr(l_csv_list, ',');
    l_dummy_num := to_number(substr(l_csv_list, 1, l_comma_pos - 1));

    l_csv_list := substr(l_csv_list, l_comma_pos + 1);
  END LOOP;
END parse_csv_wrong;
/

Line 6: We can add a sentinel comma to simplify code.

Line 9: Exit when the list is empty.

Line 11: Find the first comma.

Line 12: Extract the first value.

Line 14: Remove the first value.

Secondly, we’ll create a PL/SQL procedure that does not use copying, but advancing the offset:

CREATE OR REPLACE PROCEDURE parse_csv_right(i_csv_list IN VARCHAR2) IS
 l_offset     PLS_INTEGER;
 l_comma_pos  PLS_INTEGER;
 l_csv_list   VARCHAR2(32767);
 l_dummy_num  NUMBER;
BEGIN
  l_csv_list := i_csv_list || ',';
  l_offset := 1;

  LOOP
    l_comma_pos := instr(l_csv_list, ',', l_offset);
    EXIT WHEN l_comma_pos = 0;

    l_dummy_num := to_number(substr(l_csv_list, l_offset, l_comma_pos - l_offset));
    l_offset := l_comma_pos + 1;
  END LOOP;
END parse_csv_right;
/

Line 7: We can add a sentinel comma to simplify code.

Line 8: Set offset to the start of the list.

Line 11: Find the next comma, starting the search from current offset.

Line 12: Exit when comma is not found.

Line 14: Extract the current value.

Line 15: Advance offset.

At the end, we’ll create an anonymous PL/SQL block, which populates CSV list with 6002 values, starts profiler and calls both procedures passing created CSV list as a value for the input parameter.

DECLARE
  l_csv_list   VARCHAR2(32767);
BEGIN
  l_csv_list := '';
  for i in 1000..7000 loop
    l_csv_list := l_csv_list || i || ',';
  end loop;
  l_csv_list := l_csv_list || 7001;

  dbms_profiler.start_profiler(run_comment=>'csv parsing test');

  parse_csv_wrong(l_csv_list);
  parse_csv_right(l_csv_list);

  dbms_profiler.stop_profiler;
END;
/

I ran the test 5 times and I used this select to identify runids.

SELECT runid,
           to_char(run_date, 'DD.MM.YYYY HH24:MI:SS') run_date,
           run_comment,
           run_total_time
FROM    plsql_profiler_runs
ORDER BY
           runid;

Now, let’s take a look at the results:

SELECT u.runid,
       u.unit_type,
       u.unit_name,
       sum(d.total_time) / 1000 microsec
FROM   plsql_profiler_units u
       INNER JOIN plsql_profiler_data d ON (u.runid = d.runid AND u.unit_number = d.unit_number)
WHERE  u.runid between 53 and 57
AND    u.unit_name LIKE 'PARSE_CSV%'
GROUP BY
       u.runid,
       u.unit_type,
       u.unit_name
ORDER BY
       u.runid,
       u.unit_name
 RUNID UNIT_TYPE            UNIT_NAME                          MICROSEC
------ -------------------- -------------------------------- ----------
    53 PROCEDURE            PARSE_CSV_RIGHT                       35790
    53 PROCEDURE            PARSE_CSV_WRONG                      212523
    54 PROCEDURE            PARSE_CSV_RIGHT                       36408
    54 PROCEDURE            PARSE_CSV_WRONG                      214583
    55 PROCEDURE            PARSE_CSV_RIGHT                       35547
    55 PROCEDURE            PARSE_CSV_WRONG                      213146
    56 PROCEDURE            PARSE_CSV_RIGHT                       35832
    56 PROCEDURE            PARSE_CSV_WRONG                      214468
    57 PROCEDURE            PARSE_CSV_RIGHT                       35585
    57 PROCEDURE            PARSE_CSV_WRONG                      213242

10 rows selected.

What a difference! Procedure PARSE_CSV_RIGHT is 6 times faster then procedure PARSE_CSV_WRONG. We can easily identify lines, where procedures spend most of the time.  I’ll show this for runid 53.

SELECT u.unit_name,
 d.total_occur,
 d.total_time / 1000 microsec,
 substr(s.text, 1, 60) plsql_code
FROM   plsql_profiler_units u
 INNER JOIN plsql_profiler_data d ON (u.runid = d.runid AND u.unit_number = d.unit_number)
 INNER JOIN all_source s ON (s.owner = u.unit_owner AND s.type = u.unit_type AND s.name = u.unit_name AND s.line = d.line#)
WHERE  u.runid = 53
AND    u.unit_name LIKE 'PARSE_CSV%'
ORDER BY
 u.unit_number, d.line#;
UNIT_NAME        TOTAL_OCCUR   MICROSEC PLSQL_CODE
---------------- ----------- ---------- -------------------------------------------------------------
PARSE_CSV_WRONG            0          4 PROCEDURE parse_csv_wrong(i_csv_list IN VARCHAR2) IS
PARSE_CSV_WRONG            1         85   l_csv_list := i_csv_list || ',';
PARSE_CSV_WRONG         6003       6851     EXIT WHEN l_csv_list IS NULL;
PARSE_CSV_WRONG         6002      10532     l_comma_pos := instr(l_csv_list, ',');
PARSE_CSV_WRONG         6002      11269     l_dummy_num := to_number(substr(l_csv_list, 1, l_comma_p
PARSE_CSV_WRONG         6002     183778     l_csv_list := substr(l_csv_list, l_comma_pos + 1);
PARSE_CSV_WRONG            0          4 END parse_csv_wrong;
PARSE_CSV_RIGHT            0          5 PROCEDURE parse_csv_right(i_csv_list IN VARCHAR2) IS
PARSE_CSV_RIGHT            1         63   l_csv_list := i_csv_list || ',';
PARSE_CSV_RIGHT            1          1   l_offset := 1;
PARSE_CSV_RIGHT         6003       9806     l_comma_pos := instr(l_csv_list, ',', l_offset);
PARSE_CSV_RIGHT         6003       7243     EXIT WHEN l_comma_pos = 0;
PARSE_CSV_RIGHT         6002      11144     l_dummy_num := to_number(substr(l_csv_list, l_offset, l_
PARSE_CSV_RIGHT         6002       7523     l_offset := l_comma_pos + 1;
PARSE_CSV_RIGHT            1          5 END parse_csv_right;

15 rows selected.

In the procedure PARSE_CSV_WRONG, most amount of the time was spent by removing the first value. In the procedure PARSE_CSV_RIGHT, the time was spent by extracting the values from the list.

Tags: ,

2 Responses to “Parsing CSV string”

  1. Seb says:

    Thank you very much. Works perfectly and easy to adapt thanks to detailed explanation.

  2. Kamran Qadir says:

    Thanks a lot, it saved my time.

Leave a Reply