Jan 01 2009
Parsing CSV string
Parsing a CSV string is something very trivial, but often many people choose an inefficient approach usually based on this pseudo-code:
loop if the CSV list is empty then break copy the first value to some variable remove the first value from the CSV list end loop
There’s lots of unnecessary work hidden in the last step of this algorithm. Removing of the first value from the list implicates copying of the entire tail of the list. This could be very inefficient especially when we are parsing large CSV list. Over and over again, I saw developers take this approach not only in PL/SQL, but also in the other languages. Even Tom did it
. I have to confess, I did it too, many many years ago
.
Correct algorithm should be based on moving a pointer (offset) along the list, it should look like this:
set the offset to the start of the CSV list loop find the next comma starting the search at the actual offset if no comma is found then break copy the value between the offset and the comma position to some variable update offset to the value of (the comma position + 1) end loop;
As you can see, there’s no unnecessary copying, just moving the offset.
Let’s test these two implementations with DBMS_PROFILER package. If you don’t have DBMS_PROFILER package installed on your database, then you can take a look at this article.
Firstly, we’ll create a PL/SQL procedure based on the first algorithm:
CREATE OR REPLACE PROCEDURE parse_csv_wrong(i_csv_list IN VARCHAR2) IS
l_comma_pos PLS_INTEGER;
l_csv_list VARCHAR2(32767);
l_dummy_num NUMBER;
BEGIN
l_csv_list := i_csv_list || ',';
LOOP
EXIT WHEN l_csv_list IS NULL;
l_comma_pos := instr(l_csv_list, ',');
l_dummy_num := to_number(substr(l_csv_list, 1, l_comma_pos - 1));
l_csv_list := substr(l_csv_list, l_comma_pos + 1);
END LOOP;
END parse_csv_wrong;
/
Line 6: We can add a sentinel comma to simplify code.
Line 9: Exit when the list is empty.
Line 11: Find the first comma.
Line 12: Extract the first value.
Line 14: Remove the first value.
Secondly, we’ll create a PL/SQL procedure that does not use copying, but advancing the offset:
CREATE OR REPLACE PROCEDURE parse_csv_right(i_csv_list IN VARCHAR2) IS
l_offset PLS_INTEGER;
l_comma_pos PLS_INTEGER;
l_csv_list VARCHAR2(32767);
l_dummy_num NUMBER;
BEGIN
l_csv_list := i_csv_list || ',';
l_offset := 1;
LOOP
l_comma_pos := instr(l_csv_list, ',', l_offset);
EXIT WHEN l_comma_pos = 0;
l_dummy_num := to_number(substr(l_csv_list, l_offset, l_comma_pos - l_offset));
l_offset := l_comma_pos + 1;
END LOOP;
END parse_csv_right;
/
Line 7: We can add a sentinel comma to simplify code.
Line 8: Set offset to the start of the list.
Line 11: Find the next comma, starting the search from current offset.
Line 12: Exit when comma is not found.
Line 14: Extract the current value.
Line 15: Advance offset.
At the end, we’ll create an anonymous PL/SQL block, which populates CSV list with 6002 values, starts profiler and calls both procedures passing created CSV list as a value for the input parameter.
DECLARE
l_csv_list VARCHAR2(32767);
BEGIN
l_csv_list := '';
for i in 1000..7000 loop
l_csv_list := l_csv_list || i || ',';
end loop;
l_csv_list := l_csv_list || 7001;
dbms_profiler.start_profiler(run_comment=>'csv parsing test');
parse_csv_wrong(l_csv_list);
parse_csv_right(l_csv_list);
dbms_profiler.stop_profiler;
END;
/
I ran the test 5 times and I used this select to identify runids.
SELECT runid,
to_char(run_date, 'DD.MM.YYYY HH24:MI:SS') run_date,
run_comment,
run_total_time
FROM plsql_profiler_runs
ORDER BY
runid;
Now, let’s take a look at the results:
SELECT u.runid,
u.unit_type,
u.unit_name,
sum(d.total_time) / 1000 microsec
FROM plsql_profiler_units u
INNER JOIN plsql_profiler_data d ON (u.runid = d.runid AND u.unit_number = d.unit_number)
WHERE u.runid between 53 and 57
AND u.unit_name LIKE 'PARSE_CSV%'
GROUP BY
u.runid,
u.unit_type,
u.unit_name
ORDER BY
u.runid,
u.unit_name
RUNID UNIT_TYPE UNIT_NAME MICROSEC
------ -------------------- -------------------------------- ----------
53 PROCEDURE PARSE_CSV_RIGHT 35790
53 PROCEDURE PARSE_CSV_WRONG 212523
54 PROCEDURE PARSE_CSV_RIGHT 36408
54 PROCEDURE PARSE_CSV_WRONG 214583
55 PROCEDURE PARSE_CSV_RIGHT 35547
55 PROCEDURE PARSE_CSV_WRONG 213146
56 PROCEDURE PARSE_CSV_RIGHT 35832
56 PROCEDURE PARSE_CSV_WRONG 214468
57 PROCEDURE PARSE_CSV_RIGHT 35585
57 PROCEDURE PARSE_CSV_WRONG 213242
10 rows selected.
What a difference! Procedure PARSE_CSV_RIGHT is 6 times faster then procedure PARSE_CSV_WRONG. We can easily identify lines, where procedures spend most of the time. I’ll show this for runid 53.
SELECT u.unit_name, d.total_occur, d.total_time / 1000 microsec, substr(s.text, 1, 60) plsql_code FROM plsql_profiler_units u INNER JOIN plsql_profiler_data d ON (u.runid = d.runid AND u.unit_number = d.unit_number) INNER JOIN all_source s ON (s.owner = u.unit_owner AND s.type = u.unit_type AND s.name = u.unit_name AND s.line = d.line#) WHERE u.runid = 53 AND u.unit_name LIKE 'PARSE_CSV%' ORDER BY u.unit_number, d.line#;
UNIT_NAME TOTAL_OCCUR MICROSEC PLSQL_CODE ---------------- ----------- ---------- ------------------------------------------------------------- PARSE_CSV_WRONG 0 4 PROCEDURE parse_csv_wrong(i_csv_list IN VARCHAR2) IS PARSE_CSV_WRONG 1 85 l_csv_list := i_csv_list || ','; PARSE_CSV_WRONG 6003 6851 EXIT WHEN l_csv_list IS NULL; PARSE_CSV_WRONG 6002 10532 l_comma_pos := instr(l_csv_list, ','); PARSE_CSV_WRONG 6002 11269 l_dummy_num := to_number(substr(l_csv_list, 1, l_comma_p PARSE_CSV_WRONG 6002 183778 l_csv_list := substr(l_csv_list, l_comma_pos + 1); PARSE_CSV_WRONG 0 4 END parse_csv_wrong; PARSE_CSV_RIGHT 0 5 PROCEDURE parse_csv_right(i_csv_list IN VARCHAR2) IS PARSE_CSV_RIGHT 1 63 l_csv_list := i_csv_list || ','; PARSE_CSV_RIGHT 1 1 l_offset := 1; PARSE_CSV_RIGHT 6003 9806 l_comma_pos := instr(l_csv_list, ',', l_offset); PARSE_CSV_RIGHT 6003 7243 EXIT WHEN l_comma_pos = 0; PARSE_CSV_RIGHT 6002 11144 l_dummy_num := to_number(substr(l_csv_list, l_offset, l_ PARSE_CSV_RIGHT 6002 7523 l_offset := l_comma_pos + 1; PARSE_CSV_RIGHT 1 5 END parse_csv_right; 15 rows selected.
In the procedure PARSE_CSV_WRONG, most amount of the time was spent by removing the first value. In the procedure PARSE_CSV_RIGHT, the time was spent by extracting the values from the list.
