published on in programming querying sql database SQLite
tags: SQL C AWK SQLite GDELT big ideas

Two big ideas of SQL: The What, not How, and querying with ease

There are many factors that have contributed to the slow but steady rise of the popularity and ubiquity of SQL as a way to work with data. Among all these factors, the most significant one in my opinion is the concept of “what” rather than “how”. The second factor of greatness of SQL for me is the ability to quickly and easily answer questions about data.

I choose to start out with these two big ideas that to me, is fundamental to the language.

The “what” rather than the “how”

The “what” rather than “how” idea seems simple, but it contains within it far reaching implications. It is the idea of expressing “what” you want to achieve, rather than stipulating exactly “how” this should be achieved, leaving the “how” up to the system to decide.

If you have had some experience solving problems by programming in a general purpose language such as C, C#, Java or Python, you should be familiar with the necessity to explain to the computer exactly how to do what you want it to do. Failing to correctly specify every step the computer should take can lead to the inability to compile your program, or the wrong output being produced, or no output at all, or at best the correct output but too late. This need for verbosity seems like a reasonable requirement, after all, the computer cannot guess what needs to happen, and you should find it natural that you have to tell it exactly what to do as well as how to do it.

The need for total control and complete verbosity is perhaps most evident in the domain of real time systems, where it is vital for the software not only to produce the correct result, but this result also needs to be produced at the exact correct time. In these real time systems, it can be that the right result at the wrong time would be as disastrous as the wrong result or total failure. A common programming language for building real time applications is C, because C allows the developer to specify the what and how very precisely, making it possible to calculate the exact time each step would take to execute. So I got it into my head to try to do some things that are easy with SQL, in none other than C language. It turned out much harder than I thought, painful really, so read on for the gory details.

The ability to ask easily

The next great idea of SQL is the ability to quickly and easily craft questions, or queries on data. I understand that SQL might look foreign and strange to someone who has never worked with it. I do invite that group of people to consider how much stranger, harder and more foreign other ways of answering this kind of questions can be if you have to use general purpose programming languages; this post being a case in point.

Data : Example from the GDELT data set

The GDELT data set is a very high resolution global collection of events, gathered from numerous online news agencies and publications. For one days worth of data from the GDELT project, we would simply like to know:

  1. How many data points, or events, is contained in the sample file.
  2. How many events had the actor 1 code of ‘AFR’, the code for Africa.

The header columns is not part of the data file. We place the data and header file in a ./gdelt directory after download.

*Nix Solution

While already well away cooking up a C language solution, I remembered AWK, and the wc program.

 $ wc -l gdelt/20190531.export.CSV
 176780

  $ gawk '{ if ($6 == "AFR") sum += 1; } END { print sum; }' gdelt/20190531.export.CSV
 1184

Here I used the wc or word count program, with the -l flag to count lines. For the count of ‘AFR’ valued actor codes I used the GNU version of awk, together with a tiny awk program for a conditional incrementing of a counter. We might want to stop here and conclude that we need go no further since we’ve gotten our answer, but the truth is that we were quite fortunate in that there even exists a tool like wc and that conditionally counting the occurances of a particular value in a particular column is fairly straightforward with awk.

Raising the bar ever so slightly by, for example wanting to know all the actor codes that appear in the file, along with the counts for each becomes tricky using awk. Furthermore, we need to verify our answer above, by computing the same in a few more ways.

C Solution

My C language solution is straightforward but quite a bit more verbose and detailed:

  1. Read the data file line by line.
  2. For each line, increase the lines counter.
  3. For each line, get the value of the column containing the actor 1 code.
  4. If the actor 1 code column contains what we are looking for, increase the second counter.
  5. After all lines were read, print the results to the console.

After some research, we can add some details on how to achieve our goal:

  1. Read the data file line by line.
    • Open the file with the fopen() function from the stdio library.
    • Read the file line by line with successive calls to fgets() function, also from sdio.
  2. For each line, increase the lines counter.
  3. For each line, get the value of the column containing the actor 1 code.
    • Use the strtok() function, from the strlib library to tokenise on a given delimiter.
    • The actor 1 code is in the fith column from the start of each line, where columns are seperated by the tab character.
  4. If the actor 1 code column contains what we are looking for, increase the second counter.
    • Use the strcmp() function from strlib to compare two strings for equality.
  5. After all lines were read, print the results to the console.
    • Use the printf() function with its very easy to use string interpolation system.

And here is the simplest first attempt C solution:

#include <stdio.h>  // fopen, fclose, fgets
#include <string.h>  // strtoc, strcmp

int main() 
{
   const char FILE_PATH[] = "../gdelt/20190531.export.CSV";
   const char DELIMITER[] = "\t";
   const char MATCH_TEXT[] = "AFR";
   const int MATCH_COLUMN_POS = 5; 

   int match_count = 0;

   FILE *fp;
   const int BUFF_SIZE = 2048;
   char buff[BUFF_SIZE];

   fp = fopen(FILE_PATH, "r"); 
   if (fp == NULL) return -1; 
   int lines_count = 0;
   while (fgets(buff, BUFF_SIZE, (FILE*)fp)) {
       lines_count++;
       char *ptr = strtok(buff, DELIMITER);

       // find the value of the ACTOR_CODE
       for (int i=0; i<MATCH_COLUMN_POS; i++)
           ptr = strtok(NULL, DELIMITER);

if (strcmp(ptr, MATCH_TEXT) == 0)
{
    match_count += 1;
}
   }
   printf("Total number of lines: %d\n", lines_count );
   printf("Number of lines matching '%s' in column %d: %d\n", MATCH_TEXT, 
           MATCH_COLUMN_POS,
           match_count );
   fclose(fp);
   return 0;
}

After compiling with gcc ./query.c -o ./query, (gcc .\query.c -o .\query.exe on Windows) and running it, the output is:

Total number of lines: 176780
Number of lines matching 'AFR' in column 5: 1184

Nice. This answer corresponds perfectly with the *nix solution, but what a lot of work for answering such a trivial question! Clearly we cannot write programs each time we want to answer something like this.

SQL Solution

As you might know, SQL is not a general purpose programming language, but rather a kind of mini language implemented in a larger database management system (or DBMS). Because of this, you cannot directly work with flat files, but have to first load the files into the system. For my SQL solution I’ll use SQLite as the SQL query engine; the DBMS.

With the data and header file in the ./gdelt directory, we get them loaded like this:

> sqlite3 gdelt.md
SQLite version 3.28.0 2019-04-16 19:49:53
Enter ".help" for usage hints.
sqlite> .mode csv
sqlite> .timer on
sqlite> .separator "\t" "\r\n"
sqlite> .import ../gdelt/CSV.header.dailyupdates.txt oneday
sqlite> .import ../gdelt/20190531.export.CSV oneday

Here is the breakdown:

  1. I start by invoking the sqlite3 interactive CLI program with a name of the file where we want our database to live.
  2. All the commands starting with period (‘.’) are system related commands, and not SQL as such.
  3. The data file has no header info, but we do have a file with header info, which we load first.
  4. Loading the header file first conveniently creates the table oneday, which will hold the actual data. Later on we will show how to create tables from scrach.
  5. Loading the data file into the created table oneday does not re-create the table, but appends to it.
  6. The command .timer on is not required but provides interesting timing info output.
  7. The data file is tab delimited, new line record seperated, and we let sqlite know about this with the command .separator "\t" "\r\n".

Now we are ready to answer our question using SQL!

sqlite> select count(1) from oneday;
176780
Run Time: real 0.163 user 0.031250 sys 0.125000
sqlite> select count(1) from oneday where actor1code = 'AFR';
1056
Run Time: real 0.181 user 0.046875 sys 0.140625
sqlite>

Hmm… We arrive at the same total row or lines count, but the count for the matching actor 1 code values is very different. To figure out why this is the case is not trivial, but very instructive to illustrate even more the need for something like SQL and database systems, so stay with me.

Debugging

Why do the figure for the actor 1 code column equal to ‘AFR’ correlate for the *nix and C solutions, yet differ for the SQL sqlite3 solution? Since the majority rules, the SQL sqlite3 solution must be incorect. What is this SQLite3 anyways - it’s probably got a bug. Yet, SQLite is probably the most widely deployed database system, primarily due to it being used on mobile platforms such as Android. Surely they would nnot use a broken thing!

What we know

  1. All solutions correlate on the row / line / record count.
  2. The *nix and C language solution correlates on actor 1 code count.
  3. The SQLite3 solution actor 1 code count is less than the other two solutions.
  4. None of the solutions produced any errors or warnings on data load or query.
  5. SQLite is unlikely to have a bug causing the issue, but we might have loaded data wrongly somehow.
  6. It is unlikely that twoout of our three solutions would produce the same wrong result.

Poking around in the data

Since we have everything setup in SQLite already, let’s query the data a bit more, in hope we find some clues to help us solve the dilemma.

How many values are there for column actor1code?

sqlite> select count(distinct actor1code) from oneday;
2148

Is the count from before for the SQLite solution still the same?

sqlite> select count(1) from oneday where actor1code like 'AFR';
1056

Perhaps the comparison predicate in the where clause actor1code = 'AFR' has a problem. Let us try some LIKE matching instead.

sqlite> select count(1) from oneday where actor1code like '%AFR';
1056
sqlite> select count(1) from oneday where actor1code like '%AFR%';
1241
sqlite> select count(1) from oneday where actor1code like 'AFR%';
1139

Nope, this is not helping much, andsomething else is going on.

Edge cases

It’s always a good idea to consider some edge cases in the data. One such edge case is blank values, so let’s check for them in the actor 1 column:

sqlite> select count(1) from oneday where actor1code = '';
16475

We have seen that the SQLite solution under counts, or, gasp BOTH the AWK *nix solution AND the C language one over counts, and do it the same. Since both the AWK and the C language solution produced the same result, let’s see if we can find the blanks count of above with AWK:

 $ gawk '{ if ($6 == "") sum += 1; } END { print sum; }' ./gdelt/20190531.export.CSV

Well, it produced no output, so gawk wasn’t able to count blanks…. I bet the C language solution will do exactly the same!

Now the actor 1 code of ‘AFR’ is quite special, but we know from the documentation of GDELT and also from the header columns, that there is also an actor 2 code column, and it follows right after all the actor 1 columns. How many rows are there where the column actor1code is blank, and the actor2code is ‘AFR’:

sqlite> select count(1) from oneday where actor1code = 'AFR';
1056

sqlite> select count(1) from oneday where actor1code = '' and actor2code = 'AFR';
128

sqlite> select count(1) from oneday where actor1code = '' and actor2code = 'AFR' or actor1code = 'AFR';
1184

Now that is interesting. The C and AWK solutions count, in column 6 where actor 1 code should live, the sum of the columns where actor 1 code is ‘AFR’, plus the number of rows where actor 1 code is blank AND actor 2 code is ‘AFR’.

Before I touch that C solution again, let’s do something easier, and search a bit online; my money is on the strtok() function: I think it works different to what we assumed…

And yes! Indeed! A fellow Internet citizen came across this already.

We could now proceed and fix up the C solution, but by now I’ve kind of lost interest, and I’m truely thankful I do not have to code C for a living - it must be very, very hard. None the less, perhaps for the love of Linux, let’s at least get AWK right.

$ gawk 'BEGIN {FS = "\t"} { if ($6 == "AFR") sum += 1; } END { print sum; }' gdelt/20190531.export.CSV
1056

Yay for AWK, yay for the Unix tradition, yay for Linux! We explicitly set the field seperator (FS), same as for the SQLite solution.

DDL and DML

You may have heard the acronym DDL and DML before. The objectives of SQL can be categorised into two broad groups. The first group of objectives is related to the definition of artifacts to organise and access data. This group is commonly referred to as the data definition language (or DDL) statements . The second group of objectives relates to manipulation; the creation, the modification and the deletion of data. This group is commonly refered to as the data manipulation language (or DML) statements.

Outside the DDL and DML groups there usually exists additional statements concerned with particulars of the actual SQL-based system, such as those intended for administration and maintenance. For SQLite examples of these more admin-type keywords and functions are the ones starting with ‘.’ as illustrated above when we loaded the CSV file.

Conclusion

  • Answering even the simplest of questions about data, using only imperative programming like C is tough and error prone.
  • The Unix tradition has many, many jewels; AWK and wc two nice examples, and knowing about these can give us an edge.
  • For answering anything but the simplest queries it is best to rope in a proper SQL database system, and this can be light weight such as SQLite.
  • The “What” rather than “How” idea of SQL is huge: don’t worry how to do stuff, just declare to the computer what you want, and let the system sort out how to give it to you.
  • It’s very powerful to be able to answer questions on data, simbply by writing short queries rather than whole programs.