Personal Workflow Blog

To content | To menu | To search

Tag - sql

Entries feed

Saturday, 14 November 2009

Sed is much slower than Perl, or not...

I wanted to do some text replacement with a huge file (think ~18GiB), filled with huge lines (think ~2MiB per ligne)[1].

I naïvely piped it through sed and I was quite shocked that it was CPU bound, and not I/O bound. The average rate was about 5 MiB/s (measured with pv, and the CPU was at almost 100%.The text file was gzipped on the filesystem, but with a 1/100 ratio, so the gzip process just took less than 2% CPU. I replaced then the sed -e with the Perl one-liner perl -lnpe, and .... tadaa, it was flying at a rate of 50MiB/s !

While I'm a big fan of Perl, and know its effectiveness to handle text streams, I'm was still astonished : being 10x faster than sed was something.

But in the good old saying Too good to be true means suspect, I remembered something about the character encoding of the regular expression. Since the system is entirely configured in UTF8, I suspected the infamous UTF8 overhead over plain ASCII.

I was right : a little LANG=C in front of the sed command line restored the rate to 50MiB/s.

So, beware of the performance impact of UTF8 strings, and try to avoid it if you can.


[1] For the record, it was a MySQL dump

Monday, 10 August 2009

A Simple Dns Server for a SOHO Network

I'm in search of a very simple DNS Server for a small network. It should be :

  • recursive & caching (can be used as a proxy)
  • very simple administration (parsing /etc/hosts would be perfect, raw DNS zones like BIND would be a little bit overkill)
  • quite lightweight (aka no dependency on an SQL engine like MySQL, such as MyDNS)
  • Seamless integration to Windows lookups (nmblookup) via proxying functions (DNS to/from NMB)

Tuesday, 19 May 2009

Surrogate Keys : Globally Unique, Application Unique or Type Unique ?

When you just decided to use Surrogate Keys, another problem arises : which value should I use ? It seems to be a very simple problem since the key now is completely in isolation in your application and is not related to any data. The choice is quite bonding, since a change means that every value has to be changed, and you did not let this key leak outside the database did you ?

The keypoint here is isolation. Many differents kinds of isolation are available, and soon you will be able to make a informed decision.

Continue reading...

Friday, 17 April 2009

Databases: Partial Indexing

In huge tables, usually only a fraction of the table is used on a regular basis. The rest is useful either for historical purpose, or is just not being treated right now.

Continue reading...

Monday, 30 March 2009

Databases: Meta-Data (ctime & mtime)

A RDBMS is all about data. Actually also about the relationships between them, but let's focus mostly on the datas.

We can even view a RDBMS as a glorified file system, with the tables like directories and rows like files inside. Every filesystem has some meta-data attached to its elements such as the creation date and the last modification date. With this analogy it becomes quite clear that we could also have this meta-data on the table. The good part of the filesystem meta-data design is that it's handled outside the element. The last modification date is something that isn't stored in the file itself. (Actually it could, but then it had an application meaning, and is to be handled by the application). It's also completely automated, without any intervention required, or even possible, on the application part (except for specialized tools like when we restore a backup).

Continue reading...

- page 1 of 2