Personal Workflow Blog

To content | To menu | To search

Saturday, 14 November 2009

Sed is much slower than Perl, or not...

I wanted to do some text replacement with a huge file (think ~18GiB), filled with huge lines (think ~2MiB per ligne)[1].

I naïvely piped it through sed and I was quite shocked that it was CPU bound, and not I/O bound. The average rate was about 5 MiB/s (measured with pv, and the CPU was at almost 100%.The text file was gzipped on the filesystem, but with a 1/100 ratio, so the gzip process just took less than 2% CPU. I replaced then the sed -e with the Perl one-liner perl -lnpe, and .... tadaa, it was flying at a rate of 50MiB/s !

While I'm a big fan of Perl, and know its effectiveness to handle text streams, I'm was still astonished : being 10x faster than sed was something.

But in the good old saying Too good to be true means suspect, I remembered something about the character encoding of the regular expression. Since the system is entirely configured in UTF8, I suspected the infamous UTF8 overhead over plain ASCII.

I was right : a little LANG=C in front of the sed command line restored the rate to 50MiB/s.

So, beware of the performance impact of UTF8 strings, and try to avoid it if you can.

Notes

[1] For the record, it was a MySQL dump

Friday, 11 September 2009

Sychronize clock between hosts with SSH

NTP is very handy for server clock synchronisation, but it can be cumbersome to deploy.

Sometimes you just need to do a one-shot clock synchronisation, so you use the standard date command. But there isn't a flag to easily copy a setting to another.

From a remote host

Quite easy :

# date `ssh remoteuser@remotehost date +%m%d%H%M%Y.%S`

To a remote host

It's also very easy[1] :

# ssh root@remotehost date `date +%m%d%H%M%Y.%S`

Notes

[1] Yes, I do know that logging remotely as root is a security pitfall...

Tuesday, 8 September 2009

Databases: Efficient Case-insensitive searches with Function-based Indexing

Doing a case insensitive search is a very common task, but is quite hard to optimize correctly. But since it's done via a UPPER(MY_COLUMN) = UPPER('MY_DATA'), it doesn't use the index that could be on MY_COLUMN.

Different RDMS means different approaches.

Continue reading...

Monday, 7 September 2009

Overloading a method is hard : a common pitfall

As I said in my equality article, overloading in Java[1] is resolved by the static type of the argument, not the run-time type.

It's a generic problem of most compiled OO languages since usually overloading resolution happens at compile-time and not at runtime.

Now, that militates for the well known idiom :

Never overload a method with one that has the same number of parameters.

Actually, it should be enough to overload a method with one that accept parameters that are not inheritance-related : String and Number would be OK, but MyClass and Object would not.

Notes

[1] It's not really a Java-ism, it's the same in other languages, such as C++ .

Monday, 10 August 2009

A Simple Dns Server for a SOHO Network

I'm in search of a very simple DNS Server for a small network. It should be :

  • recursive & caching (can be used as a proxy)
  • very simple administration (parsing /etc/hosts would be perfect, raw DNS zones like BIND would be a little bit overkill)
  • quite lightweight (aka no dependency on an SQL engine like MySQL, such as MyDNS)
  • Seamless integration to Windows lookups (nmblookup) via proxying functions (DNS to/from NMB)

- page 4 of 10 -