Personal Workflow Blog

To content | To menu | To search

Saturday, 14 November 2009

Sed is much slower than Perl, or not...

I wanted to do some text replacement with a huge file (think ~18GiB), filled with huge lines (think ~2MiB per ligne)[1].

I naïvely piped it through sed and I was quite shocked that it was CPU bound, and not I/O bound. The average rate was about 5 MiB/s (measured with pv, and the CPU was at almost 100%.The text file was gzipped on the filesystem, but with a 1/100 ratio, so the gzip process just took less than 2% CPU. I replaced then the sed -e with the Perl one-liner perl -lnpe, and .... tadaa, it was flying at a rate of 50MiB/s !

While I'm a big fan of Perl, and know its effectiveness to handle text streams, I'm was still astonished : being 10x faster than sed was something.

But in the good old saying Too good to be true means suspect, I remembered something about the character encoding of the regular expression. Since the system is entirely configured in UTF8, I suspected the infamous UTF8 overhead over plain ASCII.

I was right : a little LANG=C in front of the sed command line restored the rate to 50MiB/s.

So, beware of the performance impact of UTF8 strings, and try to avoid it if you can.

Notes

[1] For the record, it was a MySQL dump

Friday, 11 September 2009

Quickly replicate the clock between remote hosts with SSH

NTP is very handy for server clock synchronisation, but it can be cumbersome to deploy.

Sometimes you just need to do a one-shot clock synchronisation, so you use the standard date command. But there isn't a flag to easily copy a setting to another.

From a remote host

Quite easy :

# date `ssh remoteuser@remotehost date +%m%d%H%M%Y.%S`

To a remote host

It's also very easy[1] :

# ssh root@remotehost date `date +%m%d%H%M%Y.%S`

Notes

[1] Yes, I do know that logging remotely as root is a security pitfall...

Monday, 10 August 2009

A Simple Dns Server for a SOHO Network

I'm in search of a very simple DNS Server for a small network. It should be :

  • recursive & caching (can be used as a proxy)
  • very simple administration (parsing /etc/hosts would be perfect, raw DNS zones like BIND would be a little bit overkill)
  • quite lightweight (aka no dependency on an SQL engine like MySQL, such as MyDNS)
  • Seamless integration to Windows lookups (nmblookup) via proxying functions (DNS to/from NMB)