GAE: Storing serializable objects in datastore

Google AppEngine’s datastore supports a variety of simple Types and Property Classes by default. However if we want to store something like a dictionary, we typically have to serialize it and store it as a Blob. On fetching we will de-serialize it. While this approach works, it is repetitive and somewhat error-prone.

Wouldn’t it be great if there is a SerializableProperty class that can handle this automatically for us? It doesn’t exist but according to this article, it is easy to create our own customized Property classes. So here is a simple implementation of SerializableProperty that worked for me:

import cPickle as pickle
import zlib
from google.appengine.ext import db

class SerializableProperty(db.Property):
  A SerializableProperty will be pickled and compressed before it is
  saved as a Blob in the datastore. On fetch, it would be decompressed
  and unpickled. It allows us to save serializable objects (e.g. dicts)
  in the datastore.

  The sequence of transformations applied can be customized by calling
  the set_transforms() method.

  data_type = db.Blob
  _tfm = [lambda x: pickle.dumps(x,2), zlib.compress]
  _itfm = [zlib.decompress, pickle.loads]

  def set_transforms(self, tfm, itfm):
    self._tfm = tfm
    self._itfm = itfm

  def get_value_for_datastore(self, model_instance):
    value = super(SerializableProperty,
    if value is not None:
      value = self.data_type(reduce(lambda x,f: f(x), self._tfm, value))
    return value

  def make_value_from_datastore(self, value):
    if value is not None:
      value = reduce(lambda x,f: f(x), self._itfm, value)
    return value

Usage is as simple as this:

class MyModel(db.Model):
  data = SerializableProperty()

entity = MyModel(data = {"key": "value"}, key_name="somekey")
entity = MyModel.get_by_key_name("somekey")

Hope that helps!

Update (20091126): I’ve changed db.Blob to self.data_type as suggested by Peritus in Comment. The same comment also suggested a JSONSerializableProperty subclass:

import simplejson as json
class JSONSerializableProperty(SerializableProperty):
  data_type = db.Text
  _tfm = [json.dumps]
  _itfm = [json.loads]

Thanks Peritus!

Building SAGE 4.1.1 on Fedora 11

While building SAGE 4.1.1 on a AMD Phenom II running Fedora 11, GCC 4.4.1 will hang the machine when compiling “base3.c” of PARI. Apparently it was sucking up all the available memory.

According to this thread, it happens on Ubuntu 9.10 too and the solution is to compile with -O1 instead of -O3 optimization. Unfortunately it wasn’t obvious (to me, at least) how to make GCC use -O1 specifically for PARI only.

Digging around in SAGE’s build system, I figured it could be done by repacking the PARI spkg with a modified “get_cc” script:

cd sage-4.1.1/spkg/standard
tar jxf pari-2.3.3.p1.spkg
sed 's/OPTFLAGS=-O3/OPTFLAGS=-O1/g' \
  pari-2.3.3.p1/src/config/get_cc > get_cc
mv get_cc pari-2.3.3.p1/src/config/get_cc
mv pari-2.3.3.p1.spkg pari-2.3.3.p1.spkg.orig
tar jcf pari-2.3.3.p1.spkg pari-2.3.3.p1

After that, I was able to compile SAGE using its standard build procedure. Admittedly this is a quick hack. A better solution may be to set OPTFLAGS according to the version of GCC used.

Updates: According to this thread, it is fixed in Ubuntu Karmic.

Posted in hacks. Tags: , , . 2 Comments »

Grep binary string

A co-worker asked me if it is possible to grep arbitrary binary strings, e.g. sequences of non-printable ASCII characters. It turns out that GNU grep does understand binary strings if we use Perl-regex via the -P option.

[sh@pc ~]$ grep -slrP '\x05\x00\xc0' /boot

I couldn’t find this when Googling for “grep binary” so I thought I should pen it down here.

Posted in hacks. Tags: , . Leave a Comment »

Setting up a local YUM repository of installed RPMs

I was looking for a way to set up a local YUM repository of RPMs that are installed on my Fedora 8 system. Searching on the web, I came across this HOWTO: While I liked the idea of using rsync to mirror a YUM repository, it was prohibitively expensive to download the entire Everything folder which exceeded 9GB. I only wanted those RPMs that are currently installed on my system.

Another way I found that may work is to use the livecd-creator script in livecd-tools at The livecd-creator script can download the packages specified in a kickstart file. But it will also try to create a LiveCD at the same time so it was doing more than what I need. I could hack the script but it would take some time to understand the yum Python module.

In the end, I settled for a quick-and-dirty shell script that did the job. It first creates a list of all installed RPMs and then it downloads them from the specified YUM repository mirror. Obsolete RPMs are removed and we can also check the signature of the downloaded RPMs. The script is posted below for all to use. It can also be downloaded here: mirrorf8_v1.

Usage is self-explanatory. Just run the script and it will explain to you how it should be used. Note that it requires you to have rsync and createrepo pre-installed.



rsync_list() {
  echo "Contacting rsync server ($1) --> $3"
  mkdir -p $1
  rsync --dry-run -v --files-from=list_a.txt $2 . 2>/dev/null | grep rpm > $3

rsync_get() {
  rsync -vzP --delete --files-from=$1 $2 $3

remove_obsolete() {
  echo "Removing obsolete files ($1)"
  [ -e $1 ] && mv $1 $1.tmp
  mkdir -p $1
  for j in `cat $2`
    [ -e $1.tmp/$j ] && mv $1.tmp/$j $1
  rm -rf $1.tmp

cmd_list() {
  echo "Generating list of installed rpms --> list_a.txt"
  rpm -qa --qf "%{N}-%{V}-%{R}.%{ARCH}.rpm\n" | sort > list_a.txt
  rsync_list releases $REL/ list_r.txt
  rsync_list updates  $UPD/ list_u.txt
  wc list_*.txt

cmd_get() {
  remove_obsolete releases list_r.txt
  remove_obsolete updates list_u.txt
  rsync_get list_r.txt $REL/ releases/
  rsync_get list_u.txt $UPD/ updates/

cmd_checksig() {
  echo "Checking signatures (releases)"
  for i in releases/*.rpm; do rpm -K $i; done > sigs
  cut -d ":" -f 2- sigs | sort | uniq
  echo "Checking signatures (updates)"
  for i in updates/*.rpm; do rpm -K $i; done > sigs
  cut -d ":" -f 2- sigs | sort | uniq
  rm sigs

cmd_createrepo() {
  createrepo releases
  createrepo updates

cmd_clean() {
  echo "Cleaning up"
  rm -f list_[aru].txt

usage() {
  cat <<-MSG
	  `basename $0` [ list | get | checksig | createrepo | clean ]

	  The commands below should be executed in order.
	  list       : create lists of currently installed RPMs
	  get        : fetches RPMs from repository using rsync
	  checksig   : check signatures of downloaded RPMs
	  createrepo : executes createrepo command on downloaded RPMs
	  clean      : remove lists of currently installed RPMs

  exit 1

case "$1" in
  list)         cmd_list ;;
  get)          cmd_get ;;
  checksig)     cmd_checksig ;;
  createrepo)   cmd_createrepo ;;
  clean)        cmd_clean ;;
  *)            usage
Posted in hacks. Tags: , , . 3 Comments »

A progress bar for Bash scripts

When copying large files in Bash shell scripts, it would be nice to have a progress bar displayed. Unfortunately the cp command does not have a progress bar option.

The following script shows how we can implement a copy function in Bash that displays a progress bar with ETA (Estimated Time to Arrival). It simulates the cp command by piping the stdin and stdout of cat, dd and cat in series. The dd command is encapsulated in a while loop so that it can perform the copying in chunks of 1MB and also to print the progress bar. Note that the progress bar is printed on stderr as stdin and stdout are used to transfer the bytes.

copy() { # src dst [width]
    srcsize=$(stat -c %s $1) || return $?
    mega=$(( 1024 * 1024 ))
    start=$(date +%s)
    cat $1 | (
    while [[ dstsize -lt srcsize ]]
        dd bs=512 count=2048 2>/dev/null || return $?
        (( dstsize += $mega ))
        [[ dstsize -gt srcsize ]] && dstsize=$srcsize

        # print truncated filename
        name=$(basename $1 | cut -b -20)
        printf "\r%-20s " $name 1>&2

        # print percentage
        percent=$(( 100 * $dstsize / $srcsize ))
        printf "%3d%% [" $percent 1>&2

        # print progress bar
        bar=$(( $width * $dstsize / $srcsize ))
        for i in $(seq 1 $bar); do printf "=" 1>&2; done
        for i in $(seq $bar $(( $width-1 ))); do printf " " 1>&2; done

        # print size of file copied
        if [[ $dstsize -le 1024 ]]; then
            printf -v size "%d" $dstsize;
        elif [[ $dstsize -le $mega ]]; then
            printf -v size "%d kB" $(( $dstsize / 1024  ));
            printf -v size "%d MB" $(( $dstsize / $mega ));
        printf "] %7s" "$size" 1>&2

        # print estimated time of arrival
        elapsed=$(( $(date +%s) - $start ))
        remain=$(( $srcsize - $dstsize ))
        eta=$(( ($elapsed * $remain) / $dstsize + 1))
        if [[ $remain == 0 ]]; then eta=0; fi
        etamin=$(( $eta / 60 ))
        etasec=$(( $eta % 60 ))
        if [[ $eta > 0 ]]; then etastr="ETA"; else etastr="   "; fi
        printf "   %02d:%02d $etastr" $etamin $etasec 1>&2
    echo 1>&2
    ) | cat >$2

The inspiration for this code came from studying the progress bar implemented at The major differences of the above code from bar are in the progress bar format and the use of the extra cat command at the beginning. The above code is also less generic but that’s fine with me since it is simpler and shorter.

To use the above code, simply cut and paste it into the top of your shell script. Then wherever you have cp srcfile dstfile replace it with copy srcfile dstfile to see the progress bar. Note that the copy function is not a drop-in replacement for cp since it handles neither options nor wildcards.

Posted in hacks. Tags: , . 12 Comments »