Beware the Python generators

Generators and list comprehension in Python are very closely related. After all, each gives you an iterable. However, I really wish that generators came with a "DANGER: Handle with care!" label. The problem is that while the syntax for creating a generator vs a list differs in exactly two characters, generators have side effects that are both subtle and easy to overlook. Let's take a look at some code:

items = get_items()

for x in items:
    print x

for x in items:
    print x

Does that code look reasonable to you? It does, until I tell you that get_items() returns a generator. You see, generators have an internal pointer to the index and cannot be reset. Thus the sequence of items will only be printed once. This can be solved by convention. Some libraries will prefix the function's name with an "i" converting get_items to iget_items(). The built-in Python function xrange(), a generator version of range(), is another example of trying to solve this problem with convention.

Let's look at another piece of code:

try:
    numbers = (int(x) for x in line.split(','))
except ValueError:
    numbers = [] # Handle the case where the input is invalid

for num in numbers:
   print num

That looks reasonably good, no? Well, generators are lazy, they have to be. Thus number = ... line defines the generator, but not a single call to int() is made at that point. The calls to int() are made during the iteration, while the for loop is executing, which is outside of the try/except block. There are several solutions that exist here, ranging from using a list comprehension instead to placing the for loop inside the try/except block.

Another difference between lists/tuples/sets/other sequence types and generators is that generators have no length. Calling len(get_items()) would result in an error. This is by design: generators may be infinite, and thus it does not make sense to ask what their length is.

I love generators as much as the next guy. However, I think care must be taken when using them. My rules of thumb are:

First, if you are using a generator to optimize for speed: don't. In casual observation they are indeed faster than lists, but lists are so flipping fast already that unless you are processing millions of items, it will make no difference. Exception to this rule is when you are in fact processing millions of items or you routinely need to create a lot of iterables in your hot loop and your profiler tells you that this is the bottleneck.

Second, if you are optimizing for memory usage, use generators only if you have a significant number of records. A list of 100 ints will make little difference. A list of ten million log entries is going to cost you some RAM.

Third, never return a generator from a library method, or any type of opaque object. Generators should mostly be used for intermediate iterables until a final result is obtained. Avoid the confusion of get_items() returning a generator.

Lastly, use generators if you must. They are the only way to create infinite iterables, and they do have small speed and large memory advantages over other iterables. If you use them, put in several safeguards: make sure to document the fact that a generator object is used in multiple places, create a convention for what these objects (and the corresponding functions) will be called and test, test, test. As I mentioned at the beginning of this post: generators should come with a warning to not surprise unsuspecting maintainers of your code.

Making Debian Packages

I recently spoke at the Triange Linux Users Group meeting about making Debian packages. This is a topic that's near and dear to my heart since I think that apt is one of the more powerful pieces of software that is underutilized in today's world, where deploying code with no build process seems to be the norm.

The notes to my presentation can be found as a separate page here. After following along with the presentation, you should be able to create basic software packages for Debian/Ubuntu. Hopefully, that skill will come in handy one day.

A clever way to fight IE6

A recent conversation with a web consultant I had got on the topic of browser support. I asked "What browsers do you support?". He responded with the list of the usual "modern" ones: IE7+, Chrome, Firefox, Safari. "How about IE6?" I inquired. "Well, it is a hassle. What I usually do is add an extra charge for IE6 compatability. When the clients sees it, they usually drop that requirement. Normally, if you ask them whether it is important to support IE6, they say yes and then it's your headache. But showing them the price tag seems to work well."

This makes sense to me: making the cost of IE6 explicit is something that can hasten the departure from supporting this legacy browsers. People tend to be very careful with their money when they feel they are not getting the greatest return on it, and I have a feeling that a lot of clients would not want to spend extra on making sure their site works in the half-dead zombie that is IE6.

Proper way to send e-mail from PHP

Depending on your setup, PHP might not be sending properly encoded e-mails if you just use the mail() function. Specifically, headers might not be properly encoded, and this includes the subject, the To, and Reply-To, etc. Just give it a try using some non-ASCII characters and see if it works. If it doesn't, here's a better way:

// At the beginning of each page load, set internal encoding to UTF-8
mb_internal_encoding('UTF-8');

// ... rest of initialization code

// Headers are an associative array, unlike the original mail() function
function better_mail($email, $subject, $body, Array $headers = array(), $additional_parameter = NULL) {
    // Make sure we set Content-Type and charset
    if ( !isset( $headers['Content-Type'] ) ) {
        $headers['Content-Type'] = 'text/plain; charset=utf-8';
    }

    $headers_str = '';
    foreach( $headers as $key => $val ) {
        $headers_str .= sprintf( "%s: %s\r\n", $key, $val );
    }

    // Use mb_send_mail() function instead of mail() so that headers, including subject are properly encoded
    return mb_send_mail( $email, $subject, $body, $headers_str, $additional_parameters );
}

better_mail( 'example@example.com', 'Résumé with non-ASCII characters', 'Résumé content.', array( 'From' => 'noreply@example.com' ) );

For more information see:

Introducing LovelyCo.de

LovelyCo.de is a social website for people to share code samples they find particularly elegant. My goal in creating it was to make the experience as smooth as possible. If you have a piece of code, in any language, that you think is worth, sharing, head on over!

How to set up nginx with PHP on Ubuntu

In an environment where RAM is the major constraint, Apache might not be your best bet. Since I have moved all of my web projects over to an unmanaged VPS, I was looking for ways to either optimize or replace the LAMP stack with something less resource hungry. One of the ways I found to decrease the RAM requirement of my web services was to replace Apache with an event driven web server called nginx. There are quite a few resources on setting it up but not many discuss how to marry nginx to PHP (to run a WordPress blog for example). The best guide I found so far is this one. I built upon it to create the simplest nginx+FastCGI/PHP setup possible.

Step 1: Installation

$ sudo apt-get install nginx php5-cgi

Many of the nginx/PHP guides out there will tell you that you need to install spawn-fcgi, and most of them will have you compiling it from source. It turns out that php5-cgi package contains a FastCGI wrapper already, so the above two packages are all you need to get going.

Step 2: Startup script for FastCGI

We want to create a Startup script for FastCGI PHP processes to run on every boot up. Here is the script I use:

!/bin/bash
BIND_DIR=/var/run/php-fastcgi
BIND="$BIND_DIR/php.sock"
USER=www-data
PHP_FCGI_CHILDREN=8
PHP_FCGI_MAX_REQUESTS=1000

PHP_CGI=/usr/bin/php-cgi
PHP_CGI_NAME=`basename $PHP_CGI`
PHP_CGI_ARGS="- USER=$USER PATH=/usr/bin PHP_FCGI_CHILDREN=$PHP_FCGI_CHILDREN PHP_FCGI_MAX_REQUESTS=$PHP_FCGI_MAX_REQUESTS $PHP_CGI -b $BIND"
RETVAL=0

start() {
    echo -n "Starting PHP FastCGI: "
    mkdir $BIND_DIR
    chown -R $USER $BIND_DIR
    start-stop-daemon --quiet --start --background --chuid "$USER" --exec /usr/bin/env -- $PHP_CGI_ARGS
    RETVAL=$?
    echo "$PHP_CGI_NAME."
}
stop() {
    echo -n "Stopping PHP FastCGI: "
    killall -q -w -u $USER $PHP_CGI
    RETVAL=$?
    rm -rf $BIND_DIR
    echo "$PHP_CGI_NAME."
}

case "$1" in
    start)
        start
  ;;
    stop)
        stop
  ;;
    restart)
        stop
        start
  ;;
    *)
        echo "Usage: php-fastcgi {start|stop|restart}"
        exit 1
  ;;
esac
exit $RETVAL

Put the text above into /etc/init.d/fastcgi-php. Then run:

$ sudo chmod 755 /etc/init.d/fastcgi-php
$ sudo update-rc.d fastcgi-php defaults
$ sudo /etc/init.d/fastcgi-php start

Note the variables at the top of the script and adjust to fit your available RAM. The the php-cgi processes will get bigger after a while so allot at least 16-20MB for each. This script is slightly different than most I've found, in that it uses a UNIX socket instead of a TCP one for communication. UNIX sockets are faster, and you never have to worry about your firewall setup.

Step 3: Enable PHP processing

Create a new file /etc/nginx/fastcgi_php with this content:

# pass the PHP scripts to FastCGI server listening on UNIX socket
location ~ \.php$ {
    fastcgi_pass   unix:/var/run/php-fastcgi/php.sock;
    fastcgi_index  index.php;
    fastcgi_param  SCRIPT_FILENAME  $document_root$fastcgi_script_name;
    include /etc/nginx/fastcgi_params;
}

Now define you virtual servers. Here is a sample config:

server {
    listen   80;
    server_name  example.com www.example.com;

    access_log  /var/log/nginx/example.com.access.log;

    root   /var/www/example.com;
    index  index.php index.html index.htm;
    autoindex off;

    error_page  404  /404.html;
    error_page   500 502 503 504  /50x.html;

    # deny access to .htaccess files, if Apache's document root
    # concurs with nginx's one
    location ~ /\.ht {
        deny  all;
    }
    # Enable PHP
    include /etc/nginx/fastcgi_php;
}

Notice the next to last line of the file. You can include this line in all of your server definitions to enable PHP processing through FastCGI.

Step 4: Enable the site:

Enable the site and reload the configs:

$ sudo ln -s /etc/nginx/sites-available/example.com /etc/nginx/sites-enabled/example.com
$ sudo /etc/init.d/nginx reload

Optional tweaking

There are several other things you might want to consider once this setup is in place. First of all, nginx does not process htaccess files since those are Apache specific. For the most part this means that mod_rewrite rules you've had in place won't work. The good news is that nginx has its own URL rewriting engine, which is quite capable. The bad news is that you will have to translate your mod_rewrite rules to the syntax used by nginx.

WordPress compatibility

A "quick fix" exists for WordPress:

server {
    # Your server definition
    # ...

    location / {
        # this serves static files that exist without running other rewrite tests
        if (-f $request_filename) {
            expires 30d;
            break;
        }

        # this sends all non-existing file or directory requests to index.php
        if (!-e $request_filename) {
            rewrite ^(.+)$ /index.php?q=$1 last;
        }
    }

    # End of server definition
}

Note that if you are using the above rules and WP Super Cache, the cache will be used in only "half-on" mode, so you might want to at least consider translating the rewrite rules used by it to take full advantage of the caching.

Security considerations

Security of this setup can be increased if you define separate pools of PHP processes for each virtual server or even application, running as separate users. You can do so by running separate startup scripts. Just define each pool to use its own UNIX socket and instead of inlining the /etc/nginx/fastcgi_php file, write its contents with the customized socket name in each virtual server config.

Disc ID DB now supports movies

Thanks to the Movie DB, my pet project Disc ID DB now supports movies as well as TV shows. I have made several other improvements to the code which should help performance and reliability of the service:

  • Now using prepared statements for database queries
  • Returning more detailed errors
  • Database backups available to the public
  • Stats page available to the public

In addition I have been brewing a client for the database which will allow everyone and anyone to backup their DVD's with all the metadata attached. More to come soon.

This does not scale well

Here is a piece of code I just found:

# new location
# creates new record or seeks empty record marked by group_id = -2
if($CGI_DATA['new'] == 'New Location') {
  # look for empty - abandoned record
  $SQL  = "SELECT * FROM location WHERE group_id = -2";
  $result = mysql_query($SQL);
  if($row = mysql_fetch_array($result)) {
    $CGI_DATA['location_id'] = $row['location_id'];
  } else {
    $SQL  = "INSERT INTO location SET group_id = '-2',location_bname='New Location',location_roomnum='',
            location_capacity=0,location_address='',location_order=0,location_status='active'";
    $result = mysql_query($SQL);
    $CGI_DATA['location_id'] = mysql_insert_id();
  }
  $CGI_DATA['edit']='New';
}

A little background: this application has a number of groups each of which can specify several locations. What the above code does is it inserts a location with an "invalid" (-2) group_id, and then allows you to "edit" that record on the next page load. I don't know how they thought this was acceptable.

« Page 2 / 3 »