2008-09-18

YouTube archiver

Here's a fun experiment with combining various tools to easily make my MythTV box download YouTube videos with one click in the browser on another computer.

Proceeding from client to server, the first component is a Firefox bookmarklet for sending the URL of the current page to the MythTV box:

javascript:(function(){window.open('http://mythtvbox.mydomain.com:12345/'+window.location);})()

On the MythTV box I have a Python script listening for incoming HTTP requests on port 12345. The script uses twisted.web, so apt-get python-twisted-web (on Debian-based distros) is needed.
#!/usr/bin/python

from twisted.web import server, resource
from twisted.internet import reactor

class DownloadRequestHandler(resource.Resource):
isLeaf = True
def render_GET(self, request):
try:
video_id = request.args['v'][0]
except:
msg = "Can't parse URL %r" % request.path
return msg
file('/home/akaihola/.download-queue/youtube.com/%s' % video_id, 'w').close()
return 'Downloading YouTube %s' % video_id

site = server.Site(DownloadRequestHandler())
reactor.listenTCP(12345, site)
reactor.run()

At this point, whenever I select the "Download YouTube" bookmarklet while viewing a YouTube video, a file named after the ID of the video magically appears on the MythTV box.

For actual downloading of the content I use the youtube-dl script. The following bash script downloads content as long as files corresponding to video IDs exist, deletes those files after downloading, and waits for new files to appear. Waiting for new files is implemented efficiently using the inotifywait tool (apt-get inotify-tools on Debian-based distros). Only a small slice of my bandwidth is allowed for downloads during daytime, and in the night it can freely saturate the connection.

#!/bin/bash
QUEUEDIR=/home/akaihola/.download-queue/youtube.com
TEMPLATE="/var/lib/mythtv/videos/%(stitle)s-%(id)s.%(ext)s"
cd $QUEUEDIR
while true; do
while [ "`ls`" ]; do
VIDEO=`ls -1|head -1`
RATE=$( [ `date +%H` -gt 6 ] && echo "-r 30k" )
~/bin/youtube-dl -b $RATE -o $TEMPLATE http://youtube.com/watch?v=$VIDEO
rm $VIDEO
done
inotifywait -e create $QUEUEDIR
done

Both scripts are running as services in the background on the MythTV server ready to queue and download videos. The videos appear in the "Watch Videos" menu of the MythTV box ready for enjoyment.

Voilà!* It works! Improvements appreciated.

*or "Viola!" as I've seen many native English speakers happily saluting their favorite string instrument after achieving a desired goal

Update: Insufficient bash experience. Wouldn't work for >1 files in the queue. Fixed now, but [ "`ls`" ] feels like a bit awkward way to test if a directory is not empty.

2008-09-15

Reason for Dovecot's memory hunger

One runs into strange things when using rare combinations of software. I've been wondering why the Dovecot IMAP process on our mailserver grows to 50 or 60 MB when I'm syncing my mailbox with OfflineIMAP. It turns out the reason is in how OfflineIMAP identifies messages and how Dovecot's caching reacts to that.

I won't pretend I understand the technical details, but here's the bug report and some discussion. The quick solution is to delete overgrown dovecot index/cache files for any large folders to be synchronized. Dovecot 1.1's UIDPLUS support may also help, but that version isn't available for Etch even in the backports repository, so I'm not trying it any time soon.

2007-10-03

Why Scribus fails to import Sibelius music

I finally found an explanation for why I haven't been able to import EPS files exported from Sibelius (version 2) into Scribus. The reason is they use Type 3 fonts. See this message thread for more information and some possible solutions.

2007-09-26

Goodie for your PDF toolbox: automatic cropping tool

Ok so I couldn't find a PDF auto-cropping tool for Unix anywhere. Except for a csh shell script which worked fine but which I didn't like too much. So I rolled my sleeves and created my own in Python. Here it is:

#!/usr/bin/python

import re, sys
from subprocess import Popen, PIPE

bbsub = re.compile('%%BoundingBox: [\d ]+\n')

def fixbbox(pdfpath):
badeps = Popen('pdftops -eps %s -' % pdfpath,
shell=True, stdout=PIPE).stdout.read()
gs = Popen('gs -sDEVICE=bbox -dNOPAUSE -dBATCH -',
shell=True, stdin=PIPE, stderr=PIPE)
gs.stdin.write(badeps)
gs.stdin.close()
bbox = gs.stderr.read()
goodeps = bbsub.sub(bbox, badeps)
epstopdf = Popen('epstopdf --filter --outfile=%s' % pdfpath,
shell=True, stdin=PIPE)
epstopdf.stdin.write(goodeps)
epstopdf.stdin.close()


if __name__ == '__main__':
for pdfpath in sys.argv[1:]:
fixbbox(pdfpath)

2007-04-14

en-dash and em-dash in Emacs

Finally! I found them both! The em-dash can be typed in Emacs (on Ubuntu Edgy) with the key sequence Compose - - - (that's three hyphens), and the en-dash with Compose - - . (two hyphens and a period).

The en- and em-dashes seem to look identical when using Bitstream Vera Sans Mono, which is annoying but understandable in a monospace font, but at least I can now type them.

And where's the Compose key, you ask. Well, in addition to other peculiar keyboard customizations, I've re-assigned Compose to the context menu key. On my Dell laptop it's above F9, and on most full-size international keyboards it's next to Alt Gr. The XModmap rule for this trick is keycode 117 = Multi_key.

Now, if someone fixed copy-pasting extended characters from terminal windows to Emacs... And where's the full list of compose key sequences?

2007-04-10

LaTeX index entries inside footnotes fail

Inside LaTeX footnotes, \index entries which contain special characters or commands may appear multiple times in the index. This can happen when \makeidx inserts extra spaces in the markup which does not happen with \index in normal text.

I haven't found a good solution nor really understood what's going on. Let this post act as a reminder for myself in case I run into the same problem later and have forgotten details about it. If you know a good solution, please do send a comment!

Here's a simple test case:
% do e.g. the following:
% pdflatex test.tex ; makeindex test.idx ; pdflatex test.tex ; evince test.pdf
% and take a close look at test.idx to see where the problem is

\documentclass{article}
\usepackage{makeidx}
\makeindex
\begin{document}
First index entry is here\index{Index Entry@\emph{Index Entry}}.
\footnote[1]{Second index entry is here\index{Index Entry@\emph{Index Entry}}}
\printindex
\end{document}
And here's what ends up in test.idx:

\indexentry{Index Entry@\emph{Index Entry}}{1}
\indexentry{Index Entry@\emph {Index Entry}}{1}


This is how the index then looks like:
Index Entry, 1
Index Entry, 1
Here's some insight about what is going on. It isn't clear to me what Bernd is suggesting in his "Quick&Dirty" advice.

2007-02-28

Django newforms improvement

It's not the first time during my acquaintance with Django that I've needed functionality not available in the forms (and old manipulators) machinery.

Actually the need is quite simple: Only edit a subset of a model's field in a form, and keep old values for the rest.

For example, I might have a simplified form for changing only the content of a FlatPage, but keeping the title, url and other properties. In this case I'd like to be able to just say
FlatPageForm = form_for_instance(
myflatpage,
include_fields=['content'])

and have Django automagically fill in values from the original myflatpage object when saving form data for all fields except content.

There was a good discussion about this on the #django IRC channel, and eventually I implemented this functionality based on newforms. The module is available in our Subversion repository.