Indexing mp3 database with Python and Lucene
Par Mathieu Lecarme le samedi, 15 mars 2008, 14:11 - Lien permanent
MP3 player uses a database for managing thousands of songs. Here is a Python test for indexing the XML dump of common MP3 player (rhytmbox and iTunes), to a Lucene index, via Goniometre, a Passerelle project.
First demo of Passerelle was a PHP example. Here is a Python example, with XML reading via Genshi toolkit.
Building
OSX with developer tools installed get all tools. All needed tools are packaged for Linux. For Windows, it's harder but doable.
Geeting source code
Subversion repository is used.
svn checkout https://anonymous@admin.garambrogne.net/subversion/passerelle/trunk passerelle cd passerelle
Building tools
Java code is built with ant. There is no dependency management, lib are inside the source.
Python code use classic setup tools with two foreign packages managed by easy_install. If psyco is available, it is used (20% faster).
ant sudo easy_install genshi simplejson cd src/python sudo python setup.py install
Launching music server
Ant build and launch the Passerelle server, listening port 8042.
cd ../../goniometre/ ant music
Launching indexation
In another terminal window, launch the client. Default XML dump is used. Itunes for OSX and Windows, Rhytmbox for Linux.
cd demo/py/music ./music.py
Testing
This test do only writing stuff. Luke can be used to test index in /tmp/goniometre/music/
With common hardware, 200 songs/second can be indexed. Java part take 30% of time consumed, Python parsing 70% (with 12% for JSON serialization). XML reading cost is huge.
Inside
source : music.py
XML dump is read with a SAX tool, Song objects are created and sending to java while reading. Compass is used in batch mode to optimize massive write process.

