Saturday, March 3, 2007

multi-threaded "cat-like" command for web pages...

Following a very useful script to dump HTML of one or more web links, given as arguments, to standard output. If more than one link is passed it spawns a thread for each link (synchronizing the dump on stdout in mutual exclusion). The threaded approach reduce the average waiting time for the reply of the connections and it's strongly improve performances when we need to download a lot of pages at the same time (i.e. I usually use this script to dump and grep the linux kernel changelogs directly from web...).

BTW: python is great! ;-)

#!/usr/bin/env python

import sys, urllib, urllib
from threading import Thread, Lock

class webThread(Thread):
def __init__(self, url):
self.url = url
Thread.__init__(self)

def run(self):
remotefile = urllib.urlopen(self.url)
data = remotefile.read()
remotefile.close()

stdout_mutex.acquire()
print "=== %s ===" % self.url
print data
stdout_mutex.release()

if __name__ == '__main__':
if len(sys.argv) <>" % sys.argv[0]
sys.exit(1)
else:
threads = []
stdout_mutex = Lock()
sys.stdout.flush()
for url in sys.argv[1:]:
t = webThread(url)
t.start()
threads.append(t)
for t in threads:
t.join()

No comments: