arighi's blog: multi-threaded "cat-like" command for web pages...

Following a very useful script to dump HTML of one or more web links, given as arguments, to standard output. If more than one link is passed it spawns a thread for each link (synchronizing the dump on stdout in mutual exclusion). The threaded approach reduce the average waiting time for the reply of the connections and it's strongly improve performances when we need to download a lot of pages at the same time (i.e. I usually use this script to dump and grep the linux kernel changelogs directly from web...).

BTW: python is great! ;-)


#!/usr/bin/env python

import sys, urllib, urllib
from threading import Thread, Lock

class webThread(Thread):
def __init__(self, url):
  self.url = url
  Thread.__init__(self)

def run(self):
  remotefile = urllib.urlopen(self.url)
  data = remotefile.read()
  remotefile.close()

  stdout_mutex.acquire()
  print "=== %s ===" % self.url
  print data
  stdout_mutex.release()

if __name__ == '__main__':
if len(sys.argv) <>" % sys.argv[0]
  sys.exit(1)
else:
  threads = []
  stdout_mutex = Lock()
  sys.stdout.flush()
  for url in sys.argv[1:]:
      t = webThread(url)
      t.start()
      threads.append(t)
  for t in threads:
      t.join()

arighi's blog

Saturday, March 3, 2007

multi-threaded "cat-like" command for web pages...

No comments: