Wednesday, November 10, 2010

Extract the key information from a webpage

I wish to write a script (i.e. computer program) that can go to a webpage, extract the HTML code, and remove all advertisements, graphics and supplementary information - leaving only the key content of that page. This script will need an understanding of the structure of the HTML page and also some artificial intelligence. I will send the URL of the webpage to this script and it should return the text contained in the webpage.

If any expert knows how to write this script or wish to give it a try, send an e-mail to kinlian@gmail.com.

1 comment:

Unknown said...

There is a javascript tool that can help you do that:
http://lab.arc90.com/experiments/readability/

Blog Archive