giovedì, agosto 16, 2007

Extract Text From HTML

Handy! I should try this for searching.

“In the course of improving this website's search engine, I wrote a routine that would extract the text from an article given a URL, strip out the HTML, and then convert all of the white space and carriage returns into single spaces. This was done to compress the size of the text involved, which was then stored in the database and used for full-text searches.”