What is a Web Spider?


Often websites will put their content in a binary file in an attempt to hide the content from unwanted spiders.

> PDFs, Excel spreadsheets, even shockwave files can all be parsed and content can be extracted.  PDFs use pdftohtml or pstotext.  Excel spreadsheets use xls2xml or java excel libraries.  Shockwave's are a little tricky.  You need to launch a windows application called swfcatcher ( flash decomplier ) and automate the parsing of the shockwave to get the content.

< prev | next >