Often websites will put their content in a binary
file in an attempt to hide the content from unwanted spiders.
> PDFs, Excel spreadsheets, even shockwave files can all be parsed and
content can be extracted. PDFs use pdftohtml or pstotext. Excel
spreadsheets use xls2xml or java excel libraries. Shockwave's are a
little tricky. You need to launch a windows application called swfcatcher
( flash decomplier ) and automate the parsing of the shockwave to get the
content.