/code/ - Coding/Scripting

File 129916886586.jpg - (66.25KB , 848x848 , bz.jpg ) Thumbnail displayed, click image for full size.

Python (2.6 / 2.7 / or 3.0 preferred) help, web content text extraction NIGGER 11/03/03(Thu)08:14 No. 27

sup /c0de/,

I really like python and am moving into doing fun stuff within the standard library. I was wondering if you could give a Nigger some tips with regards to "scraping" content off of sites and what tools (internal to the standard lib, OR external resources) or modules I could plug into? No need re-inventing the wheel again right?

Oh, In b4 use urllib nigr0id, because it just extracts the html code and not simply the "content" (Ex: a sentence, or paragraph of text sans the <html> code).

Expand all images

deleted 11/03/03(Thu)20:45 No. 28
File 129921394998.png - (577.34KB , 947x547 , Picture 1.png ) Thumbnail displayed, click image for full size.

I don't know much about Python, but HTML is just plain text. If you know the particular tag the content is housed in, you should be able to request the HTML and use regular expressions to get what you want.

That's assuming you want textual content. Images will either be provided as direct links, or will reference some server side code that serves the image back to you. That code might check your HTTP referrer header to see where you're coming from. If you're coming from off-site, it might serve up an alternate image. So basically, you need to spoof your HTTP Referrer Header. Fortunately, HTTP is also plain text.

I might be making this more complicated than it needs to be, but it seems like you'll need to know HTTP, HTTP Headers, Cookies (just in case the server needs them), HTML, and RegExps. And you'll need an idea of the specific section or sections that house the content you're after. Basically, you're building a primitive web browser from scratch.

I don't know what Python offers in that regard, but it's too popular not to offer something in this area. Look for anything dealing with sockets, http connections, http parsing, and regexps.

>>	Anonymous 11/03/21(Mon)12:18 No. 37 >Oh, In b4 use urllib nigr0id, because it just extracts the html code and not simply the "content" (Ex: a sentence, or paragraph of text sans the <html> code). lrn2regex: http://docs.python.org/howto/regex.html

Name
Email
Subject	(reply to 27)
Message
Captcha
File
Embed	Help
Password	(for post and file deletion)

Supported file types are: 7Z, GIF, JPG, PDF, PNG, RAR, TXT, XZ, ZIP Maximum file size allowed is 10240 KB. Images greater than 200x200 pixels will be thumbnailed. Currently 112 unique user posts. View catalog Blotter updated: 2012-05-14 Show/Hide Show All 05/14/12 - Users, we are currently pruning boards that are rarely used, and merging redundant boards, For your pleasure. 03/22/12 - FYI /cwc/ users: tinyurl is banned because of stuff like linkbux and nimp clones. 12/01/11 - Check out our irc channel irc.lostsig.net/6667 #789