-  [WT]  [Home] [Manage]

[Return]
Posting mode: Reply
Name
Email
Subject   (reply to 27)
Message
Captcha
File
Embed   Help
Password  (for post and file deletion)
  • Supported file types are: 7Z, GIF, JPG, PDF, PNG, RAR, TXT, XZ, ZIP
  • Maximum file size allowed is 10240 KB.
  • Images greater than 200x200 pixels will be thumbnailed.
  • Currently 112 unique user posts. View catalog

  • Blotter updated: 2012-05-14 Show/Hide Show All

File 129916886586.jpg - (66.25KB , 848x848 , bz.jpg ) Thumbnail displayed, click image for full size.
27 No. 27
sup /c0de/,

I really like python and am moving into doing fun stuff within the standard library. I was wondering if you could give a Nigger some tips with regards to "scraping" content off of sites and what tools (internal to the standard lib, OR external resources) or modules I could plug into? No need re-inventing the wheel again right?

Oh, In b4 use urllib nigr0id, because it just extracts the html code and not simply the "content" (Ex: a sentence, or paragraph of text sans the <html> code).
Expand all images
>> No. 28
File 129921394998.png - (577.34KB , 947x547 , Picture 1.png ) Thumbnail displayed, click image for full size.
28
I don't know much about Python, but HTML is just plain text. If you know the particular tag the content is housed in, you should be able to request the HTML and use regular expressions to get what you want.

That's assuming you want textual content. Images will either be provided as direct links, or will reference some server side code that serves the image back to you. That code might check your HTTP referrer header to see where you're coming from. If you're coming from off-site, it might serve up an alternate image. So basically, you need to spoof your HTTP Referrer Header. Fortunately, HTTP is also plain text.

I might be making this more complicated than it needs to be, but it seems like you'll need to know HTTP, HTTP Headers, Cookies (just in case the server needs them), HTML, and RegExps. And you'll need an idea of the specific section or sections that house the content you're after. Basically, you're building a primitive web browser from scratch.

I don't know what Python offers in that regard, but it's too popular not to offer something in this area. Look for anything dealing with sockets, http connections, http parsing, and regexps.
>> No. 37
>Oh, In b4 use urllib nigr0id, because it just extracts the html code and not simply the "content" (Ex: a sentence, or paragraph of text sans the <html> code).
lrn2regex: http://docs.python.org/howto/regex.html


Delete post []
Password  
Report post
Reason  




Inter*Chan Imageboard Top List