Webbots, spiders, and screen scrapers a guide to developing internet agents with PHP/CURL

The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. There's no reason to let browsers limit your online experience—especially wh...

Full description

Saved in:

Bibliographic Details
Main Author	Schrenk, Michael
Format	eBook
Language	English
Published	San Francisco No Starch Press, Inc 2012 No Starch Press, Incorporated
Edition	2nd ed.
Subjects	Intelligent agents (Computer software) Internet programming Internet searching Web search engines
Online Access	Get full text

Cover

Loading…

Abstract	The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. There's no reason to let browsers limit your online experience—especially when you can easily automate online tasks to suit your individual needs. Learn how to write webbots and spiders that do all this and more: –Programmatically download entire websites –Effectively parse data from web pages –Manage cookies –Decode encrypted files –Automate form submissions –Send and receive email –Send SMS alerts to your cell phone –Unlock password-protected websites –Automatically bid in online auctions –Exchange data with FTP and NNTP servers Sample projects using standard code libraries reinforce these new skills. You'll learn how to create your own webbots and spiders that track online prices, aggregate different data sources into a single web page, and archive the online data you just can't live without. You'll learn inside information from an experienced webbot developer on how and when to write stealthy webbots that mimic human behavior, tips for developing fault-tolerant designs, and various methods for launching and scheduling webbots. You'll also get advice on how to write webbots and spiders that respect website owner property rights, plus techniques for shielding websites from unwanted robots. Some tasks are just too tedious—or too important!—to leave to humans. Once you've automated your online life, you'll never let a browser limit the way you use the Internet again.
AbstractList	The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. There's no reason to let browsers limit your online experience—especially when you can easily automate online tasks to suit your individual needs. Learn how to write webbots and spiders that do all this and more: –Programmatically download entire websites –Effectively parse data from web pages –Manage cookies –Decode encrypted files –Automate form submissions –Send and receive email –Send SMS alerts to your cell phone –Unlock password-protected websites –Automatically bid in online auctions –Exchange data with FTP and NNTP servers Sample projects using standard code libraries reinforce these new skills. You'll learn how to create your own webbots and spiders that track online prices, aggregate different data sources into a single web page, and archive the online data you just can't live without. You'll learn inside information from an experienced webbot developer on how and when to write stealthy webbots that mimic human behavior, tips for developing fault-tolerant designs, and various methods for launching and scheduling webbots. You'll also get advice on how to write webbots and spiders that respect website owner property rights, plus techniques for shielding websites from unwanted robots. Some tasks are just too tedious—or too important!—to leave to humans. Once you've automated your online life, you'll never let a browser limit the way you use the Internet again.
Author	Schrenk, Michael
Author_xml	– sequence: 1 fullname: Schrenk, Michael
BookMark	eNotT8tKw0AUHfGBbc0_dCNuWrjzSCezbENrhWJBiy7DneROjcZJzKQL_96UeDbnweHAGbMrX3u6YJHRCY-NFFpJAZdsPBguQN2wkTax5kpquGVRCJ_QQySx4osRu38na-suzKahKQtqe4G-mIa8JfJnwqYP79i1wypQ9M8T9rZZH9LtfLd_fEqXuznGPFFm7sBAnjhyDriVBpVxqJCo0LYgMmiADKhEWgNcCQ0KNaKSJElQQVLLCXsYhpu2_jlR6DKydf2Vk-9arLL1KpXA9UKavjkbmuWxOdmqDB-lP2ZNW35j-5s9718Py5d0uzp_BS3kH5coUog
ContentType	eBook
DBID	NMGKS
DEWEY	006.3
DatabaseName	No Starch Press
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering Computer Science Library & Information Science
EISBN	9781593274320 1593274327
Edition	2nd ed. 2
ExternalDocumentID	EBC3017639 NOSTARCHB0000072
GroupedDBID	AABBV AALIM ABARN ABIAV ABQPQ ACLGV ACNAM ADVEM AERYV AFOJC AHWGJ AJFER AKHYG ALMA_UNASSIGNED_HOLDINGS AZZ BBABE BPBUR GEOUK HF4 J-X JJU MYL NK1 NK2 NMGKS OHILO OODEK PQQKQ WZT ~H6
ID	FETCH-LOGICAL-a51849-f090c8feff01b39a49fa4aeed7bdee9a90e90483b90142704a7aa43e3e2ede373
ISBN	1593271204 1593273975 9781593271206 9781593273972
IngestDate	Wed Aug 27 04:29:04 EDT 2025 Fri Mar 21 18:58:05 EDT 2025
IsPeerReviewed	false
IsScholarly	false
LCCallNum	TK5105.884
LCCallNum_Ident	TK5105.884
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-a51849-f090c8feff01b39a49fa4aeed7bdee9a90e90483b90142704a7aa43e3e2ede373
OCLC	795714370
PQID	EBC3017639
PageCount	396
ParticipantIDs	proquest_ebookcentral_EBC3017639 igpublishing_primary_NOSTARCHB0000072
ProviderPackageCode	J-X
PublicationCentury	2000
PublicationDate	2012. 2012
PublicationDateYYYYMMDD	2012-01-01
PublicationDate_xml	– year: 2012 text: 2012.
PublicationDecade	2010
PublicationPlace	San Francisco
PublicationPlace_xml	– name: San Francisco
PublicationYear	2012
Publisher	No Starch Press, Inc No Starch Press, Incorporated
Publisher_xml	– name: No Starch Press, Inc – name: No Starch Press, Incorporated
SSID	ssj0000285416 ssj0000680961 ssib036191387
Score	1.8625588
Snippet	The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take...
SourceID	proquest igpublishing
SourceType	Publisher
SubjectTerms	Intelligent agents (Computer software) Internet programming Internet searching Web search engines
SubjectTermsDisplay	Intelligent agents (Computer software) Internet programming Internet searching Web search engines
Subtitle	a guide to developing internet agents with PHP/CURL
TableOfContents	Webbots, spiders, and screen scrapers : a guide to developing internet agents with PHP/CURL -- Brief Contents -- Contents in Detail -- About the Author; About the Technical Reviewer -- Acknowledgements -- Introduction -- Part I: Fundamental Concepts and Techniques -- 1. What's in It for You? -- 2. Ideas for Webbot Projects -- 3. Downloading Web Pages -- 4. Basic Parsing Techniques -- 5. Advanced Parsing with Regular Expressions -- 6. Automating Form Submission -- 7. Managing Large Amounts of Data -- Part II: Projects -- 8. Price-Monitoring Webbots -- 9. Image-Capturing Webbots -- 10. Link-Verification Webbots -- 11. Search-Ranking Webbots -- 12. Aggregation Webbots -- 13. FTP Webbots -- 14. Webbots That Read Email -- 15. Webbots That Send Email -- 16. Converting a Website into a Function -- Part III: Advanced Technical Considerations -- 17. Spiders -- 18. Procurement Webbots and Snipers -- 19. Webbots and Cryptography -- 20. Authentication -- 21. Advanced Cookie Management -- 22. Scheduling Webbots and Spiders -- 23. Scraping Difficult Websites with Browser Macros -- 24. Hacking iMacros -- 25. Deployment and Scaling -- Part IV: Larger Considerations -- 26. Designing Stealthy Webbots and Spiders -- 27. Proxies -- 28. Writing Fault-Tolerant Webbots -- 29. Designing Webbot-Friendly Websites -- 30. Killing Spiders -- 31. Keeping Webbots out of Trouble -- Appendix A: PHP/CURL Reference -- Appendix B: Status Codes -- Appendix C: SMS Gateways -- Index. Procurement Webbot Theory -- Get Purchase Criteria -- Authenticate Buyer -- Verify Item -- Evaluate Purchase Triggers -- Make Purchase -- Evaluate Results -- Sniper Theory -- Get Purchase Criteria -- Authenticate Buyer -- Verify Item -- Synchronize Clocks -- Time to Bid? -- Submit Bid -- Evaluate Results -- Testing Your Own Webbots and Snipers -- Further Exploration -- Final Thoughts -- 19: Webbots and Cryptography -- Designing Webbots That Use Encryption -- SSL and PHP Built-in Functions -- Encryption and PHP/CURL -- A Quick Overview of Web Encryption -- Final Thoughts -- 20: Authentication -- What Is Authentication? -- Types of Online Authentication -- Strengthening Authentication by Combining Techniques -- Authentication and Webbots -- Example Scripts and Practice Pages -- Basic Authentication -- Session Authentication -- Authentication with Cookie Sessions -- Authentication with Query Sessions -- Final Thoughts -- 21: Advanced Cookie Management -- How Cookies Work -- PHP/CURL and Cookies -- How Cookies Challenge Webbot Design -- Purging Temporary Cookies -- Managing Multiple Users' Cookies -- Further Exploration -- 22: Scheduling Webbots and Spiders -- Preparing Your Webbots to Run as Scheduled Tasks -- The Windows XP Task Scheduler -- Scheduling a Webbot to Run Daily -- Complex Schedules -- The Windows 7 Task Scheduler -- Non-calendar-based Triggers -- Final Thoughts -- Determine the Webbot's Best Periodicity -- Avoid Single Points of Failure -- Add Variety to Your Schedule -- 23: Scraping Difficult Websites with Browser Macros -- Barriers to Effective Web Scraping -- AJAX -- Bizarre JavaScript and Cookie Behavior -- Flash -- Overcoming Webscraping Barriers with Browser Macros -- What Is a Browser Macro? -- The Ultimate Browser-Like Webbot -- Installing and Using iMacros -- Creating Your First Macro -- Final Thoughts Intro -- Brief Contents -- Contents In Detail -- Introduction -- Old-School Client-Server Technology -- The Problem with Browsers -- What to Expect from This Book -- Learn from My Mistakes -- Master Webbot Techniques -- Leverage Existing Scripts -- About the Website -- About the Code -- Requirements -- Hardware -- Software -- Internet Access -- A Disclaimer (This Is Important) -- PART I: Fundamental Concepts and Techniques -- 1: What's in It for You? -- Uncovering the Internet's True Potential -- What's in It for Developers? -- Webbot Developers Are in Demand -- Webbots Are Fun to Write -- Webbots Facilitate "Constructive Hacking" -- What's in It for Business Leaders? -- Customize the Internet for Your Business -- Capitalize on the Public's Inexperience with Webbots -- Accomplish a Lot with a Small Investment -- Final Thoughts -- 2: Ideas for Webbot Projects -- Inspiration from Browser Limitations -- Webbots That Aggregate and Filter Information for Relevance -- Webbots That Interpret What They Find Online -- Webbots That Act on Your Behalf -- Figure 2-3: An example pokerbot -- A Few Crazy Ideas to Get You Started -- Help Out a Busy Executive -- Save Money by Automating Tasks -- Protect Intellectual Property -- Monitor Opportunities -- Verify Access Rights on a Website -- Create an Online Clipping Service -- Plot Unauthorized Wi-Fi Networks -- Track Web Technologies -- Allow Incompatible Systems to Communicate -- Final Thoughts -- 3: Downloading Web Pages -- Think About Files, Not Web Pages -- Downloading Files with PHP's Built-in Functions -- Downloading Files with fopen() and fgets() -- Downloading Files with file() -- Introducing PHP/CURL -- Multiple Transfer Protocols -- Form Submission -- Basic Authentication -- Cookies -- Redirection -- Agent Name Spoofing -- Referer Management -- Socket Management -- Installing PHP/CURL -- LIB_http Spidering Search Engines Is a Bad Idea -- Familiarize Yourself with the Google API -- Further Exploration -- 12: Aggregation Webbots -- Choosing Data Sources for Webbots -- Example Aggregation Webbot -- Familiarizing Yourself with RSS Feeds -- Writing the Aggregation Webbot -- Adding Filtering to Your Aggregation Webbot -- Further Exploration -- 13: FTP Webbots -- Example FTP Webbot -- PHP and FTP -- Further Exploration -- 14: Webbots That Read Email -- The POP3 Protocol -- Logging into a POP3 Mail Server -- Reading Mail from a POP3 Mail Server -- Executing POP3 Commands with a Webbot -- Further Exploration -- Email-Controlled Webbots -- Email Interfaces -- 15: Webbots That Send Email -- Email, Webbots, and Spam -- Sending Mail with SMTP and PHP -- Configuring PHP to Send Mail -- Sending an Email with mail() -- Writing a Webbot That Sends Email Notifications -- Keeping Legitimate Mail out of Spam Filters -- Sending HTML-Formatted Email -- Further Exploration -- Using Returned Emails to Prune Access Lists -- Using Email as Notification That Your Webbot Ran -- Leveraging Wireless Technologies -- Writing Webbots That Send Text Messages -- 16: Converting a Website into a Function -- Writing a Function Interface -- Defining the Interface -- Analyzing the Target Web Page -- Using describe_zipcode() -- Final Thoughts -- Distributing Resources -- Using Standard Interfaces -- Designing a Custom Lightweight "Web Service" -- PART III: Advanced Technical Considerations -- 17: Spiders -- How Spiders Work -- Example Spider -- LIB_simple_spider -- harvest_links() -- archive_links() -- get_domain() -- exclude_link() -- Experimenting with the Spider -- Adding the Payload -- Further Exploration -- Save Links in a Database -- Separate the Harvest and Payload -- Distribute Tasks Across Multiple Computers -- Regulate Page Requests -- 18: Procurement Webbots and Snipers Familiarizing Yourself with the Default Values -- Using LIB_http -- Learning More About HTTP Headers -- Examining LIB_http's Source Code -- Final Thoughts -- 4: Basic Parsing Techniques -- Content Is Mixed with Markup -- Parsing Poorly Written HTML -- Standard Parse Routines -- Using LIB_parse -- Splitting a String at a Delimiter: split_string() -- Parsing Text Between Delimiters: return_between() -- Parsing a Data Set into an Array: parse_array() -- Parsing Attribute Values: get_attribute() -- Removing Unwanted Text: remove() -- Useful PHP Functions -- Detecting Whether a String Is Within Another String -- Replacing a Portion of a String with Another String -- Parsing Unformatted Text -- Measuring the Similarity of Strings -- Final Thoughts -- Don't Trust a Poorly Coded Web Page -- Parse in Small Steps -- Don't Render Parsed Te xt While Debugging -- Use Regular Expressions Sparingly -- 5: Advanced Parsing with Regular Expressions -- Pattern Matching, the Key to Regular Expressions -- PHP Regular Expression Types -- PHP Regular Expressions Functions -- Resemblance to PHP Built-In Functions -- Learning Patterns Through Examples -- Parsing Numbers -- Detecting a Series of Characters -- Matching Alpha Characters -- Matching on Wildcards -- Specifying Alternate Matches -- Regular Expressions Groupings and Ranges -- Regular Expressions of Particular Interest to Webbot Developers -- Parsing Phone Numbers -- Where to Go from Here -- When Regular Expressions Are (or Aren't) the Right Parsing Tool -- Strengths of Regular Expressions -- Disadvantages of Pattern Matching While Parsing Web Pages -- Which Are Faster: Regular Expressions or PHP's Built-In Functions? -- Final Thoughts -- 6: Automating Form Submission -- Reverse Engineering Form Interfaces -- Form Handlers, Data Fields, Methods, and Event Triggers -- Form Handlers -- Data Fields -- Methods Multipart Encoding -- Event Triggers -- Unpredictable Forms -- JavaScript Can Change a Form Just Before Submission -- Form HTML Is Often Unreadable by Humans -- Cookies Aren't Included in the Form, but Can Affect Operation -- Analyzing a Form -- Final Thoughts -- Don't Blow Your Cover -- Correctly Emulate Browsers -- Avoid Form Errors -- 7: Managing Large Amounts of Data -- Organizing Data -- Naming Conventions -- Storing Data in Structured Files -- Storing Text in a Database -- Storing Images in a Database -- Database or File? -- Making Data Smaller -- Storing References to Image Files -- Compressing Data -- Removing Formatting -- Thumbnailing Images -- Final Thoughts -- PART II: Projects -- 8: Price-Monitoring Webbots -- The Target -- Designing the Parsing Script -- Initialization and Downloading the Target -- Further Exploration -- 9: Image-Capturing Webbots -- Example Image-Capturing Webbot -- Creating the Image-Capturing Webbot -- Binary-Safe Download Routine -- Directory Structure -- The Main Script -- Further Exploration -- Final Thoughts -- 10: Link-Verification Webbots -- Creating the Link-Verification Webbot -- Initializing the Webbot and Downloading the Target -- Setting the Page Base -- Parsing the Links -- Running a Verification Loop -- Generating Fully Resolved URLs -- Downloading the Linked Page -- Displaying the Page Status -- Running the Webbot -- LIB_http_codes -- LIB_resolve_addresses -- Further Exploration -- 11: Search-Ranking Webbots -- Description of a Search Result Page -- What the Search-Ranking Webbot Does -- Running the Search-Ranking Webbot -- How the Search-Ranking Webbot Works -- The Search-Ranking Webbot Script -- Initializing Variables -- Starting the Loop -- Fetching the Search Results -- Parsing the Search Results -- Final Thoughts -- Be Kind to Your Sources -- Search Sites May Treat Webbots Differently Than Browsers Are Macros Really Necessary?
Title	Webbots, spiders, and screen scrapers
URI	http://portal.igpublish.com/iglibrary/search/NOSTARCHB0000072.html https://ebookcentral.proquest.com/lib/[SITE_ID]/detail.action?docID=3017639
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwEDZ9LDBBAVFe8kCZGuQ2Th2vrYoqhjK0oG6VX2FAaitUFvjz3DluGhUkBIuTWJEj3ef47uy77wi5USmLOaipCMnSIq6FjLS1OuJ-x431stT6aItxb_TEH2bJrLL3Wc4uWes78_FjXsl_UIU-wBWzZP-AbDEodMA94AstIAztjvFbPAbCbYclwDwEEyzxmtc581GYBgNp8KJWbntSMzEYVve6Gygf3H0fN1F298dLtEJxO8SHaISFJJAeF-lQuXcIlkoM1onMa-Ps0EoP-wP4t2F1kVVSFSKtkToowmGx1RGDZ9WJg-2Wq7UUy8Ngctxm3CRwZhXfQb7Xl1Wxf_ZNv3mlPT0kdYeZHEek4hYNclBiXmyQq5CvQW9pSMhClGhY6Y5JK0i4TYN82xSkS3Pp0o10T8jz_XA6GEWhokSkEnBlZZQxyUyauSxjHR1LxWWmuAI7QWjrnFSSOYkk-xpPl7uCcSWU4jBtXddZF4v4lNQWy4U7I1RbxZzR4A6nnBtjNNiGMIJVPSUSZVmTtMqymK9y9pD5-HEyxYPAvvfPRLdJ6EZIc39AHqJy51uEzn9_5YLsb2fLJamt397dFZhLa30dgP0Cp3sTMg
linkProvider	ProQuest Ebooks
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=book&rft.title=Webbots%2C+Spiders%2C+and+Screen+Scrapers&rft.au=Schrenk%2C+Michael&rft.date=2012-01-01&rft.pub=No+Starch+Press%2C+Incorporated&rft.isbn=9781593273972&rft.externalDocID=EBC3017639
thumbnail_s	http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fportal.igpublish.com%2Figlibrary%2Famazonbuffer%2FNOSTARCHB0000072_null_0_320.png