Scraping, also know as content scraping, web scraping, data aggregation, database scraping and other terms, refers to the collection of an application’s content and/or other data for use elsewhere. It’s an automated threat that aims to steal sensitive data from a victim or abuse functionality
OWASP, a worldwide not-for-profit charitable organization focused on improving the security of software, says data commonly misused in scraping incidents includes authentication credentials, payment cardholder data and other financial data, medical and other personal data, intellectual property and other business data and public information.
Some scraping might use fake or compromised accounts, or the information might be accessible without authentication, the handbook notes. The scraper might attempt to read all accessible paths and parameter values for web pages and APIs, collecting the responses and extracting data from them. Scraping can occur in real time or be more periodic in nature.
OWASP says possible symptoms of scraping include unusual request activity for selected resources, duplicated content from multiple sources in search engine results, and decreased search engine ranking.
Countermeasures for Scraping: Countermeasures the group suggests include reducing the data fields collected and subsequently output, and/or reducing the retention period; documenting what is acceptable usage and what is unacceptable scraping; defining test cases for scraping that confirm an application will detect and/or prevent users attempting to scrape content and other data.
Other recommended steps include randomizing the content and URLs of content, tying these changes to an individual user’s session, verifying the changes at each request, and restricting any identified automated usage. Companies can also identify and restrict automated usage by fingerprinting before a scraping attack can occur; require greater identity authentication for access; pre-register users and implement strong authentication for access to any exposed APIs.