Web scraping and GDPR. Practical guide for those who want to dig( but not too deep)

Soledad Musella Rubio
6 min readMay 28, 2020

--

Everyone knows that there is more information on the Internet than any human can absorb in a lifetime. The amount of documentation and data is as large that is not easy to access it and is needed a scalable way to rank, collect, organise, and analyse it. In order to do that, WEB SCRAPING comes to help by extracting data and presenting it in a format that allows users to easily make sense of. But what is WEB SCRAPING and how can be useful?

Web scraping

The concept of “Web scraping” refers basically to the extraction of data from a website. Below are presented the most common different techniques of web scraping. However is important to emphasise that in most of the cases automated tools are preferred when scraping web data as they can be cheaper and give a higher performance.

Manual scraping

Copy pasting:

Manual scraping involves copying and pasting web content. This method takes a lot of effort and is highly repetitive in the way it is carried out. Is an effective way of stealing content when the website’s defence mechanisms are tuned to only detect automated scraping bots.

Automated Scraping

HTML Parsing:

HTML parsing is done using JavaScript, and targets linear or nested HTML pages. This fast and robust method is used for text extraction, link extraction (such as nested links or email addresses), screen scraping, resource extraction, and so on.

DOM Parsing:

The Document Object Model, or DOM, defines the style, structure and the content within XML files. DOM parsers are generally used if you want to get an in-depth view of the structure of a web page. It can be used to get the nodes containing information, and then use a tool such as XPath to scrape web pages. Full-fledged web browsers like Internet Explorer or Firefox can be embedded to extract the entire web page or just parts of it, even if the content generated is dynamic in nature.

Vertical Aggregation:

Vertical aggregation platforms are created by companies with access to large-scale computing power to target specific verticals. Creation and monitoring of bots for specific verticals is carried out by these platforms with virtually no human intervention. The bots efficiency is measured by the quality of data extracted.

XPath:

XML Path Language, or XPath, is a query language that works on XML documents. Since XML documents are based on a tree-like structure, XPath can be used to navigate across the tree by selecting nodes based on a variety of parameters. XPath can be used in conjunction with DOM parsing to extract an entire web page and publish it at the destination website.

Google Sheets:

Google Sheets can be used as a scraping tool, and it’s quite popular among scrapers. From within Sheets, a scraper can use the IMPORTXML (,) function to scrape data from websites. This is useful when the scraper wants specific data or patterns to be extracted from a website. You can use this command to check if your website is scrape-proof.

What’ s the GDPR?

“The General Data Protection Regulation (EU) 2016/679 (“GDPR”) is a regulation in EU law on data protection and privacy for all individuals within the European Union (EU) and the European Economic Area (EEA). It also addresses the export of personal data outside the EU and EEA areas. The GDPR aims primarily to give control to individuals over their personal data and to simplify the regulatory environment for international business by unifying the regulation within the EU.”

EU’s General Data Protection Regulation (GDPR) is the most comprehensive data protection law that has ever been introduced, changing radically the way that companies are allowed to use the personal data of their customers and prospects.

Evaluate your web scraping for GDPR compliance step by step.

Step 1. Personal Data

The very first question to ask yourself is: are you scraping personal data? Personal data is defined as any personally identifiable information (PII) that could be used to directly or indirectly identify a specific individual. Most common examples of the personal data are name, physical address, email address, phone number, credit card details and so on. If you aren’t scraping personal data, then GDPR does not apply. However, if you are scraping personal data then move to step 2.

Step 2. Geographical location

If you are planning to scrape personal data, the next question you need to ask yourself is whether or not you are scraping the personal data of EU citizens or residents (note that the GDPR actually covers the EEA, which includes all EU countries, plus Iceland, Liechtenstein, and Norway, so it’s a bit broader than just the EU).

GDPR is an EEA specific regulation, so it only applies to EU citizens. If you are scraping the personal information of residents of other countries (ex. US, Canada, Australia, etc.) then GDPR may not apply.

Step 3. Do you have a lawful reason to scrape the EU citizens data ?

Under GDPR in order to to use or hold the personal data of any EU citizen, companies must comply with one or more of the following 2 legal reasons:

Consent — If you want to scrape the personal data of EU residents you now need to demonstrate that you have the explicit consent of the individual before scraping their personal data. Maybe you mightn’t like this position, but by adopting this policy is the only way you can prevent you and your company falling foul to GDPR. Obviously, this interpretation of the GDPR regulations will significantly curtail most web scraping projects focused on extraction of the personal information of EU residents for lead generation, market analysis, etc.

Legitimate Interest — The other likely lawful reason available to web scrapers is if they can demonstrate they have a legitimate interest in scraping/storing/using this personal data. For companies is very difficult to demonstrate that they have a legitimate interest in scraping someone’s personal data. In most cases, only governments, law enforcement agencies, etc. will have what would be deemed a to have a legitimate interest in scraping the personal data of its citizens as they will typically be scraping people’s personal data for the public good.

Step 4. There is Data and Data

Under the GDPR regulation, there are certain types of data that are classed as “sensitive” . If you want to scrape sensitive data you are going to be subject to additional rules that’s why unless you don’t have explicit consent and legitimate reason is better to avoid it. Examples of sensitive data are Racial or ethnic origin, political opinions, religious beliefs, genetic data. Admitting that you can argue that you have a legitimate interest in this data or have the users consent to extract and store their personal data, under GDPR you need to have a clear and legal reason for doing so and being able to demonstrate that it will be used for legitimate business purposes.

Step.6 Check your IPs

As the GDPR regulation defines IP addresses as personally identifiable information you need to ensure that any EU residential IPs you use as proxies are GDPR compliant.This means that you need to ensure that the owner of that residential IP has given their explicit consent for their home or mobile IP to be used as a web scraping proxy. If you own your own residential IPs then you will need to handle this consent yourself. If you are obtaining residential proxies from a 3rd party provider then you need to ensure that they have obtained consent and are in compliance with GDPR prior to using the proxy for your web scraping project.

Conclusions

In our data-driven world is now clear the idea that data is power. In the wrong hands it can be a dangerous tool that can even influence the election results of world’s most powerful nations, and creating scandals such as the latest involving Cambridge Analytica and Facebook. So is in order to do an ethical use of the internet and to avoid any compliance trouble you need to make sure that for all the personal data you have in your database, you have consent from the owners. The GDPR has to be considered as a guideline that can help small businesses and companies to prevent lawsuits and out-of-court settlements that have taken place before. Making most of the framework can, with time, prepare you and other companies for the future, when most countries will be having strict rules to regulate data and prevent its misuse.

--

--

Responses (1)