Publicly Available Data, Privacy Rights: The Debate Over Web Scraping

Shaileja Verma and Alice Sharma
5 days ago
10 min read

-Shaileja Verma and Alice Sharma*

LLMs and Their Dependence on Web Scraping

Unlike laws and regulations that often undergo a slow churn, the Internet has not waited to play catch-up and has changed dramatically in the last few decades. One of the more remarkable changes is the widespread popularity of artificial intelligence (“AI”) tools and technologies, particularly generative AI (“GenAI”) and its subset, large language models (“LLMs”).

A key factor enabling the development and training of GenAI and LLMs is ‘data scraping’ or ‘web scraping’[1]. Simply put, data scraping is a process by which a program automatically scans webpages, picks out useful data (such as text and photos), and organises it to make it easy to use or analyse. Data scraping allows LLMs to learn from vast amounts of information that is publicly available on the Internet, which, in turn, helps them improve their ability to generate accurate and relevant responses to user prompts. Notably, publicly available information forms one of the three primary sources of information used to train LLMs. The other two sources are (i) information that is provided to model developers by third parties (such as data brokers) under contractual arrangements; and (ii) information that is provided or generated by users of the LLMs or by trainers/researchers engaged by model developers.

With the growing prevalence of GenAI and its symbiotic relationship with web scraping technologies, a debate over data privacy rights, legal restrictions, and ethical boundaries has come to the forefront. Jurisdictions across the globe have been actively deliberating over whether personal data available on the Internet should be freely accessible and usable for developing and training GenAI models, or would such data scraping activities violate an individual’s right to privacy? Against this backdrop, we aim to explore how laws in India confer privacy protections on publicly available (personal) data, the current global regulatory landscape governing the interplay between privacy laws and web scraping, and key lessons India can draw from international experiences.

Exploring the Legal Framework in India

Twenty-five years ago, the Information Technology Act, 2000 (“IT Act”) was enacted in India in line with the Model Law on Electronic Commerce, adopted by the United Nations General Assembly and United Nations Commission on International Trade Law. The IT Act penalises various kinds of online harms, including those arising from unauthorised access to data or failure to protect data. Of note is Section 43 read with Section 66, which envisages a dual penalty-cum-compensation mechanism for any person who - without permission of the owner or person in charge of a computer, computer system or computer network - downloads, copies or extracts any data, computer database or information from the same. As LLMs become more commonplace, the IT Act will be advantageous to online platforms, especially those with terms and conditions that expressly prohibit web scraping, to take action against LLMs for unauthorised data scraping. However, it may not be best placed to protect the privacy rights of individuals whose personal data is extracted from the Internet by LLMs. This is an issue that the recently enacted Digital Personal Data Protection Act, 2023 (“DPDP Act”) is better equipped to deal with.

The DPDP Act was enacted in August 2023 after undergoing several iterations and a long and winding consultation process spread across five years. Touted as a privacy law tailored to the Indian context, the DPDP Act envisages extensive obligations to be undertaken by organisations (i.e., ‘Data Fiduciaries’) before and at the time of processing personal data of individuals (i.e., ‘Data Principals’). Section 3 of the DPDP Act identifies different scenarios where the law applies and where it does not. For example, the DPDP Act does not apply to personal data that is made or caused to be made publicly available by the Data Principal or any other person under an obligation under applicable laws in India to make such personal data publicly available (“Personal Data Exemption”).

The term ‘processing’ has been broadly defined under Section 2(x) of the DPDP Act to include wholly or partly automated operations, such as collecting, retrieval, indexing and use - each such activity being inherent to developing and training of LLMs. Given this definition, scraping of personal data available on the Internet by LLMs will be considered a processing activity under the DPDP Act, and Data Fiduciaries involved in developing and training such LLMs will be subject to the requirements under the DPDP Act. However, this is likely to pose compliance hurdles since it may be difficult for Data Fiduciaries to adhere to key obligations under the law. Specifically, sending a notice to and obtaining consent from every Data Principal whose personal data is being scraped and allowing Data Principals to exercise their rights (such as the right to access information, right to correction and erasure and right of grievance redressal) in relation to such scraped data.

This is where the Personal Data Exemption could become a valuable alternative for LLMs. However, Data Fiduciaries will have to establish that the personal data being scraped by them falls within this exemption in the first place. Establishing this is likely to lead to operational challenges.

For example, there could be situations where:

A Data Principal has a public profile on a social media platform, and personal data available on this profile is considered fair game under the Personal Data Exemption. However, it is unclear if and how the law would apply in a scenario where such an individual changes the privacy settings of their profile. For example, would the profile settings applicable at the time of scraping (‘public’ in this case) guide a determination of the legality of an LLM’s processing activities, and how would Data Fiduciaries prove this in case of regulatory scrutiny? Similar challenges may arise if a Data Principal deletes personal data that was once considered “publicly available”.[2]
Another person has made a Data Principal’s personal data available on the Internet. Here, a Data Fiduciary would have no way of verifying if the same was done with the consent of the Data Principal or under an existing legal obligation. Even where personal data was made public by a Data Principal themselves or pursuant to having provided consent to another person (thereby “causing” them to make the information publicly available), the Data Principal may not necessarily anticipate or expect such data to be used to train an AI model.

These challenges could cast doubt over the overall legality of an LLM’s data-scraping operations. They could also undermine an individual’s ability to manage their digital footprint - an aspect other jurisdiction have been striving to address.

Overview of Regulatory Responses Across the Globe

Globally, one of the first crackdowns on large-scale data scraping was the ‘Joint Statement on Data Scraping and the Protection of Privacy’ (August 2023) issued by data protection authorities from countries such as the U.K., Canada, Australia and Switzerland (“Initial Joint Statement”). The Initial Joint Statement discussed the growing concerns about data scraping and its risks to individual privacy. It emphasised that even publicly accessible personal data is protected by privacy laws and that mass data scraping may constitute a reportable data breach in many jurisdictions. It also recommended that organisations - including social media companies - implement suitable measures to protect an individual’s personal data available on online platforms from unlawful scraping (such as by limiting excessive profile visits, detecting bot activity, and taking suitable legal action, where unlawful scraping is found), as well as continuously update such measures given the dynamic nature of data scraping threats. Building upon the Initial Joint Statement, a ‘Concluding Joint Statement on Data Scraping and the Protection of Privacy’ (October 2024) (“Concluding Joint Statement”) was eventually issued. Notably, the Concluding Joint Statement addressed the issue of data scraping by GenAI models, an aspect the Initial Joint Statement was silent on. It noted that where organisations rely on scraped data sets or use data from their platforms to train GenAI models, they must, at a minimum, comply with applicable data protection laws, as well as AI-specific laws (if any). However, the Concluding Joint Statement did not discuss the hows and whys of compliance in detail.

Simultaneous developments have taken place in the E.U. and the U.K. that have addressed these compliance aspects. A Task Force constituted by the European Data Protection Board (“ChatGPT Task Force”) issued a report titled ‘Report of the Work Undertaken by the ChatGPT Taskforce’ (May 2024). The ChatGPT Task Force was set up to assess the processing of personal data by LLMs, particularly OpenAI’s ChatGPT, and provide views on compliance with the E.U. General Data Protection Regulation (“GDPR”). According to the ChatGPT Task Force, personal data is processed by LLMs across five different stages. Three of these stages include: (i) collecting training data (which could include personal data) via web scraping; (ii) pre-processing and filtering such data; and (iii) training the model using the same. The ChatGPT Task Force observed that Article 14 of the GDPR (on providing specific information to an individual where their data has not been directly collected from them, like the identity of the Data Fiduciary, purposes of processing and categories of personal data concerned) should ordinarily apply to web scraping of personal data from public sources. However, since large volumes of data are collected by LLMs via web scraping, it would be impractical (or even impossible) to inform each individual about the circumstances surrounding the processing of their data; therefore, an exemption from this Article 14 requirement could be granted.

The ChatGPT Task Force also touched upon ‘legitimate interests’ as one of the grounds on which the personal data of individuals can be processed by LLMs, without obtaining consent from Data Principals and after undergoing an assessment of the three-part test comprising: (i) purpose(s) of processing; (ii) necessity of processing; and (iii) balancing of interests between parties.[3] While the ChatGPT Task Force did not provide a conclusive opinion or finding on relying on legitimate interests for scraping and processing personal data on the Internet, sans consent, this aspect was addressed by recent guidance issued by the U.K.’s Information Commissioner’s Office (“ICO”). As per the ICO, legitimate interest is the “sole available lawful basis” for training GenAI models using personal data scraped from the Internet, to the extent that model developers can pass the three-part test. This is because other grounds on which personal data can be lawfully processed under the U.K. GDPR cannot be practically implemented. In the context of consent as one such ground, the ICO noted that “this is unlikely to apply here because the organisation training the generative AI model has no direct relationship with the person whose data is scraped.” The ICO also laid down factors under the three-part ‘purpose’ test, ‘necessity’ test, and ‘balancing’ test that model developers need to account for prior to relying on legitimate interests.[4]

Analysing the DPDP Act against International Developments

As noted above, Section 3 of the DPDP Act excludes publicly available personal data from the purview of the DPDP Act, subject to the criteria under the Personal Data Exemption being satisfied. It is worth noting that the approach under the DPDP Act differs from earlier iterations of the law, which envisaged “processing of publicly available personal data” as one of the grounds on which Data Fiduciaries need not obtain consent from and/or provide notice to the concerned Data Principals, but would have to comply with all other obligations.

The Committee of Experts under the Chairmanship of Justice B.N. Srikrishna, in its report on ‘A Free and Fair Digital Economy’ (2018), provides some insight into the rationale behind this:

“Processing of personal data made public by a data principal also falls into this category. Conventional views of privacy would offer little protection to such information made public by an individual as the act of making information publicly available could be said to denude the individual of any reasonable expectation of privacy…On the other hand, limits of fair processing must also be clearly drawn.”

Thus, the DPDP Act has gone one step further in entirely excluding the subset of publicly available personal data from the ambit of the framework altogether. When contrasted against global developments ranging from requiring Data Fiduciaries to comply with privacy laws when scraping data to only exempting them from consent/notice requirements, the DPDP Act grants far more flexibility to LLMs to use web-scraped personal data. However, the utility of the Personal Data Exemption to Data Fiduciaries developing LLMs remains a moot point. This is because, as noted above, the conditions identified thereunder will likely prove challenging to satisfy when scraping personal data. A useful starting point would be for the Indian Government to clarify the scope of the Personal Data Exemption, as well as address its interplay with personal data publicly available on the Internet and the use of such data by LLMs for training purposes.

As part of this, the Indian Government can develop guidance and identify good business practices for Data Fiduciaries to follow when developing GenAI models. Taking a leaf out of jurisdictions such as Singapore and Australia, this can include measures such as: (i) deploying privacy enhancing tools and technologies (including anonymisation or de-identification techniques) before using publicly available personal data in the training and development of GenAI models; (ii) adhering to purpose limitation and data minimisation principles and only processing web-scraped personal data that is reasonably necessary to train the model in question or help it achieve its specific purpose (such as generate images or create videos); (iii) implementing a well-defined collection criteria during the training of an AI model so that certain kinds of information are not collected in the first place. This could include public social media profiles or websites that contain large amounts of personal data, or certain types of sensitive data; (iv) deleting data that is not relevant to the training, even if initially collected; and (v) increasing reliance on synthetic datasets to help counter privacy concerns associated with training models based on publicly available personal data.

If adopted, these measures will go a long way in bringing the DPDP Act in line with the global best practices, as well as ensuring that personal data, even if “publicly available” and exempt from the law, continues to be processed in a manner that respects and is cognizant of an individual’s privacy considerations.

[1] Please note that we have used the terms ‘data scraping’ or ‘web scraping’ interchangeably in this article

[2] When an individual deletes their personal data from the Internet (say, a social media post or photo), it may not mean that the data has been fully removed from the Internet. This is because data scraping tools may have already collected and stored copies of that information elsewhere. As a result, even though the original post does not exist, versions of it may continue to be available on the Internet. It is unclear how an LLM that relies on web scraping technologies for training purposes will distinguish such datasets from the ones which actually fall under the Personal Data Exemption.

[3] Please note that ‘legitimate interests’ is one of the six lawful bases that allows an organization to process personal data under the GDPR.

[4] For example, the factors under the: (i) ‘purpose’ test include defining all purposes and justifying the use of each type of data collected to demonstrate that the chosen approach for achieving a legitimate interest is reasonable, (ii) ‘necessity’ test include developing evidence on why other available or alternative methods for data collection are not feasible), and (iii) ‘balancing’ test include assessing the financial impact on individuals.

*Shaileja Verma and Alice Sharma are tech lawyers based in New Delhi. The views expressed here are those of the authors.

Publicly Available Data, Privacy Rights: The Debate Over Web Scraping

LLMs and Their Dependence on Web Scraping

Exploring the Legal Framework in India

Overview of Regulatory Responses Across the Globe

Analysing the DPDP Act against International Developments

Commentaires

Fair Share or Foul Play? Examining the Feasibility of Cap Contributions in India’s Telecom Sector

High Hopes, Higher Hurdles: Analysis of the DPDP Act and its Draft Rules (Part I)

High Hopes, Higher Hurdles: Analysis of the DPDP Act and its Draft Rules (Part II)

Recent