To Train or Not to Train: AI and the Data Privacy Dilemma
- Kshitij Malhotra
- Jun 2
- 11 min read
-Kshitij Malhotra*
Abstract
While the need to ensure access to data that is rich in quality and quantity is essential for India’s AI ecosystem, India’s data protection framework effectively provides minimal protection to personal data in contexts relevant to AI training. The piece discusses the potential first- and second-order effects of the approach. It explores a more balanced approach to regulating personal data processing for artificial intelligence in the Indian context.
Introduction
The importance of allowing AI companies to train on Indian data cannot be overstated, especially considering, first, the lack of quality data in vernacular languages, as the report on the development of AI Governance Guidelines released in January 2025 also emphasized similar concerns, and second, the proliferation of synthetic content. It has become necessary for India to ensure access to quality data if it seeks to remain competitive in the global arena. However, scraping the internet (“web scraping”) for training Artificial Intelligence as a source of data cannot be given a free pass either, as the implications of doing so will extend beyond markets and will have adverse effects on the rights of the citizens.
The dilemma is rather complex, and while India has taken the route of granting near-free rein to web scraping in the context of data protection, the recent reply by the Minister of State for IT, if taken as a sign of regulatory intent, signals otherwise.
In response to a question on web scraping for training AI models, the comment stated that the Digital Personal Data Protection Act, 2023 (“Act”) mandates entities involved in web scraping of publicly available data to implement “robust compliance measures and obtain consent.” The reply, although well-intentioned, runs contrary to India’s data protection framework i.e., Sections 3(c)(ii) and 17(2)(b) of the Act in specific.
This article aims to address this dilemma by first, analysing the challenges and opportunities that the current framework poses, and second, discussing the possible implications and approaches that a revised regulatory intent would entail to ensure that commercial interests are not achieved on the pyre of privacy.
I. The current legal framework:
S.3(c)(ii) of the Act states that it shall not apply to personal data that has been made publicly available, hence, making such publicly available personal data (and user data) exempt from the scope of the Act. The illustration under Section 3(c) states that an individual who blogs their views on social media and makes their personal data available online is not protected under this Act. The part where the reply can be said to be consistent with the Act is that social media platforms need to obtain consent from the user before processing their data to train LLMs on it. The exception was also referred to by the then Minister of State for IT while introducing a non-personal data collection platform to train Indian LLMs.
The exemption is a marked shift from the approach taken by preceding regulations (Sensitive Personal Data Rules, 2011) and previous drafts of the Digital Personal Data Protection Act that recognised greater protection, and diligence on the end of data fiduciaries in context of Sensitive Personal Data. However, the 2023 Act does away with this distinctive category of Sensitive personal data altogether. If read in the context of the relevance of training LLMs in the past few years and the statement by the then MoS for IT, there is a reason to believe that one of the factors behind the provision was to allow scraping of publicly available data so that data-driven business models (and Artificial Intelligence development) do not suffer from any impediments that the compliance burden is likely to create in terms of the caution it has to exercise if the Act is applicable on the publicly available personal data scraped from the web.
The exemption for research, archiving, and statistical purposes under Section 17(2)(b) of the Act and Rule 15 of the Draft Digital Personal Data Protection Rules, 2025 (supplemented by the Second Schedule) seem to further the same by leaving a door open to process personal data for training without the provisions of the Act being applicable in the guise of “research purposes” not being defined under the Act or the Draft Rules. Since the exemption under section 17(2) prohibits person-specific decisions, it could be leveraged to influence opinions in groups as a workaround. Google has been experimenting with Topics as a method that anonymises the identity of people but aggregates users with similar interests together to serve them ads.
While the question of whether “research” includes AI training specifically within its scope has been raised in a public consultation, and clarification is awaited, the larger picture will continue to hold ground. Hence, any data on the internet, as long as the platform or the user specifically restricts it themselves, is open to being scraped in India (subject to the evolving position of any copyright claims on such publicly available work) for training.
The freedom afforded to the State and for “research purposes” under Section 17(2) of the Act goes beyond the exemptions already provided under the Act, and dilutes the processing requirements to ones enlisted under the Second Schedule of the Data Protection Rules. While it is a welcome measure in the sense that it would allow for a more competitive and open internet by preventing bigger, older LLMs from establishing a first-mover advantage by training on authentic data before it was restricted and facilitating research in a general sense, the potential privacy and inefficiency implications that could arise from the same make the provisions look like a pandora’s box.
II. Challenges under the current framework of law
Privacy challenges
The European Data Protection Board and NIST have highlighted various issues with training on personal data and the insufficiency of standard practice. While challenges to data privacy exist at all stages of an LLM’s lifecycle, from training to deployment, the implications can be grouped into two broad issues: first, the compromise of personal data, and second, the falsification of personal data.
Compromise of personal data: Various reports across industries have indicated how attackers have been able to extract training samples from datasets, parameters, and other sensitive information through model privacy, membership inference, and prompt injection attacks, among others. Studies have demonstrated how it has been used to reverse engineer patients’ genetic markers from a dataset and infer whether a movement pattern was part of the training dataset, among other implications. Since these datasets are likely to include personal data or information that can be pieced together to become personally identifiable, the risks of personal information being voluntarily included in such a database are a matter of concern.
Falsification of personal data: Informational privacy (one of the nine types of privacy recognised in Justice Chandrachud’s plural judgment in Puttaswamy-I) stands at risk not only because of what information the LLM may scrape and present but also due to the misinformation that such algorithms could generate about a person due to poisoning attacks that may take place at the training stage of an LLM by introducing additional data or manipulating the parameters, and labelling mechanisms.
Enabling a decision-making machinery
Surveillance Machinery: Apart from the abuse of the publicly available personal data being harmful in itself, the accompanying lack of safeguards for exempted personal data could enable the creation of surveillance machinery like ClearviewAI. ClearviewAI relied on publicly available data to identify an individual, their profile, and their whereabouts based on one photo. While the tool itself was largely accurate, its inaccuracy in identifying minorities due to a lack of training data and attaining that level of accuracy years ago indicates what mala fide actors will be able to create courtesy of the resources available today, and potential challenges that may arise with its adoption by the government at scale.
Discrimination and Inefficient Allocation of Resources: Due to the black-box nature of this technology, decision-making machines also pose the risk of becoming a front for discrimination (similar concerns have also been raised in the report on AI governance guidelines development). There are various incidents where minorities, or backward communities, have gotten the short end of the stick or have been wrongfully incriminated by an AI due to a lack of data and/or historical biases. Considering the lack of accountability of an algorithm vis-à-vis a person, the challenge of discrimination is even more pertinent.
Additionally, such investments pose the risk of being an inefficient allocation of resources for the exchequer. A recent report by the Plank on Lucknow’s investment in AI cameras as a part of ‘Mission Shakti’ to reduce crimes against women is a case study close to home, where resources were allocated to a project in the promise of being “smarter” or efficient replacements. Unfortunately, the project has not been able to assist the police in crime prevention because of being unable to appropriately identify people or due to the practical difficulty of understanding the context, due to which a majority of the alerts generated have been false positives and officers have found it more effective to extract information from the accused than the system for now. The same also serves as a reminder of how “smart” systems often come at the cost of more basic requirements like database efficiency, with servers taking a backseat.
III. Handle with care: Course correction?
Given the discussion on the current framework under DPDPA, 2023 on AI regulation and the uncertainty regarding the same, this section attempts to discuss certain provisions, approaches under the GDPR and other jurisdictions to understand what safeguards can be put in place or considered for the future by the regulatory body or adjudicator:
Protections afforded to Publicly available Personal Data
While Article 9(2)(e) of the GDPR has a proviso similar to the exception under Section 3(c) of the DPDP Act about publicly available data, it is alive to the concerns of privacy violations that may ensue once publicly available data, in general, is brought into the mix. Data Protection Authorities across Canada and Europe, even the EDPB, have acknowledged that merely uploading data online does not imply that the processing is legitimate. The authorities held that the context and purpose of the data uploaded need to be examined before determining whether the data uploaded was manifestly public or not.
Even if the exemption is availed, one would need to justify the legitimacy or satisfy the test of balancing of interests under Article 6 of the GDPR, which bears similarity to the three-prong test justifying a violation of the right to privacy by the state in Paragraph 180 of Justice Chandrachud’s plural judgment in Puttaswamy-I.
The forthcoming jurisprudence in India on data protection should follow suit, and apply a test akin to Puttaswamy’s test to non-state entities with the modification that instead of a legitimate state aim, it should be a legitimate objective of the data fiduciary, where the judge must duly consider the legality, context in which the data was initially being processed and whether such processing was proportionate in the case of a dispute related to publicly available personal data.
Not all personal data is the same
The GDPR and previous drafts of the Act have differentiated between different grades of personal data. Article 9 of the GDPR has restricted the processing of certain types of personal data which could be used as a front for discrimination or make the data principal vulnerable. The Court of Justice of the European Union (CJEU) in Meta Platforms (Paragraph 89) held that if sensitive and non-sensitive data in a database cannot be separated, restrictions on the processing of sensitive data will apply. Article 10(5) of the EU’s AI Act has further restricted the processing of certain kinds of personal data in high-risk systems unless it is strictly necessary for such processing to detect and correct bias.
This stands in stark contrast to our framework, which at best offers vague protection under draft Rule 12(3). The rule states that algorithms by significant data fiduciaries shall ensure that the rights of data principals are not violated.
In light of this limited regulatory oversight, and the challenges discussed above, a more substantial framework is the need of the hour. The next section discusses the differing approaches, and their challenges.
IV. Towards Balanced AI Regulation in India
The reply on web scraping, if considered a signal of regulatory intent, may indicate a shift closer to Brussels or an imitation of the same; however, aping the West may not suffice. One major limitation of the EU approach is the suggested offering of an ‘unconditional opt-out’ mechanism which allows for the data principal to withdraw data without any justification, as has also been referred to in the recent European Data Protection Board (“EDPB”) opinion in the context of direct marketing. Such a mandate (in any form) allows established companies to maintain control over the market due to their unrestricted access to data at a time when there was no regulation in force, allowing their LLM’s to develop capacities and iteratively train on the data years before any regulation was in place, making such an approach as good as shooting the industry in the foot [this concern has been highlighted in multiple articles (here and here) when OpenAI-ANI lawsuit was being filed (including this blog)]. While the idea of an unconditional opt-out policy aligns with the principles of consent, the requirement ought not to be imposed upon companies below a certain threshold of users as that would lead smaller developers to deal with compliance burdens of unlearning information instead of developing their product at an early stage, which could effectively discourage, and deter developers from building tools in and for India.
Additionally, considering the current stance of the US government regarding restrictive regulations in the EU, an over-restrictive approach to data privacy and AI regulation in India could lead India’s tech sector at large to be a casualty of the recent restructuring of global trade.
While the applicability of Rule 12(3) and its extent today are narrow at best, the forthcoming framework for AI governance should fill the gap of protections not being afforded to personal data in processing by AI. Since clause g(iii) of the Second Schedule of the draft rules [the schedule regulating standards for activities under Section 17(2)(b) i.e. the exception for research, archival or statistical purposes] states that the processing has to comply with the standards under a policy by the central government or any law in force, the role of any AI governance framework will become even more pivotal if AI training is recognised as a “research purpose” under Section 17(2)(b) for filling the gaps discussed above. The inclusion of protections in the guidelines would also help avoid second-order effects that an amendment of the act or rules could cause.
Definition of Research
Text and Data Mining exception under the EU’s Directive on copyright and related rights in the Digital Single Market offers a similarly phrased exemption as Section 17(2) of the Act. However, it goes on to define research as a nonprofit entity or an entity tasked with public research or storing mined data for scientific research (subject to lawful access). GDPR also has a similarly worded provision, Article 89, that exempts research. However, the scope is restricted to historical and scientific research, which is also subject to safeguards similar (although relatively specific) to the Second Schedule of the Draft Rules. Considering the narrow nature of the research exemption in other jurisdictions, the framework needs to clarify the scope of the exemption or issue a separate set of rules that set out the scope of ‘research’.
Specific requirements for AI Models
The EDPB’s opinion on the processing of personal data for AI models has also suggested specific measures on the lines of evaluation of data sources, encryption measures, justification of methods and purposes of data collection, and other technical measures (like differential privacy, homomorphic encryption) to address the dangers discussed in the previous section. While safeguards are not absent from India’s data protection framework, considering the challenges, specific safeguards should be introduced through sector-specific guidelines on Data Protection Impact Assessment (DPIA) or consent management framework, and potentially in the AI governance guidelines.
The inclusion of the above measures will address concerns of personal data being compromised by ensuring that the data is anonymised to prevent reidentification. A greater degree of transparency can be ensured through these measures, as the disclosure of data sources in a DPIA will lead developers to be more conscious of the databases they train the AI models upon. While the Act does offer standard protections that may suffice for an ordinary business, specificity in the measures is required to appropriately address the trickle-down effects of AI systems on user privacy.
Conclusion
Although the draft Digital Personal Data Protection Rules, 2025, bring India a step closer to an operational data protection framework, many questions remain unanswered on the limits of web scraping. While defining the limits is certainly an unenvious task of walking the tightrope between privacy, innovation, and geopolitical considerations, the framework today offers a set of vague protections (and none in the case of the draft AI governance guidelines) that is harmful to both the developer (as they would have to walk on the tightrope without knowing what the line is whilst developing a product) and the user.
As discussed above, aping the West is not a solution, however, personal data protection standards and technical safeguards discussed elsewhere ought to be considered by the regulator. Another factor that regulators need to consider is that the protections and expectations at different stages of an AI model’s deployment need to be distinguished so that the liability of developers at differing stages does not suffer from the mistakes of another.
*Kshitij Malhotra is a Second Year Law Student at the National Law University, Delhi. He would like to thank Shikhar Sarangi for his feedback, and comments on previous drafts of the article.
Comments