【GDPR】支持GDPR合规决策的人工智能框架:第六部分
视频号
微信公众号
知识星球
6讨论
为了支持意大利PA确保公共文件的GDPR合规性和个人数据的安全,我们制定了INTREPID,这是一个基于人工智能的框架,用于自动检测PA文件中的安全漏洞。作为我们框架的支柱,我们使用了为意大利语处理开发的语言资源,并调整了GDPR的情报。此外,我们定义了一个基于Bag of Word和NER信息的文本数据工程模块,并使用机器学习算法进行分类。最后,我们准备了一个意大利PA文件语料库,用于培训和评估,方法是使用适当的管道来平衡用人工标识符替换任何已识别或可识别信息的需要,以及GDPR检查不适用于匿名信息的事实。对准备好的语料库进行的深入评估强调了INTREPID的有效性以及它所建立的所有组件的设置。
除了INTREPID显示的结果的准确性之外,还需要解决一些限制,以朝着开发有效工具的方向迈出进一步的一步,降低PA文件中安全漏洞的风险。
- 缺乏解释机制。如今,为了让最终用户接受自动决策过程,解释人工智能系统决策的能力至关重要。这与GDPR对所有决定(包括基于人工智能的决定)的“解释权”的评估一致,这些决定可能会对个人产生重大影响。为此,本研究未来的研究方向可以致力于探索可解释的人工智能机制,通过解释文本中如何发现数据泄露来丰富数据泄露警报。
- 定位文档中的数据泄露位置。所提出的框架在文件一级执行分类任务。它允许我们识别可能不符合GDPR标准的PA文件,但这是在没有定位文件中的头寸数据泄露的情况下完成的。
- 数据泄露的多样性。拟议框架的分类模型已经通过与非法披露健康信息有关的数据泄露进行了培训。未来的研究方向可以致力于将分类模型推广到各种数据海滩类别。
- 多语言支持。拟议的框架是为意大利巴勒斯坦权力机构文件设计的。然而,最近出现了新的多语言模型,并已证明在各种文本分类任务中非常准确(Conneau et al.,2020)。这可以在用于数据泄露检测的多语言系统中进行探索。
7结论
在本文中,我们提出了一个新的基于人工智能的框架,以帮助意大利PA的数据保护工作流程自动化。所提出的框架是根据公共文件的数据保护可以被公式化为二进制文本分类问题的想法设计的。基于这一想法,我们准备了一个由意大利PA各城市在线发布的公共文件标记文本语料库。该语料库包含人类专家标记为符合GDPR或不符合GDPR的文本文件。我们描述了一个人工智能框架,从这个标记的文本语料库中学习文本分类模型,以便学习的模型可以用于预测新的公共文件是否符合GDPR标准。为此,我们选择了SpaCy和Tint这两种能够处理意大利语的NLP工具,并将其调整为GDPR情报。具体来说,我们使用NER工具来处理准备好的文本语料库,并定位几个类别的命名实体。我们介绍了在已识别的命名实体出现时提取的三组NER特征。我们利用这些NER功能丰富了文本文档的传统BoW表示,并训练分类器将文档标记为符合或不符合GDPR标准。我们使用了线性支持向量机、随机森林和XGboost作为分类算法。
我们根据NER的注释预测与领域专家的注释的一致性,以及文本分类模型的准确性对提取的特征组的敏感性,评估了所提出的框架的有效性。特别是,对准备好的文本语料库的评估表明,Tint在该领域的注释预测与领域专家的注释一致性方面优于SpaCy。它还表明,所提出的特征提取阶段工作得相当好,因为它使我们能够训练一个文本分类模型,该模型在检测数据泄露的文档时具有很高的准确性,误报率很低。这一结论可以独立于分类算法得出,尽管通过同时考虑基于BoW和基于NER的特征,使用XGBoost训练分类器获得了最高的精度性能。
到目前为止,据我们所知,这项研究首次尝试结合跨学科能力,以开发一个框架,帮助意大利PA自动化(或半自动化)分析公共文件的GDPR合规性。本研究的下一阶段将通过在文本语料库中包括可能涉及不同类别数据泄露的新文件来扩展对所提出框架的有效性的评估,并使用我们的注释语料库提高NER模型的性能。此外,还需要将该框架扩展到其他类别的个人数据,以及集成XAI技术来解释数据泄露警报,并开发人工智能技术来定位被标记为不符合GDPR标准的文件中的数据泄露位置。最后,我们计划探讨多语言资源在GDPR合规性分析问题中的表现。
代码可用性
根据合理要求,可从通讯作者处获得支持本研究结果的代码和为训练分类算法而提取的数据。
注意事项
-
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), https://eur-lex.europa.eu/eli/reg/2016/679/oj
-
Norms contained in the Italian Personal Data Protection Code (Legislative Decree 196/2003) were aligned with provisions introduced by GDPR with the legislative decree n. 101/2018 published in the Official Gazette n. 205 on September 4, 2018.
-
https://www.dataguidance.com/news/italy-garante-fines-trento-health-authority-150000
-
https://ec.europa.eu/info/sites/default/files/commission-white-paper-artificial-intelligence-feb2020_en.pd (last access: 2021/10/13)
-
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206 (last access: 2021/10/13)
-
We used doccano as the platform for the annotation: https://github.com/doccano/doccano.
-
Legal references were extracted by the Linkoln tool https://gitlab.com/IGSG/LINKOLN/linkoln.
-
https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
References
-
Adams, A., Aili, E., Aioanei, D., Jonson, R., Mickelsson, L., Mikmekova, D., Roberts, F., Mikmekova, D., Fernandez Valencia, J., & Wechsler, R. (2019). Anonymate: a toolkit for anonymizing unstructured chat data. In Proceedings of the workshop on NLP and pseudonymisation, pp. 1–7. Finland: Linköping Electronic Press, Turku.
-
Al-Abdulkarim, L., Atkinson, K., & Bench-Capon, T. (2016). A methodology for designing systems to reason with legal cases using abstract dialectical frameworks. Artificial Intelligence and Law, 24, 1–49. https://doi.org/10.1007/s10506-016-9178-1.
-
Attardi, G., Basile, V., Bosco, C., Caselli, T., Dell’Orletta, F., Montemagni, S., Patti, V., Simi, M., & Sprugnoli, R. (2015). State of the art language technologies for italian: the EVALITA 2014 perspective. Intelligenza Artificiale, 9(1), 43–61. https://doi.org/10.3233/IA-150076.
-
Bansal, A., & Kaur, S. (2018). Extreme gradient boosting based tuning for classification in intrusion detection systems. In M. Singh, P. K. Gupta, V. Tyagi, J. Flusser, & T. Ören (Eds.) Advances in computing and data sciences, communications in computer and information science, (vol. 905 pp. 372–380). https://doi.org/10.1007/978-981-13-1810-8_37. Singapore: Springer.
-
Biesner, D., Ramamurthy, R., Stenzel, R., Lu̇bbering, M., Hillebrand, L. P., Ladi, A., Pielka, M., Loitz, R., Bauckhage, C., & Sifa, R. (2022). Anonymization of german financial documents using neural network-based language models with contextual word representations. International Journal of Data Science and Analytics, 13(2), 151–161. https://doi.org/10.1007/s41060-021-00285-x.
-
Blume, P. (2016). Impact of the EU general data protection regulation on the public sector. Journal of Data Protection & Privacy, 1(1), 53–63.
-
Brandsen, A., Verberne, S., Wansleeben, M., & Lambers, K. (2020). Creating a dataset for named entity recognition in the archaeology domain. In Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, pp. 4573–4577. European Language Resources Association (ELRA).
-
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324 .
-
Chen, T., & Guestrin, C. (2016). Xgboost: a scalable tree boosting system. In B. Krishnapuram, M. Shah, A. J. Smola, C.C. Aggarwal, D. Shen, & R. Rastogi (Eds.) Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 785–794. Association for Computing Machinery (ACM). https://doi.org/10.1145/2939672.2939785.
-
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37–46.
-
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmȧn, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In D. Jurafsky, J. Chai, N. Schluter, & J.R. Tetreault (Eds.) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, pp. 8440–8451. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747.
-
Contissa, G., Docter, K., Lagioia, F., Lippi, M., Micklitz, H. W., Palka, P., Sartor, G., & Torroni, P. (2018). CLAUDETTE meets gdpr: automating the evaluation of privacy policies using artificial intelligence. SSRN Electronic Journal, 1–59.
-
Csányi, G. M., Nagy, D., Vági, R., Vadász, J. P., & Orosz, T. (2021). Challenges and open problems of legal document anonymization. Symmetry, 13(8).
-
Dadgostari, F., Guim, M., Beling, P. A., Livermore, M. A., & Rockmore, D. N. (2020). Modeling law search as prediction. Artificial Intelligence and Law, 29, 3–34. https://doi.org/10.1007/s10506-020-09261-5.
-
Datta, P. (2020). Digital transformation of the italian public administration: a case study. Communications of the Association for Information Systems pp. 252–272. https://doi.org/10.17705/1CAIS.04611.
-
Davari, M., & Bertino, E. (2019). Access control model extensions to support data privacy protection based on GDPR. In C. Baru, J. Huan, L. Khan, X. Hu, R. Ak, Y. Tian, R. S. Barga, C. Zaniolo, K. Lee, & Y.F. Ye (Eds.) Proceedings of the 2019 IEEE international conference on big data, big data 2019, pp. 4017–4024. IEEE. https://doi.org/10.1109/BigData47090.2019.9006455.
-
De Felice, I., Dell’Orletta, F., Venturi, G., Lenci, A., & Montemagni, S. (2018). Italian in the trenches: linguistic annotation and analysis of texts of the great war. In E. Cabrio, A. Mazzei, & F. Tamburini (Eds.) Proceedings of the 5th italian conference on computational linguistics, CLiC-it 2018, CEUR Workshop Proceedings, (vol. 2253 pp. 1–5).
-
De Martino, G., Pio, G., & Ceci, M. (2022). PRILJ: an efficient two-step method based on embedding and clustering for the identification of regularities in legal case judgments. Artificial Intelligence and Law, 30, 359–390. https://doi.org/10.1007/s10506-021-09297-1.
-
Di Cerbo, F., & Trabelsi, S. (2018). Towards personal data identification and anonymization using machine learning techniques. In A. Benczúr, B. Thalheim, T. Horváth, S. Chiusano, T. Cerquitelli, C. Sidló, & P. Z. Revesz (Eds.) New trends in databases and information systems, ADBIS 2018, communications in computer and information science, pp. 118–126. https://doi.org/10.1007/978-3-030-00063-9_13. Cham: Springer.
-
Di Nicola, P., Grossi, P., & Preti, A. (2016). Rethinking the organization of public administration through the enhancement of human resources. The Istat case. RIEDS-Rivista Italiana di Economia, Demografia e Statistica- The Italian Journal of Economic. Demographic and Statistical Studies, 70(1), 17–28.
-
Dias, M., Bone, J., Ferreira, J., Ribeiro, R., & Maia, R. (2020). Named entity recognition for sensitive data discovery in portuguese. Applied Sciences, 10, 2303. https://doi.org/10.3390/app10072303.
-
Francopoulo, G., & Schaub, L. P. (2020). Anonymization for the GDPR in the context of citizen and customer relationship management and NLP. In Proceedings of the of the workshop on legal and ethical issues (Legal2020), pp. 9–14. European Language Resources Association (ELRA).
-
Ghosh, M., Raihan, M. M., Raihan, M., Akter, L., Bairagi, A., Alshamrani, S., & Masud, M. (2021). A comparative analysis of machine learning algorithms to predict liver disease. Intelligent Automation and Soft Computing, 29, 917–928. https://doi.org/10.32604/iasc.2021.017989.
-
Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., & Quintard, L. (2011). Proposal for an extension of traditional named entitites: from guidelines to evaluation, an overview. In Proceedings of the 5th linguistics annotation workshop (The LAW V), pp. 92–100. USA: Association for Computational Linguistics, Portland, Oregon.
-
Harkous, H., Fawaz, K., Lebret, R., Schaub, F., Shin, K. G., & Aberer, K. (2018). Polisis: automated analysis and presentation of privacy policies using deep learning. In Proceedings of the 27th USENIX conference on security symposium, SEC’18 (pp. 531–548). USA: USENIX Association.
-
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Springer Series in Statistics. New York: Springer. https://doi.org/10.1007/978-0-387-84858-7.
-
Hoofnagle, C. J., van der Sloot, B., & Borgesius, F. Z. (2019). The European Union general data protection regulation: what it is and what it means. Information & Communications Technology Law, 28(1), 65–98. https://doi.org/10.1080/13600834.2019.1573501.
-
Hripcsak, G., & Rothschild, A. S. (2005). Agreement, the F-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3), 296–298. https://doi.org/10.1197/jamia.M1733.
-
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec C. Rouveirol (Eds.) Proceedings of 10th european conference on machine learning: ECML-98, lecture notes in computer science, (vol. 1398 pp. 137–142). Berlin, Heidelberg: Springer. https://doi.org/10.1007/BFb0026683.
-
Kingston, J. (2017). Using artificial intelligence to support compliance with the general data protection regulation. Artificial Intelligence and Law, 25, 429–443. https://doi.org/10.1007/s10506-017-9206-9.
-
Magnini, B., Pianta, E., Girardi, C., Negri, M., Romano, L., Speranza, M., Bartalesi Lenzi, V., & Sprugnoli, R. (2006). I-CAB: the italian content annotation bank. In Proceedings of the 5th international conference on language resources and evaluation (LREC ’06), pp. 963–968. Italy: European Language Resources Association (ELRA), Genoa.
-
Mc Cullagh, K., Tambou, O., & Bourton, S. (eds.) (2019). National adaptations of the GDPR, 1st edn. Blogdroiteuropéen: Collection Open Access Book.
-
Meszaros, J., & Ho, C. (2021). AI research and data protection: can the same rules apply for commercial and academic research under the GDPR? Computer Law & Security Review, 105532, 41. https://doi.org/10.1016/j.clsr.2021.105532.
-
Mozes, M., & Kleinberg, B. (2021). No intruder, no validity : evaluation criteria for privacy-preserving text anonymization . Preprint at arXiv:2103.09263.
-
Nothman, J., Ringland, N., Radford, W., Murphy, T., & Curran, J. R. (2013). Learning multilingual named entity recognition from wikipedia. Artificial Intelligence, 194, 151–175. https://doi.org/10.1016/j.artint.2012.03.006.
-
Palmero Aprosio, A., & Moretti, G. (2018). Tint 2.0: an all-inclusive suite for NLP in italian. In Proceedings of the 5th italian conference on computational linguistics, CLiC-it 2018, CEUR workshop proceedings, (vol. 2253, pp. 1–7).
-
Passaro, L. C., Lenci, A., & Gabbolini, A. (2017). Informed PA: a NER for the italian public administration domain. In R. Basili, M. Nissim, & G. Satta (Eds.) Proceedings of the 4th italian conference on computational linguistics, CLiC-it 2017, CEUR Workshop Proceedings, Vol. 2006.
-
Ricci, A. (2018). E-government, transparency and personal data protection.: a new analysis’ approach to an old juridical issue. Central and Eastern European eDem and eGov Days, 325, 125–135. https://doi.org/10.24989/ocg.v325.11.
-
Romano, M. F., Baldassarini, A., & Pavone, P. (2020). Text mining of public administration documents: preliminary results on judgments. In D. F. Iezzi, D. Mayaffre, & M. Misuraca (Eds.) Text analytics: advances and challenges. proceedings of the 14th international conference on the statistical analysis of textual data (JADT 2018), studies in classification, data analysis, and knowledge organization, pp. 117–126. Cham: Springer. https://doi.org/10.1007/978-3-030-52680-1_10.
-
Sartor, G., & Lagioia, F. (2020). The impact of the General Data Protection Regulation (GDPR) on artificial intelligence. European Parliamentary Research Service. https://doi.org/10.2861/293.
-
Savic, D., & Veinovic, M. (2018). Challenges of general data protection regulation (GDPR). In Proceeding of the 5th international scientific conference on information technology and data related research, sinteza 2018, pp. 23–30. Serbia: Singidunum University, Belgrade. https://doi.org/10.15308/Sinteza-2018-23-30.
-
Selbst, A. D., & Powles, J. (2017). Meaningful information and the right to explanation. International Data Privacy Law, 7(4), 233–242. https://doi.org/10.1093/idpl/ipx022.
-
Silva, P., Gonçalves, C., Godinho, C., Antunes, N., & Curado, M. (2020). Using natural language processing to detect privacy violations in online contracts. In Proceedings of the 35th annual ACM symposium on applied computing, SAC 2020, pp. 1305–1307. New York: Association for Computing Machinery (ACM), DOI 10.1145/3341105.3375774, (to appear in print).
-
Sovrano, F., Vitali, F., & Palmirani, M. (2020). Modelling GDPR-compliant explanations for trustworthy ai. In A. Kȯ, E. Francesconi, G. Kotsis, A. M. Tjoa, & I. Khalil (Eds.) Electronic Government and the Information Systems Perspective. Proceedings of the 9th international conference on electronic government and the information systems perspective, EGOVIS 2020, lecture notes in computer science, (vol. 12394 pp. 219–233). Cham: Springer. https://doi.org/10.1007/978-3-030-58957-8_16.
-
Stamova, I., & Draganov, M. (2020). Artificial intelligence in the digital age. In Proceedings of the international scientific conference “digital transformation on manufacturing, infrastructure and service”, IOP conference series: materials science and engineering, vol. 940. https://doi.org/10.1088/1757-899X/940/1/012067.
-
Sánchez, D., Viejo, A., & Batet, M. (2021). Automatic assessment of privacy policies under the GDPR. Applied Sciences 11(4). https://doi.org/10.3390/app11041762.
-
Tagarelli, A., & Simeri, A. (2021). Unsupervised law article mining based on deep pre-trained language representation models with application to the italian civil code. Artificial Intelligence and Law, 30, 417–473. https://doi.org/10.1007/s10506-021-09301-8.
-
van der Aalst, W. M. P. (2016). Process Mining- Data Science in Action, 2nd edn. Berlin Heidelberg: Springer. https://doi.org/10.1007/978-3-662-49851-4.
-
van Engers, T. M. (2005). Legal engineering: a structural approach to improving legal quality. In A. Macintosh, R. Ellis, & T. Allen (Eds.) Proceedings of the 25th SGAI international conference on innovative techniques and applications of artificial intelligence, AI-2005. https://doi.org/10.1007/1-84628-224-1_1 (pp. 3–10). London: Springer.
-
Yadav, V., & Bethard, S. (2019). A survey on recent advances in named entity recognition from deep learning models. Preprint at arxiv:1910.11470.
-
Zaman, R., Cuzzocrea, A., & Hassani, M. (2019). An innovative online process mining framework for supporting incremental GDPR compliance of business processes. In C. Baru, J. Huan, L. Khan, X. Hu, R. Ak, Y. Tian, R.S. Barga, C. Zaniolo, K. Lee, & Y.F. Ye (Eds.) Proceedings of the 2019 IEEE international conference on big data, big data 2019, pp. 2982–2991. https://doi.org/10.1109/BigData47090.2019.9005705.
-
Zaman, R., & Hassani, M. (2020). On enabling GDPR compliance in business processes through data-driven solutions. SN Computer Science, 1(4), 210. https://doi.org/10.1007/s42979-020-00215-x.
Acknowledgements
We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU, as well as the PON “Governance e capacità istituzionale” 2014–2020 project “Modelli, Sistemi e Competenze per l’implementazione dell’Ufficio per il Processo/Start UPP” (CUP: H29J22000390006), funded by the Italian Ministry for Universities and Research (MIUR).
- 6 次浏览