A Dataset of German Legal Documents for Named Entity Recognition

What categories of NOs are typical of legal documents? Which classes should be identified and classified? What legal documents can be used for a dataset? The dataset consists of 66,723 sets with 2,157,048 tokens. The size of the seven court-specific cases ranges from 5,858 to 12,791 sets and 177,835 to 404,041 tokens. The distribution of annotations by token is about 19-23%. For the dataset, which was published online by the Federal Ministry of Justice and Consumer Protection, the 2017 and 2018 court decisions were selected. The documents come from seven federal courts: the Federal Labour Court (ABG), the Federal Finance Court (BFH), the Federal Court of Justice (BGH), the Federal Patent Court (BPatG), the Federal Social Court (BSG), the Federal Constitutional Court (BVerfG) and the Federal Administrative Court (BVerwG). Dozier, C., Kondadadi, R., Light, M., Vachher, A., Veeramachaneni, S., Wudali, R.: Recognition and resolution of entities named in the legal text. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (Eds.) Semantic processing of legal texts. LNCS (LNAI), vol.

6036, pp. 27-43. Springer, Heidelberg (2010). doi.org/10.1007/978-3-642-12837-0_2 This article is devoted to the recognition of NIRs and their respective categories in German legal documents. Legal language is unique and very different from newspaper language. It is also about the use of the NE of person, place and organization in the legal text, which are relatively rare. It contains specific entities such as designations of legal standards and references to other legal documents (laws, regulations, ordinances, decisions, etc.) that play an essential role. Despite the development of the NER for other languages and fields, the legal field has not yet been dealt with exhaustively. This research also faced the following two challenges. (1) There is no uniform typology of semantic concepts relating to CEs in legal documents; Therefore, there are no uniform annotation guidelines for NEs in the legal field. (2) There are no freely available data sets including legal documents in which the NAs have been commented. In order to adapt the categories to the legal field, the set of NE classes has been redefined in the approaches described above.

Thus, Dozier et al. [13] focused on legal NEs (e.g., judge, lawyer, court). Cardellino et al. [8] extended the EN at the NERC level to document, extract and act. It is not known what belongs to these classes and how they were separated from each other. Glaser et al. [18] haben Referenz hinzugefügt [23]. Dies wurde jedoch als Verweis auf Rechtsnormen verstanden, so dass weitere Verweise (auf Entscheidungen, Verordnungen, Rechtsliteratur etc.) nicht erfasst wurden.

Cardellino, C., Teruel, M., Alemany, L.A., Villata, S.: A low-cost, high-coverage legal entity recognizer, classifier and linker. In: Proceedings of the 16th edition of the International Conference on Artificial Intelligence and Law, ICAIL 2017, S. 9-18. , ACM, New York (2017) Reimers, N., Eckle-Kohler, J., Schnober, C., Kim, J., Gurevych, I.: GermEval-2014: nested named entity recognition with neural networks. In: Faaß, G., Ruppenhofer, J. (Hrsg.) Proceedings of the workshop of the 12th edition of the KONVENS conference, October 2014, pp. 117-120. Universitätsverlag Hildesheim (2014) Tkachenko, M., Simanovsky, A.: Named Entity Recognition: Exploration of Features. In: Jancsary, J.

(Ed.) 11th Conference on Natural Language Processing, KONVENS 2012, Empirical Methods in Natural Language Processing, Vienna, Austria, 19-21 September 2012. Scientific Series of the ÖGAI, Vol. 5, pp. 118-127. ÖGAI, Vienna (2012) This article describes an approach to the recognition of appointed entities (NER) in German-language legal documents. To this end, a data set consisting of German court decisions has been developed. The source texts have been manually annotated with 19 semantic classes: person, judge, lawyer, country, city, street, landscape, organization, company, institution, court, trademark, law, regulation, European legal standard, regulation, contract, court decision and legal literature. The dataset consists of approximately 67,000 sentences and contains 54,000 annotated entities. The 19 fine-grained classes were automatically generalized to seven other coarse-grain classes (person, place, organization, legal norm, individual case settlement, court decision and legal literature). Thus, the dataset contains two annotation variants, namely coarse and fine-grained. For the NER task, conditional random fields (CRF) and long- and short-term bidirectional memory networks (BiLSTM) were applied to the dataset as state-of-the-art models. For each of these two model families, three different models were developed and tested with coarse and fine annotations.

BiLSTM models achieve the best performance with a score of 95.46 F(_1) for fine-grained classes and 95.95 for coarse-grained classes. CRF models reach a maximum of 93.23 for fine-grained classes and 93.22 for coarse-grained classes. The work presented in this document was carried out under the aegis of the European lynx project, which develops a semantic platform that allows the development of various document processing and analysis applications for the legal field. The dataset contains two different versions of annotations, one with a set of 19 fine-grained semantic classes and the other with a set of 7 coarse-grained classes. There are a total of 53,632 annotated entities, the majority of which (74.34%) are legal persons, the rest are persons, places and organizations (25.66%). The dataset consists of 66,723 sets and 2,157,048 tokens. The percentage of annotations (per token base) is about 19%. In total, the dataset includes 53,632 NEs annotated. The dataset has two variants for classifying legal SEs (Table 1). The person, location and organization account for 25.66% of all cases commented. 74.34% are specific categories such as NRM legal standard, REG individual case settlement, RS court decision and LIT legal literature.

The most important classes are the GS right (34.53%) and the RS court decision (23.46%). Other entities, i.e. regulation, European legal standard, regulation, treaties and legal literature, are less common (between 1 and 6% of all annotations). Legal documents differ from texts in other fields and from each other in terms of internal and external criteria [7, 12, 15, 21], which has a major influence on linguistic and thematic design, citation, structure, etc. This also applies to BNs used in legal documents. In legal texts and administrative regulations, the occurrence of typical NEs such as person, place and organization is very low.