Technical Report Series on Corpus Building



Relaterede dokumenter
Unitel EDI MT940 June Based on: SWIFT Standards - Category 9 MT940 Customer Statement Message (January 2004)

Help / Hjælp

Privat-, statslig- eller regional institution m.v. Andet Added Bekaempelsesudfoerende: string No Label: Bekæmpelsesudførende

Financial Literacy among 5-7 years old children

Generalized Probit Model in Design of Dose Finding Experiments. Yuehui Wu Valerii V. Fedorov RSU, GlaxoSmithKline, US

Skriftlig Eksamen Beregnelighed (DM517)

The X Factor. Målgruppe. Læringsmål. Introduktion til læreren klasse & ungdomsuddannelser Engelskundervisningen

Engelsk. Niveau D. De Merkantile Erhvervsuddannelser September Casebaseret eksamen. og

CHAPTER 8: USING OBJECTS

Vores mange brugere på musskema.dk er rigtig gode til at komme med kvalificerede ønsker og behov.

User Manual for LTC IGNOU

Portal Registration. Check Junk Mail for activation . 1 Click the hyperlink to take you back to the portal to confirm your registration

Engelsk. Niveau C. De Merkantile Erhvervsuddannelser September Casebaseret eksamen. og

Vina Nguyen HSSP July 13, 2008

Aktivering af Survey funktionalitet

On the complexity of drawing trees nicely: corrigendum

PARALLELIZATION OF ATTILA SIMULATOR WITH OPENMP MIGUEL ÁNGEL MARTÍNEZ DEL AMOR MINIPROJECT OF TDT24 NTNU

Website review groweasy.dk

To the reader: Information regarding this document

Statistical information form the Danish EPC database - use for the building stock model in Denmark

Nyhedsmail, december 2013 (scroll down for English version)

Learnings from the implementation of Epic

Bilag. Resume. Side 1 af 12

GUIDE TIL BREVSKRIVNING

Basic statistics for experimental medical researchers

Hvor er mine runde hjørner?

Trolling Master Bornholm 2014

Trolling Master Bornholm 2012

Sport for the elderly

DoodleBUGS (Hands-on)

Skriftlig Eksamen Beregnelighed (DM517)

SAS Corporate Program Website

Project Step 7. Behavioral modeling of a dual ported register set. 1/8/ L11 Project Step 5 Copyright Joanne DeGroat, ECE, OSU 1

Skriftlig Eksamen Kombinatorik, Sandsynlighed og Randomiserede Algoritmer (DM528)

Special VFR. - ved flyvning til mindre flyveplads uden tårnkontrol som ligger indenfor en kontrolzone

Trolling Master Bornholm 2016 Nyhedsbrev nr. 8

Besvarelser til Lineær Algebra Reeksamen Februar 2017

ECE 551: Digital System * Design & Synthesis Lecture Set 5

Trolling Master Bornholm 2016 Nyhedsbrev nr. 3

Trolling Master Bornholm 2016 Nyhedsbrev nr. 7

Trolling Master Bornholm 2014

Bookingmuligheder for professionelle brugere i Dansehallerne

Trolling Master Bornholm 2013

Strings and Sets: set complement, union, intersection, etc. set concatenation AB, power of set A n, A, A +

ESG reporting meeting investors needs

Danish Language Course for International University Students Copenhagen, 12 July 1 August Application form

Applications. Computational Linguistics: Jordan Boyd-Graber University of Maryland RL FOR MACHINE TRANSLATION. Slides adapted from Phillip Koehn

Dendrokronologisk Laboratorium

Heuristics for Improving

Danish Language Course for Foreign University Students Copenhagen, 13 July 2 August 2016 Advanced, medium and beginner s level.

applies equally to HRT and tibolone this should be made clear by replacing HRT with HRT or tibolone in the tibolone SmPC.

IBM Network Station Manager. esuite 1.5 / NSM Integration. IBM Network Computer Division. tdc - 02/08/99 lotusnsm.prz Page 1

Velkommen til IFF QA erfa møde d. 15. marts Erfaringer med miljømonitorering og tolkning af nyt anneks 1.

Central Statistical Agency.

Subject to terms and conditions. WEEK Type Price EUR WEEK Type Price EUR WEEK Type Price EUR WEEK Type Price EUR

DK - Quick Text Translation. HEYYER Net Promoter System Magento extension

ATEX direktivet. Vedligeholdelse af ATEX certifikater mv. Steen Christensen

Linear Programming ١ C H A P T E R 2

Trolling Master Bornholm 2015

Mandara. PebbleCreek. Tradition Series. 1,884 sq. ft robson.com. Exterior Design A. Exterior Design B.

Statistik for MPH: 7

Sikkerhed & Revision 2013

Improving data services by creating a question database. Nanna Floor Clausen Danish Data Archives

The River Underground, Additional Work

Den nye Eurocode EC Geotenikerdagen Morten S. Rasmussen

Barnets navn: Børnehave: Kommune: Barnets modersmål (kan være mere end et)

Overview LINKING METRICS BACKLINKS TYPES. URL Rating Domain Rating Backlinks Referring Domains. Referring Pages 173. text 173. Total Backlinks 184

Handout 1: Eksamensspørgsmål

CS 4390/5387 SOFTWARE V&V LECTURE 5 BLACK-BOX TESTING - 2

Skriftlig Eksamen Diskret matematik med anvendelser (DM72)

Forslag til implementering af ResearcherID og ORCID på SCIENCE

Sports journalism in the sporting landscape

WIKI & Lady Avenue New B2B shop

Trolling Master Bornholm 2014

Using SL-RAT to Reduce SSOs

The GAssist Pittsburgh Learning Classifier System. Dr. J. Bacardit, N. Krasnogor G53BIO - Bioinformatics

Exercise 6.14 Linearly independent vectors are also affinely independent.

Dendrokronologisk Laboratorium

StarWars-videointro. Start din video på den nørdede måde! Version: August 2012

Developing a tool for searching and learning. - the potential of an enriched end user thesaurus

Appendix 1: Interview guide Maria og Kristian Lundgaard-Karlshøj, Ausumgaard

The complete construction for copying a segment, AB, is shown above. Describe each stage of the process.

Trolling Master Bornholm 2015

Trolling Master Bornholm 2013

TM4 Central Station. User Manual / brugervejledning K2070-EU. Tel Fax

Kvant Eksamen December timer med hjælpemidler. 1 Hvad er en continuous variable? Giv 2 illustrationer.

Strategic Capital ApS has requested Danionics A/S to make the following announcement prior to the annual general meeting on 23 April 2013:

How Long Is an Hour? Family Note HOME LINK 8 2

Info og krav til grupper med motorkøjetøjer

Black Jack --- Review. Spring 2012

BILAG 8.1.B TIL VEDTÆGTER FOR EXHIBIT 8.1.B TO THE ARTICLES OF ASSOCIATION FOR

Design til digitale kommunikationsplatforme-f2013

Vejledning til Sundhedsprocenten og Sundhedstjek

Mandara. PebbleCreek. Tradition Series. 1,884 sq. ft robson.com. Exterior Design A. Exterior Design B.

Digitaliseringsstyrelsen

MSE PRESENTATION 2. Presented by Srunokshi.Kaniyur.Prema. Neelakantan Major Professor Dr. Torben Amtoft

Kort A. Tidsbegrænset EF/EØS-opholdsbevis (anvendes til EF/EØS-statsborgere) (Card A. Temporary EU/EEA residence permit used for EU/EEA nationals)

Remember the Ship, Additional Work

Domestic violence - violence against women by men

Listen Mr Oxford Don, Additional Work

Transkript:

Technical Report Series on Corpus Building Vol. 2 (March 2013) Danish Corpora Uwe Quasthoff Dirk Goldhahn Erla Hallsteinsdóttir Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig

Affiliation oft the authors: Uwe Quasthoff und Dirk Goldhahn: Institut für Informatik,Universität Leipzig {quasthoff, dgoldhahn}@informatik.uni-leipzig.de Erla Hallsteinsdóttir, Institut for Sprog og Kommunikation, Syddansk Universitet Odense, erla@sdu.dk Copyright: Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, http://asv.informatik.uni-leipzig.de/ Technical Report Series on Corpus Building Vol. 1: Deutscher Wortschatz 2013 Vol. 2: Danish Corpora This PDF document was created using the open source tool mwlib. For more infotmation, see http://code.pediapress.com/ PDF generated at: Tue, 15 May 2013 12:19:38 UTC

Danish corpora 1 Introduction to corpus creation 1 DAN - a processing related language description 2 DAN corpora 4 DAN corpus comparison 8 Processing details 10 Appendix to dan news 2007: Database summary 10 Appendix to dan news 2008: Database summary 10 Appendix to dan news 2010: Database summary 11 Appendix to dan news 2011: Database summary 11 Appendix to dan newscrawl 2011: Database summary 12 Appendix to dan wikipedia 2007: Database summary 12 Appendix to dan wikipedia 2012: Database summary 13 Appendix to dan web 2002: Database summary 13 Appendix to dan web 2011: Database summary 14 Appendix to dan mixed 2012: Database summary 14 Content details 15 Appendix to dan news 2007: Size of different TLDs 15 Appendix to dan news 2008: Size of different TLDs 15 Appendix to dan news 2010: Size of different TLDs 16 Appendix to dan news 2011: Size of different TLDs 16 Appendix to dan newscrawl 2011: Size of different TLDs 17 Appendix to dan web 2002: Size of different TLDs 17 Appendix to dan web 2011: Size of different TLDs 17 Appendix to dan mixed 2012: Size of different TLDs 18 Appendix to dan news 2007: Size of largest domains 18 Appendix to dan news 2008: Size of largest domains 19 Appendix to dan news 2010: Size of largest domains 19 Appendix to dan news 2011: Size of largest domains 20 Appendix to dan newscrawl 2011: Size of largest domains 21 Appendix to dan web 2002: Size of largest domains 21

Appendix to dan web 2011: Size of largest domains 22 Appendix to dan mixed 2012: Size of largest domains 22 Appendix to dan news 2007: Number of sources by time period 23 Appendix to dan news 2008: Number of sources by time period 24 Appendix to dan news 2010: Number of sources by time period 25 Appendix to dan news 2011: Number of sources by time period 26 Word details 28 Appendix to dan news 2007: Words by length without multiplicity 28 Appendix to dan news 2008: Words by length without multiplicity 30 Appendix to dan news 2010: Words by length without multiplicity 32 Appendix to dan news 2011: Words by length without multiplicity 34 Appendix to dan newscrawl 2011: Words by length without multiplicity 36 Appendix to dan wikipedia 2007: Words by length without multiplicity 38 Appendix to dan wikipedia 2012: Words by length without multiplicity 40 Appendix to dan web 2002: Words by length without multiplicity 42 Appendix to dan web 2011: Words by length without multiplicity 44 Appendix to dan mixed 2012: Words by length without multiplicity 46 Appendix to dan news 2007: Words by length with multiplicity 48 Appendix to dan news 2008: Words by length with multiplicity 50 Appendix to dan news 2010: Words by length with multiplicity 52 Appendix to dan news 2011: Words by length with multiplicity 54 Appendix to dan newscrawl 2011: Words by length with multiplicity 56 Appendix to dan wikipedia 2007: Words by length with multiplicity 58 Appendix to dan wikipedia 2012: Words by length with multiplicity 60 Appendix to dan web 2002: Words by length with multiplicity 62 Appendix to dan web 2011: Words by length with multiplicity 64 Appendix to dan mixed 2012: Words by length with multiplicity 66 Appendix to dan news 2007: The most frequent 50 words 67 Appendix to dan news 2008: The most frequent 50 words 68 Appendix to dan news 2010: The most frequent 50 words 69 Appendix to dan news 2011: The most frequent 50 words 70 Appendix to dan newscrawl 2011: The most frequent 50 words 71 Appendix to dan wikipedia 2007: The most frequent 50 words 72 Appendix to dan wikipedia 2012: The most frequent 50 words 73 Appendix to dan web 2002: The most frequent 50 words 74 Appendix to dan web 2011: The most frequent 50 words 75 Appendix to dan mixed 2012: The most frequent 50 words 76

Appendix to dan news 2007: Longest words in top 1.000 by rank 77 Appendix to dan news 2008: Longest words in top 1.000 by rank 78 Appendix to dan news 2010: Longest words in top 1.000 by rank 79 Appendix to dan news 2011: Longest words in top 1.000 by rank 80 Appendix to dan newscrawl 2011: Longest words in top 1.000 by rank 81 Appendix to dan wikipedia 2007: Longest words in top 1.000 by rank 82 Appendix to dan wikipedia 2012: Longest words in top 1.000 by rank 83 Appendix to dan web 2002: Longest words in top 1.000 by rank 84 Appendix to dan web 2011: Longest words in top 1.000 by rank 85 Appendix to dan mixed 2012: Longest words in top 1.000 by rank 86 Character details 87 Appendix to dan news 2007: Alphabet as used in the top-100.000 words 87 Appendix to dan news 2008: Alphabet as used in the top-100.000 words 88 Appendix to dan news 2010: Alphabet as used in the top-100.000 words 90 Appendix to dan news 2011: Alphabet as used in the top-100.000 words 91 Appendix to dan newscrawl 2011: Alphabet as used in the top-100.000 words 92 Appendix to dan wikipedia 2007: Alphabet as used in the top-100.000 words 94 Appendix to dan wikipedia 2012: Alphabet as used in the top-100.000 words 95 Appendix to dan web 2002: Alphabet as used in the top-100.000 words 96 Appendix to dan web 2011: Alphabet as used in the top-100.000 words 98 Appendix to dan mixed 2012: Alphabet as used in the top-100.000 words 99 Abbreviation details 101 Appendix to dan news 2007: Most frequent abbreviations 101 Appendix to dan news 2008: Most frequent abbreviations 102 Appendix to dan news 2010: Most frequent abbreviations 103 Appendix to dan news 2011: Most frequent abbreviations 104 Appendix to dan newscrawl 2011: Most frequent abbreviations 105 Appendix to dan wikipedia 2007: Most frequent abbreviations 106 Appendix to dan wikipedia 2012: Most frequent abbreviations 107 Appendix to dan web 2002: Most frequent abbreviations 108 Appendix to dan web 2011: Most frequent abbreviations 109 Appendix to dan mixed 2012: Most frequent abbreviations 110 Appendix to dan news 2007: Left neighbors of the full stop 111 Appendix to dan news 2008: Left neighbors of the full stop 112 Appendix to dan news 2010: Left neighbors of the full stop 113 Appendix to dan news 2011: Left neighbors of the full stop 114

Appendix to dan newscrawl 2011: Left neighbors of the full stop 115 Appendix to dan wikipedia 2007: Left neighbors of the full stop 116 Appendix to dan wikipedia 2012: Left neighbors of the full stop 117 Appendix to dan web 2002: Left neighbors of the full stop 118 Appendix to dan web 2011: Left neighbors of the full stop 119 Appendix to dan mixed 2012: Left neighbors of the full stop 120 Appendix to dan news 2007: Left neighbors of the full stop with additional internal full stops 121 Appendix to dan news 2008: Left neighbors of the full stop with additional internal full stops 122 Appendix to dan news 2010: Left neighbors of the full stop with additional internal full stops 123 Appendix to dan news 2011: Left neighbors of the full stop with additional internal full stops 124 Appendix to dan newscrawl 2011: Left neighbors of the full stop with additional internal full stops 125 Appendix to dan wikipedia 2007: Left neighbors of the full stop with additional internal full stops 126 Appendix to dan wikipedia 2012: Left neighbors of the full stop with additional internal full stops 127 Appendix to dan web 2002: Left neighbors of the full stop with additional internal full stops 128 Appendix to dan web 2011: Left neighbors of the full stop with additional internal full stops 129 Appendix to dan mixed 2012: Left neighbors of the full stop with additional internal full stops 130 Sentences details 131 Appendix to dan news 2007: Shortest sentences 131 Appendix to dan news 2008: Shortest sentences 132 Appendix to dan news 2010: Shortest sentences 134 Appendix to dan news 2011: Shortest sentences 135 Appendix to dan newscrawl 2011: Shortest sentences 137 Appendix to dan wikipedia 2007: Shortest sentences 138 Appendix to dan wikipedia 2012: Shortest sentences 140 Appendix to dan web 2002: Shortest sentences 141 Appendix to dan web 2011: Shortest sentences 143 Appendix to dan mixed 2012: Shortest sentences 144 Appendix to dan news 2007: Longest sentences 146 Appendix to dan news 2008: Longest sentences 148 Appendix to dan news 2010: Longest sentences 150 Appendix to dan news 2011: Longest sentences 152 Appendix to dan newscrawl 2011: Longest sentences 154 Appendix to dan wikipedia 2007: Longest sentences 156 Appendix to dan wikipedia 2012: Longest sentences 158 Appendix to dan web 2002: Longest sentences 160 Appendix to dan web 2011: Longest sentences 162 Appendix to dan mixed 2012: Longest sentences 164

Appendix to dan news 2007: Length of sentences in characters 166 Appendix to dan news 2008: Length of sentences in characters 167 Appendix to dan news 2010: Length of sentences in characters 168 Appendix to dan news 2011: Length of sentences in characters 169 Appendix to dan newscrawl 2011: Length of sentences in characters 170 Appendix to dan wikipedia 2007: Length of sentences in characters 171 Appendix to dan wikipedia 2012: Length of sentences in characters 172 Appendix to dan web 2002: Length of sentences in characters 173 Appendix to dan web 2011: Length of sentences in characters 174 Appendix to dan mixed 2012: Length of sentences in characters 175 Appendix to dan news 2007: Length of sentences in words 176 Appendix to dan news 2008: Length of sentences in words 177 Appendix to dan news 2010: Length of sentences in words 178 Appendix to dan news 2011: Length of sentences in words 179 Appendix to dan newscrawl 2011: Length of sentences in words 180 Appendix to dan wikipedia 2007: Length of sentences in words 181 Appendix to dan wikipedia 2012: Length of sentences in words 182 Appendix to dan web 2002: Length of sentences in words 183 Appendix to dan web 2011: Length of sentences in words 184 Appendix to dan mixed 2012: Length of sentences in words 185 Oddities details 186 Appendix to dan news 2007: Longest words 186 Appendix to dan news 2008: Longest words 186 Appendix to dan news 2010: Longest words 187 Appendix to dan news 2011: Longest words 187 Appendix to dan newscrawl 2011: Longest words 188 Appendix to dan wikipedia 2007: Longest words 188 Appendix to dan wikipedia 2012: Longest words 189 Appendix to dan web 2002: Longest words 189 Appendix to dan web 2011: Longest words 190 Appendix to dan mixed 2012: Longest words 190 Appendix to dan news 2007: Sentences with high average word length 191 Appendix to dan news 2008: Sentences with high average word length 192 Appendix to dan news 2010: Sentences with high average word length 193 Appendix to dan news 2011: Sentences with high average word length 194 Appendix to dan newscrawl 2011: Sentences with high average word length 195 Appendix to dan wikipedia 2012: Sentences with high average word length 196

Appendix to dan news 2007: Problems with sentence segmentation - words ending in a stopword 197 Appendix to dan news 2008: Problems with sentence segmentation - words ending in a stopword 197 Appendix to dan news 2010: Problems with sentence segmentation - words ending in a stopword 198 Appendix to dan news 2011: Problems with sentence segmentation - words ending in a stopword 199 Appendix to dan newscrawl 2011: Problems with sentence segmentation - words ending in a stopword 200 Appendix to dan wikipedia 2007: Problems with sentence segmentation - words ending in a stopword 201 Appendix to dan wikipedia 2012: Problems with sentence segmentation - words ending in a stopword 201 Appendix to dan web 2002: Problems with sentence segmentation - words ending in a stopword 202 Appendix to dan web 2011: Problems with sentence segmentation - words ending in a stopword 203 Appendix to dan mixed 2012: Problems with sentence segmentation - words ending in a stopword 204

1 Danish corpora Introduction to corpus creation The Leipzig Corpora Collection (LCC) collects Web based corpora for many different languages. The main text genres are newspaper texts, Wikipedias and randomly collected web pages. All corpora are processed in the same way: Crawling Web pages HTML stripping Language identifikation Sentence segmentation Cleaning: Removal of ill-formed sentences Duplicate removal Calculation of word frequences and word co-occurrences As result we have a corpus containing only well-formed sentences in the language under consideration. The sentences are in random order; hence, sharing the corpus does not violate copyright law because it is impossible to reconstruct the original texts. The pre-processing steps contain both language independent steps (like HTML stripping and duplicate removal) and language dependent steps (like language identification and sentence segmentation). Especially the language specific parts are vulnerable to specific processing problems. The aim of the paper is to identify possible problems and evaluate the results. The following problems are adressed: A processing-focused language description Language size: How much text is available for this language? What are the biggest sources? Corpus description: Genre, size, crawling and processing date. Possible problems in language identification: Which languages are similar? Character set and alphabet Inspecting the word list: Most frequent words, longer high frequent words and longest words at all. Word length distribution. Can abbreviations confuse sentence segmentation? Information about the abbreviation list. Inspecting sentences: Inspect shortest and longest sentences to identify possible segmentation problems. Sentence length distribution. The paper describes the result of these inspections; the appendices show the exact results for the different corpora. This helps to compare the corpora with respect to quality. In the section quality overview, an overall quality description for each corpus is given. All corpora contain only minor problems which are irrelevant for most applications. Otherwise the corpus creation has been iterated.

DAN - a processing related language description 2 DAN - a processing related language description General properties of the Danish language Native Name: Dansk Classifiation: Indo-European, Germanic, North, East Scandinavian, Danish-Swedish, Danish-Riksmal, Danish Total Number of Speakers: 5.6M Largest countries with number of spakers: Denmark (5.6M) Source: http:/ / www. dst. dk/ en/ Statistik/ emner/ befolkning-og-befolkningsfremskrivning/ folketal. aspx Processing summary latin alphabet with some additional characters full stop is used as sentence boundary and for abbreviations apostrostophes used rarely Properties important for processing Alphabet and punctuation The alphabet is latin based, with the following specialities (sources: http:/ / en. wikipedia. org/ wiki/ Alphabets_derived_from_the_Latin and http:/ / en. wikipedia. org/ wiki/ Danish_and_Norwegian_alphabet): Danish includes all 26 base letters and Æ, Ø, Å Additional letter forms: É (a diacritic used for disamgiguation: en/et - én/ét) In foreign words: Á, À, Â, Ä, É, È, Ê, Ë, Í, Ì, Î, Ï, Ó, Ò, Ô, Ö, Ú, Ù, Û, Ü and more Additional digraphs: EE in foreign words (trainee, frisbee); AA in older texts (replaced by å in 1948) and names (Aalborg, Aarhus). NB! Aa is treetet like Å in alphabetical sorting in danish words only, meaning that Aabenraa is listet under Å (last letter of the alphabet) and Aachen under A. Å, Æ and Ø might occur as AA, AE and OE in newer texts (avoidance of language specific letters) Usual latin punctuation Usage of uppercase letters: At sentence beginnings and for proper names (of persons, organisations, countries etc.). When a word beginning with Aa is capitalized, only the first letter becomes capital, e.g. Aarhus. Sentence segmentation and word tokenization Sentence beginnings Sentences begin with a capitalized first word. Abbreviations Abbreviations confusing with sentence boundaries: Special abbreviation list has to be inspected. Sources for abbreviations: http:/ / www. dsn. dk/ retskrivning/ retskrivningsregler/ a7-40-60/ a7-41-43/ a7-42 and http:/ / www. dsn. dk/ sprogviden/ udgivelser/ sprognaevnets-skriftserie-1/ flere-udgivelser/ Rigtigt%20kort%20indskannet. pdf/ at_download/ file Abbreviations with full stop may appear in the word list without full stop. Apostrophes (http:// www. dsn. dk/ retskrivning/ retskrivningsregler/ a7-1-6/ a7-6)

DAN - a processing related language description 3 Use of apostrophes: infrequent. in elliptical forms like "bli'", "hva'", "ha'", "ka'" and "la'r" instead of "blive", "hvad", "have", "kan" and "lader" (Bitte überprüfen, warum nach "ha'" immer ein ";" steht, dies passt nicht) to mark combination of a word/radical and inflectional endings: in combination of definite article: euro'en, PC'en, SMS'erne, OP'ens, CD-ROM'en used to mark genitive (instead of "s") in words that end with the letters s, z or x: Marx's ven Wilhelm Liebknecht, Georg Brandes' Plads to mark a genitive or plural form with "s": Jan's, foto's (both incorrect but frequent usages), and, in certain cases, other inflectional endings on proper names: Albert'er (2x Albert), Alberte'r (2x Alberte), Borges'ske dimensioner, Crohn's sygdom in combination with english (or other foreign) words: chicken satay's, Google's brugsoplevelser to mark combination of numerals and inflectional endings: 60'er-rock to mark combination of foreign words ending on "-ee" and inflectional endings: frisbee'en, yankee'er Mainly used to mark citations Sources and ranking (2012) Estimated number of webpages containing text Google.com top-5 words: 3.170.000 results for "i" "og" "at" "er" "på" Google.com top-10 words: 1.190.000 results for "i" "og" "at" "er" "på" "til" "en" "af" "for" "med" Rank according to number of speakers (Ethnologue): 111 Rank according to Wikipedia size (see http:/ / de. wikipedia. org/ wiki/ Wikipedia:Sprachen): Rank 30 with 172.000 articles. Rank according to number of newspapers as found by AbyZ (5/2012): 160 newspapers, rank 15. Rank according to number of newspapers with RSS feeds (5/2012): 110 newspapers, rank 14. Rank according to our corpus size (9/2012): 19

DAN corpora 4 DAN corpora Quality Overview Quality Ratings A: Very good quality. Ready to use (or already used) for frequency dictionary. Size as large as possible Only minimal errors Multiple genres (if possible) A-: Small problems identified. They should not affect usage. B: Native speaker quality. Information about abbreviations and sentence boundaries by native speaker Resulting statistics checked by native speaker, possible errors corrected C: Non-native speaker quality Obvious problems shown in corpus statistics are corrected D: First version Pre-processing with default abbreviation list and default sentence boundaries E: Poor Quality: Old, outdated or faulty. Corpus Quality The quality of the corpora differes slightly because the corpus processing toolchain changed slightly during several years. Moreover, original data are often no more available. Hence, improvement of quality often means removing incomplete or doubtful sentences. Forthcoming editions of all corpora thus might have a slightly smaller number of sentences. This especially applies to near duplicate sentences which are removed only sparingly. The following table shows the quality of the corpora. Minimal errors are still possible and described in the sections below. All possible major improvements are mentioned here. Corpus Quality rating Known problems to-dos dan_news_2007 A- near duplicates, see sentence length distibution - dan_news_2008 A - - dan_news_2010 A - - dan_news_2011 A - - dan_newscrawl_2011 A - - dan_wikipedia_2007 A- near duplicates, see sentence length distibution - dan_wikipedia_2012 A - - dan_web_2002 A - - dan_web_2011 A - - dan_mixed_2012 A - -

DAN corpora 5 Processing Overview For more details, see Appendix: Database Summary and Appendix: Number of sources by time period. Corpus Size (M sentences) Size (M running words) Multiwords Crawling date Production date dan_news_2007 1.0 18 0 03-12/2007 2012 dan_news_2008 0.8 15 10.311 01-06/2008 2012 dan_news_2010.7 14 9.693 06-12/2010 2012 dan_news_2011 1.2 22 11.762 dayly 2011 2012 dan_newscrawl_2011 2.5 44 18.202 batch crawling 2012 dan_wikipedia_2007 0.4 7 25833 dump 2007 2011 dan_wikipedia_2012 0.6 10 26.875 dump 2012 2012 dan_web_2002 9.5 155 0 randomly 2002 2007 dan_web_2011 6.2 103 21.247 randomly 2011 2012 dan_mixed_2012 21.4 368 43740-2012 Content Overview For more details, see Appendix: Size of different TLDs and Appendix: Size of different domains. Corpus Type of sources Countries Number of sources Publishing date Biggest source dan_news_2007 News dk 42 newspapers 2007 www.dr.dk/ dan_news_2008 News dk 56 newspapers 2008 www.dr.dk/ dan_news_2010 News dk 45 newspapers 2010 www.dr.dk/ dan_news_2011 News dk 36 newspapers 2011 www.dr.dk/ dan_newscrawl_2011 News dk 73 newspapers 2011 and before www.arbejderen.dk/ dan_wikipedia_2007 Wikipedia - - - - dan_wikipedia_2012 Wikipedia - - - - dan_web_2002 Web dk 29.071 domains 2002 and before dan_web_2011 Web dk 59.009 domains 2011 and before aarhus.lokalavisen.dk/ dan_mixed_2012 combined combined 59.037 domains 2011 and before www.dr.dk/ Words Appendix: Words by Length without multiplicity and Appendix: Words by Length without multiplicity show the length distribution for words. The curves should be smooth and decreasing for length>=5. Appendix: The Most Frequent 50 Words shows the most frequent stopwords as well as one or more words related to the region. Appendix: Longest Words in Top-1000 by rank shows the 25 longest words within the top-1000. They usually give an impression of the main topics treated in the corpus. Appendix: Longest Words with minimum frequency 2 should give an idea of very long words. In the case of processing problems, different types of non-words may appear. This might help to improve the word definition.

DAN corpora 6 Corpus Word length graph without multiplicity Word length graph with multiplicity Most Frequent 50 Words Longest Words in Top-1000 Longest Words with minimum frequency 2 dan_news_2007 okay okay, min. avg. 5.04 okay okay URLs, routes dan_news_2008 okay okay okay okay okay dan_news_2010 okay okay okay okay URLs, routes dan_news_2011 okay okay okay okay missing blanks, hex strings dan_newscrawl_2011 okay okay Publiceret and.. okay URLs, routes, missing blanks dan_wikipedia_2007 okay, min. avg. 10.12 okay, max. avg. 5.36 okay okay URLs, routes dan_wikipedia_2012 okay okay okay okay URLs dan_web_2002 okay, max. avg. 12.07 okay okay okay missing blanks, special characters dan_web_2011 okay okay okay okay missing blanks, special characters dan_mixed_2012 okay okay okay okay all errors as above Remarks The average word length (without multiplicity) differs for the different text genres. There is an unexpected minimum in the length distribution (with multiplicity) for length 4. Abbreviations For sentence boundary detection, abbreviations ending in a full stop are of interest: Such abbreviations are usually not used as sentence boundaries. Conversely, missing abbreviations can overgenerate sentence boundaries. The list of abbreviations is of high quality: nearly complete and manually checked. Due to limitations in the processing chain, this list of abbreviations is only used for sentence boundary detection and not included in the word list. Hence, abbreviations ending with a full stop appear in the word list without the full stop. Sentences Appendix: Shortest sentences shows the shortest declarative, exclamatory and interrogative sentences. In preprocessing, a minimal length for sentences might be specified. And missing abbreviations are often visible as faulty sentence endings. Appendix: Longest sentences shows the longest declarative, exclamatory and interrogative sentences. Usually, the maximun sentence length is defined as 256 characters (not 256 bytes). Very long exclamatory or interrogative sentences often contain an overseen sentence boundary. Appendix: Length of sentences in characters shows the distribution of the sentence length. A large and balanced corpus will result in a smooth and bell-shaped curve. Isolated local maxima usually result from large sets of near duplicate sentences.

DAN corpora 7 Corpus Shortest sentences Longest sentences Length distribution (in characters) dan_news_2007 unsymmetric quotation marks okay near duplicate peak at 48 dan_news_2008 some unsymmetric quotation marks okay sentences longer than 255? dan_news_2010 okay 1 menu list, 2x hex data near duplicate peak at 42? Length distribution (in words) okay okay okay dan_news_2011 duplicate sentences declarative sentences with many time data near duplicate peak at 42 okay dan_newscrawl_2011 okay declarative sentences with many time data many near duplicate peaks many near duplicate peaks dan_wikipedia_2007 declarative sentences beginning with digits and ending with abbrev. okay near duplicate peak at 20 sharp maximum at 10 dan_wikipedia_2012 okay okay okay okay dan_web_2002 dan_web_2011 dan_mixed_2012 declarative non-sentences, interrogative sentences beginning lowercase or with blank Lowercase beginnings for declarative sentences okay very smooth okay Enumerations, multiple sentences max. 277 characters Oddities Appendix: Sentences with high average word length: Average sentences contain many stopwords, and these stopwords are usually short. Hence, they restrict the average word length in a sentence. Conversely, sentences with high average word length are often ill formed. They may be used to improve pre-processing. Appendix: Problems with sentence segmentation - Words ending in a stopword: If there are many ill-formed word or sentence boundaries witout a blank between two words, they will generate new ill-formed words. The appendix shows the most frequent words ending in an uppercase stopword. If they are infrequent then the date were of high quality. Corpus Sentences with high average word length Words ending in a stopword... dan_news_2007 all kinds of errors okay dan_news_2008 okay okay dan_news_2010 2x hex strings maxfreq=11 dan_news_2011 2x hex strings, 2x missing blanks maxfreq=24 dan_newscrawl_2011 1x missing blanks maxfreq=805 dan_wikipedia_2007 (no data) okay dan_wikipedia_2012 okay maxfreq=27 dan_web_2002 missing blanks, underscores maxfreq=67 dan_web_2011 missing blanks maxfreq=58 dan_mixed_2012 missing blanks, underscores words containing ";"

DAN corpus comparison 8 DAN corpus comparison Automated Corpus comparison For the conducted comparisons, the following tests on the top-1000 words are performed: Vectors based on the frequencies of the top-1000 words are created for the analysed languages. As similarity value, 1-cos(alpha) of the angle alpha between these vectors is computed. Identical languages receive a value of 0, distinct languages get a value of 1. The same analysis is conducted using the frequencies of the top-1000 typical letter trigrams of the languages. Monolingual word list comparison (top-1000 words) As one can expect the comparisons show: The different news corpora have word lists with maximum distance 0.19 (dan_newscrawl_2011 and dan_news_2008) The web corpora have word lists with distance 0.13 The wikipedia corpora are similar with distance 0.10 The biggest distance of 0.36 can be found between dan_wikipedia_2007 dan_news_2008 The mixed corpus dan_mixed_2012 has a central position within the corpora and has a maximum distance of 0.31 to the wikipedia_2007 corpus Multilingual word list comparison (top-1000 words) Both the comparison of the top-1000 words and the comparison of the letter trigrams used in these words were conducted to find the most similar languages based on these features. The distance of Danish to the next languages considering words is 0.47 to Swedish. Considering letter trigrams the nearest language with distance 0.38 is Bokmål. These distances are below average. On average the value for the most similar language to a language in question is 0.58 for trigrams. The most similar languages based on words: Swedish, Bokmål, Nynorsk +--------+---------------------+--------------------+-------------+ source language_short_name language_name cos_logfreq +--------+---------------------+--------------------+-------------+ dan swe Swedish 0.469093 dan nob Norwegian, Bokmål 0.492077 dan nno Norwegian, Nynorsk 0.573548 dan fao Faroese 0.813491 dan isl Icelandic 0.828406 +--------+---------------------+--------------------+-------------+ The most similar languages based on letter trigrams: Bokmål, Swedish, Dutch +--------+---------------------+--------------------+-------------+ source language_short_name language_name cos_logfreq +--------+---------------------+--------------------+-------------+ dan nob Norwegian, Bokmål 0.381547 dan swe Swedish 0.544641 dan nld Dutch 0.547686 dan nno Norwegian, Nynorsk 0.563022

DAN corpus comparison 9 dan deu German 0.581681 +--------+---------------------+--------------------+-------------+.

10 Processing details Appendix to dan news 2007: Database summary Values for some general parameters Parameter Value Number of sentences 1019416 Number of running word forms 18004757 Number of distinct word forms 496351 Number of multiwords 0 Percentage of words with frequency=1 54.9835 Number of sentence based co-occurrences 3323980 Number of neighbour co-occurrences 465789 Appendix to dan news 2008: Database summary Values for some general parameters Parameter Value Number of sentences 764570 Number of running word forms 14724500 Number of distinct word forms 411502 Number of multiwords 10311 Percentage of words with frequency=1 53.3893 Number of sentence based co-occurrences 3271410 Number of neighbour co-occurrences 404953

Appendix to dan news 2010: Database summary 11 Appendix to dan news 2010: Database summary Values for some general parameters Parameter Value Number of sentences 734284 Number of running word forms 13333010 Number of distinct word forms 393468 Number of multiwords 9696 Percentage of words with frequency=1 53.8712 Number of sentence based co-occurrences 2708704 Number of neighbour co-occurrences 364115 Appendix to dan news 2011: Database summary Values for some general parameters Parameter Value Number of sentences 1219425 Number of running word forms 21802092 Number of distinct word forms 520768 Number of multiwords 11762 Percentage of words with frequency=1 54.6631 Number of sentence based co-occurrences 4250226 Number of neighbour co-occurrences 538976

Appendix to dan newscrawl 2011: Database summary 12 Appendix to dan newscrawl 2011: Database summary Values for some general parameters Parameter Value Number of sentences 2495624 Number of running word forms 43803329 Number of distinct word forms 862781 Number of multiwords 18202 Percentage of words with frequency=1 57.0620 Number of sentence based co-occurrences 6424512 Number of neighbour co-occurrences 887568 Appendix to dan wikipedia 2007: Database summary Values for some general parameters Parameter Value Number of sentences 425109 Number of running word forms 6890416 Number of distinct word forms 399379 Number of multiwords 25833 Percentage of words with frequency=1 57.8318 Number of sentence based co-occurrences 1500578 Number of neighbour co-occurrences 213947