CINTIL online concordancer

Developed at the University of Lisbon by NLX/FCUL and CLUL


concordancer    |    intro    |    what's in?    |    how to?    |    release    |    versão portuguesa

What's in?


Table of contents

Corpus composition

CINTIL is a corpus of Portuguese with 1 Million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open classes lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition).

Over one third of the corpus is composed of transcribed spoken materials, with about half of that being the transcription of informal conversations.

The remaining corpus is composed of written materials. The majority (58.73%) of this written corpus includes articles from newspapers and magazines, such as the Jornal Público, Diário de Notícias, Revista Visão, etc. The rest of the written corpus is mostly composed of fiction.

A more detailed breakdown of the composition of CINTIL is presented in the following table:

Breakdown of the CINTIL corpus
Written
689,124
tokens
News 33.96% 404,690
Fiction 16.80% 200,194
Other 7.07% 84,240
Spoken
502,622
tokens
Formal/Natural 8.18% 97,499
Formal/Media 7.45% 88,727
Formal/Phone 4.05% 48,284
Informal/Private 18.26% 217,604
Informal/Public 4.05% 48,221
Informal/Phone 0.19% 2,287
Total 1,191,746

Companion tools and online services

You may be also interested in the companion tools. These tools deliver output in strict accordance to CINTIL's annotation conventions.

They include the following individual tools covering analysis and generation procedures:

These tools were bundled into four suites of self-contained functionality and made available as the following online services:

Annotation guidelines

The linguistic information encoded in CINTIL adheres to the annotation guidelines described here. However, for practical reasons, the concordancer displays the annotation—when the "Show tags" box is checked—in a slightly different format. For more details, see query outcome.

Tagset

Part-of-speech tags

TagCategoryExamples
ADJAdjectivesbom, brilhante, eficaz, …
ADVAdverbshoje, já, sim, felizmente, …
CARDCardinalszero, dez, cem, mil, …
CJConjunctionse, ou, tal como, …
CLCliticso, lhe, se, …
CNCommon Nounscomputador, cidade, ideia, …
DADefinite Articleso, os, …
DEMDemonstrativeseste, esses, aquele, …
DFRDenominators of Fractionsmeio, terço, décimo, %, …
DGTRRoman NumeralsVI, LX, MMIII, MCMXCIX, …
DGTArabic Numerals0, 1, 42, 12345, 67890, …
DMDiscourse Markerolá, …
EADRElectronic Addresseshttp://www.di.fc.ul.pt, …
EOEEnd of Enumerationetc
EXCExclamationah, ei, …
GERGerundssendo, afirmando, vivendo, …
GERAUXGerund "ter"/"haver" in compound tensestendo, havendo
IAIndefinite Articlesuns, umas, …
INDIndefinitestudo, alguém, ninguém, …
INFInfinitiveser, afirmar, viver, …
INFAUXInfinitive "ter"/"haver" in compound tensester, haver, …
INTInterrogativesquem, como, quando, …
ITJInterjectionbolas, caramba, …
LTRLettersa, b, c, …
MGTMagnitude Classesunidade, dezena, dúzia, resma, …
MTHMonthsJaneiro, Dezembro, …
NPNoun Phrasesidem, …
ORDOrdinalsprimeiro, centésimo, penúltimo, …
PADRPart of AddressRua, av., rot., …
PNMPart of NameLisboa, António, João, …
PNTPunctuation Marks., ?, (, …
POSSPossessivesmeu, teu, seu, …
PPAPast Participles not in compound tensessido, afirmados, vivida, …
PPPrepositional Phrasesalgures, …
PPTPast Participle in compound tensessido, afirmado, vivido, …
PREPPrepositionsde, para, em redor de, …
PRSPersonalseu, tu, ele, …
QNTQuantifierstodos, muitos, nenhum, …
RELRelativesque, cujo, tal que, …
STTSocial TitlesPresidente, drª., prof., …
SYBSymbols@, #, &, …
TERMNOptional Terminations(s), (as), …
UM"um" or "uma"um, uma
UNITAbbreviated Measurement Unitkg., km., …
VAUXFinite "ter" or "haver" in compound tensestemos, haveriam, …
VVerbs (other than PPA, PPT, INF or GER)falou, falaria, …
WDWeek Dayssegunda, terça-feira, sábado, …
Tags for multi-word expressions
LADV1…LADVnMulti-Word Adverbsde facto, em suma, um pouco, …
LCJ1…LCJnMulti-Word Conjunctionsassim como, já que, …
LDEM1…LDEMnMulti-Word Demonstrativeso mesmo, …
LDFR1…LDFRnMulti-Word Denominators of Fractionspor cento
LDM1…LDMnMulti-Word Discourse Markerspois não, até logo, …
LITJ1…LITJnMulti-Word Interjectionsmeu Deus
LPRS1…LPRSnMulti-Word Personalsa gente, si mesmo, V. Exa., …
LPREP1…LPREPnMulti-Word Prepositionsatravés de, a partir de, …
LQD1…LQDnMulti-Word Quantifiersuns quantos, …
LREL1…LRELnMulti-Word Relativestal como, …
Tags specific to the spoken corpus
EMPEmphasis
ELExtra-linguistic
PLPara-linguistic
FRGFragment

Inflection tags

TagDescription
Tags for nominal categories
mMasculine
fFeminine
gUnderspecified gender
sSingular
pPlural
nUnderspecified number
dimDiminutive
supSuperlative
compComparative
Tags for verbs
1First Person
2Second Person
3Third Person
piPresente do Indicativo
ppiPretérito Perfeito do Indicativo
iiPretérito Imperfeito do Indicativo
mpiPretérito Mais que Perfeito do Indicativo
fiFuturo do Indicativo
cCondicional
pcPresente do Conjuntivo
icPretérito Imperfeito do Conjuntivo
fcFuturo do Conjuntivo
impImperativo
Tags for infinitive verbs
iflInflected
niflNot Inflected

Named-entity tags

PositiondescriptionSemantic typedescriptionexample
B-beginning
PER
ORG
LOC
WRK
MSC
person
organization
location
work
other cases
...o[O] João[B-PER] Silva[I-PER] disse[O]...
...a[O] Universidade[B-ORG] de[I-ORG] Lisboa[I-ORG] comprou[O]...
...de[O] Londres[B-LOC] a[O] Paris[B-LOC]...
...a[O] Mona[B-WRK] Lisa[I-WRK] está[O]...
...o[O] RMS[B-MSC] Titanic[I-MSC] afundou[O]...
 
I-inside
Ooutside



© All rights reserved