CINTIL online concordancer

Developed at the University of Lisbon by NLX/FCUL and CLUL


concordancer    |    intro    |    what's in?    |    how to?    |    release    |    versão portuguesa

How to?


Table of contents

Cheat sheet / Quick reference

Query syntax cheat sheet
Basic query
a word matches itself
Query modifiers
/icase-insensitive match
/xsub-sequence matching
Character expressions
.any single character
[ ]character from a set
[^ ]character from negated set
Repetition operators
?optional
*zero or more times
+one or more times
{n}exactly n times
{n,}n or more times
{,n}up to n times
{m,n}from m to n times
Combining expressions
e1e2e1 followed by e2
|alternation
( )grouping
Search by annotation
[keyword=expression]
[keyword!=expression]
[key1=exp1 & key2=exp2]
[key1=exp1 | key2=exp2]
Regular expressions must be enclosed in quotes.
Contraction are reverted and encoded as two tokens, where the first is concatenated with an underscore.

Quick reference
Field Keyword Values
Orthographic form orth any
Part-of-speech tag pos full table
Inflection feature gender f, m, g
number s, p, n
degree dim, sup, comp
person 1, 2, 3
time full table
inflection ifl, nifl
Lemma (base form) base any
Named-entity iob full table
Metadata source writtennews
writtenfiction
writtenother
spoken

Query outcome

The CINTIL Online Concordancer permits to retrieve passages with occurrences of a given target expression in the CINTIL corpus.

The target expression is entered in the query text box. The retrieved passages are displayed below that box.

When the "Show tags" box is checked, the concordancer displays also the linguistic annotation.

For each token, this annotation is displayed between square brackets, with a colon separating each field. For instance, the annotation for the common noun gatas will be displayed as follows:

occurrence with annotation    →    gatas [ gato : cn : f : p : O ]
keywords    →    orth base pos gender number IOB

Note that this annotation is displayed in a slightly different format than the one used in the corpus release. For a description of the latter, check here.

For practical reasons each passage returned with the occurrence contains at most 10 tokens.

Also for practical reasons, not all passages with occurrences of the target expression in the CINTIL corpus are returned. Also, the order in which the passages are displayed does not correspond to a possible consecutive order of their occurrence in the corpus. Note however that the outcome of the CINTIL online concordancer can be used as a reference in research given that identical queries return identical outcome.

In those usage cases where it is imperative to have access to every occurrence, the interested user can acquire a copy of the corpus and run a concordancer of his choice over that local copy.

User interface

The online concordancer user interface is quite self-explanatory.

The operation of "Sort" buttons provides the following functionality: Upon pressing these buttons, the results are alphabetically sorted according to the context.

The right-hand side button sorts the passages using their right side context.

The left-hand side button sorts the passages according to the context on their left side, from right to left.

The following example illustrates the use of these two buttons over the outcome of the same search for carro (with 2 words of left context and 1 word of right context):

no sort
...guiar um carro novo...
...ir de carro para...
...levar o carro até...
right sort
...levar o carro até...
...guiar um carro novo...
...ir de carro para...
left sort
...ir de carro para...
...levar o carro até...
...guiar um carro novo...

Searching orthographic forms

Case-sensitiveness
Search is case-sensitive. For a case-insensitive search, append /i to the orthographic form:

Sub-sequence matching
The query expression match whole tokens. For instance gato will not match parts of words, and will not return regato or obrigatoriamente.

To allow sub-sequence matching, append /x to the orthographic form (which can be combined with the /i mentioned previously).

For instance:

Contractions
Note that in the CINTIL corpus the contractions (e.g. daquela, aos, nas) are reverted and encoded with two tokens, where the first is concatenated with an underscore symbol (e.g. de_ aquela, a_ os, em_ as)

Searching for patterns

It is possible to search with general pattern (aka regular expressions). A query can thus include regular expressions, provided it is enclosed in quotes. The usual notational conventions are followed:

Alternation
Alternatives are introduced by the | (vertical bar) character:

Character sets
A set of characters within square brackets match occurrences of any of those characters: A set can be negated by placing a ^ (caret) symbol immediately after the opening bracket.

Period
The "." (period) match any single character (letter, digit or symbol):

Optionality
The "?" (question mark) permits that the character/expression preceding it is optionally matched:

Iteration
There are three forms of expressing iteration. The * (star) operator permits that the character/expression preceding it is matched zero or more times: The + (plus) operator is similar, but enforces that there is at least one occurrence of the character/expression preceding it: Finally, {l,u} permits that the number of iterations is bounded by a lower (l) and an upper (u) value. Note that either bound may be omitted. In such cases, {l,} means "at least l times", {,u} means "at most u times" and {n} means "exactly n times":

Grouping
Parentheses are used to group expressions. The operators described above can then be applied to the whole expression in parentheses as if it was a single character:

Note that any of these expressions may also be modified by the /i and /x described previously.

For instance:

Searching through linguistic information

Each token is associated to linguistic information, encoded by means of annotation tags. Each tag is composed of a field and its value in square brackets ([field=value]). For example, [gender=m], [time=pi], etc.

Each field is instantiated by a keyword.

The values can be matched with any of the methods described above:

Field-pattern pairs can be combined by using logical operators: & (ampersand) for conjunction and | (vertical bar) for disjunction:

In addition, the negation symbol ! (exclamation) permits to match tokens whose field values do not conform to a given pattern:

Orthographic form (again)

The orthographic form itself can be matched via the keyword orth:

Part-of-speech

Selecting occurrences with a given part-of-speech (POS) category is done by resorting to keyword pos:

Here is the list of POS tags.

Nominal inflection

The keywords gender and number have, respectively, the values f (feminine) or m (masculine), and the values s (singular) or p (plural). They permit to match occurrences with selected inflection features:

Some tokens may bear degree features, accessed through the degree keyword:

Verbal inflection

In order to match tokens according to their verbal inflection features, one can resort to person, time and number keywords:

Here is the list of verbal inflection tags.

Infinitives can occur inflected or not inflected. This information is matched through the inflection keyword.

Lemma

In order to match tokens by their lemma, the base keyword can be used:

Named-entity

To match tokens according to their being part of an expression naming an entity, the iob keyword is used:

Here is the list of named-entity tags.

Metadata

It is possible to use metadata to restrict the match to a given type of text through the use of the meta command:

For a list of metadata fields and values, see here.

Advanced queries

Through the combination of the different search options described above, it is possible to construct advanced queries and uncover relevant linguistic information:

Aligning matches

It is possible to split the outcome of the query into two columns to make it more readable by using the ^ (caret) symbol:




© All rights reserved