Natural Language Processing Terminology

Generic terminology about NLP (Natural Language Processing):

  • Collocation – A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds very odd.
  • Bigram – Word pairs; such as “red wine“.
  • Corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.
  • Entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy. Shannon entropy is a measure of disorder, or more precisely unpredictability. For example, a series of coin tosses with a fair coin has maximum entropy, since there is no way to predict what will come next. A string of coin tosses with a two-headed coin has zero entropy, since the coin will always come up heads. Most collections of data in the real world lie somewhere in between.
  • Maximum Entropy Modelling – General-purpose machine learning technique. When characterizing some unknown events with a statistical model, we should always choose the one that has Maximum Entropy. Maximum entropy modeling is a general-purpose machine learning technique originally developed for statistical physics, but which has been employed in a wide variety of fields, including computer vision and natural language processing. It can be usefully applied whenever there is a complex process whose internals are unknown but which must be modeled by the computer. Like any statistical modeling technique, it relies on the existence of a data sample that shows, for given sets of input, what output the process generates. This sample is analysed, and from it, a model is generated, encapsulating all the rules about the process that could be inferred from the sample. This model is then used to predict the output of the process, when supplied with sets of input not found in the sample data. (Presentation)
  • Conditional Frequency Distribution – Collection of frequency distributions, each one for a different “condition.” The condition will often be the category of the text. A frequency distribution counts observable events, such as the appearance of words ina text. A conditional frequency distribution needs to pair each event with a condition.
  • Semantic Network – Represents semantic relations among concepts. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges.
  • Language Resource – Refers to data-only resources such as lexicons, corpora, thesauri or ontologies.
  • Morpheme – The smallest component of word, or other linguistic unit, that has semantic meaning. The term is used as part of the branch of linguistics known as morpheme-based morphology. A morpheme is composed by phoneme(s) (the smallest linguistically distinctive units of sound) in spoken language, and by grapheme(s) (the smallest units of written language) in written language.
  • Ontology – Formal representation of a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain, and may be used to define the domain.
  • Anaphora – Instance of an expression referring to another. The monkey took the banana and ate it. “It” is anaphoric under the strict definition (it refers to the banana).
  • Lexicon of a language is its vocabulary, including its words and expressions.
  • Lexeme is an abstract unit of morphological analysis in linguistics, that roughly corresponds to a set of forms taken by a single word. For example, in the English language, runrunsran and running are forms of the same lexeme, conventionally written as run.
  • Lemma is the canonical form of a lexeme. Lexeme, in this context, refers to the set of all the forms that have the same meaning, and lemma refers to the particular form that is chosen by convention to represent the lexeme.
  • Coreference – It occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same “referent.” For example, in the sentence “Mary said she would help me”, “she” and “Mary” are most likely referring to the same person or group, in which case they are coreferent. Similarly, in “I saw Scott yesterday. He was fishing by the lake,” Scott and he are most likely coreferent.
  • POS Tagging – The process of classifying words into their parts-of-speech and labeling them accordingly is known as part-of-speech tagging, POS tagging, or simply tagging. Parts-of-speech are also known as word classes or lexical categories. The collection of tagsused for a particular task is known as a tagset.
  • tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query. One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

    I will be updating this list as I encounter them…