Back to the blog

Getting started with spaCy

author image

Manoj Kumar Patra

Artificial Intelligence | August 28, 2019

hero image

spaCy is a popular library for advanced NLP in Python.

At the center of spaCy is the object containing the processing pipeline.

# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

nlp here contains:

  1. the processing pipeline
  2. includes language-specific rules used for tokenization, etc.

spacy.lang contains a variety of languages. See here for languages supported by spaCy.

The Doc object

The Doc object is created by procesing a string of text with the nlp object.

doc = nlp("Hello world!")

print(doc.text)

# Iterate over tokens in a doc
for token in doc:
  print(token.text)  

OUTPUT

Hello world!

Hello
world
!

Some points about the doc object:

  1. short for document
  2. lets us access information about the text in a structured way, and no information is lost
  3. behaves like a normal Python sequence

The Token object

Token objects represent the tokens in a document.

# Index into the doc to get a token at a specific index
token = doc[1]  
# Get the verbatim token text via the .text attribute
print(token.text)  

OUTPUT

world

The Span object

  1. A span object is a slice of the document consisting of one or more tokens.
  2. It is only a view of the doc and doesn’t contain any data itself.
# A slice from the Doc is a Span object
span = doc[1:4]

# Get the span text via the .text attribute
print(span.text)

Lexical attributes (Depend only on the entry in the vocabulary) of the token object

token.i -> Index
token.text -> Text
token.is_alpha -> Is alpha numeric
token.is_punct -> Is punctuation
token.like_num -> Is like a number

spaCy’s Statistical Models

Statistical models enable spaCy to predict linguistic attributes in context such as whether a word is a verb or whether a Span of text is a person name.

Linguistic attributes generally include:

  1. Part-of-speech tags
  2. Syntactic dependencies
  3. Named entities

These models are trained on large datasets of labeled example texts which can be further fine-tuned using our own specific data.

Some of the available pre-trained model packages from spaCy are:

  1. encoreweb_sm
  2. encoreweb_md
  3. encoreweb_lg

Find out more models here

python -m spacy download en_core_web_sm

To load a pre-trained package:

import spacy

nlp = spacy.load("en_core_web_sm")

The pre-trained package provides:

  1. Binary weights – Enables spaCy to make predictions
  2. Vocabulary
  3. Meta information – Information related to which language class to use and how to configure the processing pipeline

NOTE : The models provided by spaCy use binary weights to make predictions and thus, are not shipped with the training data.

Predicting Part-of-Speech tags

import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Process the text
doc = nlp("He drank the coffee")

# Iterate over the tokens
for token in doc:
  # Print the text and the predicted part-of-speech tag
  print(token.text, token.pos_)

OUTPUT

He PRON
drank VERB
the DET
coffee NOUN

NOTE: In spaCy, attributes that return strings usually end with an underscore. Attributes without the underscore return an ID.

The dep_ attribute returns the predicted dependency label.
The head attribute returns the syntactic head token for a word (the parent token this word is attached to).

for token in doc:
  print(token.text, token.pos_, token.dep_, token.head.text)  

OUTPUT

He PRON nsubj drank
drank VERB ROOT drank
the DET det coffee
coffee NOUN dobj drank

spaCy uses a standardized labels scheme. Some common labels are:

Label Description Example
nsubj nominal subject attached to the verb (in this case drank) He
dobj direct object attached to the verb (in this case drank) coffee
det determiner(article) the

Predicting Named Entities

The doc.ents property lets us access the named entities predicted by the model. It returns an iterator of Span objects.

# Process a text
doc = nlp(u"Apple is looking to buy U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
  print(ent.text, ent.label_)

OUTPUT

Apple ORG
U.K. GPE
$1 billion MONEY

TIP: To get quick definition of most common tags and labels, use spacy.explain() method.

Example 1:

# GPE stands for Geopolitical entity
spacy.explain('GPE')

OUTPUT

'Countries, cities, states'

Example 2:

spacy.explain('NNP')

OUTPUT

'noun, proper singular'

Example 3:

spacy.explain('dobj')

OUTPUT

'direct object'

Rule-based matching

spaCy ‘s matcher lets us write rules to find words and phrases in text.

Token-based matching opens up a lot of new possibilities for information extraction.

Why not just regular expressions?

  1. Works on both DOC and TOKEN objects and not just strings.
  2. Search for text as well as other lexical attributes.
  3. Write rules that use the model’s predictions

Match patterns are list of dictionaries with each dictionary describing one token. The dictionary consists of the name of the token attributes as the keys mapped to their expected values.

Example 1:

import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load the model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern
pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]

# Add the pattern to the matcher
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc) # returns a list of tuples

# Iterate over the matches
for match_id, start, end in matches:
  # Get the matched span
  matched_span = doc[start:end]
  print(matched_span.text)

OUTPUT

iPhone X

For more on token-based matching, click here
matcher.add takes three arguments:

  1. Unique ID for the pattern matched
  2. Optional callback
  3. pattern to match

matches in the above code is a list of tuples with each tuple consisting of three values:

  1. match_id: hash value of the pattern name
  2. start: start index of matched span
  3. end : end index of matched span

Example 2:

import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load the model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern
pattern = [
  {'IS_DIGIT', True},
  {'LOWER', 'fifa'},
  {'LOWER', 'world'},
  {'LOWER', 'cup'},
  {'IS_PUNCT', True}
]

# Add the pattern to the matcher
matcher.add('FIFA_PATTERN', None, pattern)

# Process some text
doc = nlp("2018 FIFA World Cup: France won!")

# Call the matcher on the doc
matches = matcher(doc) # returns a list of tuples

# Iterate over the matches
for match_id, start, end in matches:
  # Get the matched span
  matched_span = doc[start:end]
  print(matched_span.text)

OUTPUT

2018 FIFA World Cup:

Example 3:

import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load the model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern
pattern = [
  {'LEMMA': 'love', 'POS': 'VERB'},
  {'POS': 'NOUN'}
]

# Add the pattern to the matcher
matcher.add('FIFA_PATTERN', None, pattern)

# Process some text
doc = nlp("I loved dogs but now I love cats more.")

# Call the matcher on the doc
matches = matcher(doc) # returns a list of tuples

# Iterate over the matches
for match_id, start, end in matches:
  # Get the matched span
  matched_span = doc[start:end]
  print(matched_span.text)

OUTPUT

loved dogs
love cats

Operators and quantifiers

Operators and quantifiers define how often a token should be matched.

import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load the model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern
pattern = [
  {'LEMMA': 'buy'},
  {'POS': 'DET', 'OP': '?'}, # optional: match 0 or 1 times
  {'POS': 'NOUN'}
]

# Add the pattern to the matcher
matcher.add('FIFA_PATTERN', None, pattern)

# Process some text
doc = nlp("I bought a smartphone. Now I'm buying apps.")

# Call the matcher on the doc
matches = matcher(doc) # returns a list of tuples

# Iterate over the matches
for match_id, start, end in matches:
  # Get the matched span
  matched_span = doc[start:end]
  print(matched_span.text)

OUTPUT

bought a smartphone
buying apps
Description
{‘OP’: ‘!’} Negation: match 0 times
{‘OP’: ‘?’} Optional: match 0 or 1 times
{‘OP’: ‘+’} Match 1 or more times
{‘OP’: ‘*’} Match 0 or more times

NOTE : Operators can make our patterns powerful but increase the complexity too. So, use wisely.

Checkout the next blog to deep dive into large-scale data analysis with spaCy

Browse all categories