Getting started with spaCy
spaCy is a popular library for advanced NLP in Python.
At the center of spaCy is the object containing the processing pipeline.
# Import the English language class
from spacy.lang.en import English
# Create the nlp object
nlp = English()
nlp
here contains:
- the processing pipeline
- includes language-specific rules used for tokenization, etc.
spacy.lang
contains a variety of languages. See here for languages supported by spaCy.
The Doc object
The Doc object is created by procesing a string of text with the nlp
object.
doc = nlp("Hello world!")
print(doc.text)
# Iterate over tokens in a doc
for token in doc:
print(token.text)
OUTPUT
Hello world!
Hello
world
!
Some points about the doc
object:
- short for document
- lets us access information about the text in a structured way, and no information is lost
- behaves like a normal Python sequence
The Token object
Token objects represent the tokens in a document.
# Index into the doc to get a token at a specific index
token = doc[1]
# Get the verbatim token text via the .text attribute
print(token.text)
OUTPUT
world
The Span object
- A span object is a slice of the document consisting of one or more tokens.
- It is only a view of the doc and doesn’t contain any data itself.
# A slice from the Doc is a Span object
span = doc[1:4]
# Get the span text via the .text attribute
print(span.text)
Lexical attributes (Depend only on the entry in the vocabulary) of the token object
token.i -> Index
token.text -> Text
token.is_alpha -> Is alpha numeric
token.is_punct -> Is punctuation
token.like_num -> Is like a number
spaCy’s Statistical Models
Statistical models enable spaCy to predict linguistic attributes in context such as whether a word is a verb or whether a Span of text is a person name.
Linguistic attributes generally include:
- Part-of-speech tags
- Syntactic dependencies
- Named entities
These models are trained on large datasets of labeled example texts which can be further fine-tuned using our own specific data.
Some of the available pre-trained model packages from spaCy are:
- encoreweb_sm
- encoreweb_md
- encoreweb_lg
Find out more models here
python -m spacy download en_core_web_sm
To load a pre-trained package:
import spacy
nlp = spacy.load("en_core_web_sm")
The pre-trained package provides:
- Binary weights – Enables spaCy to make predictions
- Vocabulary
- Meta information – Information related to which language class to use and how to configure the processing pipeline
NOTE : The models provided by spaCy use binary weights to make predictions and thus, are not shipped with the training data.
Predicting Part-of-Speech tags
import spacy
# Load the small English model
nlp = spacy.load("en_core_web_sm")
# Process the text
doc = nlp("He drank the coffee")
# Iterate over the tokens
for token in doc:
# Print the text and the predicted part-of-speech tag
print(token.text, token.pos_)
OUTPUT
He PRON
drank VERB
the DET
coffee NOUN
NOTE: In spaCy, attributes that return strings usually end with an underscore. Attributes without the underscore return an ID.
Predicting Syntactic Dependencies (How the words are related)
The dep_
attribute returns the predicted dependency label.
The head
attribute returns the syntactic head token for a word (the parent token this word is attached to).
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)
OUTPUT
He PRON nsubj drank
drank VERB ROOT drank
the DET det coffee
coffee NOUN dobj drank
spaCy uses a standardized labels scheme. Some common labels are:
Label | Description | Example |
---|---|---|
nsubj | nominal subject attached to the verb (in this case drank) | He |
dobj | direct object attached to the verb (in this case drank) | coffee |
det | determiner(article) | the |
Predicting Named Entities
The doc.ents
property lets us access the named entities predicted by the model. It returns an iterator of Span objects.
# Process a text
doc = nlp(u"Apple is looking to buy U.K. startup for $1 billion")
# Iterate over the predicted entities
for ent in doc.ents:
print(ent.text, ent.label_)
OUTPUT
Apple ORG
U.K. GPE
$1 billion MONEY
TIP: To get quick definition of most common tags and labels, use spacy.explain() method.
Example 1:
# GPE stands for Geopolitical entity
spacy.explain('GPE')
OUTPUT
'Countries, cities, states'
Example 2:
spacy.explain('NNP')
OUTPUT
'noun, proper singular'
Example 3:
spacy.explain('dobj')
OUTPUT
'direct object'
Rule-based matching
spaCy ‘s matcher lets us write rules to find words and phrases in text.
Token-based matching opens up a lot of new possibilities for information extraction.
Why not just regular expressions?
- Works on both
DOC
andTOKEN
objects and not just strings. - Search for text as well as other lexical attributes.
- Write rules that use the model’s predictions
Match patterns are list of dictionaries with each dictionary describing one token. The dictionary consists of the name of the token attributes as the keys mapped to their expected values.
Example 1:
import spacy
# Import the Matcher
from spacy.matcher import Matcher
# Load the model and create the nlp object
nlp = spacy.load('en_core_web_sm')
# Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern
pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
# Add the pattern to the matcher
matcher.add('IPHONE_PATTERN', None, pattern)
# Process some text
doc = nlp("New iPhone X release date leaked")
# Call the matcher on the doc
matches = matcher(doc) # returns a list of tuples
# Iterate over the matches
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
OUTPUT
iPhone X
For more on token-based matching, click here
matcher.add
takes three arguments:
- Unique ID for the pattern matched
- Optional callback
- pattern to match
matches
in the above code is a list of tuples with each tuple consisting of three values:
match_id
: hash value of the pattern namestart
: start index of matched spanend
: end index of matched span
Example 2:
import spacy
# Import the Matcher
from spacy.matcher import Matcher
# Load the model and create the nlp object
nlp = spacy.load('en_core_web_sm')
# Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern
pattern = [
{'IS_DIGIT', True},
{'LOWER', 'fifa'},
{'LOWER', 'world'},
{'LOWER', 'cup'},
{'IS_PUNCT', True}
]
# Add the pattern to the matcher
matcher.add('FIFA_PATTERN', None, pattern)
# Process some text
doc = nlp("2018 FIFA World Cup: France won!")
# Call the matcher on the doc
matches = matcher(doc) # returns a list of tuples
# Iterate over the matches
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
OUTPUT
2018 FIFA World Cup:
Example 3:
import spacy
# Import the Matcher
from spacy.matcher import Matcher
# Load the model and create the nlp object
nlp = spacy.load('en_core_web_sm')
# Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern
pattern = [
{'LEMMA': 'love', 'POS': 'VERB'},
{'POS': 'NOUN'}
]
# Add the pattern to the matcher
matcher.add('FIFA_PATTERN', None, pattern)
# Process some text
doc = nlp("I loved dogs but now I love cats more.")
# Call the matcher on the doc
matches = matcher(doc) # returns a list of tuples
# Iterate over the matches
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
OUTPUT
loved dogs
love cats
Operators and quantifiers
Operators and quantifiers define how often a token should be matched.
import spacy
# Import the Matcher
from spacy.matcher import Matcher
# Load the model and create the nlp object
nlp = spacy.load('en_core_web_sm')
# Initialize the matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern
pattern = [
{'LEMMA': 'buy'},
{'POS': 'DET', 'OP': '?'}, # optional: match 0 or 1 times
{'POS': 'NOUN'}
]
# Add the pattern to the matcher
matcher.add('FIFA_PATTERN', None, pattern)
# Process some text
doc = nlp("I bought a smartphone. Now I'm buying apps.")
# Call the matcher on the doc
matches = matcher(doc) # returns a list of tuples
# Iterate over the matches
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
OUTPUT
bought a smartphone
buying apps
Description | |
---|---|
{‘OP’: ‘!’} | Negation: match 0 times |
{‘OP’: ‘?’} | Optional: match 0 or 1 times |
{‘OP’: ‘+’} | Match 1 or more times |
{‘OP’: ‘*’} | Match 0 or more times |
NOTE : Operators can make our patterns powerful but increase the complexity too. So, use wisely.
Checkout the next blog to deep dive into large-scale data analysis with spaCy