The parse
API is much like Python Common Expressions, primarily consisting of the parse
, search
, and findall
strategies. Primary utilization could be discovered from the parse documentation.
Sample format
The parse format is similar to the Python format syntax. You’ll be able to seize matched textual content utilizing {}
or {field_name}
.
For instance, within the following textual content, if I need to get the profile URL and username, I can write it like this:
content material:
Whats up everybody, my Medium profile url is https://qtalen.medium.com,
and my username is @qtalen.parse sample:
Whats up everybody, my Medium profile url is {profile},
and my username is {username}.
Otherwise you need to extract a number of cellphone numbers. Nonetheless, the cellphone numbers have totally different codecs of nation codes in entrance, and the cellphone numbers are of a set size of 11 digits. You’ll be able to write it like this:
compiler = Parser("{country_code}{cellphone:11.11},")
content material = "0085212345678901, +85212345678902, (852)12345678903,"outcomes = compiler.findall(content material)
for lead to outcomes:
print(outcome)
Or if that you must course of a chunk of textual content in an HTML tag, however the textual content is preceded and adopted by an indefinite size of whitespace, you possibly can write it like this:
content material:
<div> Whats up World </div>sample:
<div>{:^}</div>
Within the code above, {:11}
refers back to the width, which implies to seize a minimum of 11 characters, equal to the common expression (.{11,})?
. {:.11}
refers back to the precision, which implies to seize at most 11 characters, equal to the common expression (.{,11})?
. So when mixed, it means (.{11, 11})?
. The result’s:
Probably the most highly effective function of parse is its dealing with of time textual content, which could be immediately parsed into Python datetime objects. For instance, if we need to parse the time in an HTTP log:
content material:
[04/Jan/2019:16:06:38 +0800]sample:
[{:th}]
Retrieving outcomes
There are two methods to retrieve the outcomes:
- For capturing strategies that use
{}
with no discipline identify, you possibly can immediately useoutcome.fastened
to get the outcome as a tuple. - For capturing strategies that use
{field_name}
, you should utilizeoutcome.named
to get the outcome as a dictionary.
Customized Sort Conversions
Though utilizing {field_name}
is already fairly easy, the supply code reveals that {field_name}
is internally transformed to (?P<field_name>.+?)
. So, parse
nonetheless makes use of common expressions for matching. .+?
represents a number of random characters in non-greedy mode.
Nevertheless, usually we hope to match extra exactly. For instance, the textual content “my e mail is xxx@xxx.com”, “my e mail is {e mail}”
can seize the e-mail. Generally we could get soiled information, for instance, “my e mail is xxxx@xxxx”, and we don’t need to seize it.
Is there a approach to make use of common expressions for extra correct matching?
That’s when the with_pattern
decorator turns out to be useful.
For instance, for capturing e mail addresses, we are able to write it like this:
@with_pattern(r'b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b')
def e mail(textual content: str) -> str:
return textual contentcompiler = Parser("my e mail deal with is {e mail:E mail}", dict(E mail=e mail))
legal_result = compiler.parse("my e mail deal with is xx@xxx.com") # authorized e mail
illegal_result = compiler.parse("my e mail deal with is xx@xx") # unlawful e mail
Utilizing the with_pattern
decorator, we are able to outline a customized discipline sort, on this case, E mail
which is able to match the e-mail deal with within the textual content. We are able to additionally use this strategy to match different sophisticated patterns.