Sunday, March 31, 2024

Introducing Python’s Parse: The Final Various to Common Expressions | by Peng Qian | Jun, 2023

Must read


The parse API is much like Python Common Expressions, primarily consisting of the parse, search, and findall strategies. Primary utilization could be discovered from the parse documentation.

Sample format

The parse format is similar to the Python format syntax. You’ll be able to seize matched textual content utilizing {} or {field_name}.

For instance, within the following textual content, if I need to get the profile URL and username, I can write it like this:

content material:
Whats up everybody, my Medium profile url is https://qtalen.medium.com,
and my username is @qtalen.

parse sample:
Whats up everybody, my Medium profile url is {profile},
and my username is {username}.

Otherwise you need to extract a number of cellphone numbers. Nonetheless, the cellphone numbers have totally different codecs of nation codes in entrance, and the cellphone numbers are of a set size of 11 digits. You’ll be able to write it like this:

compiler = Parser("{country_code}{cellphone:11.11},")
content material = "0085212345678901, +85212345678902, (852)12345678903,"

outcomes = compiler.findall(content material)

for lead to outcomes:
print(outcome)

Or if that you must course of a chunk of textual content in an HTML tag, however the textual content is preceded and adopted by an indefinite size of whitespace, you possibly can write it like this:

content material:
<div> Whats up World </div>

sample:
<div>{:^}</div>

Within the code above, {:11} refers back to the width, which implies to seize a minimum of 11 characters, equal to the common expression (.{11,})?. {:.11} refers back to the precision, which implies to seize at most 11 characters, equal to the common expression (.{,11})?. So when mixed, it means (.{11, 11})?. The result’s:

Capture fixed-width characters.
Seize fixed-width characters. Picture by Creator

Probably the most highly effective function of parse is its dealing with of time textual content, which could be immediately parsed into Python datetime objects. For instance, if we need to parse the time in an HTTP log:

content material:
[04/Jan/2019:16:06:38 +0800]

sample:
[{:th}]

Retrieving outcomes

There are two methods to retrieve the outcomes:

  1. For capturing strategies that use {} with no discipline identify, you possibly can immediately use outcome.fastened to get the outcome as a tuple.
  2. For capturing strategies that use {field_name}, you should utilize outcome.named to get the outcome as a dictionary.

Customized Sort Conversions

Though utilizing {field_name} is already fairly easy, the supply code reveals that {field_name} is internally transformed to (?P<field_name>.+?). So, parse nonetheless makes use of common expressions for matching. .+? represents a number of random characters in non-greedy mode.

The transformation process of parse format to regular expressions
The transformation means of parse format to common expressions. Picture by Creator

Nevertheless, usually we hope to match extra exactly. For instance, the textual content “my e mail is xxx@xxx.com”, “my e mail is {e mail}” can seize the e-mail. Generally we could get soiled information, for instance, “my e mail is xxxx@xxxx”, and we don’t need to seize it.

Is there a approach to make use of common expressions for extra correct matching?

That’s when the with_pattern decorator turns out to be useful.

For instance, for capturing e mail addresses, we are able to write it like this:

@with_pattern(r'b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b')
def e mail(textual content: str) -> str:
return textual content

compiler = Parser("my e mail deal with is {e mail:E mail}", dict(E mail=e mail))

legal_result = compiler.parse("my e mail deal with is xx@xxx.com") # authorized e mail
illegal_result = compiler.parse("my e mail deal with is xx@xx") # unlawful e mail

Utilizing the with_pattern decorator, we are able to outline a customized discipline sort, on this case, E mailwhich is able to match the e-mail deal with within the textual content. We are able to additionally use this strategy to match different sophisticated patterns.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article