Sitemap
Python in Plain English

New Python content every day. Follow to join our 3.5M+ monthly readers.

Follow publication

To RegEx or Not To RegEx? (Part II)

7 min readFeb 25, 2021

--

The Re Module

re.match(regex, string)

re.sub(regex, repl, string)

df.column = df.column.apply(lambda x: re.sub(r'pattern', "", str(x))

re.findall(regex, string)

Application with Twitter Dataset

# import modules
import json
import pandas as pd
import re
import regex
from datetime import datetime
pd.set_option(“display.max_rows”, 999)
pd.set_option(“display.max_columns”, 999)
# import twitter data
data_json = open(‘data/tweets/2017–12–01.json’, mode=’r’).read()
df = pd.read_json(‘data/tweets/2017–12–01.json’)
df.head()

1. Removal of #Hashtag, @Callouts, and &Character References

2. Emoji Conversion and URL Link Removal

Capturing Groups

3. Punctuation and Digit Removal

4. Removing Extraneous Whitespaces

5. Working with Time

regex.match(r’^...(?P<name1>regex1)...(?P<name2regex2)...(?P<name3>regex3)...$’, string)
# Extract year, month, day, and time into a dictionary
df[‘time’] = [x for x in df.time.apply(lambda x: regex.match(r’^(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})T(?P<time>\d{2}:\d{2}:\d{2})-\d{2}:\d{2}$’, str(x)))]
df.match = df.match.apply(lambda x: x.groupdict())
# Convert into datetime ISO format
format = “%Y-%m-%dT%H:%M:%S%z”
df[‘datetime’] = df.time.apply(lambda x: datetime.strptime(x, format))

--

--

Python in Plain English
Python in Plain English

Published in Python in Plain English

New Python content every day. Follow to join our 3.5M+ monthly readers.

Steven Yan
Steven Yan

Written by Steven Yan

Data Scientist for Social Good. Former MCAT Tutor and Content Writer. Pianist and Linguaphile. UChicago and Flatiron Alum.

No responses yet

Write a response