Program for parsing and processing the given URL using regex

Introduction

A URL contains three to four components namely scheme, hostname, path and query string. The task is to find the protocol and hostname for the given URL.

Program

parsing and processing the given URL

Approach 1

import re   
ip_url = input("Enter the url: ")
protocol = re.findall('(\w+)://', ip_url) 
print("Protocol: ", protocol) 
hostname = re.findall('://www.([\w\-\.]+)', ip_url) 
print("Hostname: ", hostname)

Output

parsing and processing the given URL Output

Approach 2

import re   
ip_url = input("Enter the url: ")
file = re.findall('(\w+)://', s)   
print("Protocol: ", file) 
hostname = re.findall('://([\w\-\.]+)(:(\d+))?', ip_url) 
print("Hostname: ", hostname)

Output

parsing and processing the given URL Output 2

Explanation

In first approach, the regex expression for extracting protocol is ‘(\w+)://‘ and for extracting hostname is ‘://([\w\-\.]+)‘. The metacharacter ‘\w’ matches the alphanumeric character.

In the second approach, we are extracting the protocol and hostname for URLs having port numbers also with it. The metacharacter ‘\d’ matches the numeric character and ‘?’ is used for the optional occurrences.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.