Как найти ссылку в тексте python - AvtoShod.ru - решение различных проблем

You can use the following monstrous regex:

b((?:https?://)?(?:(?:www.)?(?:[da-z.-]+).(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[w.-]*)*/?)b

Demo regex101

This regex will accept urls in the following format:

INPUT:

add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 192.168.1.1/test.jpg.
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.

OUTPUT:

http://mit.edu.com
https://facebook.jp.com
www.google.be
https://www.google.be
www.website.gov.us
www.test.com
http://192.168.1.1/test.jpg
www.test.com:8080/test.jpg
www.website.gov.us/login.html
192.168.1.1/test.jpg
google.co.jp/maps
2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg

Explanations:

b is used for word boundary to delimit the URL and the rest of the text
(?:https?://)? to match http:// or https// if present
(?:(?:www.)?(?:[da-z.-]+).(?:[a-z]{2,6}) to match standard url (that might start with www. (lets call it STANDARD_URL)
(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) to match standard Ipv4 (lets call it IPv4)
to match the IPv6 URLs: (?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])) (lets call it IPv6)
to match the port part (lets call it PORT) if present: (?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])
to match the (?:/[w.-]*)*/?) target object part of the url (html file, jpg,…) (lets call it RESSOURCE_PATH)

This gives the following regex:

b((?:https?://)?(?:STANDARD_URL|IPv4|IPv6)(?:PORT)?(?:RESSOURCE_PATH)b

Sources:

IPv6: Regular expression that matches valid IPv6 addresses

IPv4: https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9780596802837/ch07s16.html

PORT: https://stackoverflow.com/a/12968117/8794221

Other sources:
https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know—net-6149

$ more url.py

import re

inputString = """add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 (192.168.1.1/test.jpg).
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg."""

regex=ur"b((?:https?://)?(?:(?:www.)?(?:[da-z.-]+).(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9]).){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[w.-]*)*/?)b"

matches = re.findall(regex, inputString)
print(matches)

OUTPUT:

$ python url.py 
['http://mit.edu.com', 'https://facebook.jp.com', 'www.google.be', 'https://www.google.be', 'www.website.gov.us', 'www.test.com', 'http://192.168.1.1/test.jpg', 'www.test.com:8080/test.jpg', 'www.website.gov.us/login.html', '192.168.1.1/test.jpg', 'google.co.jp/maps', '2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg']

Источник

Today we are going to learn how we can find and extract a URL of a website from a string in Python. We will be using the regular expression module of python. So if we have a string and we want to check if it contains a URL and if it contains one then we can extract it and print it.

First, we need to understand how to judge a URL presence. To judge that we will be using a regular expression that has all possible symbols combination/conditions that can constitute a URL.

This regular expression is going to help us to judge the presence of a URL.

#regular expression to find URL in string in python

r"(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))))+(?:(([^s()<>]+|(([^s()<>]+))))|[^s`!()[]{};:'".,<>?«»“”‘’]))"

Then we will just parse our string with this regular expression and check the URL presence. So to do that we will be using findall() method/function from the regular expression module of python.

so let us begin our code.

Program to find the URL from an input string

Importing the regular expression module to our program and defining a method to do the logic

Code Example

#How to Extract URL from a string in Python?

import re

def URLsearch(stringinput):

  #regular expression

 regularex = r"(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))))+(?:(([^s()<>]+|(([^s()<>]+))))|[^s`!()[]{};:'".,<>?«»“”‘’]))"

 #finding the url in passed string

 urlsrc = re.findall(regularex,stringinput)

 #return the found website url

 return [url[0] for url in urlsrc]

textcontent = 'text :a software website find contents related to technology https://devenum.com https://google.com,http://devenum.com'

#using the above define function

print("Urls found: ", URLsearch(textcontent))

Output:

Urls found:  ['https://devenum.com', 'https://google.com,http://devenum.com']

Find URL in string of HTML format

In this code example we are searching the urls inside a HTML <p><a></a></p> tags.We are using the above defines regular expression to find the same.

import re

def URLsearch(stringinput):

  #regular expression

 regularex =  regularex = r"(?i)b((?:https?://|wwwd{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/)(?:[^s()<>]+|(([^s()<>]+|(([^s()<>]+)))))+(?:(([^s()<>]+|(([^s()<>]+))))|[^s`!()[]{};:'".,<>?«»“”‘’]))"


 #finding the url in passed string

 urlsrc = re.findall(regularex,stringinput)

 #return the found website url

 return [url[0] for url in urlsrc]

textcontent = '<p>Contents <a href="https://www.google.com">Python Examples</a><a href="https://www.devenum.com">Even More Examples</a> <a href="http://www.devenum.com"></p>'

#using the above define function

print("Urls found: ", URLsearch(textcontent))

Output

Urls found:  ['https://www.google.com"', 'https://devenum.com"', 'http://www.devenum.com"']

Источник

В переменной text хранится какой-то набор слов. Так же там возможно находится ссылка.
Надо написать функцию, которая проверяет наличие этой ссылки.

Например:
В тексте «Hello, pythonworld.ru!» Есть ссылка pythonworld.ru
В тексте «Checking…..гто.рф!» Есть ссылка гто.рф

Вопрос задан

более трёх лет назад
4982 просмотра

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?P<domain>w+.w{2,3})"

test_str = ("Hello, pythonworld.ru!n"
	"Checking гто.рфn"
	"microsoft.com")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Пригласить эксперта

обычно решается регекспами, ну там

myString = "This is a link http://www.google.com"
re.search("(?P<url>https?://[^s]+)", myString).group("url")

но если парсите — эффективнее через lxml

result = self._openurl(self.mainurl)
content = result.read()
html = lxml.html.fromstring(content)
urls = html.xpath('//a/@href')

Показать ещё
Загружается…

28 мая 2023, в 23:02

5000 руб./за проект

28 мая 2023, в 22:56

9000 руб./за проект

28 мая 2023, в 22:23

1000 руб./за проект

Минуточку внимания

Источник

How can I parse text and find all instances of hyperlinks with a string? The hyperlink will not be in the html format of <a href="http://test.com">test</a> but just http://test.com

Secondly, I would like to then convert the original string and replace all instances of hyperlinks into clickable html hyperlinks.

I found an example in this thread:

Easiest way to convert a URL to a hyperlink in a C# string?

but was unable to reproduce it in python

asked Apr 6, 2009 at 2:37

Here’s a Python port of Easiest way to convert a URL to a hyperlink in a C# string?:

import re

myString = "This is my tweet check it out http://tinyurl.com/blah"

r = re.compile(r"(http://[^ ]+)")
print r.sub(r'<a href="1">1</a>', myString)

Output:

This is my tweet check it out <a href="http://tinyurl.com/blah">http://tinyurl.com/blah</a>

answered Apr 6, 2009 at 2:53

maxyfcmaxyfc

11.1k7 gold badges36 silver badges46 bronze badges

Here is a much more sophisticated regexp from 2002.

@yoniLavi minified this to:

re.compile(r'b(?:https?|telnet|gopher|file|wais|ftp):[w/#~:.?+=&%@!-.:?\-]+?(?=[.:?-]*(?:[^w/#~:.?+=&%@!-.:?-]|$))')

answered Jan 20, 2010 at 15:45

dfrankowdfrankow

19.9k40 gold badges149 silver badges210 bronze badges

Django also has a solution that doesn’t just use regex. It is django.utils.html.urlize(). I found this to be very helpful, especially if you happen to be using django.

You can also extract the code to use in your own project.

Erock

7707 silver badges10 bronze badges

answered Jan 24, 2012 at 6:16

KekoaKekoa

27.8k14 gold badges72 silver badges91 bronze badges

Jinja2 (Flask uses this) has a filter urlize which does the same.

Docs

answered Oct 25, 2012 at 22:57

jmozjmoz

7,8295 gold badges31 silver badges33 bronze badges

I would recommend to have a look also on urlextract

You can install it running: pip install urlextract

from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
print(urls) # prints: ['janlipovsky.cz']

The main advantage is that urlextract will find URLs without specifying schema (http, ftp, etc.) it has also a lot of configuration options to tune in the extractor to fit your needs. Everything can be found in documentation.

answered Jan 2 at 14:04

Источник

aalexandrov

0 / 0 / 0

Регистрация: 14.05.2017

Сообщений: 31

15.08.2017, 19:43. Показов 21076. Ответов 7

Метки нет (Все метки)

Привет, всем)

Возник вопрос, как найти url в строке. Например, имеется строка

Python

1	s = 'какая то строка [url]http://vk.com/club123456768[/url] откуда нужно вытянуть url'

И в этой строке я хочу вытянуть url

Python

1	http://vk.com/club123456768

Как бы это сделать?

P.S. Скрипт форума упорно ставит к ссылкам

0x10

3254 / 2056 / 351

Регистрация: 24.11.2012

Сообщений: 4,909

15.08.2017, 20:42

Python

import re
 
 
text = 'какая то строка [url]http://vk.com/club123456768[/url] откуда нужно вытянуть url [url]http://vk.com/club42[/url]'
url_pattern = r'[url](.*?)[/url]'
 
urls = re.findall(url_pattern, text)  # ['http://vk.com/club123456768', 'http://vk.com/club42']

Garry Galler

5407 / 3831 / 1214

Регистрация: 28.10.2013

Сообщений: 9,554

Записей в блоге: 1

15.08.2017, 22:32

Сообщение было отмечено aalexandrov как решение

Решение

0x10,
У него нет в строке слова [url] :-). Это добавка форума.

Python

text = 'какая то строка http://vk.com/club123456768 откуда нужно вытянуть url http://vk.com/club42'
url_pattern = r'http://[S]+'
 
urls = re.findall(url_pattern, text) 
print(urls)

aalexandrov,
P.S. В расширенном режиме редактирования в Дополнительных опциях есть чекбокс «Другое»: убирайте галочку «Автоматически вставлять ссылки».

Рыжий Лис

Просто Лис

5238 / 3260 / 1008

Регистрация: 17.05.2012

Сообщений: 9,554

Записей в блоге: 9

16.08.2017, 05:17

Garry Galler, про https забыли…

Python

#!/usr/bin/env python3
import re
text = 'какая то строка [url]http://vk.com/club123456768[/url] о [url]https://vk.com/club42[/url]'
urls = re.findall(r'http(?:s)?://S+', text)
print(urls)

Добавлено через 2 минуты
Даже так, но менее читабельно:

Python

1	urls = re.findall(r'https?://S+', text)

Dominatrix

33 / 30 / 16

Регистрация: 21.01.2014

Сообщений: 101

16.08.2017, 09:26

Вот первый попавшийся по поиску паттерн для универсального поиска URL:

Python

1	r'/^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-])/?$/'

разбираться в нём нет ни малейшего желания

3254 / 2056 / 351

Регистрация: 24.11.2012

Сообщений: 4,909

16.08.2017, 09:46

Сообщение от Dominatrix

разбираться в нём нет ни малейшего желания

А что тут разбираться? Разбиваем на группы:

Код

# Начало строки
^

# http(s) 0 или 1 раз
(https?://)?

# Цифра или символ из диапазона a-z или точка или минус.
# Этой группой автор хотел обработать несколько уровней доменов,
# но она сработает и на невалидные имена с несколькими точками подряд: g..gle
([da-z.-]+)

# Точка
.

# Символ из a-z или точка, последовательность длиной от 2 до 6. Та же проблема с точками.
# Ожидание автора — доменная зона.
([a-z.]{2,6})

# Ожидание автора — слова, разделенные слешами.
# По факту — последовательности, состоящие из слов, слешей, точек и дефисов.
([/w .-]*)*

# Опциональный слеш в конце
/?$

Для визуалов

0 / 0 / 0

Регистрация: 14.05.2017

Сообщений: 31

16.08.2017, 11:22

[ТС]

0x10, спасибо) там действительно нет [‘url]’

Добавлено через 55 секунд
Рыжий Лис, вконтакте поругался на это сразу же) поправил

Добавлено через 28 секунд
Garry Galler, все работает четко. большое спасибо)

Добавлено через 26 секунд
Dominatrix, на первый взгляд страшно)

Рыжий Лис

Просто Лис

5238 / 3260 / 1008

Регистрация: 17.05.2012

Сообщений: 9,554

Записей в блоге: 9

16.08.2017, 13:51

Dominatrix, там сгруппировано по другому:

Python

#!/usr/bin/env python3
import re
 
t = '''
[url]https://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0[/url]
'''
r = re.findall(r'(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/', t)
print (r)  # [('https://', 'ru.wikipedia', 'org', '')]

IT_Exp

Эксперт

87844 / 49110 / 22898

Регистрация: 17.06.2006

Сообщений: 92,604

16.08.2017, 13:51

Помогаю со студенческими работами здесь

Как из файла/строки вытянуть данные?
ну в смысле данные, есть например файл содержащий
854
634
6436
2357
457345
вопщем неважно…

Как вытянуть элементы из сложной строки?
Здраствуйте! Подскажите, пожалуйста, как лучше сделать следующую штуку:
Читается файл и получается…

Как из выбранной строки в datagridviewer вытянуть id связанной таблицы?
Пример таблицы: id, имя, фамилия, id.должности. id выбранной строки уже есть, как вытянуть…

Как из выбранной строки в DataGridView вытянуть id связанной таблицы?
Имеется таблица: id, имя, фамилия, id должности. В DataGridVeiw отображается не id должности, а…

Вытянуть число, со строки
Здравствуйте. Есть строка такого вида: "slovo write;https://www.google.com/50= 21"

Как мне…

Вытянуть данные из строки
Добрый день. Есть информация,

$str = "АКПП/МКПП: MANUAL TRANSMISSION; Расположение руля:…

Искать еще темы с ответами

Или воспользуйтесь поиском по форуму:

Источник

This regular expression is going to help us to judge the presence of a URL.

Program to find the URL from an input string

Find URL in string of HTML format

Минуточку внимания

Решение

Не пропустите также: