如何在 Python 中使用正则表达式验证 URL?
- 2025-02-11 09:51:00
- admin 原创
- 82
问题描述:
我正在 Google App Engine 上构建一个应用程序。我对 Python 非常陌生,过去 3 天一直在努力解决以下问题。
我有一个类来表示 RSS Feed,在这个类中我有一个名为 setUrl 的方法。此方法的输入是一个 URL。
我正在尝试使用 re python 模块来验证 RFC 3986 Reg-ex (http://www.ietf.org/rfc/rfc3986.txt)
下面是一个应该可以工作的片段?
p = re.compile('^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?')
m = p.match(url)
if m:
self.url = url
return url
解决方案 1:
这是解析 URL 的完整正则表达式。
(?:https?://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?)
.)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d
+)){3}))(?::(?:d+))?)(?:/(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA
-Fd]{2}))|[;:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd
]{2}))|[;:@&=])*))*)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd
]{2}))|[;:@&=])*))?)?)|(?:s?ftp://(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),
]|(?:%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:
%[a-fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Z\nd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(
?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?))(?:/(?:(?:(?:(?:[a-zA-Zd$-
_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:;type=[AIDaid])?)?)|(?:news
:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;/?:&=])+@(?:
(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:
(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3})))|(?:[a-zA
-Z](?:[a-zA-Zd]|[_.+-])*)|*))|(?:nntp://(?:(?:(?:(?:(?:[a-zA-Zd](?:
(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA
-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?)/(?:[a-zA-Z](?:[a-
zA-Zd]|[_.+-])*)(?:/(?:d+))?)|(?:telnet://(?:(?:(?:(?:(?:[a-zA-Zd$\n-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;?&=])*)(?::(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[;?&=])*))?@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(
?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-
Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?))/?)|(?:gopher://(?
:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z
](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?
:d+))?)(?:/(?:[a-zA-Zd$-_.+!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))(?:(?:
(?:[a-zA-Zd$-_.+!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))*)(?:%09(?:(?:(?:[
a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;:@&=])*)(?:%09(?:(?:[a-zA-
Zd$-_.+!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))*))?)?)?)?)|(?:wais://(?:(?
:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?
:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d
+))?)/(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)(?:(?:/(?:(?:[
a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)/(?:(?:[a-zA-Zd$-_.+!*'()
,]|(?:%[a-fA-Fd]{2}))*))|?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-
Fd]{2}))|[;:@&=])*))?)|(?:mailto:(?:(?:[a-zA-Zd$-_.+!*'(),;/?:@&=]|
(?:%[a-fA-Fd]{2}))+))|(?:file://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-
Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))
|(?:(?:d+)(?:.(?:d+)){3}))|localhost)?/(?:(?:(?:(?:[a-zA-Zd$-_.+!
*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'()
,]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*))|(?:prospero://(?:(?:(?:(?:(?:[a-
zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd
]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d+))?)/(?:(?:(
?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*)(?:/(?:(?:(?
:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&=])*))*)(?:(?:;(?:(?:
(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&])*)=(?:(?:(?:[a-zA
-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[?:@&])*)))*)|(?:ldap://(?:(?:(?
:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](?
:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:d
+))?))?/(?:(?:(?:(?:(?:(?:(?:[a-zA-Zd]|%(?:3d|[46][a-fA-Fd]|[57][Aa
d]))|(?:%20))+|(?:OID|oid).(?:(?:d+)(?:.(?:d+))*))(?:(?:%0[Aa])?(
?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-
fA-Fd]{2}))*))(?:(?:(?:%0[Aa])?(?:%20)*)+(?:(?:%0[Aa])?(?:%20)*)(?:(
?:(?:(?:(?:[a-zA-Zd]|%(?:3d|[46][a-fA-Fd]|[57][Aad]))|(?:%20))+|(?
:OID|oid).(?:(?:d+)(?:.(?:d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[
Aa])?(?:%20)*))?(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)))*)
(?:(?:(?:(?:%0[Aa])?(?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))(?:(?:(?:
(?:(?:(?:[a-zA-Zd]|%(?:3d|[46][a-fA-Fd]|[57][Aad]))|(?:%20))+|(?:O
ID|oid).(?:(?:d+)(?:.(?:d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa
])?(?:%20)*))?(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*))(?:(?
:(?:%0[Aa])?(?:%20)*)+(?:(?:%0[Aa])?(?:%20)*)(?:(?:(?:(?:(?:[a-zA-Zd
]|%(?:3d|[46][a-fA-Fd]|[57][Aad]))|(?:%20))+|(?:OID|oid).(?:(?:d+
)(?:.(?:d+))*))(?:(?:%0[Aa])?(?:%20)*)=(?:(?:%0[Aa])?(?:%20)*))?(?:(
?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))*)))*))*(?:(?:(?:%0[Aa])?(
?:%20)*)(?:[;,])(?:(?:%0[Aa])?(?:%20)*))?)(?:?(?:(?:(?:(?:[a-zA-Zd$\n-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+)(?:,(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%
[a-fA-Fd]{2}))+))*)?)(?:?(?:base|one|sub)(?:?(?:((?:[a-zA-Zd$-_.+
!*'(),;/?:@&=]|(?:%[a-fA-Fd]{2}))+)))?)?)?)|(?:(?:z39.50[rs])://(?:(
?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?).)*(?:[a-zA-Z](
?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:.(?:d+)){3}))(?::(?:\nd+))?)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+)(?:+(?
:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+))*(?:?(?:(?:[a-zA-Zd
$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+))?)?(?:;esn=(?:(?:[a-zA-Zd$-_.+!*
'(),]|(?:%[a-fA-Fd]{2}))+))?(?:;rs=(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[
a-fA-Fd]{2}))+)(?:+(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))+
))*)?))|(?:cid:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;?
:@&=])*))|(?:mid:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[
;?:@&=])*)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[;?:
@&=])*))?)|(?:vemmi://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-
zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)
(?:.(?:d+)){3}))(?::(?:d+))?)(?:/(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?
:%[a-fA-Fd]{2}))|[/?:@&=])*)(?:(?:;(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?
:%[a-fA-Fd]{2}))|[/?:@&])*)=(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA
-Fd]{2}))|[/?:@&])*))*))?)|(?:imap://(?:(?:(?:(?:(?:(?:(?:[a-zA-Zd$\n-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[&=~])+)(?:(?:;[Aa][Uu][Tt][Hh]=(?:*|
(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[&=~])+))))?)|(?:(
?:;[Aa][Uu][Tt][Hh]=(?:*|(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-F\nd]{2}))|[&=~])+)))(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}
))|[&=~])+))?))@)?(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a-zA-Z
d])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+)(?:\n.(?:d+)){3}))(?::(?:d+))?))/(?:(?:(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]
|(?:%[a-fA-Fd]{2}))|[&=~:@/])+)?;[Tt][Yy][Pp][Ee]=(?:[Ll](?:[Ii][Ss][
Tt]|[Ss][Uu][Bb])))|(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{
2}))|[&=~:@/])+)(?:?(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}
))|[&=~:@/])+))?(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?
:[1-9]d*)))?)|(?:(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|
[&=~:@/])+)(?:(?:;[Uu][Ii][Dd][Vv][Aa][Ll][Ii][Dd][Ii][Tt][Yy]=(?:[1-9
]d*)))?(?:/;[Uu][Ii][Dd]=(?:[1-9]d*))(?:(?:/;[Ss][Ee][Cc][Tt][Ii][Oo
][Nn]=(?:(?:(?:[a-zA-Zd$-_.+!*'(),]|(?:%[a-fA-Fd]{2}))|[&=~:@/])+))
)?)))?)|(?:nfs:(?:(?://(?:(?:(?:(?:(?:[a-zA-Zd](?:(?:[a-zA-Zd]|-)*[a
-zA-Zd])?).)*(?:[a-zA-Z](?:(?:[a-zA-Zd]|-)*[a-zA-Zd])?))|(?:(?:d+
)(?:.(?:d+)){3}))(?::(?:d+))?)(?:(?:/(?:(?:(?:(?:(?:[a-zA-Zd$-_.
!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Zd$-_.!~*
'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*))*)?)))?)|(?:/(?:(?:(?:(?:(?:[a-zA
-Zd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA-Z\nd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*))*)?))|(?:(?:(?:(?:(?:[a
-zA-Zd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*)(?:/(?:(?:(?:[a-zA
-Zd$-_.!~*'(),])|(?:%[a-fA-Fd]{2})|[:@&=+])*))*)?)))
考虑到它的复杂性,我认为你应该采用 urlparse 方式。
为了完整起见,这里是上述正则表达式的伪 BNF(作为文档):
;URL 的通用形式为:
genericurl = scheme“:”schemepart
; 这里定义了特定的预定义方案;新方案
;可能已在 IANA 注册
url = httpurl | ftpurl | newsurl |
nntpurl | telneturl | gopherurl |
瓦苏尔 |邮件旅游 |文件网址 |
prosperourl | 其他网址
;新方案遵循一般语法
otherurl = genericurl
;该方案为小写;解释器应使用忽略大小写
方案 = 1*[ 低字母 | 数字 | “+” | “-” | “。” ]
schemepart = *xchar | ip-schemepart
; 基于 ip 的协议的 URL 方案部分:
ip-schemepart =“//”登录[“/”urlpath]
登录名 = [用户 [“:”密码]“@”]主机端口
主机端口 = 主机 [ “:” 端口 ]
主机 = 主机名 | 主机号
主机名 = *[ 域名标签 "." ] 顶部标签
域名标签 = 字母数字 |字母数字 *[ 字母数字 | “-”] 字母数字
顶部标签 = 阿尔法 |阿尔法 *[ 阿尔法数字 | “-”] 字母数字
字母数字 = 字母 | 数字
主机号 = 数字“。”数字“。”数字“。”数字
端口 = 数字
用户 = *[ uchar | “;” | “?” | “&”| “=”]
密码 = *[ uchar | “;” | “?” | “&”| “=”]
urlpath = *xchar ; 取决于协议,参见第 3.1 节
;预定义的方案:
;FTP(另请参阅 RFC959)
ftpurl = "ftp://" 登录 [ "/" fpath [ ";type=" ftptype ]]
fpath = fsegment *[ “/” fsegment ]
fsegment = *[ uchar | “?” | “:”| “@”| “&”| “=”]
ftptype =“A”|“I”|“D”|“a”|“i”|“d”
; 文件
fileurl = "file://" [ 主机 | "localhost" ] "/" fpath
;HTTP
httpurl = "http://" 主机端口 [ "/" hpath [ "?” 搜索 ]]
hpath = hsegment *[ “/” hsegment ]
hsegment = *[ uchar | “;” | “:”| “@”| “&”| “=”]
搜索 = *[ uchar | “;” | “:” | “@” | “&”| “=”]
; GOPHER(另请参阅 RFC1436)
gopherurl = “gopher://” 主机端口 [ / [ gtype [ 选择器
[ "%09" 搜索 [ "%09" gopher+_string ] ] ] ] ]
gtype=xchar
选择器 = *xchar
gopher+_string = *xchar
;MAILTO(另请参阅 RFC822)
mailtourl =“mailto:”encoded822addr
coded822addr = 1*xchar ; 在 RFC822 中进一步定义
;新闻(另请参阅 RFC1036)
newsurl =“新闻:”grouppart
grouppart =“*”| 组| 文章
组 = alpha *[alpha | 数字 | "-" | "." | "+" | "_" ]
文章 = 1*[ uchar | “;” | “/” | “?” | “:”| “&”| “=”]“@”主机
;NNTP(另请参阅 RFC977)
nntpurl = “nntp://”主机端口“/”组[“/”数字]
;远程登录
telneturl =“telnet://”登录[“/”]
;WAIS(另请参阅 RFC1625)
waisurl = waisdatabase | waisindex | waisdoc
waisdatabase = “wais://”主机端口“/”数据库
waisindex = “wais://”主机端口“/”数据库“?”搜索
waisdoc = “wais://”主机端口“/”数据库“/”wtype“/”wpath
数据库 = *uchar
wtype = *uchar
wpath = *uchar
;普洛斯彼罗
prosperourl = "prospero://" 主机端口 "/" ppath *[ fieldspec ]
ppath = psegment *[ “/” psegment ]
psegment = *[ uchar | “?” | “:”| “@” | “&”| “=”]
fieldspec = ";" fieldname "=" fieldvalue
字段名 = *[ uchar | “?” | “:” | “@” | “&”]
字段值 = *[ uchar | “?” | “:” | “@” | “&”]
;杂项定义
lowalpha = “a”|“b”|“c”|“d”|“e”|“f”|“g”|“h”|
“我” | “j” | “k” | “l” | “m” | “n” | “o” | “p”
“q” | “r” | “s” | “t” | “u” | “v” | “w” | “x”
“y” | “z”
hialpha =“ A”|“ B”|“ C”|“ D”|“ E”|“ F”|“ G”|“ H”|“ I”|
“J” | “K” | “L” | “M” | “N” | “O” | “P” | “Q” | “R”
“S” | “T” | “U” | “V” | “W” | “X” | “Y” | “Z”
alpha = 低alpha | 高alpha
数字 = “0” | “1” | “2” | “3” | “4” | “5” | “6” | “7”|
“8” | “9”
安全 = “$” | “-” | “_” | “。” | “+”
额外 = “!” | “*” | “'” | “(” | “)” | “,”
国家 = “{” | “}” | “|” | “” | “^” | “~” | “[” | “]” | “`”
标点符号 = “” | “#” | “%” |
保留 = “;” | “/” | “?” | “:” | “@” | “&” | “=”
十六进制 = 数字 | “A” | “B” | “C” | “D” | “E” | “F” |
“a” | “b” | “c” | “d” | “e” | “f”
转义 = “%”十六进制十六进制
未保留 = 字母 | 数字 | 安全 | 额外
uchar = 未保留 | 转义
xchar = 未保留 | 保留 | 转义
数字 = 1*数字
解决方案 2:
解析(和验证)URL 的一个简单方法是使用urlparse
(py2、py3)模块。
正则表达式的工作量太大了。
没有“验证”方法,因为几乎任何东西都是有效的 URL。有一些标点符号规则可以将其拆分。没有任何标点符号,您仍然拥有有效的 URL。
仔细检查 RFC,看看是否可以构造“无效”的 URL。规则非常灵活。
例如:::::
是一个有效的 URL。路径是":::::"
。这是一个相当愚蠢的文件名,但却是一个有效的文件名。
此外,/////
是有效的 URL。netloc(“主机名”)是""
。路径是"///"
。同样很愚蠢。也是有效的。此 URL 标准化为"///"
是等效的。
类似的事情"bad://///worse/////"
完全有效。虽然愚蠢,但有效。
底线。分析它,并查看各个部分,看看它们是否在某些方面令人不快。
您是否希望方案始终为“http”?您是否希望 netloc 始终为“www.somename.somedomain”?您是否希望路径看起来像 unix?还是像 windows?您是否要删除查询字符串?还是保留它?
这些不是 RFC 指定的验证。这些是您的应用程序独有的验证。
解决方案 3:
我正在使用Django使用的那个,它似乎运行得很好:
def is_valid_url(url):
import re
regex = re.compile(
r'^https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?.)+[A-Z]{2,6}.?|' # domain...
r'localhost|' # localhost...
r'd{1,3}.d{1,3}.d{1,3}.d{1,3})' # ...or ip
r'(?::d+)?' # optional port
r'(?:/?|[/?]S+)$', re.IGNORECASE)
return url is not None and regex.search(url)
您可以随时在此处查看最新版本:https://github.com/django/django/blob/master/django/core/validators.py#L74
解决方案 4:
我承认,我觉得你的正则表达式完全不可理解。我想知道你是否可以使用 urlparse 来代替?例如:
pieces = urlparse.urlparse(url)
assert all([pieces.scheme, pieces.netloc])
assert set(pieces.netloc) <= set(string.letters + string.digits + '-.') # and others?
assert pieces.scheme in ['http', 'https', 'ftp'] # etc.
它可能会比较慢,也许你会错过条件,但(对我而言)它比URL 的正则表达式更容易阅读和调试。
解决方案 5:
如今,如果你在 Python 中使用 URL,90% 的情况下你可能会使用 python-requests。因此,这里的问题是 - 为什么不重用请求中的 URL 验证?
from requests.models import PreparedRequest
import requests.exceptions
def check_url(url):
prepared_request = PreparedRequest()
try:
prepared_request.prepare_url(url, None)
return prepared_request.url
except requests.exceptions.MissingSchema, e:
raise SomeException
特征:
不要重新发明轮子
干燥
离线工作
最少的资源
解决方案 6:
urlparse
很高兴接受无效的 URL,它更像是一个字符串拆分库,而不是任何类型的验证器。例如:
from urlparse import urlparse
urlparse('http://----')
# returns: ParseResult(scheme='http', netloc='----', path='', params='', query='', fragment='')
根据情况来看,这可能没问题。
如果您基本信任数据,并且只想验证协议是否为 HTTP,那么urlparse
就很完美了。
如果你想让 URL 实际上是合法的 URL,请使用荒谬的正则表达式
如果你想确定这是一个真实的网址,
import urllib
try:
urllib.urlopen(url)
except IOError:
print "Not a real URL"
解决方案 7:
http://pypi.python.org/pypi/rfc3987给出了与 RFC 3986 和 RFC 3987 中的规则一致的正则表达式(即,不符合特定于方案的规则)。
IRI_reference 的正则表达式为:
(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*):(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[
a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU0002
0000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU
00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009ff
fdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U00
0dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\n[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4]
[0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0
-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]
?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(
?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|
[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F
]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?
:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[
0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:
[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3
}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,
4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0
-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-
9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]
|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|
(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6
}[0-9A-F]{1,4})?::|v[0-9A-F]+\\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\\]|(?:(?:(
?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][
0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-\nU0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU000500
00-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00
090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffd
U000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(
?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/uf
dcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffd\nU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007f
ffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U0
00bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-
F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7
ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000
-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU0007
0000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU
000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000eff
fd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff
/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-\nU0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU000700
00-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU00
0b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd
])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[\nxa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU
00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006ff
fdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U00
0afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-
U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa
0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00
030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffd
U00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000a
fffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U
000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\\?(?P<iquery
>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U000
1fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-\nU0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU000900
00-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU00
0d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[\nue000-/uf8ffU000f0000-U000ffffdU00100000-U0010fffd]|/|\\?)*))?(?:\\#(?P<ifra
gment>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-
U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050
000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU0
0090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfff
dU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|
@)|/|\\?)*))?|(?:(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[xa
0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00
030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffd
U00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000a
fffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U
000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\\[(?:(?:[0-9A-F]{1,
4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-
9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:
[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3
}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4
}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\
.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1
,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-
4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?
:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A
-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][
0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1
,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]
?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-
9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[
0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}
:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|
v[0-9A-F]+\\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\\]|(?:(?:(?:25[0-5]|2[0-4][0-
9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA
-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000
-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU0006
0000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU
000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dff
fdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*)
)?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU0
0010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fff
dU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U000
8fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-\nU000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*
+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufd
f0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU000400
00-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00
080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffd
U000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A
-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0
-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000
-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU0008
0000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU
000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F
]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/u
fdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffd
U00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007
fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U
000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A
-F][0-9A-F]|[!$&'()*+,;=]|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf
/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00
040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffd
U00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000b
fffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][
0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9.
_~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U000
2fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-\nU0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a00
00-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU00
0e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[/ue000-/uf8ffU000f000
0-U000ffffdU00100000-U0010fffd]|/|\\?)*))?(?:\\#(?P<ifragment>(?:(?:(?:[a-zA-
Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-
U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060
000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU0
00a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfff
dU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\\?)*))?)
一行代码:
(?P<scheme>[a-zA-Z][a-zA-Z0-9+.-]*):(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\\[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|v[0-9A-F]+\\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\\]|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[/ue000-/uf8ffU000f0000-U000ffffdU00100000-U0010fffd]|/|\\?)*))?(?:\\#(?P<ifragment>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\\?)*))?|(?:(?://(?P<iauthority>(?:(?P<iuserinfo>(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:)*)@)?(?P<ihost>\\[(?:(?:[0-9A-F]{1,4}:){6}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|::(?:[0-9A-F]{1,4}:){5}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|[0-9A-F]{1,4}?::(?:[0-9A-F]{1,4}:){4}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:)?[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){3}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,2}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:){2}(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,3}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:)(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,4}[0-9A-F]{1,4})?::(?:[0-9A-F]{1,4}:[0-9A-F]{1,4}|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)))|(?:(?:[0-9A-F]{1,4}:){,5}[0-9A-F]{1,4})?::[0-9A-F]{1,4}|(?:(?:[0-9A-F]{1,4}:){,6}[0-9A-F]{1,4})?::|v[0-9A-F]+\\.(?:[a-zA-Z0-9_.~-]|[!$&'()*+,;=]|:)+)\\]|(?:(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=])*)(?::(?P<port>[0-9]*))?)(?P<ipath>(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>/(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)?)|(?P<ipath>(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|@)+(?:/(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)*)*)|(?P<ipath>))(?:\\?(?P<iquery>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|[/ue000-/uf8ffU000f0000-U000ffffdU00100000-U0010fffd]|/|\\?)*))?(?:\\#(?P<ifragment>(?:(?:(?:[a-zA-Z0-9._~-]|[xa0-/ud7ff/uf900-/ufdcf/ufdf0-/uffefU00010000-U0001fffdU00020000-U0002fffdU00030000-U0003fffdU00040000-U0004fffdU00050000-U0005fffdU00060000-U0006fffdU00070000-U0007fffdU00080000-U0008fffdU00090000-U0009fffdU000a0000-U000afffdU000b0000-U000bfffdU000c0000-U000cfffdU000d0000-U000dfffdU000e1000-U000efffd])|%[0-9A-F][0-9A-F]|[!$&'()*+,;=]|:|@)|/|\\?)*))?)
解决方案 8:
注意- Lepl 不再维护或支持。
RFC 3696 定义了 URL 验证的“最佳实践” - http://www.faqs.org/rfcs/rfc3696.html
Lepl(一个 Python 解析器库)的最新版本包含 RFC 3696 的实现。您可以类似如下方式使用它:
from lepl.apps.rfc3696 import Email, HttpUrl
# compile the validators (do once at start of program)
valid_email = Email()
valid_http_url = HttpUrl()
# use the validators (as often as you like)
if valid_email(some_email):
# email is ok
else:
# email is bad
if valid_http_url(some_url):
# url is ok
else:
# url is bad
尽管验证器是在 Lepl(一种递归下降解析器)中定义的,但它们大部分都是在内部编译为正则表达式。这结合了两全其美的优势 - 一个(相对)易于阅读的定义,可以根据 RFC 3696 进行检查,并且实现高效。我的博客上有一篇文章展示了这如何简化解析器 - http://www.acooke.org/cute/LEPLOptimi0.html
Lepl 可在http://www.acooke.org/lepl上获取,RFC 3696 模块的文档可在http://www.acooke.org/lepl/rfc3696.html上找到
这是此版本中的全新功能,因此可能包含错误。如果您有任何问题,请联系我,我会尽快修复。谢谢。
解决方案 9:
提供的正则表达式应该与任何形式为http://www.ietf.org/rfc/rfc3986.txt的 URL 匹配;并且在 python 解释器中测试时确实如此。
您在解析时遇到困难的 URL 是什么格式的?
解决方案 10:
修改后的 django url 验证正则表达式:
import re
ul = "/u00a1-/uffff" # Unicode letters range (must not be a raw string).
# IP patterns
ipv4_re = (
r"(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)"
r"(?:.(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)){3}"
)
ipv6_re = r"[[0-9a-f:.]+]" # (simple regex, validated later)
# Host patterns
hostname_re = (
r"[a-z" + ul + r"0-9](?:[a-z" + ul + r"0-9-]{0,61}[a-z" + ul + r"0-9])?"
)
# Max length for domain name labels is 63 characters per RFC 1034 sec. 3.1
domain_re = r"(?:.(?!-)[a-z" + ul + r"0-9-]{1,63}(?<!-))*"
tld_re = (
r"." # dot
r"(?!-)" # can't start with a dash
r"(?:[a-z" + ul + "-]{2,63}" # domain label
r"|xn--[a-z0-9]{1,59})" # or punycode label
r"(?<!-)" # can't end with a dash
r".?" # may have a trailing dot
)
host_re = "(" + hostname_re + domain_re + tld_re + "|localhost)"
regex = re.compile(
r"^(?:http|ftp)s?://" # http(s):// or ftp(s)://
r"(?:[^s:@/]+(?::[^s:@/]*)?@)?" # user:pass authentication
r"(?:" + ipv4_re + "|" + ipv6_re + "|" + host_re + ")"
r"(?::[0-9]{1,5})?" # port
r"(?:[/?#][^s]*)?" # resource path
r"Z",
re.IGNORECASE,
)
来源:https://github.com/django/django/blob/master/django/core/validators.py#L74
解决方案 11:
多年来我需要多次这样做,但最终总是复制别人的正则表达式,而他们对此的思考远远超出了我的想象。
话虽如此,Django 表单代码中有一个正则表达式可以解决问题:
http://code.djangoproject.com/browser/django/trunk/django/forms/fields.py#L534
解决方案 12:
urlfinders = [
re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]"),
re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?"),
re.compile("(~/|/|\\./)([-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]|\\\\
)+"),
re.compile("'\\<((mailto:)|)[-A-Za-z0-9\\.]+@[-A-Za-z0-9\\.]+"),
]
注意:虽然在浏览器中看起来很丑,但只需复制粘贴,格式就很好了
在 python 邮件列表中找到并用于 gnome-terminal
来源:http://mail.python.org/pipermail/python-list/2007-January/595436.html
解决方案 13:
简单的方法:
import re
def is_valid_url(url):
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?.)+(?:[A-Z]{2,6}.?|[A-Z0-9-]{2,}.?)|' # domain...
r'localhost|' # localhost...
r'd{1,3}.d{1,3}.d{1,3}.d{1,3})' # ...or ip
r'(?::d+)?' # optional port
r'(?:/?|[/?]S+)$', re.IGNORECASE)
return re.match(regex, url) is not None
示例
print(is_valid_url("http://www.example.com")) # True
print(is_valid_url("https://example.com/path")) # True
print(is_valid_url("ftp://example.com")) # True
print(is_valid_url("://example.com")) # False
print(is_valid_url("http:///example.com")) # False
扫码咨询,免费领取项目管理大礼包!