Python中域名处理技巧

Python中经常会遇到各种字符串处理问题，而在网络请求这块又经常与URL或者域名打交道，本文侧重介绍Python在处理URL/或者域名中的技巧，涉及到的Python主要有如下几种：

urlparse
tldextract
dnspython
domaintools

验证域名

判断一个域名是否有效，可以试用domaintools工具来实现，并且也有解析域名的功能，例如下面代码可以从域名的构成上分析是否有效「即不验证是否可ping」，另外也可以获取子域名和后缀：

from domaintools import Domain
d = Domain('www.example.com')

>>> d.valid
True

>>> d.domain
u'example.com'

>>> d.subdomain
u'www'

>>> d.tld
u'com'

>>> d.sld
u'example'

更多可以参见Domain parsing with Python.

域名解析

主从给定的URL提取其主机「即请求域名」，协议，路径，甚至参数。例如从http://blog.ourren.com/2015/04/14/ip-information-with-python 获取“blog.ourren.com”, “http”, “/2015/04/14/ip-information-with-python/”。

此类问题一般采用urlparse来进行处理，针对处理后的数据进行拼接即可获得，示例代码如下：

from urlparse import urlparse
print urlparse('http://blog.ourren.com/2015/04/14/ip-information-with-python/')
ParseResult(scheme='http', netloc='blog.ourren.com', path='/2015/04/14/ip-information-with-python/', params='', query='', fragment='')

主域名

主要是指获取域名的主域名「即需要去除子域名」和域名后缀，例如从http://blog.ourren.com/2015/04/14/ip-information-with-python 获取“ourren.com”，“com”；

这类问题可以通过tldextract进行处理，而tldextract则是对tld库进行了封装，使用起来比较方便，示例代码如下：

import tldextract
tldextract.extract("http://blog.ourren.com/2015/04/14/ip-information-with-python/")

ExtractResult(subdomain='blog', domain='ourren', suffix='com')

子域名

其实获取特定域名的子域名的思路主要有几种：基于字典的暴力解析；基于搜索引擎结果的去重分析；基于域传送漏洞；基于全球网站数据的筛选；并且也有很多开源的工具可以参考，SecWiki-二级域名搜索工具汇总就对这些工具进行了汇总，所以这里就不具体讨论了。

Ourren

关注技术，记录生活.

Python中域名处理技巧

验证域名

域名解析

主域名

子域名