Urlnorm
Passes the tests defined in PaceCanonicalIds. Passes all but three of the tests defined in MNot's urlnorm.py, as I interpret the specs differently for these three.
Only exercised significantly for http URIs.
Testcases welcome.
It’s just data
Passes the tests defined in PaceCanonicalIds. Passes all but three of the tests defined in MNot's urlnorm.py, as I interpret the specs differently for these three.
Only exercised significantly for http URIs.
Testcases welcome.
License?
Posted by Mark atI want to check with mnot first. Default_ports and the second set of tests are the only substantial reuse from his codebase. My preference is the Python License.
Posted by Sam Ruby atHi Sam,
Interesting. Given that you're shooting for RFC2396bis, which gets rid of separate path params, you should probably use urlparse.urlsplit instead of urlparse.urlparse (and likewise urlunsplit instead of urlunparse).
This line:
(auth,host,port)=re.search('([^@]*@)?([^:]*):?(.*)',auth).groups()
seems to find (userinfo, host, port); is that what you meant?
Similarly what's going on here?
if auth=="@": auth=""
WRT atom:id, why go through all this when you can just do a lexical compare on the strings; if they're just IDs, why do they need to be normalised at this level?
WRT license, just ack me and link to the original; license yours however you like. I'm also amenable to folding changes back in if you like.
Posted by Mark Nottingham atPossibly this is outside of the scope you're imagining for the function, but in the context of normalizing urls 'in the wild', it would be useful to strip() whitespace and line returns from the url as a whole and also perhaps in the parsed fragments as well. This would resolve cases like:
"http:// www.mysite.com" and
"""http://www.mysite.com
"""
Posted by Phil McCluskey atThis is going to be difficult since your comment parser will munge these, but here goes:
Is your intention to produce an ultra-liberal URL normalizer? If not, you'll need some defensive code to guard against invalid URLs, such as ones that include unescaped high-bit characters.
Posted by http://diveintomark.org/ atWow, that didn't work at all. Let's try without the scheme. These are all http URLs:
@example.com/
:@example.com/
127.0.0.1/
127.0.0.1:80/
OK, I've updated urlnorm based on the feedback above.
The way http://:@example.com/
was handled previously was a bug, it now is normalized to http://example.com/
. Unless I'm missing something, http://127.0.0.1/
is correctly normalized.
I'd like non-ASCII characters in URIs to be handled the same way a typical query works: these characters are escaped.
My comment parser won't munge URIs found in code
.
I reckon maybe canonicalization isn't such a good idea after all. Too fiddly. I didn't think so before, but after looking at the source, there's an awful lot for someone to get wrong in those 200 loc, and then what if there's any minor change to RFC 2396? Unless someone's prepared to maintain a public domain repository of normalizers for every language under the sun, the additional complexity is asking for non-compliant feeds.
Posted by Danny atDanny, well then perhaps you shouldn't look at the source. ;-)
Seriously: can you name one URI on your site that is not normalized? Actually, looking at your feed, I can name exactly one: http://dannyayers.com
, and that particular one is not likely to appear as an entry id.
My plans are to add this code to the feedvalidator - whether this results in error, warning, or informational messages. So, I'm quite prepared to worry about the "fiddly bits", but I seriously doubt that many other people will have to.
P.S. Most of the 200 loc are comments, tests, and blank lines.
Posted by Sam Ruby atShould also replace backslashes in the path with slashes:
r"http://example.com\test.html" --> "http://example.com/test.html"
r"http://example.com/a\test.html" --> "http://example.com/a/test.html"
r"http://example.com\a\test.html" --> "http://example.com/a/test.html"
No, backslashes should be translated to %5C.
Posted by Sam Ruby atLooks good so far, but you could add a "bozo" mode to the parser that handles backslashes and unintended spaces as requested above. Just a thought.
Posted by Asbjørn Ulsberg atThis is what I meant when I asked about whether you were planning on making this an ultra-liberal normalizer. There are lots of goofy things you can do to URLs that work in IE, or particular versions of Netscape, or something.
Posted by Mark atThere are a number of "goofy" things that are legal. My intent is not to mimic IE, Netscape, etc, but to faithfully implement the rules listed in the initial comment in this source file. Of course, people are free to compare the output of this function with the input to see if anything changed, and may chose to make value judgments based on this. In fact, when I originally wrote this function, it was my intent for the feedvalidator to do exactly that.
Suggestions for new rules, comments on the existing rules, and testcases are all welcome.
Posted by Sam Ruby atHey, this is a nice article! I took the mentioned test cases (and some more) and adjusted them to my own URL "normalizer".
But it is more an URL "fixer" than a normalizer. So RFC compliance is not guaranteed :) For example I replace backslashes with slashes in the path part (as suggested above). This fixes some broken window-ish URL paths.
Implementation (as url_norm()) and unit tests.
[link]...
Excerpt from del.icio.us/jonas/webstandards atHey, Sam. Mind if we use this in iPodder?
Regards,
Garth.
Garth: you are welcome to do so. My understanding is that the Python license is GPL compatible.
Posted by Sam Ruby atSam Ruby: Urlnorm by benoit python url Copy | React (0) [link]...
Excerpt from Public marks from user benoit atBecause this page is a top result for Google searches on the topic, I think it’s worth mentioning some work that has been done on URL normalization with Python that goes beyond urlencoding space characters. For example, dealing with default ports,...
Excerpt from How can I normalize a URL in python - Stack Overflow atIn my environment (Python 2.5.1, urllib 1.17), I had one offending http:// URL:
fer-martin.com/flying-suit-up/
that returned with an exception:
TypeError: decoding Unicode is not supported
I used to pass the URLs to be normalised as: url.encode('utf-8'), so they are <type ‘str'>, but this one kept being <type 'unicode'> even after .encode(). Something was going wrong with the unquote(string) command. I changed clean() as following:
def clean(string):
string=unquote(string)
if type(string) == type(unicode()):
print 'changing from unicode to str’
string=string.encode('utf-8')
string=str(string)
string=unicode(string,'utf-8','replace')
return unicodedata.normalize('NFC',string).encode('utf-8')
Bandwidth is often one of the first bottlenecks you’ll hit when web crawling. So, it’s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you’ve already crawled a page you need to keep an...
Excerpt from X-Combinator atBandwidth is often one of the first bottlenecks you’ll hit when web crawling. So, it’s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you’ve already crawled a page you need to keep an...
Excerpt from Eigenjoy atClick on My Human Resources button, then on the section of Find Employee Apps. lite-blue.us Developed with its own name, this service provider is liable for handling.
Posted by sandra atI want you to thank for your time of this wonderful read!!! I definately enjoy every little bit of it and I have you bookmarked to check out new stuff of your blog a must read blog!!!! refrigerator repair Buena Park
Posted by pioneerseo atInitial You got a awesome blog .I determination be involved in plus uniform minutes. i view you got truly very functional matters , i determination be always checking your blog blesss.
Posted by Preventivo rifacimento bagno atPlease continue to make great content like this. I really liked it and have shared it with my friends! ΗΛΕΚΤΡΟΛΟΓΟΣ ΑΘΗΝΑ
Posted by pioneerseo atAwesome review, I am a major adherent to remarking on online journals to illuminate the website scholars realize that they’ve added something beneficial to the internet!.. www.seorango.com
Posted by peter atThis substance is composed exceptionally well. Your utilization of organizing when mentioning your focuses makes your objective facts clear and straightforward. Much obliged to you. seorango
Posted by peter atwhen i am relaxing, i would love to just hear some instrumental music instead of regualr music,. marrakechcraft.com
Posted by AsharSeo atGood composed article. It will be steady to any individual who uses it, including me. Continue doing what you are doing – can’r hold up to peruse more posts. seorango.com
Posted by peter atMmm.. great to be here in your article or post, whatever, I figure I ought to likewise buckle down for my own site like I see some great and refreshed working in your site. best wrinkle cream
Posted by jackrobert atI have to search sites with relevant information on given topic and provide them to teacher our opinion and the article. lamborghini remote control cars
Posted by S E O Experts atIntriguing post. I Have Been pondering about this issue, so a debt of gratitude is in order for posting. Entirely cool post.It 's extremely exceptionally decent and Useful post.Thanks Tujhse Hai Raabta
Posted by jackrobert atI am awed by the data that you have on this blog. It demonstrates how well you comprehend this subject. varizes
Posted by S E O Experts atstunning, awesome, I was thinking about how to cure skin inflammation normally. what’s more, discovered your site by google, took in a ton, now i’m somewhat clear. I’ve bookmark your site and furthermore include rss. keep us refreshed. Lambingan
Posted by S E O Experts atI am awed by the data that you have on this blog. It demonstrates how well you comprehend this subject. 123 models
Posted by jackrobert atllo there mates, it is incredible composed piece completely characterized, proceed with the great work always. yukon bail bondsman
Posted by jackrobert atHi Buddy, Your Blog' S Design Is Simple And Clean And I Like It. Your Blog Posts About Online Dissertation Help Are Superb. Please Keep Them Coming. Greets!!
Operations Management Homework Help
I wear t have room schedule-wise right now to completely read your site yet I have bookmarked it and furthermore include your RSS channels. I will return in a day or two. much obliged for an extraordinary site. Jake Asa Friedman
Posted by S E O Experts atThe site is affectionately adjusted and spared as much as date. So it ought to be, a debt of gratitude is in order for offering this to us. ecommerce Mod
Posted by jackrobert atThings Are Very Open And Intensely Clear Explanation Of Issues. Was Truly Information. Your Website Is Very Beneficial.
Programming Assignments
llo there mates, it is incredible composed piece completely characterized, proceed with the great work always. cyrpto apparel
Posted by jackrobert atllo there mates, it is incredible composed piece completely characterized, proceed with the great work always. bad drip ejuice
Posted by jackrobert atllo there mates, it is incredible composed piece completely characterized, proceed with the great work always. DOG
Posted by jackrobert atThey’re produced by the very best degree developers who will be distinguished for your polo dress creating. You’ll find polo Ron Lauren inside exclusive array which include particular classes for men, women. Psychedelic art
Posted by Merck SEO atllo there mates, it is incredible composed piece completely characterized, proceed with the great work always. bitcoin price today
Posted by jackrobert atI think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article. jesus ring
Posted by Merck SEO atThese boots are the ultimate all-purpose boot that you can use for anything from the workplace to a mountainside. We can help you find the best tactical boots on the market tactical boots review
Posted by S E O Experts atI think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article. חברה שמפתחת אפליקציות
Posted by Merck SEO atAtlantic Power is an South African based company that specializes in the installation and jointing of cables up to 33kv including assisted cables. electric wire joiner
Posted by Merck SEO atI undeniably valuing every single bit of it and I have you bookmarked to take a gander at new stuff you post. คาสิโนออนไลน์
Posted by S E O Experts atstunning, great, I was wondering how to cure skin break out ordinarily. likewise, found your site by google, took in an extraordinary arrangement, now i’m fairly clear. I’ve bookmark your site and moreover incorporate rss. keep us invigorated. Manual Handling Training
Posted by Merck SEO atTo a great degree professional blog. Much appreciated Again. As a matter of fact Great. You’re the best proposed for spreading, this specific is a viable article. Genuinely much appreciated! Wonderful. Truly appreciated this specific blog entry. Truly much obliged! Certainly Great. vegasoutdooradventures
Posted by S E O Experts atllo there mates, it is incredible composed piece completely characterized, proceed with the great work always. lowest auto insurance rates
Posted by S E O Experts atI think this is an informative post and it is very useful and knowledgeable. therefore, I would like to thank you for the efforts you have made in writing this article. Backwoods Cigars
Posted by S E O Experts atllo there mates, it is incredible composed piece completely characterized, proceed with the great work always. back pain relief
Posted by S E O Experts atGangaur Realtech is a professionally overseen association having some expertise in land administrations where coordinated administrations are given by experts to its customers looking for expanded an incentive by owning, involving or putting resources into land. betway
Posted by betway atIt’s time to reveal the truth about your future thanks to your voyance amour. No matter the hour, you can call the 0892 22 20 33 and talk to your voyante par telephone. She will help you thanks to the mighty divinatory arts – an ancient knowledge which permit everyone to see clearly through the veil of the fate. voyante par telephone
Posted by S E O Experts atI’m eager to reveal this page. I have to thank you for ones time for this especially fabulous read !! I unquestionably extremely loved all aspects of it and I likewise have you spared to fav to take a gander at new data in your site. Kasauti Zindagi Kay
Posted by S E O Experts atI’m eager to reveal this page. I have to thank you for ones time for this especially fabulous read !! I unquestionably extremely loved all aspects of it and I likewise have you spared to fav to take a gander at new data in your site. Yeh Hai Mohabbatein
Posted by S E O Experts atBuy Card PIN & Serial Online Automatic Instant Delivery NECO WAEC NABTEB JAMB cards on www.examscard.com 24/7. 24/7 Live Chat customers service 09090200085. neco result
Posted by linkerseo atYour online diaries propel more each else volume is so captivating further serviceable It chooses me happen for pull back repeat. I will in a blaze grab your reinforce to stay instructed of any updates. Credit Repair Debt Consolidation
Posted by linkerseo atIt’s extremely pleasant and meanful. it’s extremely cool blog. Connecting is exceptionally valuable thing.you have truly helped bunches of individuals who visit blog and give them usefull data. visit palau
Posted by linkerseo atGreat post! I am actually getting ready to across this information, is very helpful my friend. Also great blog here with all of the valuable information you have. Keep up the good work you are doing here
Posted by genf20 plus reviews atIt’s extremely pleasant and meanful. it’s extremely cool blog. Connecting is exceptionally valuable thing.you have truly helped bunches of individuals who visit blog and give them usefull data. Best hiking shoes for Men
Posted by jack Robert atIt’s extremely pleasant and meanful. it’s extremely cool blog. Connecting is exceptionally valuable thing.you have truly helped bunches of individuals who visit blog and give them usefull data. palau vacation
Posted by jack Robert atHi there, You have done a fantastic job. I will definitely digg it and in my view recommend to my friends.
I am sure they’ll be benefited from this website.
Hey guys.. Why not check out babiesmall.co.il when you get around to it?
Posted by Eric atReally great post. I just unearthed your blog and needed to state that I have truly delighted in perusing your blog entries. Any way I’ll be buying in to your feed and I trust you post again soon.
Posted by Webdesign atIt is an incredible site.. The Design looks great.. Continue working like that I am unquestionably making the most of your site. You unquestionably have some incredible understanding and extraordinary stories
Posted by Webdesign atI was exceptionally urged to discover this site. The reason being this is such an instructive post. I needed to thank you for this educational examination of the subject. I ate all of it and I presented your webpage to the absolute greatest informal organizations so others can discover your blog
Posted by Erotische massage atI as of late went over your blog and have been perusing along. I figured I would leave my first remark. I don’t recognize what to state with the exception of that I have delighted in perusing. Pleasant blog. I will continue visiting this blog extremely ofte
Posted by Self Assessment Tax Returns Accountants atA debt of gratitude is in order for composing such a decent article, I bumbled onto your blog and read a couple of post. I like your style of composing
Posted by Self Assessment Tax Returns Accounts atI’m eager to reveal this page. I have to thank you for ones time for this especially awesome read!! I certainly truly preferred all aspects of it and I additionally have you spared to fav to take a gander at new data in your siteSelf Assessment Tax
Posted by ahed atGood focuses you composed here..Great stuff… I think you’ve made some really fascinating points.Keep up the great work.
Posted by ahmed atThis is a phenomenal moving article.I am basically content with your amazing work.You put to an extraordinary degree remarkably solid data. Keep it up. Continue blogging. Intending to inspecting your next
Posted by Self Assessment Tax Returns Tax Advisors atA debt of gratitude is in order for making such a cool post which is extremely exceptionally elegantly composed. Will allude a great deal of companions about this. Continue blogging.
Posted by ahmed atI unquestionably getting a charge out of each and every piece of it. It is an extraordinary site and decent offer. I need to bless your heart. Great job! You all complete an incredible blog, and have some extraordinary substance. Keep doing awesome
Posted by https://galadrielmusic.com/tag/guitar atA debt of gratitude is in order for the post, visit this site since it has such a significant number of utilizations and data required for everybody, share it to more individuals to know more, you’re the best
Posted by ahmed atI irrefutably getting a charge out of every single bit of it. It is an unprecedented site and better than average offer. I have to favor your heart. Extraordinary employment! All of you finish an unfathomable blog, and have some uncommon substance. Continue doing marvelous
Posted by https://galadrielmusic.com/tag/online atI obviously getting a charge out of each and every piece of it. It is a remarkable site and superior to average offer. I need to support your heart. Exceptional business! Every one of you complete an incredible blog, and have some extraordinary substance. Keep doing brilliant
Posted by https://galadrielmusic.com/tag/secrets atI have been looking at a couple of your accounts and I can state really well done. I will bookmark your blog
Posted by ahmed atit’s to an extraordinary degree cool blog. Interfacing is extraordinarily productive thing.you have truly had any sort of impact
Posted by ahmed atI simply discovered this blog and have high trusts in it to proceed. Keep up the incredible work, its elusive great ones. I have added to my top picks. Much obliged to You
Posted by ahmed atIt’s remarkably a mind boggling and fulfilling bit of data. I am fulfilled that you fundamentally empowered this solid information to us. If it’s not too much inconvenience stay us bob forward along these lines. Grateful to you for sharing
Posted by Rendement zonnepanelen atIt’s strikingly a marvelous and satisfying piece of information. I am satisfied that you in a general sense enabled this strong data to us. On the off chance that it’s not all that much burden stay us sway forward thusly. Thankful to you for sharing
Posted by sumba atExtraordinary review, I am a major adherent to remarking on websites to illuminate the blog essayists realize that they’ve added something advantageous to the internet!
Posted by ahmed atExtraordinary artcile, yet it would be better if in future you can share increasingly about this subject. Keep posting.
Posted by ahmed atThis is incredibly a wonderful and informational, containing all information and besides enormously influences the new advancement. An obligation of appreciation is all together to share it
Posted by https://bordenaikido.org/ atThis is extremely exceptionally decent post you shared, I like the post, a debt of gratitude is in order for sharing.. Classic Rock Band Colorado
Posted by seoexpert atThis is very interesting, but it is necessary to click on this link: dohli hire
Posted by ashly doll atAn obligation of appreciation is all together to share the data, continue doing magnificent... I really savored the experience of exploring your site. incredible resource... Wordpress baserad webbplats
Posted by Adilkhatri at