Truncate data urls#219
Conversation
|
In my opinion, data: URLs should not be collected at all by linkcheck at the Linklist level, as it doesn't point to any checkable content. Thoughts? |
|
I agree, as long as the intention of linkcheck is to track clickable links. In integreat-cms we kind of abuse it to track other stuff as well – though not through data urls, so your suggestion would not impact us, but now that I think about it that might have been a cleaner way to do things in our case. In any case, since linkcheck needs to bring the infrastructure to track links in content in order to do its main purpose, it is convenient to use that as an index for where arbitrary are used and thus might impact people. |
|
Would you be OK with #221? (current test failure is unrelated) |
Linkcheck limits URL length to
MAX_URL_LENGTH, longer URLs are skipped and a warning is logged mentioning the URL. This can create disadvantageous situations:Someone decided it was a good idea to put a multi megabyte image directly into the content as a base64 encoded data URL, and the project using django-linkcheck did not think to prevent that situation.
Now there are multi megabyte data urls in the log every time linkcheck scans this. Inspecting logs is near impossible, since one has to scroll past huge blocks of garbage data, and maybe there's another data url logged just after it, so the important log line between the two is easily missed.
Maybe it is from a data URL or maybe just a conventional URL that happens to exceed
MAX_URL_LENGTH– one decides to investigate where in the content it was used and whether it should be changed somehow.Unfortunately, the usual solution of looking at Link objects to find the content object the URL is in does not work, since the URL was rejected for being too long.
I propose a solution to each of these:
If the URL exceeds
MAX_URL_LENGTH, if it also starts withdata:, truncate it to only 64 characters.(Expectation: the data is not useful for identifying the URL)
In the log message when the URL exceeds
MAX_URL_LENGTH, also log the instance where it came from to aid doing something about it.