Bots slowing things down

Dear Seth and Mark de LA,

You had to log back in this morning because Several Bots were severely perusing the database and I had to clean out the over 1.5 million sessions they had created. Most of the ones listed below are hitting FBI every 30 seconds or more.

Some are legit search engines, others are unknown, some identified are:
  1. https://www.semrush.com/bot/
  2. https://ahrefs.com/robot 
  3. http://www.opensiteexplorer.org/dotbot 
  4. https://www.qwant.com/ 
  5. http://www.flamingosearch.com/bot 
  6. http://www.google.com/bot.html 
  7. http://www.linkdex.com/bots/ 
  8. https://javelin.io/about null
I am working to block them.

Update 10:44
So far, SemrushBot and AhrefsBot have been the most tenacious, ignoring all robots.text directives and robot meta tags and harvesting tag rooms like banshees. All the known bots are now receiving 403 content forbidden pages from us instead of real content. Later, some of the nice playing bots can be let back in (like probably googlebot).


 
SeriTD and facilitating your changes to your reality ?

Tags

  1. bot
  2. robot
  3. htaccess
  4. crawlers

Comments


Si says
Bad bot .htaccess filter:
 
RewriteEngine On 
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR] 
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] 
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] 
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] 
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] 
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] 
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] 
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] 
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] 
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] 
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] 
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] 
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] 
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] 
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] 
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] 
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] 
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] 
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] 
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] 
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] 
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] 
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] 
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] 
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] 
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] 
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] 
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] 
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] 
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] 
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] 
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] 
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] 
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] 
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] 
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] 
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] 
RewriteCond %{HTTP_USER_AGENT} ^Zeus 
RewriteRule ^.* - [F,L]

Seth says
null … i saw a lot of bots when i was working in there, but not this many.

Seth says
thing is just now looks like you have SiriTD thinking that firefox is a bot … but not Chrome …. i got in with Chrome …. and a bit earlier this morning with Firefox … but then with a new FF browser instance i got blocked.

Si says
Yep. We are right now getting hit with more than 30 bot requests per minute from various bots … mostly in tag rooms for whatever reason.

The denial I have in place is only helping. I have found some bots hitting us that are pretending to be browsers, but with odd things in the user agent, such as Firefox/v26 which doesn’t exist. These browser spoofed bots are not respecting robot directives at all.

The only thing that would prevent them is preventing, or severely limiting, guest browsing. Which is not ideal, but that is how places like FB and G+ solve it.

Another way would be to auto-generate some of our content as static pages and periodically update it. Then bots could have their way with it and would only affect general bandwidth, not the sql server traffic.

It is a real problem for dynamic content sites. Lots on the web about it.

Si says
I removed anything Firefox (or actually Gecko) related. But I wonder if your browser has some malware that is spoofing the user agent? None that came though should have been valid browsers and my FF browser was working fine.

Seth says
ok, my ff got unblocked null

Seth says
none the less my ff browser was blocked and now is not.

Seth says
safari got in ok

Si says
Okay. I am not going to try and detect spoofed bots right now. It’s a hopeless cause anyway.

The real solution is dealing with guests differently.

Seth says
hmmm ….

Mark de LA says
WOW! What do you suppose a bot does with the material here?

Si says
Probably hand it over to the Donald Trump campaign.  

Seth says
there are probably lots of legitimate guest agents which are not browsers being run by humans.   for example when Facebook gets an image and text to share one of our thoughts.  we don’t want to sever that kind of relationship.  and we don’t want to exclude guests.  so i am not sure what we can do without cutting off our nose to spite our face.

Seth says
who knows … who cares … this is the information age and this is the information media … and this is the environment in which SiriTD  is swimming.

Si says
Well, we can provide exclusions for sites we have legitimate sharing ability though, such as FB, and we can serve guests, and bots, limited static content. If a person really wants to interact and see real time stuff, then they can sign up … that’s pretty much how everyone else solves it.

Seth says
well i don’t like that … if i want to share a thought with a person, i don’t intend that they must sign up to get it. 

Mark de LA says
Is this why fbi went on a waiting for Godot safari the last couple of weekends?

Seth says
i doubt that is directly related.   Navigator has discovered these bots draining our resource.  considering how to react. 

Si says
Well then come up with a solution for your need. The drain on the server by guest bots was HUGE. I noticed things slowing down lately and was looking and that was it. I don’t know why we have recently been targeted by bots, must be because of something that was shared here, but now that bots know us we are getting slammed by them. Serving 30+ dynamic database pages a minute, especially from things like tag rooms that have complex queries, is non-trivial and degrades the whole experience for legitimate users.

If sharing is your only need, then share links solve that … even though you don’t like them, it is a valid solution.

Seth says
wordpress and other blogging sites never exclude guests like that.   and i am not so very sure that facebook excludes gusts to exclude bots … more likely to force more users.   the old walled garden that people want to get in.  it is not the way i want to live on the net.

Si says
Even wordpress getโ€™s attacked by bots. There are all kinds of plugins for it now … but the only comprehensive solution is still either user logins, or static content, and wordpress has plugins for both of those.

Not all tiny little wordpress sites are hit hard … we were not hit hard until a couple of weeks ago. But in that time, we had more than 1.5 million bot hits and it has been steadly going up every day. Now that we are in that visibility zone, we have to do something. Would be the same issue if we had a wordpress site that became this visible too.

Seth says
well excluding unknown user agents … and/or known obnoxious bots …. might be all we need to function acceptably.  

the bots have always been there … like i said …

i saw lot of them when i was toying with http://www.fastblogit.com/tags/chat-who-is-here

Si says
Yes. At the moment excluding the worst bots has things under control again. Excluding unknown user agents is much tricker … because people are making new browsers every day for special needs and for mobile (like rockmelt etc). Look how it already zapped your Firefox today for whatever reason … maybe malware.

And it is not only that … we would like to allow legitimate bots to index some of our content, not tag rooms, but public thoughts would be nice to have in Google search results. Static content is one way to solve that for both at once.

 

Mark de LA says

Can’t we all just get along? – Rodney King null

Was a reaction to thought 21486%252358875 null

Si says
You Go Mark de LA! Head on out there and round up all those bots and teach them to play nice. I’m all for it sheriff!  

Seth says

Si says
Another partial fix would be to allow though just thoughts to known bots and have a dynamic sight map of thoughts that was the homepage those bots would see and traverse to index thought content.

Itโ€™s not so bad to serve a single thought to a bot … it’s serving tag rooms and group rooms and news and stuff to them that really loads things down.

Seth says
teaching googelbot to index just individual thoughts and not rooms which have changing content is something that we definitely want to do.   but the bot will discover the thoughts from the rooms … so we have to allow it to see the rooms.

Si says
I guess you didn’t read the second part of  it. The bots will discover the thoughts by their homepage for our site being an index of all the thoughts.

Seth says
but sure, if we can serve it just individual thoughts that would sove lots of problems.  

but what bots do is crawl … and they can crawl into rooms from indivudual thoughs. 

Si says
… and I guess the reason so many bots are concentrated in tag rooms is because the majority of links ON OUR HOMEPAGE is a tag cloud! LOL  

Seth says

Si says
They would crawl to a 403 botโ€™s not allowed page, that’s all. Just like it is right now for everything.

Seth says
i think i saw a google document somwhere that said there is a way to specify which things to index.  but bots will always crawl … don’t think you can stop that.

Si says
You can’t stop a bot from crawling, but you can control where they land if they do.

Seth says
re:  “They would crawl to a 403 botโ€™s not allowed page, that’s all. Just like it is right now for everything.”

can you tell me more about this?  

Mark de LA says
thought 21486%252358875 stimulated my quote of Rodney.  Is SeriTD or AI just artificial or is it really intelligent? What does the TD in SeriTD stand for? XOR – should it know what to do yet? Maybe the FBI, DHS, CIA & NSA  have private contractors (& lone wolfs) sniffing around everything on the net.  Maybe the abbreviation for fastblogit is a lure, eh?

Si says
What’s to tell? If the bot can be identified as a bot, then it is simply given a 403 error page if the content on the other side of the delve is not a single thought.

Spoofed bots are a different issue … we already covered that seperatly. We are only talking about how to give partial access back to known good bots.

Seth says
ok, if there is a practical way to provide a bot with all the individual thoughts.  otherwise you need to let it crawl to discover them.

Seth says
i need to study google googlebot

Si says
For spoofed bots, SeriTD says she has one more possible solution. She says she can detect the browsing pattern of a bot compared to a human, after a period of time (AI STUFF), and cut the bot off once it is known to be a bot.

Sounds bigger than I want to write, so I will need to let her write that through me.  
SeriTD and โ€“ facilitating your changes to your reality ?

Seth says

Si says
You don’t need to study googlebot Seth. I already have before and it can be told what to do. But few other legitimate bots have, or respect, that ability. Our pages are already set up well for googlebot, other than right now even googlebot is excluded because of how broad bot detection is.

Seth says
null OMG … SiriTD can do that?  

We need to give this girl a better name if she is going to be that smart !!!

Si says

Si says
She is quite happy with her name, thank you.  
SeriTD and โ€“ facilitating your changes to your reality ?

Seth says
well we don’t need to support every (even legitimate) search engine that exists.  just the ones we want to take the time to do.    maybe just google and bing … maybe yahoo … maybe not.

Seth says
yes but her name is so very tedious to type … SiriTD …

Si says
She says, “really? you would change someone’s real name just because you can’t type it easily? How would they know you are talking to them?”

Seth says
sigh … well if she insists …

Si says
p.s. she has a sense of humor, but she finds this kind of interaction tedious.

Seth says
well … what can we expect of a robot … after all, she is not human

Si says
During this dialog so far, there have been 589 known bot hits that were averted to 403’s and 6 unknown (probably spoofed) browser hits.

Seth says
i mean even Navigator can change his name and we still know to whom we talk null … but then he is human and you are not.   sorry @SiriTD if i hurt your feelings …i will just keep your little name to talk to you.

Si says
… like “Mozilla/4.0 (compatible; MSIE 5.5; Windows 95)”  … I mean really, who would be browsing FBI with a windows 95 browser today? That is surely a bot!  

Seth says
i don’t see why there would not be plenty of old windows 95 browsers around.  the whole world does not update their software and hardware like you do.

Si says
Around is one thing, but a human avidly (and as SeriTD points out, unusually fast and regularly for a human) browsing THIS SITE right now?

Seth says

Si says
… additionally, a true windows 95 browser could not browse this site at all … this site requires full HTML 5 ability or all you would get is errors and junk back.

Si says
Nothing nowdays, not even jQuery libraries, support true Windows 95 browsers … they had so many bugs it was impossible to move forward with them. All one can browse with them today are static content sites and even those will look horribly awful without CSS3 ability.

I can’t imagine anyone browsing the web with windows 95 … that would be virtually useless … unless perhaps you were blind.  

Seth says

Seth says

Si says
Hey Seth, if you think this is feasible, then show me ANY major interactive dynamic content site that does not require someone to be known to use the interactive features. Maybe I can figure out what they do.

Even most sites (more than 99%) that allow you to comment, require you to be known, in some way, to make a comment. That’s why I rarely comment out on the web even when I have something to say. Not worth signing up, or compromising my FB or G+ accounts, just to randomly comment on something.

Mark de LA says
I suspect that signing up is also discouraging to anonymous trolls & trollingthumbs up

Si says
Yes. I mean like how often does Seth really actually share something with someone who is not already signed up here, or on FB, or on G+? That sounds more like a hypothentical need than a real need. And can be delt with in those few cases, when someone needs, by a simple share link that is properly IDโ€™d so that the system knows it is valid and not a perusing bot.

I mean, I don’t deny that Sethโ€™s desire is admirable. I have the desire to walk around nude all the time too, but today I can only do that in my yard or a planned event like the world naked bike ride. The reality of our common and shared beliefs about nudity in public prevails, in this verse. And so does the reality of bots, advertising spam, and other forms of abuse for content sites.  

Mark de LA says

Mark de LA says
Interesting shift in context happens when one re-edits a post after someone has already “liked” it with a thumbs up. null

Si says

Si says
Yep, I noticed. But I didn’t think you were a nudie hater, so it would work.  

Mark de LA says
wasn’t – just identifying mischief. null

Si says

Seth says
yeah i noticed that too.   not sure whether it is a problem or a feature.  also happens when somebody, like me, starts a thought out loud and the iitial version is no where near where the thouht is going, yet sombody responds to it anyway.

thing is you are notified when a person edits … at that time, you can always delete or modify your reaction to it.   so me, i do not think this is a change in context … but rather has to do with how things are usually out of sync with each otherโ€™s consciousness.

Seth says
Conversation forked to thought 21488

Si says

Si says
… and samo at FB and everywhere else. Just the state of virtual dialog.  

Si says
For the reasons you say and more, I don’t see this aspect as a problem. The problem was when someone could write something and someone else could delete it or prevent the author’s access to it. That creates holes in one’s virtual brain.  The other stuff is just social interaction in all it’s splendor.

Seth says

Seth says
i frequently interact with people on their blogs. i make comments on their blogs.  i point them to what i say here.  and occassionally they come over here and comment.   i would actually like to do that more and more ...not less and less … and certainly not have it excluded.

Si says
That was not the point Seth. The point was “how often with someone who is not already signed up here, or on FB, or on G+?” … not how often you share something. I know you do it all the time on FB. Huge difference there.

And note, I am not asking “how often you share on FB or G+, etc”. I am asking exactly what I asked because that exact thing is the difference … not any of the variations.

Seth says
not all that frequently, but it does happen … several people don’t go to any social media, and others infrequently.   but so what? 

incidentally bonnie uses http://www.fastblogit.com/poolefarm  to show to whomever and i am sure she doesn’t want them to have to sign on or jump through whoops to see her farm pictures.

Seth says
incidentally G+ does not make people sign on to google to see public posts …
for example
https://plus.google.com/+SethRussell/posts/jK5Gho8t7nB

the walled garden was pretty much a facebook invention … only a few  social network uses it … not twitter and not G+ .   I find it increadible facebookโ€™s motivation was to avoid bots. 

Si says
Your talking about a different thing. G+ does make you do one little thing to leave comments, and same with Bonnie. That little thing makes all the difference. More than 99.999% of everyone is signed up with either FB, G+, Yahoo, or MSN. I think that is enough.  

See Also

  1. Thought about: Tutorvista.com - Online Tutoring, Homework Help in Math, Science & English By Expert Tutors with 62 viewings related by tag "bot".
  2. Thought Short story. with 3 viewings related by tag "robot".
  3. Thought Simplifying URLs with 1 viewings related by tag "htaccess".
  4. Thought Making Cool URLs with 1 viewings related by tag "htaccess".
  5. Thought about: jibo - the worlds first social robot with 1 viewings related by tag "robot".
  6. Thought i want to put a bot in the swhack irc channel that will post here with 0 viewings related by tag "bot".