diff options
| author | aethrvmn <me@aethrvmn.gr> | 2025-08-30 13:18:29 +0000 |
|---|---|---|
| committer | aethrvmn <me@aethrvmn.gr> | 2025-08-30 13:18:29 +0000 |
| commit | 29a570918721fd5d73bb140a9fb3bfa3e5647b9f (patch) | |
| tree | cfa54b3b7c1515ac6ae41d56f9b5e15ada4092b1 /content/blog/anthropic.md | |
| parent | added non-content (diff) | |
added content
Diffstat (limited to '')
| -rw-r--r-- | content/blog/anthropic.md | 229 |
1 files changed, 229 insertions, 0 deletions
diff --git a/content/blog/anthropic.md b/content/blog/anthropic.md new file mode 100644 index 0000000..0fc8a24 --- /dev/null +++ b/content/blog/anthropic.md @@ -0,0 +1,229 @@ +--- +title: Amodei's goons are cowards +date: 2025-06-29 +showdate: true +bookToC: false +tags: [rant, legal] +--- + +I received a notification from my VPS provider that my VPS was running at more than 90% CPU utilization, and when I checked the `nginx` logs I saw the following. I was effectively being DDOS'd by Dario's (and his troupe of researchers)[^1] ClaudeBot. +<!--more--> +--- +``` +# ... more before +18.221.167.11 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +3.144.89.42 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +3.22.70.169 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +18.117.154.134 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +18.117.172.189 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +18.223.195.127 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +3.147.48.105 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +13.58.61.197 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +3.21.46.68 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +3.15.149.24 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +18.224.54.61 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +18.118.32.7 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +3.145.7.187 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +3.133.137.10 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +3.147.86.143 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +18.118.144.109 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +3.135.206.25 ... (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" +# ... more after +``` +--- +Because I host my own git server, I am in essence at the mercy of the crawlers. In any case this, followed by another DDOS attack by Dario's team some time later, finally gave me enough determination to use [`ai.robots.txt`](https://github.com/ai-robots-txt/ai.robots.txt), which sets up an automatic blocker for known ai crawlers. + +At the beginning I only used it to generate a `robots.txt`, without the actual blockers on the reverse proxy level, hoping that the different crawlers would respect it. + +The more pressing matter however was the fact that I had used a custom license. One that is not `free and open source`, but rather, my own, anti-crawler license. + +To that extent I tried to reach out to Dario's legal team, to notify them of my license, so that they wouldn't use my data to train their model (license details will be coming up). + +Since I am just "a guy", I decided that the best option was to reach out via LinkedIn to a member of their legal team. I found another "a guy", and I had the following conversation + +--- +>Nov 25, 2024 +>sent the following message at 5:40 PM +> +>Good afternoon, +> +>Anthropic's ClaudeBot recently crawled my web server and scraped my public git repositories hosted on my web server (I can send the server traffic logs if you wish). +> +>My code is licensed under a custom license, the DBEL, which has non-commercial provisions about the software (as defined in the license), especially for use in AI training. +> +>Specifically, clauses 1-4 read: +> +>Don’t Be Evil License (DBEL) 1.0 +> +>1. Acceptance +> +>By using, copying, modifying, or distributing the source code, training data, training environment, or its associated machine learning model weights (collectively the “Software”), you agree to comply with all terms outlined in this license. +> +>2. Copyright License +> +>The Licensor (defined below) grants you a non-exclusive, worldwide, royalty-free, non-sublicensable, non-transferable license to use, copy, modify, and distribute the Software, including associated model weights, training data, and training environments, subject to the conditions set forth in this license. +>This includes the right to create and distribute derivative works of the Software, provided that the limitations below are observed. +> +>3. Non-Commercial Use Only +> +>You may use, copy, modify, and distribute the Software and derivative works solely for non-commercial purposes. +>Non-commercial purposes include, but are not limited to: +> +>Personal research and study. +>Educational and academic projects. +>Public knowledge and hobby projects +>Religious observance. +>Non-commercial research, or AI and machine learning (ML) experimentation. +> +>4. Distribution and Monetization Provisions +> +>Any use of the Software or derivative works for profit, or in a business context, including in monetized services and products, requries explicit, seperate permission from the Licensor. +>The restrictions on commercial use apply to both the source code and any model weights produced by the Software. +> +>Any distribution must include this license, and the non-commercial restriction must be maintained. Weights resulting from use of the Software, including but not limited to training or fine-tuning models, must be shared under this same license, ensuring all restrictions and conditions are preserved. +>[...] +> +>And in clause 12 you can find the definitions: +>“Licensor”: The entity or individual offering the Licensed Materials under this license. +> +>“Licensed Materials”: The software, source code, training data, training environment, model weights, and any associated AI/ML components provided under this license. +> +>“You”: The individual or entity accepting the terms of this license, including any organization or entity that this individual or entity might work for or represent, including any entities under common control. +> +>“Your license”: The license granted to you for the software under this terms. +> +>“Model weights”: The machine learning model parameters generated by training or fine-tuning models using the Licensed Materials. +> +>“Use”: Anything you do with the software requiring your license +> +>As Anthropic is a for profit company, that crawls the web with the intent of scraping data to train their commercial models on, this is a violation of the DBEL. +> +>Furthermore the resulting model from the training must also have the DBEL, as per clause 4, which gives me legal rights on how any LLM generated by my data is used, distributed, or monetized. +> +>Since there is no easily findable way to communicate with the Anthropic legal team, I am reaching out to you to notify them to either delete all DBEL licensed code, or to reach out to discuss with my lawyer how to best move forward. +> +>Thank you very much, + +>>sent the following message at 10:43 PM +>>Good afternoon, I would appreciate it if you would send an email to legal@anthropic.com. I’m on vacation this week, and this is the fastest path to the correct people at Anthropic. + +>Nov 26, 2024 +>sent the following messages at 1:51 AM +>Thank you very much, have a good rest of the vacations + +--- +Things went quiet after applying `robots.txt`, at least for a little while; Dario's crew does seem to respect it, and being myself, I decided to let the whole thing pass, at least for the time being. In truth, my license is about commerical use of models trained on my data, and I didn't have any proof that my data wasn't filtered out inbetween the crawling and the training. + +So I set up a honeypot repo, which is full of a specific phrase that is unique, so that if any LLMs do scrape my git repo and they do train on my data, I could easily prove that they did it.[^2] + +After a couple of months, where I hadn't updated my setup, I got DDOS'd again, this time by the combined might of Zuckerberg and his ilk, as well as Bezos' minions, both sending bots with instructions to completely disregard the `robots.txt`, and hammer at the poor VPS I have set up. + +I checked Zuckerberg's and Jassy's legal, as I naively assumed that they would have an email much like Dario, but to my suprise, I realised that they maliciously expect you to send physical mail to an address in Ireland[^3] (Zuckerberg, I couldn't find any contact info for Jessy's folk, probably beacuse I didn't look enough). + +In any case. Yann LeCun tends to make the models that come out of his department weight-available (not to be confused with open-weight or open-source) which I am ok with, at least to the extent that if they do train their model on my data, at least I get to run the model for myself. + +Concerning the humans that work for Andy Jassy, I never heard of them, nor did I know that they do AI research/models, but it made sense; Alexa is bad, they want to improve it, so they want more data to go from a symbolic ai model to an llm. I had heard also that they wanted to use Amodei's crew to contribute to the research, so I decided to be more thoughfull; I would send an email to Dario's legal team. + +--- +Topic: Possible violation of software license + +>Hello, +> +>I noticed that AmazonBot crawled my git server at https://erga.apotheke.earth a couple of days ago. I’m reaching out because I understand that Amazon provides training data to your systems, and I want to give you a heads up regarding my projects. +> +>My repositories are licensed under two custom licenses—the Don't Be Evil License 1.0 (DBEL 1.0) and the Don't Be Evil License 1.1 (DBEL 1.1). Both licenses include explicit provisions restricting the use of the repository contents for training data or any commercial application without meeting the stated terms. +> +>For example, in DBEL 1.0: +> +> • Section 3 ("Non-Commercial Use Only") restricts use to non-commercial purposes. +> +> • Section 4 ("Distribution and Monetization Provisions") specifically states that: +> +> Any use of the Software or derivative works for profit, or in a business context, including in monetized services and products, requires explicit, separate permission from the Licensor. The restrictions on commercial use apply to both the source code and any model weights produced by the Software. +> +> Weights resulting from use of the Software, including but not limited to training or fine-tuning models, must be shared under this same license, ensuring all restrictions and conditions are preserved. +> +>Similarly, DBEL 1.1 includes analogous requirements, particularly in its provisions regarding commercial use and the need for fair compensation if your systems use the Software (or derivatives thereof) in a commercial context. +> +>You can review the full licenses here: +> +> • DBEL 1.0: https://erga.apotheke.earth/aethrvmn/protest/src/branch/master/LICENSE +> +> • DBEL 1.1: https://erga.apotheke.earth/aethrvmn/alectors/src/branch/master/LICENSE [^4] + +>Given these terms, I kindly request that you review your data collection practices to ensure compliance with my licenses. If my work is being used for training or any commercial applications without the appropriate permissions or compensation, I would like to discuss how we can address this situation. +> +>Thank you for your attention to this matter. +>Sincerely, + + +>>Dear Mr. +>>Thank you for your April 3, 2025 email. We understand that the activity at issue relates to AmazonBot. We do not control the AmazonBot crawler or receive crawl data from Amazon for training. Accordingly, we will consider this inquiry resolved. +>>Sincerely, + +>Thank you for the response. +>I also have records of ClaudeBot DDOSing my server on the 24th and 28th of November. Back then only the DBEL 1.0 was used, so if Anthropic has trained its models without filtering said data/code the DBEL 1.0 applies. +>Yours, + +>>Thank you for letting us know about traffic related to Anthropic’s web crawling. Anthropic aims to limit the impact of our crawling on website operators. We respect industry standard robots.txt instructions, including any disallows for the CCBot User-Agent (we use ClaudeBot as our UAT. Documentation is available at https://anthropic.com/crawl). Our crawler also respects anti-circumvention technologies and does not attempt to bypass CAPTCHAs or logins. Please note also that claudebot@anthropic.com is the correct point of contact for this bot going forward, as is documented in every request as part of the user agent string. +>>I understand that you've disallowed our web-crawler ClaudeBot from accessing https://erga.apotheke.earth, so we should not be crawling it or training on data from it. Can you share more information about the activity you're seeing? If you have logs, I can share that with our team to investigate further. +>>Best regards, + +>Good morning, +> +>These messages have not been about informing you about ClaudeBot, they have been about your possible violations of my copyright license. +> +>If you have used DBEL 1.0 licensed code to train any ML model, (which is possible because of the indication of ClaudeBot in my logs), then I have commercial rights to the resulting ML model, and it must also carry the DBEL 1.0. +> +>4. Distribution and Monetization Provisions +>Any use of the Software or derivative works for profit, or in a business context, including in monetized services and products, requries explicit, seperate permission from the Licensor. The restrictions on commercial use apply to both the source code and any model weights produced by the Software. Any distribution must include this license, and the non-commercial restriction must be maintained. Weights resulting from use of the Software, including but not limited to training or fine-tuning models, must be shared under this same license, ensuring all restrictions and conditions are preserved. + +>This would mean that Anthropic would need permission from me personally in order to use any of their LLMs, that have been trained on my code, in a commercial setting. +>Therefore I have been sending these emails to inform you that in case you do train a model using my code, you have accepted this License and therefore agree to come to contract with me concerning the commercial use of the model, as per Clause 1. +>Once again, you can find DBEL 1.0 here: +>https://erga.apotheke.earth/aethrvmn/protest/src/branch/master/LICENSE + + +>>Thank you. We disagree with your interpretation of this hypothetical scenario. In any event, I can confirm that your site is not in our training corpus so we will consider this matter closed and will not be responding to further communications. Anthropic reserves all rights. + + +>Good afternoon, +> +>My "hypothetical scenario" is based solely on the information provided by Anthropic at <https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler> +> +>" +>ClaudeBot +>ClaudeBot helps enhance the utility and safety of our generative AI models by collecting web content that could potentially contribute to their training. +>" +> +>Given the fact that ClaudeBot appears in my server logs, it would then make sense to be alarmed at the idea of Anthropic using my data for training purposes. +> +> +>Since you only disagree with my interpretation of the following facts: +> +>that Anthropic is using ClaudeBot to crawl websites for information to be used in training, +> +>that Anthropic crawled my server with ClaudeBot, +> +>that my software was under a restrictive license, +> +>and that said license, which you accept upon using the Software as per Clause 1, has special clauses regarding the use of the Software as training data. +> +> +>Would you be willing to point out the error in my interpretation that the license should hold? +> +>Obviously crawling a server to collect training data and actually using the data for training are two seperate processes, however there is no way for me to know whether my data has been filtered out or not by your internal processes, hence this discussion. +> +>Would you be willing to sign a declaration that Anthropic has not used the data, that they have acquired from my server, for training purposes? +> +>Yours, +>Vasileios Valatsos. +--- +at which point the response was an 'Out Of Office' template reply. + +Cowards. + +[^1]: I do not recognize metaphysical entities such as "corporations". They are human-made constructs in order to hide blame. Equivalent to golems in Hebrew mysticism, or the Kapparot. The humans that control them are the actors and the offenders. +[^2]: Note that the reason for such an aggressive license is not to make me money. I just want them to leave me alone, hence why I informed them. +[^3]: I am a citizen of an EU state, and the European HQ for most US megacorps are in Ireland. +[^4]: This has since been moved to GPLv3+ and a new repo set up. |
