I’m no expert, but dreaming here - is there a FOSS search engine that can be run in a distributed way by a community? I would happily switch to that if there was.
That’d be awesome. I’m just curious how you’d go about constructing such a thing that would be resilient against millions and potentially billions of dollars invested in trying to break it and making it serve results it otherwise wouldn’t. Because those investments will happen if such search engine would gain traction. I really like the idea though.
It’s a really interesting question and I imagine scaling a distributed solution like that with commodity hardware and relatively high latency network connections would be problematic in several ways.
There are several orders of magnitude between the population of people who would participate in providing the service and those who would consume the service.
Those populations aren’t local to each other. In other words, your search is likely global across such a network, especially given the size of the indexed data.
To put some rough numbers together for perspective, for search nearing Google’s scale:
A single copy of a 100PB index would require 10,000 network participants each contributing 10TB of reliable and fast storage.
100K searches / sec if evenly distributed and resolvable by a single node would be at least 10 req/sec/node. Realistically it’s much higher than that, depending on how many copies of the index, how requests are routed, and how many nodes participate in a single query (probably on the order of hundreds). Of that 10TB of storage per node, substantial amounts of it would need to be kept in memory to sustain the likely hundreds of req/sec a node might see on average.
The index needs to be updated. Let’s suppose the index is 1/10th the size of the crawled data and the oldest data is 30 days (which is pretty stale for popular sites). That’s at least 33PB of data to crawl per day or roughly 3,000Gbps minimum sustained data ingestion. For those 10,000 nodes they would need 1Gbps of bandwidth to index fresh data.
These are all rough numbers but this is not something the vast majority of people would have the hardware and connection to support.
You’d also need many copies of this setup around the world for redundancy and lower latency. You’d also want to protect the network against DDoS, abuse and malicious network participants. You’ll need some form of organizational oversight to support removal of certain data.
Probably the best way to support such a distributed system in an open manner would be to have universities and other public organizations run the hardware and support the network (at a non-trivial expense).
So this is starting to sound more like something that needs to explicitly be paid for in some way (as opposed to just crowd sourcing personal hardware), at least if we want to maintain the same level of service.
Yes, at least currently. There may be better options as multi-gigabit internet access becomes more common place and commodity hardware gets faster.
The other options mentioned in this thread are basically toys in comparison (either obtaining results from existing search engines or operating at a scale less than a few terabytes).
Here’s the summary for the wikipedia article you mentioned in your comment:
The Yahoo! Directory was a web directory which at one time rivaled DMOZ in size. The directory was Yahoo! 's first offering and started in 1994 under the name Jerry and David’s Guide to the World Wide Web. When Yahoo!
even netscape/mozilla was in on this game, with the open directory project (dmoz). aol eventually shut it down, but it apparently lives on independently as curlie.org - but i have no idea how current it is or anything.
Comes down to editorial quality maybe; what sites do you trust?
Jimmy Wales has a social media project with “whom to trust” built into the algorithm. I’m not sure if it is an idea with legs, but I like where his head is at
I’m no expert, but dreaming here - is there a FOSS search engine that can be run in a distributed way by a community? I would happily switch to that if there was.
That’d be awesome. I’m just curious how you’d go about constructing such a thing that would be resilient against millions and potentially billions of dollars invested in trying to break it and making it serve results it otherwise wouldn’t. Because those investments will happen if such search engine would gain traction. I really like the idea though.
I have heard of searches that are non commercial, so it can’t be impossible. I read this recently: https://lemmy.world/post/10979517
Do you mean like Yacy?
YES
It’s a really interesting question and I imagine scaling a distributed solution like that with commodity hardware and relatively high latency network connections would be problematic in several ways.
There are several orders of magnitude between the population of people who would participate in providing the service and those who would consume the service.
Those populations aren’t local to each other. In other words, your search is likely global across such a network, especially given the size of the indexed data.
To put some rough numbers together for perspective, for search nearing Google’s scale:
A single copy of a 100PB index would require 10,000 network participants each contributing 10TB of reliable and fast storage.
100K searches / sec if evenly distributed and resolvable by a single node would be at least 10 req/sec/node. Realistically it’s much higher than that, depending on how many copies of the index, how requests are routed, and how many nodes participate in a single query (probably on the order of hundreds). Of that 10TB of storage per node, substantial amounts of it would need to be kept in memory to sustain the likely hundreds of req/sec a node might see on average.
The index needs to be updated. Let’s suppose the index is 1/10th the size of the crawled data and the oldest data is 30 days (which is pretty stale for popular sites). That’s at least 33PB of data to crawl per day or roughly 3,000Gbps minimum sustained data ingestion. For those 10,000 nodes they would need 1Gbps of bandwidth to index fresh data.
These are all rough numbers but this is not something the vast majority of people would have the hardware and connection to support.
You’d also need many copies of this setup around the world for redundancy and lower latency. You’d also want to protect the network against DDoS, abuse and malicious network participants. You’ll need some form of organizational oversight to support removal of certain data.
Probably the best way to support such a distributed system in an open manner would be to have universities and other public organizations run the hardware and support the network (at a non-trivial expense).
So this is starting to sound more like something that needs to explicitly be paid for in some way (as opposed to just crowd sourcing personal hardware), at least if we want to maintain the same level of service.
It seems like there are others in the thread with good options
Yes, at least currently. There may be better options as multi-gigabit internet access becomes more common place and commodity hardware gets faster.
The other options mentioned in this thread are basically toys in comparison (either obtaining results from existing search engines or operating at a scale less than a few terabytes).
SearXNG?
Good one!
No, but we can build it. It’s called a Directory. This is how Yahoo! worked before it got enshittified and eventually replaced by Google search.
https://en.wikipedia.org/wiki/Yahoo!_Directory
Here’s the summary for the wikipedia article you mentioned in your comment:
The Yahoo! Directory was a web directory which at one time rivaled DMOZ in size. The directory was Yahoo! 's first offering and started in 1994 under the name Jerry and David’s Guide to the World Wide Web. When Yahoo!
to opt out, pm me ‘optout’. article | about
Bot thinks the exclamation point at the end of Yahoo! is the end of the sentence. Cute bot.
Dmoz was great.
even netscape/mozilla was in on this game, with the open directory project (dmoz). aol eventually shut it down, but it apparently lives on independently as curlie.org - but i have no idea how current it is or anything.
FOSS won’t change this matter unless it somehow can filter out all the low quality AI-generated articles better than Google’s filters.
Don’t allow sites in the index by default, use an allowlist.
Is this allowlist supposed to work on article by article basis? Because that’s what has to be done for publishing platforms like medium.
By default I was thinking site. But for sites with huge variances in page quality you’d need it by page/article as you say.
Comes down to editorial quality maybe; what sites do you trust?
Jimmy Wales has a social media project with “whom to trust” built into the algorithm. I’m not sure if it is an idea with legs, but I like where his head is at