This is a read only archive of pad.okfn.org. See the
shutdown announcement
for details.
distributed_music_ideas
- Objetivo: brainstorm potenciais sistemas que poderiam armazenar metadados de mídia entre um tolerante a falhas, geograficamente distribuído banco de dados.
Isso NÃO é sobre o compartilhamento de arquivos ou mesmo arquivos torrent. Um sistema distribuído para armazenar informações que seriam visíveis em uma interface de usuário é necessário. Conteria registros para artistas, lançamentos, uploads, fóruns, colagens, usuários (talvez).
It would likely lack some features. Enforcing ratios would be extremely difficult, but hopefully not actually too much of a problem.
There would likely be a large group of people that are semi-trusted to run a data node of some sort, but it should not require everyone running a data node to not be a bad actor, since that is impossible.
Problem: what.cd had a great database of music metadata, along with collages, releases, forums, etc. The data was not distributed redundantly and there was a single point of failure. Copies of the data were not saved, there was only one interface to the data, and only one authoritative source of truth. This made it possible to shut down.
Therefore, a system with a distributed database of metadata, divorced from interfaces or file hosting of any sort, should be harder to take down, whether by legal or illegal means. It could also be expected to be more on the legal side of the legal gray area in which what.cd was in. It would be highly redundant and could have different systems of access control and interface hooked up to the different nodes in the metadata replication network.
Naturally there would be trade-offs. Quality will not be as good as what.cd's, period. Deal with it. Ratio enforcement will likely be impossible. These things may not matter - TPB does work pretty well, even if you scoff at the poor quality and tags and malware. The trade-off in quality of content could be combated by a voting system, and a powerful search function with filters.
Meteor-gazelle design notes made by devs prior to shutdown:
https://github.com/meteor-gazelle/meteor-gazelle/tree/master/doc/specs
Has some valuable ideas, userflows, requirements for a successor to what.cd/gazelle. good read
Idea: database-level distributed consensus, such as with the Paxos extension for postgresql - https://github.com/citusdata/pg_paxos#pg_paxos
Multiple independent postgresql nodes could be set up with their own systems of access control (public, or VPN, or totally private, tor, whatever)
When updates/inserts/deletes happen on a node they are asynchronously replicated to the other nodes via the Paxos SQL statement log.
If there are conflicts they are resolved by the consensus of a majority of the nodes.
- Pros:
- Fast - does not have to read or write data synchronously across many geographically distributed nodes
- Handles conflicts automatically
- In theory has a log of all statements and could be rolled back to any point in time if bad things happened, not sure exactly how that would work in practice
- Integrated at the RDBMS level, not block storage level
- Can mix and match normal tables and paxos tables
- Could possibly use multi-master configuration for data that should be accessed and written synchronously across nodes, paxos tables for everything else
- Each node could have its own UI
- Cons:
- Bad actors could insert/update/delete, either users via scripted attacks on the frontend or via DB queries directly
- Requires a predefined list of nodes, may not be super easy to add nodes to the group
- Network partitioning probably leads to data loss
- Gazelle wants MySQL? There's an Ocelot fork with PG support
and i dont imagine gazelle would be hard to convert, there's a base DB driver so its just a matter of swapping that out Gazelle has a gazillion of raw SQL queries in the source code and almost zero abstraction.
combine with pg_paxos maybe? https://github.com/begriffs/postgrest "REST API for any Postgres database" -- yes pgrest is awesome. is a non-standard extension so can't be used on heroku/RDS (do we care?).
Possible architecture: Dockerize postgres with pg_paxos and paxos_replicated_tables across a semi-private network of semi-trusted people (as long as a majority are not actively trying to shit in the pool it should work out).
- pgrest connects synchronized RDBS to UIs, of which there can be many.
- or DHT for .torrent files or magnet info?
- Maybe magnet links can be stored directly in DB? Legal issues?
Idea Strong seperation of metadata from 'content':
All metadata / documentation exists on some public web site with a good API, and any sharing service queries that site for its metadata needs. Imagine a website similar to Discogs, for instance, where the site would serve a dual purpose: to document a release, but then also include a magnet link to download the release. Or, divorce the magnet link from the site completely and have it stored elsewhere so that the main site containing the metadata is not targeted for removal.
Maybe an interface so that when a user navigates to any bandcamp, discogs, amazon, itunes album page, torrent links are provided?
(alternate idea: releases have IDs that are included in torrent metadata so that one can look up the torrent in some other closed system using the public id - the closed system could even mirror data from the public system in order to integrate the information better. Oh wait, this is the same idea as described below)
The site can use Discogs, Musicbrainz, etc as an index, with torrents/files 'linked' to them. Non-discogs releases can be uploaded in the same format and potentially submitted to discogs. Symbiosis.
Discogs is not free or open source, its prorietary data. Musicbrainz is free/public domain.
Pros:
- - Tracklists
- - Genres
- - Record Labels
- - Album art
- - Release info (down to the format/year!)
- - Tons more metadata
Question regarding Discogs: Can we really just pull Discogs metadata willy nilly? What restrictions are in place as far as their API, would we just scrape it all and store if offline and check regularly for changes, or what?
- we could scrape it and use it for a basis until people start directly contributing to the new system
- No need to scrape it.. Full monthly logs of data in XML files are available at http://data.discogs.com/ including historical files. They have a hierarchy with top-line genre tagging along with subgenre "style" tagging that could be useful for distillation as what.cd only had one field for tags. Check out https://www.discogs.com/help/doc/submission-guidelines-release-genres-styles
- ^This. Also, utilizing both Discogs and Musicbrainz for a lot of initial seed (multiple fetching hosts as to not push API restrictions) is addition to continued community insight would really be a great start to a new property.
Could we pull album info from the MP3/FLAC metadata?
- - Probably not if we stay with .torrents, which seems by far still the most likely. Wouldn't be a way to get at metadata without getting pieces of the files. Though if there was a client-side way to extract FLAC/MP3 metadata, we could ask the user to point at the directory containing the files and grab the data that way. What you could have would be a server hosted by the tracker that would download the torrent and retain it for a few days. It could extract metadata and help seeding during the first few days. (Of course after that it would be deleted because of size problems).
Requests can also be linked to discogs releases, essentially all discogs releases will be unfilled 'requests'.
Request bounty?
Bounties can work as usual. By default they will have 0 votes and bounty. With a link to the appropriate Discogs page.
From the IRC:
16:53 < glittershark> 100 trusted people get access to the database, they each host 100 UIs with stricter auth
- 16:53 < glittershark> and those UIs serve 1000 people each in the end though the most popular one (or maybe two) will prevail
16:53 < glittershark> etc
essentially like a multi-layered thing
Backup Idea:
Once a month or so, an encrypted (choose your poison) backup of the site is made and released as a freeleech torrent. Thousands download, and therefore a backup is in the hands of users not the tracker.
- Why once a month? I think all metadata which doesn't contain any user/mgmt/administrative specific information should be completely available to all users all the time via download/mirror/api, etc..
Possible issues: needs to be secure and verifiable (nobody can read or tamper with it) also: who has the keys exactly. I'm not sure if it's worth it because let's say what.cd had had this: We would all have the db, but what would we do with it?
- What is done with the Database copy all users have after the fact doesn't matter. The Idea is that it's completely distributed, and therefore, can't be totally taken down so there is no loss of community effort, which is the large dissatisfactory part of this whole thing.
RE keys: still requires some single point/s of trust. There are some ceremonies for keeping a shared secret, but that is fairly OTT and difficult. (ZCash did this)
RE what to do with database: simply ensures that there is definitely a copy out there; in the event something like this happens again, we can recover (assuming the hardware/systems is in place)
~How many people would hav access to the keys? (Estimate) I cannot really accurately estimate the backup size (experience + what data do we want to back up?), but let's say a few dozen GB. If it was freeleech, you should have hundreds with the space at least. Only need to keep the most recent copy for this to work.
Idea: Freenet
Pros:
Anonymous
Censorship Proof
No need to be online to "seed"
Cons:
- Old content will fall out
- Slow
- No dynamic content
Idea: ZeroNet - using blockchain
https://zeronet.readthedocs.io/en/latest/using_zeronet/sample_sites/
- Pros:
- Looks pretty dang secure
- Simple to use and anonymity is simple with Tor
- Has a user system through something called ZeroID - this lets you do quality control on content maybe
- Cons:
- Nobody really knows WTF it is
- Requires every user to have a BTC wallet? i dont think so, it just sounds like some nice crypto built on top of bittorrent's dht?
- The same technology as BTC but it's not BTC
- Does every user have to do a complete mirror of everything? No, you only host sites you visit per "How does it work?" on the wiki
- "Namecoin's blockchain is being used for domain registrations." - you'll mirror all name registrations
- Blockchain all the things
quick description after having read about it:
* ZeroNet stores name registrations in the blockchain
* content.json is distributed over bittorrent, signed by the "wallet"/name registration
* content.json links to the other files of the site so they too will be distributed over bittorrent
* DHT or trackers bind addresses to IPs that host them (site address = magnet link)
* Differs from torrents in that this "torrent" will be updated live, and the newest signed & valid update always wins
* has a websocket api so your webpage can be notified when resources are updated
* normal users update sites by signing updates with their own key, sends that to the site owner, and the site owner does whatever site owner wants to do with that data. This can mean authorizing the user to update a certain file with his own key
- * "authorization provider" is a mechanism that helps work around the problem of every user needing a wallet, by letting another user with a wallet be responsible
* this means that the site owner can store private data outside the network - but normal centralization problems apply to that
* Observation: A site could easily be a repository of files (using the optional files feature)
* What's the catch: If authorities actually found the IP that issued the updates for a site, the pc could be seized and the key could be used to take down or overtake the site
See: https://docs.google.com/presentation/d/1_2qK1IuOKJ51pgBvllZ9Yu7Au2l551t3XBgyTSvilew/pub?start=false&loop=false&delayms=3000#slide=id.g9a1cce9ee_0_4
User accounts/identity:
Thoughts: federated user access? Something like diaspora but for file sharing would be super cool.
- OAuth/FB connect/Twitter/openID/google?
- but doesn't that make it easier for your acc to be associated with you? E.g. I log in with Facebook, gov can just get fb to reveal it. Though it is super convenient
- - yes, you gotta use an email address at some point though. it'd be optional.
- Possible technology: AWS Cognito - supports OAuth (FB etc) and SAML
- Signed distributed identity is a problem already solved by PGP. OAuth is not great for a lot of reasons, not least of which is that you need a central, federated provider.
- It's not super accessible to users (but hey, it's a private tracker, so whatever), but a distributed PGP web-of-trust based authentication model for a file sharing network is a super neat idea imo
- what does xmpp use? kk
- just identity with an @ sign for the server you come from iirc
- What about using the Diaspora fork, or a sharable comm platform like Signal; but encoding user info into an encrypted anonymizing token (think TouchID/Applepay)?
- Could also use the platform to generate QR codes within each torrent, and thereby make the content sharable by mobile device.
Idea: IPFS - distributed filesystem https://ipfs.io
...
- projects based on ipfs:
- https://github.com/ipfspics/ipfspics-server "Distributed image hosting"
- this seems really promising
- https://github.com/haadcode/orbit "Distributed, serverless, peer-to-peer chat application on IPFS"
- https://github.com/fazo96/ipfs-boards "a truly distributed social platform for the browser with no backend and no external applications required"
-
- tools:
- https://github.com/whyrusleeping/ipns-pub publish things with ipns
- ipfs "official" solution to distributed database is IPLD
- spec:
- code:
- (work in progress/confusing af imo)
- IPFS can be an absolute PITA to work with (from direct and indirect experience), it's early days but could work once matured a bit more. Maybe others have some clever ideas to get around the pain ;)
- Major drawbacks include speed, synchronicity, and implementation differences (goipfs is the most popular afaik)
Idea: Tahoe-LAFS - similar to IPFS but its an encrypted distrubuted filesystem so more suited for private content
More distributed dbs:
https://github.com/haadcode/orbit-db "Distributed peer-to-peer database on IPFS" (JavaScript)
https://github.com/amark/gun "A realtime, decentralized, offline-first, graph database engine. http://gun.js.org/" (JavaScript)
https://github.com/bigchaindb/bigchaindb "BigchainDB is a scalable blockchain database https://www.bigchaindb.com/" (Python)
i could see a nosql backend work well for this kind of content - it scales well horisontally, is easily clustered ina distriobuted manner and eventual consistency should work well for this kinda of content. Maybe couchbase: http://www.couchbase.com/
What about incorporating Hadoop toolset for parsing the various datapoints? HDFS, MapReduce (or Spark), Sqoop, Pig, Avro, Zookeeper, and Flume all seem very applicable here.
Decentralized Web
https://github.com/cjb/GitTorrent "A decentralization of GitHub using BitTorrent and Bitcoin"
https://github.com/blockstack/blockstack-core
https://github.com/mediachain/mediachain
https://github.com/datproject/dat
https://github.com/HelloZeroNet/ZeroNet -- https://zeronet.io/
http://www.mediachain.io/
https://morph.is/v0.8/
https://www.wikipediap2p.org/
Miscellanous:
http://telehash.org/ "encrypted mesh protocol"
https://github.com/feross/webtorrent "Streaming torrent client for the web https://webtorrent.io"
- projects based on webtorrent:
- (NOTE: webtorrent is incompatible with regular bittorrent as of right now; requires special webtorrent clients/plugins)
https://clickhouse.yandex/ "open-source column-oriented database management system"
https://github.com/gitchain/gitchain "Decentralized, peer-to-peer Git repositories aka "Git meets Bitcoin" "
https://github.com/bitchan/bitmessage "Bitmessage is a P2P communications protocol " https://bitmessage.org/wiki/Main_Page [* worth checking out imo]
https://github.com/cjdelisle/cjdns "Cjdns implements an encrypted IPv6 network using public-key cryptography for address allocation and a distributed hash table for routing. This provides near-zero-configuration networking, and prevents many of the security and scalability issues that plague existing networks."
https://github.com/adiitya/p2pstream "P2P Live streaming using centralized architecture http://adityaprakash.in/p2pstream "
https://www.tribler.org/ "Tribler is an open source decentralized BitTorrent client which allows anonymous peer-to-peer by default."
Encrypted email server by lavabit:
https://github.com/lavabit/magma
"Classic" BitTorrent:
https://github.com/chihaya/chihaya "A customizible, multi-protocol BitTorrent Tracker" (Go)
https://github.com/drbawb/babou "Babou is a combination web-framework/ torrent-tracker written in Go."
https://github.com/mdlayher/goat "Goat: Go Awesome Tracker. BitTorrent tracker implementation, written in Go. MIT Licensed."
https://github.com/leighmacdonald/mika "mika: Go based torrent tracker using redis for a backend and designed for private site use"
- Don't forget: It needs to be simple for end-users who just want to share content, and preferrably not blocked on corporate firewalls or by ISPs
- Move away from old PHP systems for sure. Gazelle's great and all but PHP encourages laziness and sloppy code (and it's PHP) (+1)(+1)(-1)(+1)(+1)
- True but: best if we can build a decentralized layer on top of existing tech (e.g. Gazelle) and then slowly replace it piecemeal. Rebuilding gazelle from scratch should be the ultimate goal but is unlikely to happen, current momentum won't last long. someththing like wikipediap2p would help the transition be smooth
- - Hence the RDBMS-based solution being not ideal but best for right now if we can say, make Gazelle work with Postgres and then replicate some of the tables with pg_paxos. Then we get distributed architecture but keep the old UI (for now)
- If a group of people is willing to work on it, I'm willing to help rebuild gazelle from scratch, possibly in Go. hahaha We can argue all day about languages/frameworks, though I will suggest that Go is great but maybe a little more low-level than we desire.
- -100 to go (-- = + hehe), go makes me cry http://yager.io/programming/go.html - without parametric polymorphism there are only so many libraries you can write before you run into the `interface{}` upcasting problem over and over and over again. oh whatever. we're talking a databased-backed web app. even web apps have to build abstractions or else you get completely tied up in boilerplate and cruft and writing the same thing over and over and over again. That can be done almost no matter the languages though with some work.*with some work* - read the above link
- honestly I would vote for rails the reason i suggest go is because the goroutines are suited for a web application, and there are many frameworks that can help simplify web development. You don't have to start at the HTTP server coding part. Also pls not rails, it doesn't scale well.there are plenty of perfectly good programming languages with green threads that also have a type system that isn't crap
also "doesn't scale well" is a non-issue for a private tracker with 10000 users.
- The language really doesn't matter much at this point. We gotta figure out a distributed architecture to share the data first. Then we can pick technologies.If we have a semi-distributed RDBMS then everyone can write their own UIs on top of their local synced copy of the data. Then you can make yours in rails, someone else can use adapted gazelle, someone else can do Go. UI is the least important part.and i just realized i could find someone's blog post about how bad any language is. i give up. let's just work on the arch.
fine by me
N.B. what.cd was on gazelle (PHP) and we all liked it. https://eev.ee/blog/2012/04/09/php-a-fractal-of-bad-design/ read that once, it's a good article
- What if there was a way to have a "swarm" of self-replicating trackers? For arguments sake, let's say there's 10. When you go to download/upload using the website/UI it randomly picks one of the 10 trackers. Then the others replicate it and vice versa (ideally they would be continuously replicating if possible, otherwise once a day, once a week, etc.) This way if one goes down you either lose nothing, or just a very small amount of data. There could still be a central authority moderating one of the trackers (which effectively moderates all of them via replication) - but if they get taken down we would only be losing governance not all the data.
quick rant..
Public trackers have a pretty easy job - if all the information is public then there is no problem with scattering backups all over so that the next site can just pick it up and move on.
Private trackers have to deal with having private information, typically authentication. Information which must also be protected from unauthorized access.
So, say you have a distributed file system, who gets to make an update? If you forced all updates to be cryptographically signed, then you can grant keys to privileged servers. If police gets the key, police can hijack things. so there's a problem.
Even if all privileged servers went down, all the data would be safe for migration into a new network..
Similarly, private information could be signed and encrypted. For the end user to access this information, the user would have to talk with a privileged server.
I suppose this is just a less refined version of ZeroNet. I think the lesson I want to take from this is that the owner of a zeronet site could store information encrypted in the blockchain that only the owner (+ friends) would be able to decrypt
STRUCTURE:
001 PROBLEM OF ACQUIRING AND MAINTAINING TOP QUALITY METADATA
1 requirement/wish "best quality ever descriptions, tags, collages"
- FLAC-Quality Detector:
- use machine-learning to detect transcodes into flac, and score FLAC quality
- exmaple, give it spectral analasis and flacs to train on, and . The computer will say "FLAC score is low most likely a shit transcode",
- or "spectral image looks good/matches nicely with perfect flacs".
- Album-Qualtiy Detector:
- "Album" content quality by matching it with discogs and musicbrainz.
- "album" file-list vs discogs/musicbrainz track-list tracks are missing, missnamed etc, score low" "file-list vs track-list score good" (note though - musicbrainz has a lot of typos with less mainstream music)
- Use it as a bot on any content-network, and store its results somewhere else.
1a, All legal, can be on musicbrainz, metabrainz. Take backups of musicbrainz (available), share with torrent.
1b, as this is legal, can use musicbrainz website with identity/oauth/whatever to curate content/prevent spam.
Reward is MISSING ? *** why would we do manual labor, somehow we must be rewarded with internet points. TODO
002 PROBLEM OF DISTRIBUTING CONTENT WITHOUT RISK TO USER/ADMINS OR RISK OF HAVING THE DISTRIBUTION SYSTEM DESTROYED/DEGRADED
2 torrents over i2p work, and would be fast if there was more people there. ipfs doesnt provide anonymity, other options above also dont provide anonymity
003 LINKING "ILLEGAL" CONTENT WITH MUSIC LOVERS
Linking/associating musicbrainz to an track/album content like magnetlink,
here ipfs or zeronet can be created, ipfs or zeronet website which lists and can be updated (hence zeronet may be better
last I heard ipfs didnt support updates to a website well). User submits their previous what.cd torrent to tracker-On-I2P, public DHT-like, and can now provide
the torrents-magnetlink to zeronet-website togheter with musicbrainz id which its supposed ot be,
for others to find, and begin seeding/downloading. The zeronet website would be the index of torrent-links on i2p + musicbrainz id. The website must be free/open-source,
and others can run their own if they like, and the database of torrent-links-on-i2p tracker to musicbrainz id ... ... ... ... ... ... ... ... ... ... ...
Idea Concept: Technical specification using Tor-network to configure an end point to another server.
1. Setup tor Exit and Entry nodes
2.
Want to make yourself useful?
i'd suggest if anyone wants to make themselves useful, try setting up a docker or vagrant thing with pg_paxos in a replicated cluster
or kubenetes or docker-compose
get the gazelle/wcd schema and load it inreplicate some tables with paxos
share your code/results
figure out how much effort it'd be to port gazelle to run on pg instead of mysql