Organizing dumps
by uninspiredusername - Tuesday April 9, 2024 at 05:17 PM
#1
Rebuilding my collection after some unfortunate data loss (was retarded) (still am)

Considering a more consistent solution this time around. Y'all have any interesting storage methods that might be worth exploring?

At this point, I'm probably gonna end up with a simple psql db with a table for each dump and consistent col names for common types. Something that allows for more consistent processing rather than dealing with a dozen different formats and such.

( I don't care what you do with the data -- just contemplating raw original formats vs some relational dbs vs nosql vs elastic etc )
Reply
#2
I also have the same issue, i'm considering using solr as the search.0t.rocks suggest, I have tried but for real their are a lot of work for each file, I am also considering elastic search with kibana, another way is if you have all your data on ssd and a nice CPU simply fgrep in parallel with a search but for real the solr option look like the best one since it let you do as you please and associate the data. You'll still be left with a "little" issue: some sql dump seem like they don't get any interest in adding some new line after each record, that simply mean you'll have to do it first before being able to index it.
The psql db is not really in my priority since it'll only let me query something I already know but not a diving across everything at once. What do you think about that ?
(If anyone already did all the indexing part please share your python script to do so, since we need one script per file/dump/db/data, the "unstructured" option is really bad for the association feature)
Reply
#3
(Apr 09, 2024, 05:33 PM)ygrek Wrote: [ . . . ]

I've actually never messed with Solr. I should set some time aside soon to try it out -- at least to have it on the back burner for later

> ome sql dump seem like they don't get any interest in adding some new line after each record

Honestly, that doesn't bother me too much. Even if it's not consistent across dumps, it's (normally) consistent within any one file. 5 minutes to figure out the formatting and you can add it to a db pretty easily. It's definitely an extra step that has to be taken though

> The psql db is not really in my priority since it'll only let me query something I already know but not a diving across everything at once. What do you think about that ?

I don't see it that way actually

Let's say I want unique email addresses from the AT&T leak. Theoretically, I could just query SELECT EmailAddr FROM AT&T or whatever tf, and then I can process the return data however I would normally

Anything you think I'm overlooking on that part?
Reply
#4
By PCIe 4.0 x4 NVME SSDs and just grep the raw files...
Reply
#5
(Apr 11, 2024, 09:23 AM)joepa Wrote: By PCIe 4.0 x4 NVME SSDs and just grep the raw files...

too easy

if you're not gonna reinvent the wheel why bother (/s)
Reply
#6
(Apr 11, 2024, 09:23 AM)joepa Wrote: By PCIe 4.0 x4 NVME SSDs and just grep the raw files...

that's actually what i already have but still that's not fast enough, think about a few To of data not a few Go

(Apr 10, 2024, 11:45 PM)uninspiredusername Wrote:
(Apr 09, 2024, 05:33 PM)ygrek Wrote: [ . . . ]

I've actually never messed with Solr. I should set some time aside soon to try it out -- at least to have it on the back burner for later

> ome sql dump seem like they don't get any interest in adding some new line after each record

Honestly, that doesn't bother me too much. Even if it's not consistent across dumps, it's (normally) consistent within any one file. 5 minutes to figure out the formatting and you can add it to a db pretty easily. It's definitely an extra step that has to be taken though

> The psql db is not really in my priority since it'll only let me query something I already know but not a diving across everything at once. What do you think about that ?

I don't see it that way actually

Let's say I want unique email addresses from the AT&T leak. Theoretically, I could just query SELECT EmailAddr FROM AT&T or whatever tf, and then I can process the return data however I would normally

Anything you think I'm overlooking on that part?

Also about the part to select only from one leak or source, you simply have to passe the source leak when indexing aznd then when qureying you'll have the possibility to filter. solr is very usefull and efficient for data association, OSINT in short Smile If you need those data for something else like selling, exploiting i don't know how but look evil for me at this point, it's better to use some sql solution since it's for industrialized process
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Largest Discord User History Archive - 10m+ Users Mega 318 40,226 9 hours ago
Last Post: phas3lock
  A collection of deepweb sites [2025] dg7ka 108 3,108 11 hours ago
Last Post: Moneymaking123
  In front an abuse in the school, any suggestion? dai5 0 120 Yesterday, 11:02 AM
Last Post: dai5
  Questrade leak anyone? username000 0 200 May 01, 2026, 11:36 PM
Last Post: username000
  OSINT repositories by country browdbrowniebread 0 287 Apr 30, 2026, 07:41 PM
Last Post: browdbrowniebread

Forum Jump:


 Users browsing this forum: 1 Guest(s)