Data Mining Websites like LinkedIn, IMDB, and Glassdoor - How would you do this if you had to

Hey guys. Forgive me for having this not be about WK or Japanese but my questions are about web programming and this is the best community I know for it. I wanted to ask a general question of “if I were to do ____, what might be the best approach?”

My Goal ( background: I am a Film Pipeline TD in-the-making and I’m going to graduate from SCAD soon ):
- To create a scripts to automatically go through LinkedIn and IMDB to aggregate data about companies and individual people so I can keep an eye on industry trends. A secondary goal would be to implement the same system for Glassdoor, because that has useful company/job statistics.
- I would be able to input specific links and get data from those but I’d also like to be able to do reverse lookups. For example, if I have a name or e-mail address, I want to be able to send a query and grab data about that person if they have their name or email registered in a LinkedIn/IMDB account
- Create filters and exception cases so I can look at a specific time range, say March 2013 to February 2015, and exclude all other information
- Do a general search by profession - like write in “vfx” or “film producer” “film director” and get back all of the data associated with that job title currently.

I’m a python programmer by trade, it’s the core of a ton of my tools. Is that the best language to go about doing this though? I have some knowledge of PHP, CSS, HTML, at least enough to make a basic website. C++ is my weakest language at the moment - I’ve never had the proper moment to learn it.

What are your thoughts? Are there potential pitfalls and how can I avoid them?

Hmm, honestly, I don’t think C++ would be good - it’s doable, sure, but I don’t think it would be the best language for this.  In my humble opinion, I would stick with Python or Ruby.  Java might also work, but in the same vein as C++ - you could, but it may not be the best route.

 As a data scientist I could never recommend u something different from python altho I’m a C, C++ and C# programmer as well.

Edit: For crawling itself I used a very good software made for that purpose (crawling) but it wont work with your project tho for other projects u might consider it: Visual Web Ripper

If I were going to make this I would use Python.

As a programmer I would probably never make this since if I really wanted that information I would just pay for the service as a token of gratitude to each site’s developers.

hey,

once i made one using Java (i also work as java analyst),  i collect a whole database of my state transport company, and i crossed with another database from my region.

not my fault they had an “open link” to consult those members profiles.

resulted in having a database with 800 thousand people: addresses, phone numbers, docs and school/work place.

to avoid detection it took 15 days to collect it all, i used TOR network to access both websites, and i controlled how many requests per minute i did through each hour of the day.

you might want to do that too if you want to avoid calling attention.

alucardeck said... hey,

once i made one using Java (i also work as java analyst),  i collect a whole database of my state transport company, and i crossed with another database from my region.

not my fault they had an "open link" to consult those members profiles.

resulted in having a database with 800 thousand people: addresses, phone numbers, docs and school/work place.

to avoid detection it took 15 days to collect it all, i used TOR network to access both websites, and i controlled how many requests per minute i did through each hour of the day.

you might want to do that too if you want to avoid calling attention.
 The websites are bigger so would probably not see the activity as easily. And I don't think he's trying to get all the content there.

Hey guys, thank you for your replies! Just to chime in - I’m only interested in people in my field mainly. Film is a fairly small industry so I’m not too concerned about raising any flags or anything. Plus, everything I’d be grabbing is publicly available (People manage what they put in their own LinkedIn accounts, after all) so I don’t think that’d be a problem.

prawn said…As a programmer I would probably never make this since if I really wanted that information I would just pay for the service as a token of gratitude to each site’s developers.
 Unfortunately no such service exists that collects this stuff! You can by IMDB Pro and LinkedIn Premium but that only gives you access to more accounts and info - it won’t collect it in a nice package. Also, I love your username - District 9 reference maybe?

Plus, I will use this for the rest of my career. I don’t want to spend hours every year or half year doing updates on my research. It should be a couple of shiny buttons and filters, ideally, that I can call into a database each year and plot with as industry trends.


alucardeck said…
not my fault they had an “open link” to consult those members profiles.

resulted in having a database with 800 thousand people: addresses, phone numbers, docs and school/work place.

to avoid detection it took 15 days to collect it all, i used TOR network to access both websites, and i controlled how many requests per minute i did through each hour of the day
 Can you elaborate more on “open link” please? I get the concept but what makes something open/closed/etc?

I knew you guys were the right people to ask ^_^!

meneldal said...
alucardeck said... hey,

once i made one using Java (i also work as java analyst),  i collect a whole database of my state transport company, and i crossed with another database from my region.

not my fault they had an "open link" to consult those members profiles.

resulted in having a database with 800 thousand people: addresses, phone numbers, docs and school/work place.

to avoid detection it took 15 days to collect it all, i used TOR network to access both websites, and i controlled how many requests per minute i did through each hour of the day.

you might want to do that too if you want to avoid calling attention.
 The websites are bigger so would probably not see the activity as easily. And I don't think he's trying to get all the content there.
Oh, they are..  they are really monitoring for this kind of activity.
However it doesnt mean they will take any action against it.

The number of requests coming from the same ip, accessing random pages with no logical order, its a lot suspicious.
to avoid being a possible target of retaliation, i really recommend never do that from your own machine.
alucardeck said...
meneldal said...
alucardeck said... hey,

once i made one using Java (i also work as java analyst),  i collect a whole database of my state transport company, and i crossed with another database from my region.

not my fault they had an "open link" to consult those members profiles.

resulted in having a database with 800 thousand people: addresses, phone numbers, docs and school/work place.

to avoid detection it took 15 days to collect it all, i used TOR network to access both websites, and i controlled how many requests per minute i did through each hour of the day.

you might want to do that too if you want to avoid calling attention.
 The websites are bigger so would probably not see the activity as easily. And I don't think he's trying to get all the content there.
Oh, they are..  they are really monitoring for this kind of activity.
However it doesnt mean they will take any action against it.

The number of requests coming from the same ip, accessing random pages with no logical order, its a lot suspicious.
to avoid being a possible target of retaliation, i really recommend never do that from your own machine.
 ....how many requests do you think OP will be making?
korinkite said...
alucardeck said...
not my fault they had an "open link" to consult those members profiles.

resulted in having a database with 800 thousand people: addresses, phone numbers, docs and school/work place.

to avoid detection it took 15 days to collect it all, i used TOR network to access both websites, and i controlled how many requests per minute i did through each hour of the day
 Can you elaborate more on "open link" please? I get the concept but what makes something open/closed/etc?

 
Sure,

in example, check this link:
http://www.emtu.sp.gov.br/passe/formulariocompleto.htm?regiao=255&seq=235840000

Inside you can find:
Where/what he studies, his name, CPF/RG (personal docs), mom's name, dad's name, birthday, address, e-mail, phone number and what bus he takes everyday to go to his school.

Besides the fact its a total lack of security, this link is supposed to be only distributed to himself, logging into his account and then click on the "Pay" button that will display this bank ticket.

But... its open, you can access, i can, everybody can.
No login/security needed from any kind.

Considering that, we can say this link is "open", you didnt need to hack any account to achieve that.
But.. considering i could only obtain this link after a security check, i can also say this link is "closed" or "private"

Now the million-dollar question:
   - Do we have the rights to mine it?

LinkedIn, IMDB, etc...   they are "open" out there..  (except if you need to login on your own LinkedIn account to reveal non-public profile information)

IMO, if its "open", you can do whatever you want to... and that is what i did  :)




rlgreen91 said...
alucardeck said...
meneldal said...
alucardeck said... hey,

once i made one using Java (i also work as java analyst),  i collect a whole database of my state transport company, and i crossed with another database from my region.

not my fault they had an "open link" to consult those members profiles.

resulted in having a database with 800 thousand people: addresses, phone numbers, docs and school/work place.

to avoid detection it took 15 days to collect it all, i used TOR network to access both websites, and i controlled how many requests per minute i did through each hour of the day.

you might want to do that too if you want to avoid calling attention.
 The websites are bigger so would probably not see the activity as easily. And I don't think he's trying to get all the content there.
Oh, they are..  they are really monitoring for this kind of activity.
However it doesnt mean they will take any action against it.

The number of requests coming from the same ip, accessing random pages with no logical order, its a lot suspicious.
to avoid being a possible target of retaliation, i really recommend never do that from your own machine.
 ....how many requests do you think OP will be making?
 It's hard to put a specific number on it. It depends on how much a company chooses to review about the job listings - or how much a website chooses to log that info. Glassdoors is awesome because it has info from all sorts of jobs, even if they aren't being actively recruited right this second. 

I have a list of 80 specific people I want to keep track of - and that list grows every time I find an industry name or connection I don't know - so that's 160 between LinkedIn/IMDB

I'm looking at 47 companies currently. So if I did the "search by company" route, I will probably end up with hundreds of results per company which will then be filtered down to a few dozen per company to actually query the data from (for example, I don't care about a Janitor's job history or HR consultant at Weta so I'd make sure I only grab data from people that actually had VFX roles in that company). 

So in the end, it's safe to say I'd be getting a least a few hundred, maybe even a couple thousand depending on how wide I cast my net, but nothing close to 800 thousand :) . 

For those results, I'd then grab all job history of a given person and, if I go with Python, I'll load it into a JSON dict to store and display it. 
rlgreen91 said...
alucardeck said...
meneldal said...
alucardeck said... hey,

once i made one using Java (i also work as java analyst),  i collect a whole database of my state transport company, and i crossed with another database from my region.

not my fault they had an "open link" to consult those members profiles.

resulted in having a database with 800 thousand people: addresses, phone numbers, docs and school/work place.

to avoid detection it took 15 days to collect it all, i used TOR network to access both websites, and i controlled how many requests per minute i did through each hour of the day.

you might want to do that too if you want to avoid calling attention.
 The websites are bigger so would probably not see the activity as easily. And I don't think he's trying to get all the content there.
Oh, they are..  they are really monitoring for this kind of activity.
However it doesnt mean they will take any action against it.

The number of requests coming from the same ip, accessing random pages with no logical order, its a lot suspicious.
to avoid being a possible target of retaliation, i really recommend never do that from your own machine.
 ....how many requests do you think OP will be making?
 Lets estimate:

korinkite said...My Goal ( background: I am a Film Pipeline TD in-the-making and I'm going to graduate from SCAD soon ):
- To create a scripts to automatically go through LinkedIn and IMDB to aggregate data about companies and individual people so I can keep an eye on industry trends. A secondary goal would be to implement the same system for Glassdoor, because that has useful company/job statistics.
- I would be able to input specific links and get data from those but I'd also like to be able to do reverse lookups. For example, if I have a name or e-mail address, I want to be able to send a query and grab data about that person if they have their name or email registered in a LinkedIn/IMDB account
- Create filters and exception cases so I can look at a specific time range, say March 2013 to February 2015, and exclude all other information
- Do a general search by profession - like write in "vfx" or "film producer" "film director" and get back all of the data associated with that job title currently.
 Lets suppose he wants to track only one person for now.

Koichi Senpai
hello@tofugu.com



Linkedin:
- at least one request per day (i would do hourly during the day, but its ok for making things simple here)
- one request per day in each contact (to track new connections, referrals, comments, etc)
- one request per day in each profile inside his actual job

i am not sure of average contacts people have in Linkedin, but there will be about 200 ~ 1.000 request per day, everyday per person tracked.
Of course, Linkedin is monitoring that, as they are always improving UI experience. (and they changed the UI again, i checked my login 2 months ago)

IMO, if you really want to get some relevant information about your target network, i would do a 3~4 depth network connections.
And that would consume maybe 5.000~10.000 requests per day per target.


as for IMDB, they removed a few years ago the user search functionality, so now you need to find randomly something your target posted to get his profile link, or a trick on google to search it for you.
Yet, IMDB would need very few requests per day, about 5~6 per target, to track each section (reviews, ratings, comments, etc)

I don’t think it would make sense to make that many requests that frequently though.  All of the things you lined out should be sufficiently tracked by sending a new request every 4-7 days in my opinion.  People outside of recruiters just don’t really get as much day-to-day usage out of LinkedIn, despite what their emails try to portray.

Edit: this is a reply to alucardeck, not OP.

rlgreen91 said... I don't think it would make sense to make that many requests that frequently though.  All of the things you lined out should be sufficiently tracked by sending a new request every 4-7 days in my opinion.  People outside of recruiters just don't really get as much day-to-day usage out of LinkedIn, despite what their emails try to portray.

Edit: this is a reply to alcarudeck, not OP.
 Agreed, it may be overwhelming for Linkedin to check everyday.

See, now this makes me want to write a script that will go through the しりとり post and gather all of the entries.

I would check Glassdoor everyday for sure, but even the most prolific people I know on LinkedIn don’t update even weekly, so it doesn’t make sense to run searches more frequently than that. IMDB likewise, it gets updated very slowly so I wouldn’t need to worry about the overhead of checking often.
Also IMDb definitely has a search person feature (top of page) unless you mean something else.
The alternative is to use
“site:www.imdb.com firstname lastname” in a google search and use their site filters. Comes in handy for projects!

rlgreen91 said... See, now this makes me want to write a script that will go through the しりとり post and gather all of the entries.
 Do it!
alucardeck said...
rlgreen91 said...
alucardeck said...
meneldal said...
alucardeck said... hey,

once i made one using Java (i also work as java analyst),  i collect a whole database of my state transport company, and i crossed with another database from my region.

not my fault they had an "open link" to consult those members profiles.

resulted in having a database with 800 thousand people: addresses, phone numbers, docs and school/work place.

to avoid detection it took 15 days to collect it all, i used TOR network to access both websites, and i controlled how many requests per minute i did through each hour of the day.

you might want to do that too if you want to avoid calling attention.
 The websites are bigger so would probably not see the activity as easily. And I don't think he's trying to get all the content there.
Oh, they are..  they are really monitoring for this kind of activity.
However it doesnt mean they will take any action against it.

The number of requests coming from the same ip, accessing random pages with no logical order, its a lot suspicious.
to avoid being a possible target of retaliation, i really recommend never do that from your own machine.
 ....how many requests do you think OP will be making?
 Lets estimate:

korinkite said...My Goal ( background: I am a Film Pipeline TD in-the-making and I'm going to graduate from SCAD soon ):
- To create a scripts to automatically go through LinkedIn and IMDB to aggregate data about companies and individual people so I can keep an eye on industry trends. A secondary goal would be to implement the same system for Glassdoor, because that has useful company/job statistics.
- I would be able to input specific links and get data from those but I'd also like to be able to do reverse lookups. For example, if I have a name or e-mail address, I want to be able to send a query and grab data about that person if they have their name or email registered in a LinkedIn/IMDB account
- Create filters and exception cases so I can look at a specific time range, say March 2013 to February 2015, and exclude all other information
- Do a general search by profession - like write in "vfx" or "film producer" "film director" and get back all of the data associated with that job title currently.
 Lets suppose he wants to track only one person for now.

Koichi Senpai
hello@tofugu.com



Linkedin:
- at least one request per day (i would do hourly during the day, but its ok for making things simple here)
- one request per day in each contact (to track new connections, referrals, comments, etc)
- one request per day in each profile inside his actual job

i am not sure of average contacts people have in Linkedin, but there will be about 200 ~ 1.000 request per day, everyday per person tracked.
Of course, Linkedin is monitoring that, as they are always improving UI experience. (and they changed the UI again, i checked my login 2 months ago)

IMO, if you really want to get some relevant information about your target network, i would do a 3~4 depth network connections.
And that would consume maybe 5.000~10.000 requests per day per target.


as for IMDB, they removed a few years ago the user search functionality, so now you need to find randomly something your target posted to get his profile link, or a trick on google to search it for you.
Yet, IMDB would need very few requests per day, about 5~6 per target, to track each section (reviews, ratings, comments, etc)
 I never thought of this but I will definitely incorporate a depth level setting in the tool. Thank you for that idea!
korinkite said... Hey guys. Forgive me for having this not be about WK or Japanese but my questions are about web programming and this is the best community I know for it. I wanted to ask a general question of "if I were to do ____, what might be the best approach?"

My Goal ( background: I am a Film Pipeline TD in-the-making and I'm going to graduate from SCAD soon ):
- To create a scripts to automatically go through LinkedIn and IMDB to aggregate data about companies and individual people so I can keep an eye on industry trends. A secondary goal would be to implement the same system for Glassdoor, because that has useful company/job statistics.
- I would be able to input specific links and get data from those but I'd also like to be able to do reverse lookups. For example, if I have a name or e-mail address, I want to be able to send a query and grab data about that person if they have their name or email registered in a LinkedIn/IMDB account
- Create filters and exception cases so I can look at a specific time range, say March 2013 to February 2015, and exclude all other information
- Do a general search by profession - like write in "vfx" or "film producer" "film director" and get back all of the data associated with that job title currently.

I'm a python programmer by trade, it's the core of a ton of my tools. Is that the best language to go about doing this though? I have some knowledge of PHP, CSS, HTML, at least enough to make a basic website. C++ is my weakest language at the moment - I've never had the proper moment to learn it.

What are your thoughts? Are there potential pitfalls and how can I avoid them?
 Probably more of a campfire topic.