I’ve spent the last few days at the University of Dundee, working on my masters project and preparing for the DDD day.
Here’s a quick review of the sessions I went too – it was a great day, shame it’s only once a year
Data mining the social web
Gary’s (@garyshort) talk focused on Twitter for marketing analysis. He’s an accomplished very natural presenter, easily held my attention for an hour, so not much else to say on that
I picked out the calculations used to measure the data with, I probably would have called the session reporting on the social web rather than data mining but maybe that’s being too pedantic and not in the spirit of the day.
Posts by time of day – if people are posting, they’re probably reading too.
Busiest hour = best time put out marketing messages
Acceleration graph, how fast do they start to talk. 1st differentiation
Shows conversations – how engaging is the product?
Use standard deviation to alert when acceleration falls outside the norm
Share of tweets by day, in market space you’re in. Buzz or volume of tweets
Doesn’t show sentiment, but all tweets will be referenced by search engines
I thought this would be difficult to measure in practice, getting all the tweets for the market space you’re in could end up with many different definitions.
Top 10 words by frequency. tab
Reveals words to use in marketing – google adwords
Most frequent posters – Influencers
Most retweets – also influencers, remove the bots (who’s tweets are retweeted the most)
A measure of how much people agree with what you say
Evangelist engagement – replies by post (number of replies / number of posts)
Indication of quality of relationship
Lexical diversity – measurement of vocabulary (distinct words / all words)
Indication of new content injected into community for each new tweet.
10% equals 1 in 10 tweets contains new information
What do conversations look like.
Outliers from main network are called cliques – need to be brought into network to be engaged
Where are my customers – geospatial info. Shows where the tweet was made from
How do I make links more effective, link in 2/3 of tweet
Bit.ly api to show how many links were clicked…
Sentiment – 48hours, minimal, hygiene
Trigram, co-locations
Overall a very good presentation, I would have liked to hear more on how the location data works and the sentiment info. I have no idea how a trigram works! Also shame to not see the code but that boy can talk LOL
SQL, one language to rule them all
Duncan Irving presented on this topic which showed the Teradata approach to data management following the acquisition of Asterdata.
Good breakdown of data related tasks into 3 areas
Knowledge discovery -data science, deriving new informatio – vaguer questions, how does something respond to factors
Decision support. Operational – business grade data
Deep freeze – data storage from an Ito point of view
Then looked at the overlap between KD and DS as operational usage and the elements of data analysis and the users.
Mining – statisticians
Management- DBA, data architects
Analysis- marketeers
Development- programmers
I like this definition of the law of big data
more data outperforms more complex models.
I’d not heard that before… Also the phrase “repurposing data” is a great one, changing the way data is a analysed, stored or formatted to fit a new purpose.
the rise of investigative analytics requires an investigative architecture
Duncan gave a brief overview of the day to role of a data scientist (big buzzword today)
comp science, Maths, data mining, choosing when data to warehouse or distributed processing is appropriate.
Integration
Investigation
Implementation – feedback to integrate
Output data driven products, insight for decisions, data warehouse
Finishing with the different types of data activity and the EDW should be good at handling all of them.
The Teradata approach is to bring the world of nosql under the RDBMS umbrella using the as asterdata connectors which allow SQL querying of all data.
Nice idea to fit it all together if you can afford it! One of the main ideas behind hadoop is cheap commodity kit, not sure how that translates to the teradata world.
It must be difficult for speakers tied to a particular company to present at an open event like ddd but Duncan did an excellent job, it never felt like a sales pitch, was clear and he’s an entertaining speaker also loved the oil and gas seismology images
Mobile CouchDB
Next i saw the talk on Couch db mobile by Dale Harvey
Nice relaxed style with clear simple slides (even if it was on a mac )
Started with some great stats
3 b users online by 2015
With 15b connected devices
Mobile Issues with
Reliability
Latency
Bandwidth
Securityin
Topology
0.5 sec delay = 20% drop in traffic
I knew nothing about DBs on mobile devices or couch, I thought dale pitched it just right so that a novice like me or some of the more experienced programmers could get something out of it. I’ll definitely download the app on my iPad and have a play around… I like the idea of creating a shopping list that gets sent to my wife’s phone
Getting Started with Hadoop
Next up was my talk on getting started with Hadoop, I’ll leave you to tell me what you thought!
My impression was that,as usual, I speak too fast, miss stuff out and bugger about with the pc too much but hey, I’m a geek that’s all allowed! Only one bit of code messed up when I forgot to transfer the tweet file to hdfs and I didn’t trip up or knock anything over.
I must get in contact with Gary short re: creating all his measures in Hadoop
Then I had to shoot off to get to Edinburg airport to fly home, so missed prof. Whitehorn’s talk – hope Andy videoed it!