未知数据源 2024年11月26日
Location History Analysis
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文利用Google提供的个人位置历史数据,使用Python和相关库进行分析和可视化。文章首先展示了作者所有访问过的地方,然后聚焦于以色列境内,特别是耶路撒冷的访问频率。通过对数据的处理和可视化,作者分析了其在不同时间段访问耶路撒冷的频率变化,并探讨了其背后可能的原因,例如搬家、生活习惯等。文章利用数据可视化工具,将抽象的数据转化为直观的图形,使读者能够更清晰地理解数据背后的信息。

🤔 **数据获取与预处理:** 文章首先从Google Takeout下载位置历史数据(JSON格式),并使用Pandas库读取和预处理数据,包括提取经纬度、时间戳等信息,并转换为方便分析的格式。

🗺️ **位置可视化:** 使用Basemap库将位置数据绘制在地图上,展示了作者所有访问过的地方,以及以色列境内(特别是耶路撒冷)的访问点。通过可视化,可以直观地了解作者的活动范围。

📅 **耶路撒冷访问频率分析:** 通过将时间戳转换为日期,并使用Pandas的resample函数,统计每天是否访问过耶路撒冷。利用自定义函数,计算每次访问耶路撒冷后经过的天数,并将其可视化,展现出访问频率的变化趋势。

📈 **访问频率变化趋势:** 通过可视化结果,观察到作者在不同时间段访问耶路撒冷的频率存在变化。文章推测这些变化可能与作者搬家、生活习惯等因素相关,例如在搬离耶路撒冷后,访问频率有所下降。

🤔 **数据缺失与未来分析:** 文章也指出了数据中存在缺失部分,并指出这些缺失部分会影响分析结果。未来可以进一步研究如何处理缺失数据,以及如何更精确地定义耶路撒冷的区域,从而获得更准确的分析结果。

Google has been tracking your footsteps.

In case you don't know already, google knows where you have been since you have started carrying a mobile device (unless of course you turned off Location History).

Some people might feel uncomfortable with the big brother watching them. I feel pretty ok with that. It feels safe to know that wherever I go, Google watches me; if I'm lost, Google knows where I am and can send a rescue mission. Go Google!

Anyhow, in case you want to know what google knows about you, you can download your Location History data from https://www.google.com/settings/takeout.

I downloaded mine and started exploring the data. It would be nice to see if we can get some insights from it.

The basic imports:

In [1]:
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom matplotlib import rcParamsfrom mpl_toolkits.basemap import Basemapimport jsonfrom pandas.io.json import json_normalize%matplotlib inlinercParams['figure.figsize'] = (20, 3)

First thing, we'll load the data from the downloaded json file.

In [2]:
data = json_normalize(json.load(open('LocationHistory.json', 'r'))['locations'])data.head()
Out[2]:
accuracy activitys altitude heading latitudeE7 longitudeE7 timestampMs velocity verticalAccuracy
0 26 [{u'activities': [{u'confidence': 100, u'type'... NaN NaN 320796661 347841758 1475514329219 NaN NaN
1 20 NaN NaN NaN 320797239 347841462 1475514261172 NaN NaN
2 20 [{u'activities': [{u'confidence': 100, u'type'... NaN NaN 320797239 347841462 1475514141123 NaN NaN
3 20 NaN NaN NaN 320797239 347841462 1475514008682 NaN NaN
4 20 [{u'activities': [{u'confidence': 75, u'type':... NaN NaN 320797365 347841623 1475513948078 NaN NaN
In [3]:
print 'available data:', ' to '.join(map(lambda t: str(pd.to_datetime(int(t) * 1000000).date()),                                         [data.timestampMs.min(), data.timestampMs.max()]))
available data: 2013-02-18 to 2016-10-03
In [4]:
data['longitude'] = data.longitudeE7 / 10000000data['latitude'] = data.latitudeE7 / 10000000data.drop(['longitudeE7', 'latitudeE7'], axis=1, inplace=True)

The first thing crying out to be done is to plot all the locations I've been to.

In [5]:
def plot_places(data, title, padding, markersize):    plt.figure(figsize=(10, 10))    plt.title(title)    m = Basemap(projection='gall',                llcrnrlon=data.longitude.min() - padding,                llcrnrlat=data.latitude.min() - padding,                urcrnrlon=data.longitude.max() + padding,                urcrnrlat=data.latitude.max() + padding,                resolution='h',                area_thresh=100)    m.drawcoastlines()    m.drawcountries()    m.fillcontinents(color='gainsboro')    m.drawmapboundary(fill_color='steelblue')    x, y = m(data.longitude.values, data.latitude.values)    m.plot(x, y, 'o', c='r', markersize=markersize, alpha=0.2)    plot_places(data=data, title='all the places I have visited', padding=30, markersize=10)

I've been abroad a couple of times. But obviously most of the time I've been in my own country, so let's visualize that:

In [6]:
radius = 3country_data = data[((data.longitude - data.longitude.median()).abs() <= radius) &                    ((data.latitude - data.latitude.median()).abs() <= radius)]plot_places(data=country_data,            title='places I have visited in my country',            padding=0.5,            markersize=3)

In order to find my country data I just used the median location (using the mean won't do good here, since other countries I've visited are outliers and will influence it).

Choosing radius to be 3 was done empirically so we can get the whole country in the plot.

In case you don't recognize, that's Israel. Most of my locations mass is in Tel-Aviv and Jerusalem. That's pretty reasonable, since I used to live in Jerusalem before I moved to Tel Aviv. Additionally, my parents live in Jerusalem, so it's expected that I'll visit them occasionally.


An interesting question to ask is how often do I visit my parents. Since I moved out of Jerusalem, it isn't reasonable to expect me to visit them more frequently than once a week. But let's be realistic: once a week might be too much. Once a month however will make me look like a bad son...

Let's analyze that! First, we'll mark the samples located inside Jerusalem.

In [7]:
data['is_jerusalem'] = ((data.longitude >= 35.148037) &                        (data.longitude <= 35.288230) &                        (data.latitude >= 31.724724) &                        (data.latitude <= 31.831740))plot_places(data=data[data.is_jerusalem],            title='Jerusalem samples',            padding=3,            markersize=3)

I didn't perform a precise query of the samples located inside Jerusalem. I selected the points inside some rectangular that contains Jerusalem (I just selected the rectangle's points using Google Maps). It's sufficient for our analysis, since whenever I've been inside the rectangle it has to be that I visited Jerusalem in the same day (since I have no other place to visit in the area). This rectangle defines a cut in the map with a good enough separation of Jerusalem and non-Jerusalem points.

Next thing, we'll convert the index into DatetimeIndex so it'll be easier for us to do time series manipulations.

In [8]:
data = data.set_index(pd.to_datetime(data.timestampMs.map(int) * 1000000)).drop('timestampMs', axis=1)data.index.name = 'time'data.head()
Out[8]:
accuracy activitys altitude heading velocity verticalAccuracy longitude latitude is_jerusalem
time
2016-10-03 17:05:29.219 26 [{u'activities': [{u'confidence': 100, u'type'... NaN NaN NaN NaN 34.784176 32.079666 False
2016-10-03 17:04:21.172 20 NaN NaN NaN NaN NaN 34.784146 32.079724 False
2016-10-03 17:02:21.123 20 [{u'activities': [{u'confidence': 100, u'type'... NaN NaN NaN NaN 34.784146 32.079724 False
2016-10-03 17:00:08.682 20 NaN NaN NaN NaN NaN 34.784146 32.079724 False
2016-10-03 16:59:08.078 20 [{u'activities': [{u'confidence': 75, u'type':... NaN NaN NaN NaN 34.784162 32.079737 False

Now we can easily mark the days in which I've visited Jerusalem.

In [9]:
is_jerusalem_daily = data.is_jerusalem.resample('D').max()plt.ylim((-0.1, 1.1))is_jerusalem_daily.plot.line(style='.')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x2005dcf8>

There's a lot of missing data, but we'll make the best out of what we've got...

We can see two behavioural transitions:

    It seems that before April 2013 I've always been in Jerusalem. That's because my wife and I moved out of Jerusalem to Tel-Aviv at around June 2013.It seems like the frequency of the visits has changed sometime during 2015 (the Jerusalem dots seem less dense). We'll come back to that later.

Let's better visualize this data: for every day, we want to know how many days have passed since the last visit to Jerusalem.

In [10]:
def count_days_since_most_recent_marker(marked_data):    '''    marked_data marks some of data's days. The i'th day is marked    if marked_data's i'th cell is evaluated to True.    Return a time series in which every day contains the number    of days passed since the most recent marked day.    If a day is marked, the result for that day will be 0.    '''    # put NAN in the non-marked days, and 1 in the marked days    marked_data = marked_data.map(lambda d: 1 if d else np.nan)    # multiply each marked day by the day's index    marked_data *= xrange(len(marked_data))    # 'ffill' will propagate last non-NAN value forward to NAN values.    # This way, every entry will contain the index of the last non-NAN entry    marked_data.fillna(method='ffill', inplace=True)    # translate indices to dates    marked_data = marked_data.map(lambda i: marked_data.index[int(i)])    # calculate the time delta between every day to its most recent marked day    marked_data = marked_data.index - marked_data    # extract the days component    marked_data = marked_data.map(lambda timedelta: timedelta.days)    return marked_data    days_since_last_visit = count_days_since_most_recent_marker(is_jerusalem_daily == 1)plt.ylim((0, days_since_last_visit.max() + 5))plt.title('days since last visit to Jerusalem')days_since_last_visit.plot()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x5cb6e1d0>

This graph contains some high values. But let's not forget: we have missing data, e.g. in the period between April 2013 and October 2013. The missing data was treated as non-Jerusalem days. Plotting the missing data periods, we get:

In [11]:
days_since_last_available_data = count_days_since_most_recent_marker(is_jerusalem_daily.notnull())plt.ylim((0, days_since_last_available_data.max() + 5))plt.title('days since last available data')days_since_last_available_data.plot()
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x3bc5b198>

It seems like besides the long periods of missing data, there are short periods (even of one day) of missing data.

This analysis is aimed at proving (or failing to prove) that I visit Jerusalem in a satisfactory manner. It's like a trial: if there's no data to convict me then I'm free of charge. I'll regard the long missing data periods as such, meaning: they won't be counted as non-Jerusalem days (since most probably I have been in Jerusalem a copule of times in those periods). The short periods on the other hand are too short for me to make the same claim, so I'll treat tham as no-Jerusalem days.I'll define short periods to be periods of less than a week:

In [12]:
data_availability = days_since_last_available_data <= 7plt.figure()plt.ylim(-0.1, 1.1)plt.title('data availability')data_availability.plot()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x5177fa20>

data_availability is a less noisy version of days_since_last_available_data.

In order to discard the long periods with no data, we'll first find the points in time where data starts to be available, i.e. - switch points from False to True:

In [13]:
data_availability_shifted = data_availability.shift(1).fillna(True)availability_on_switch = ((data_availability_shifted == False) & (data_availability == True)).loc[lambda df: df == True]availability_on_switch
Out[13]:
time2013-10-01    True2014-11-05    True2015-04-24    True2015-12-12    True2016-01-12    Truedtype: bool

Now we can discard these missing data periods:

In [14]:
days_since_last_visit = count_days_since_most_recent_marker((is_jerusalem_daily == 1) | availability_on_switch)days_since_last_visit *= data_availabilityplt.ylim((0, days_since_last_visit.max() + 5))plt.title('days since last visit to Jerusalem (discarding periods without enough data)')days_since_last_visit.plot()# if I don't come for 7 days in a row, it's ok by my parentsgrace_period = 7plt.hlines(y=grace_period,           xmin=days_since_last_visit.index.min(),           xmax=days_since_last_visit.index.max(),           linestyles='--',           color='gray')
Out[14]:
<matplotlib.collections.LineCollection at 0x5f1cbba8>

I plotted an horizontal line symbolizing the grace period - the amount of time I don't visit and my parents are ok with it.

Let's analyze my parents sentiment regarding my visits frequency (which is occasionally above the grace period line).

My parents aren't robots. If I don't visit for a long time they get angry at me. And once I do visit they don't all of a sudden get happy; they hold a grudge. I'll model the grudge using a rolling window of width 30. It means that when analyzing some date, only the data available in the previous 30 days can affect the grudge.

In [15]:
anger_with_grudge = days_since_last_visit.rolling(window=30).mean()anger_with_grudge = (anger_with_grudge - grace_period).where(lambda anger_with_grudge: anger_with_grudge >= 0, other=0)anger_with_grudge.plot.line()plt.title('anger over time')
Out[15]:
<matplotlib.text.Text at 0x5b736748>

Let's do the same graph, but with colors and labels so I can present it to my parents:

In [16]:
max_y = int(anger_with_grudge.values.max())z = [[z] * len(anger_with_grudge.index) for z in range(max_y)]plt.yticks(range(max_y / 3, max_y + 1, max_y / 3), ['mildly angry', 'angry', 'furious'])plt.contourf(anger_with_grudge.index, range(max_y), z, 100, cmap='Reds')plt.fill_between(x=anger_with_grudge.index, y1=anger_with_grudge.values, y2=max(anger_with_grudge.values), color='w')plt.title('anger over time')
Out[16]:
<matplotlib.text.Text at 0x5d8d1748>

It seems like most of the time I'm kind of ok, and don't get my parents too upset with my visits frequency...

But there's that huge red pick! It occurred around the birth of my first child, so I think it's reasonable I didn't drive all the way to Jerusalem to pay a visit... And the joy of being grandparents might be greater than the grudge, so I'm ok there.

One more thing I didn't consider is the frequency of visits my parents conducted to Tel-Aviv. It's reasonable that I don't come to Jerusalem the day after they came visit me in Tel-Aviv. But I don't think my parents' phones have GPS turned on, so I can't extract their data...


To conclude:

Mother, father - if you're reading this, I hope you don't hold a grudge against me (or just didn't understand this analysis).

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Google位置历史 数据分析 Python 可视化 时间序列
相关文章